CREATER: CTR-driven Advertising Text Generation with Controlled Pre-Training and Contrastive Fine-Tuning

This paper focuses on automatically generating the text of an ad, and the goal is that the generated text can capture user interest for achieving higher click-through rate (CTR). We propose CREATER, a CTR-driven advertising text generation approach, to generate ad texts based on high-quality user reviews. To incorporate CTR objective, our model learns from online A/B test data with contrastive learning, which encourages the model to generate ad texts that obtain higher CTR. To make use of large-scale unpaired reviews, we design a customized self-supervised objective reducing the gap between pre-training and fine-tuning. Experiments on industrial datasets show that CREATER significantly outperforms current approaches. It has been deployed online in a leading advertising platform and brings uplift on core online metrics.


Introduction
For businesses that want to promote their items and services, running online advertisements on advertising platforms is an effective way to achieve their marketing goals.With the aim of attracting users to know more about the displayed items, advertisers design ad creative (such as text, image and video).Figure 1 is an illustration that shows the creative of an ad in news feed, which contains a text and an image.
An appropriate creative design capturing user interest accurately can improve the ad's click-through rate (CTR).CTR is a key metric that quantifies the effect of an ad, because click is the precondition for any further actions such as sharing and purchase taken by users.Thus designing ad creatives that can achieve higher CTR is crucial for ad delivery.
Traditionally, advertisers need to manually design the creative of each ad, and then resort to online A/B test results to continually refine initial creative for catching user interests.Such trail-and-error process is labor-intensive and usually inefficient.In terms of the text in a creative, due to the variation characteristic of language expressions, it may need to be polished multiple times for obtaining an ideal one.To improve the efficiency of ad delivery for advertisers, especially for small advertisers that may not afford to hire professional writers, this paper focuses on automatically generating the text for an ad, and the goal is that the generated text can capture user interest for achieving higher CTR.
There are several challenges to achieve this goal.(I) First, it is important to choose a suitable source for generating ad texts.A straightforward source is the corresponding item's title in landing page.However, a title is usually a mixture of item attributes while may not reflect user preference.In contrast, an ad text should contain insightful and informative contents that can arouse purchasing desire of users.(II) Second, most of current natural language generation (NLG) models are optimized using cross-entropy criterion, which is discrepant to the CTR metric we concern.To encourage the model to generate texts achieving higher CTR, there is a great need to incorporate CTR objective into training.(III) Last but not least, a well-trained NLG model usually need a large amount of paired data.However it is costly to collect sufficient humanwritten ad texts, especially for small advertisers, thus we are faced to low-resource problem.

Problem Formulation
Given a source x and a control code c for an ad, where the source is a high-quality user review of the ad item, the control code is an aspect term of such review to guide generation, we aim to learn a generation model p Θ (y | x, c) that can produce an appropriate ad text y (where Θ denotes trainable parameters of the model).Our goal is that the generated ad text can capture user interest and attract users to know more about the ad item.
3 Proposed Approach: CREATER Figure 2 illustrates the workflow of our CREATER, and it consists of two stages.The first stage is Controlled Pre-Training, which learns from unpaired user reviews to provide warm-starting for low-resource scenario.The second stage is Contrastive Fine-Tuning, which further learns from online A/B test data that reflects user feedback, aiming to encourage the model to generate ad text that can achieve higher CTR.

Stage 1: Controlled Pre-Training
We construct a large set of user reviews as the pre-training corpus D x .Based on D x , we extract a set of aspect terms D c using an off-the-shelf unsupervised model ABAE (He et al., 2017), and each aspect term is typically represented as a word.
Recall that we aim to learn a generation model p Θ (y | x, c), while the pre-training stage only makes use of unpaired user reviews D x .To ensure that the model benefits from pre-training, we propose a novel self-supervised objective customized to our scenario, which reduces the gap between pre-training and fine-tuning.The core is that, for each review x ∈ D x , we construct an aspect-based pseudo-target ỹ from the review x and mask this segment in x.The self-supervised objective is to perform aspect-controlled generation, which aims to recover the segment ỹ given the masked review with the guidance of corresponding aspect term.
Aspect-Controlled Masking For a review x ∈ D x , we tokenize it as a list of segments [x seg_1 , x seg_2 , ...] based on punctuations and dependency parser, where each segment x seg_i is a sub-sequence of x.Given an aspect term c ∈ D c existed in the review x, we compute the matching score between c and each x seg_i with a matching function f (c, x seg_i ). 2 We then select the segment with highest matching score as the pseudo-target ỹ for the given pair of (source x, control code c): For each triple (source x, control code c, pseudotarget ỹ), our aspect-controlled masking strategy masks the review x by replacing its pseudo-target ỹ with a special word "[MASK]".Thus we transform each triple (x, c, ỹ) to a masked one (x, c, ỹ), where the masked review x is specific to the aspect term c.

Masking
Control code (Aspect): c < l a t e x i t s h a 1 _ b a s e 6 4 = " O 3 J l 9 A d y o Z d e u w e 4 F t 8 s T 2 e t A j o = " > a d e z I d p C q K A u / w s I A Q q x 8 B h t / g 9 N m g J Y j W T 4 6 5 1 7 d e 0 + Y M q q 0 4 3 x b K 6 t r 6 x u b t a 3 6 9 s 7 u 3 r 5 9 c N h r e X n f a r R v q j h q 4 A S c g n P g g i v Q B n e g A 7 o A g w I 8 g 1 f w Z j 1 Z L 9 a 7 9 T E v X b G q n i P w B 9 b n D / 8 z l q o = < / l a t e x i t > h2 < l a t e x i t s h a 1 _ b a s e 6 4 = " g K a d e z I d p C q K A u / w s I A Q q x 8 B h t / g 9 N m g J Y j W T 4 6 5 1 7 d e 0 + Y M q q 0 4 3 x b K 6 t r 6 x u b t a 3 6 9 s 7 u 3 r 5 9 c N h V e g D e 5 A B 3 Q B B g V 4 B q / g z X q y X q x 3 6 2 N e u m J V P U f g D 6 z P H w P O l q 0 = < / l a t e x i t > h5 < l a t e x i t s h a 1 _ b a s e 6 4 = " Q k A E U v q X 0 B + R z y 9 S 9 t t 3 a G / 7 U q E = "  < l a t e x i t s h a 1 _ b a s e 6 4 = " t H 6 j f j 6 < l a t e x i t s h a 1 _ b a s e 6 4 = " m 1 Z N 2 w l S X W N P k 9 q w 7 N x y p f M M I 7 u c q s Q o t 0 v P w a X x 6 P o Z H R 6 f j I Y f + v i 6 J E P 5 C M 5 J B E 5 I 2 P y g 0 z I l H B v z x t 6 x 9 4 X v + e P / F P / b G P 1 v a 5 n n / x T / v g e n w b B 4 w = = < / l a t e x i t > (Shifted right) Positive Target y + < l a t e x i t s h a 1 _ b a s e 6 4 = " / p Q 0  Positive Target y + < l a t e x i t s h a 1 _ b a s e 6 4 = " n p n E 7 e Z 8 X n P 7

Contrastive Loss
Modeling distinctness Negative Target y < l a t e x i t s h a 1 _ b a s e 6 4 = " B W y K c I 7

CrossEntropy Loss
Source (Review): x < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 u I l Q l W 4 i t a x j h e 9 G v I 1 2 Positive Target y Figure 2: Overview of our proposed approach CREATER for CTR-driven advertising text generation.
Aspect-Controlled Generation Given a masked review x with an aspect term c, our self-supervised objective is to recover the masked segment (i.e., pseudo-target ỹ) of original review x with the controlling of c: Such aspect-controlled generation enforces the model to understand the context of input masked review better.Compared to general pretraining models (Zhang et al., 2020;Lewis et al., 2020;Raffel et al., 2020), the proposed objective is customized to our scenario.The input information x does not contain the content to be generated, improving the ability of generating abstractive contents other than simply copying from input only.
Formally, we first prepend the control code c to the masked source x, and add a special word "[SEP]" between them.We then feed the concatenated sequence [c, [SEP], x] into CREATER to generate the pseudo-target ỹ = [ỹ 1 ,ỹ 2 ,...,ỹ T ] (where T denotes the length), where the model architecture is a Transformer encoder-decoder (Vaswani et al., 2017) and it is optimized via teacher-forcing: where Ñ is the length of the sequence [c, [SEP], x], and hi is the i-th word's representation.

Stage 2: Contrastive Fine-Tuning
To incorporate CTR objective during generation, we make use of existing online A/B test data that reflects user preference.Specifically, we construct a dataset D, where each sample is a tuple (source x, control code c, positive target y + , negative target y − ).Both y + and y − are human-written ad texts (given x and c), while y + achieves higher CTR than y − during online A/B test.
Next, we start from describing a vanilla finetuning objective that only considers y + .We then introduce two contrastive fine-tuning objectives which take good advantage of online A/B test data.
Vanilla Fine-Tuning A straightforward objective is to maximize the generation probability of positive target y + : Obviously, this learning objective omits the utility of negative targets.
To enhance the model's discriminative ability of ad texts with different CTR, we propose to expose the decoder to both positive and negative ad texts via modeling their distinctness.Specifically, we leverage the paradigm of contrastive learning, where the positive/negative target (i.e., ad text with higher/lower CTR) is used to construct positive/negative paired instance, and introduce two contrastive learning based objectives to fine-tune the pre-trained model.

i. Margin-based Contrastive Fine-Tuning
We first propose to directly maximize the margin of generation probabilities between the positive target y + and the negative target y − .This yields the following loss function: where the margin γ is a hyperparameter.Through this loss, the optimization procedure is encouraged to maximize the probability gap of ad texts having distinct CTR.
ii. InfoNCE-based Contrastive Fine-Tuning From the perspective of representation learning, we propose a contrastive loss based on InfoNCE (Oord et al., 2018), which maximizes the similarity between source and positive target, and minimizes that between source and negative target: where τ is temperature.sim(•, •) is similarity function of encoder and decoder representations.
We adopt mean-pooling to the top layer of the encoder/decoder as their representations.Let h, z + and z − denote encoder representation, decoder representations for positive and negative targets.We then add two fully-connected layers to the encoder and the decoder side respectively, transforming them to the same vector space.Thus an inner product operation is used to obtain the similarity scores: where W e and W d learnable parameters.
Objective The final loss of contrastive fine-tuning stage is the sum of L f t and contrastive loss: where α is a trade-off hyperparameter, and L cont can either be margin-based or InfoNCE-based.(1) SEGEXT (Segment extraction) employs unsupervised aspect-controlled masking ( § 3.1) to return a segment of source as the ad text.If the returned segment is too short to display, we add its left or right segment based on matching score.
(2) PGNET (Pointer-generator) is an RNN-based approach via copying mechanism (See et al., 2017); (3) C-PGNET improves PGNET by adding control code during decoding, which imposes on the generation gate; (4) TRM (Transformer) is the state-of-the-art architecture for text generation; (5) C-TRM improves TRM by adding control code at both encoder and decoder sides, with the help of fusion layers; (6) C-TRM-RL fine-tunes the C-TRM with reinforcement learning (RL), where an extra CTR regression model (trained on D) is the reward estimator that produces click probability of a generated text (Hughes et al., 2019).Negative targets are used to train the reward estimator, and are not explicitly used for optimizing generation model.
The second type contains CTR-driven approaches.They exploit negative target y − during training to explicitly incorporate CTR information: (1) QUALITYMODEL employs click behavior as a quality measure for paired samples (Wang et al., 2019).It first builds a CTR latent space to represent source and target, and then computes the cosine similarity between them as the quality score of the sample.Quality scores are used to weight the cross-entropy objective and reduce the probability of generating low-quality texts; (2) CONTRAMODEL is a variant of CREATER, which removes the controlled pre-training stage; (3) BART+CONTRAMODEL performs pre-training from scratch using the self-supervised objective of BART other than our proposed one, and then performs fine-tuning with CONTRAMODEL.

Implementation Details
Both the encoder and the decoder of CREATER contain four layers, and the dimension of hidden representations produced by each layer is set to 512.For fair comparison, all comparative approaches that based on Transformer employ the above architecture.For text preprocessing, we tokenize sources and targets to word sequences, and thus our CREATER generates ad texts at word-level.We restrict the max length of input as 128 words.The overall parameter size is 129M.At the pre-training stage, we employ Adafactor optimizer (Shazeer and Stern, 2018), with a mini-batch size of 4096 for training 10 epochs.Models are trained on 8 Tesla V100 32GB GPUs.We implement our approach with PyTorch3 and Transformers4 .
In terms of the model for extracting aspect term set, during early experiments we found that the performance of CREATER is not sensitive to it and thus we employ the representative model ABAE.For matching function f (•,•) (Equation 1) used in aspect-controlled masking for building pre-training data, we try a lexical-based (similarity of sparse TF-IDF vectors) and an embedding-based one (similarity of averaged word embeddings), and found that the performance of fine-tuned model is not sensitive to them.Thus we choose the former for simplicity.
At the fine-tuning stage, we set the mini-batch size to 1024 for 20 epochs.When the margin-based contrastive loss is used, the margin parameter γ is set to 1.0.Or if we the use InfoNCE-based contrastive loss, the temperature parameter τ is set to 1.0.We set the trade-off hyperparameter α to 1e-3 (which is searched from {1e-2, 1e-3, 1e-5}).We choose the checkpoint that has lowest perplexity on validation set as the final model.At inference time, we use beam search algorithm to generate texts, where the beam size is set to 5. The BLEU metric is evaluated using NLTK5 , and the ROUGE metric is evaluated using pyrouge6 .All reported results of different approaches are run based on the same random seed.

Performance Comparison
Table 2 shows the comparison results, and we report BLEU-4 and ROUGE-1/2/L (positive targets are regarded as gold-standard). 7It is natural that the approaches considering aspect terms outperform those that do not perform controlling.CTR-driven approaches usually outperforms non-CTR-driven ones, demonstrating that exposing the model to both positive and negative targets improves generation quality.QUALITYMODEL and CON-TRAMODEL represent two paradigms to incorporate CTR information.CONTRAMODEL is superior to QUALITYMODEL, which indicates that directly modeling the distinctness as an auxiliary objective is more effective than weighting the original loss.
BART+CONTRAMODEL performs better than CONTRAMODEL by adding a pre-training stage.CREATER proposes a customized controlled pretraining objective and achieves the best result.This verifies that designing a suitable self-supervised objective is crucial to improve generation.Analysis of Contrastive Fine-Tuning Our CRE-ATER exposes the model to both positive and negative targets for incorporating CTR information.Table 4 shows the comparison of two contrastive objectives.For with and without pre-training, the bestperforming model is based on contrastive learning.An interesting point is that when we perform pre-training, InfoNCE-based model achieves best performance, while margin-based model outperforms other variants if we do not pre-train the model.We suggest that InfoNCE-based loss is designed from the perspective of representation learning, and pre-training can provide better text representations compared to no pre-training.Thus in this situation the utility of InfoNCE-based model is highlighted.

Human Evaluation
An ad text will be measured from the three views: grammaticality, informativeness (whether its content reflects the key points of aspect term and the source) and suitability (whether it is suitable to be displayed).Each view is ranging from 1 to 5 (5 is the best).We randomly choose fifty samples and invite three human judgments.
Table 5 shows that CREATER performs well on most views and achieves the best ranking results among four comparative approaches, possessing the ability of generating fluent, informative and suitable ad texts.We found that the reason why the informativeness and suitability of CREATER are not as high as human-written ones is that the faithfulness of generated texts is not always ideal.We leave the improvement in future work.
Figure 1: An illustration that shows the creative of an online advertisement in news feed on mobile.
1 1 b c b n 0 7 b j c 7 P 5 d 1 N N g u + 8 A + s p h 9 Z x 3 2 i x 2 x L g N 2 6 4 X e G + + t 9 9 d / 7 b / z 3 y + i v r e c e c X u w N / / B z d z u k U = < / l a t e x i t > Source x 5 m y T v e G f O J s f 5 o 4 m t 3 d 6 I S u X O L P P O d u a A 7 t 8 9 W 5 v / Y o K T J + b C S u i g J N W w u m p S K k + G r o P l Y W g R S C y 8 E W B 8 U c L g T V g D 5 7 1 i F k O w / + a G 4 e t 9 N z r o f v p 6 1 e 5 + 2 c T T Y M X v N T l j C P r I e + 8 I u W Z 8 B + 8 F + B / W g E f w K a 2 E U N j e t Y b C d e c n + q f D V H 5 f 1 s 3 w = < / l a t e x i t > Contrastive Fine-Tuning (Shifted right) Pseudo-Target ỹ c d 7 b 2 b Y N 0 k p h c U w / O v 5 e 0 + e P n v e e x G 8 f P X 6 z d v + u / e X V l e G w 5 R r q c 1 1 w i x I o W C K A i V c l w Z Y k U i 4 S m 6 + r / S r W z B W a H W B d Q n z g u V K Z I I z d N S i / z t G u M P m 8 O d S Z A g p N S J f 4 p B O L F S p P r p g J g e k b Y x C p k B r G h x s / J v 9 t L 3 7 T O P 4 g e R a o d G S c u 2 8 L d + W S m 0 9 y D 5 9 I i k J p 5 I M D 6 o I D D V F g B 5 K + 2 C C H Z / v J N c P q 2 n x z 1 3 3 0 9 6 h 5 / W s e x y 5 6 x 5 2 y f J e w 9 O 2 Z f 2 I A N G b D f Q S / o B 2 / C n f B F m I S H K 2 s Y r H u e s n 8 q / P g H a I 3 B a w = = < / l a t e x i t > (Shifted right) Negative Target y < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 H g i w P i j g M B F W A P m r z U N I r n / 5 J j h 5 3 U / 2 + w d f 9 r u H H 1 Z x b L J n 7 D n b Z g l 7 y w 7 Z J 3 b E B g z Y r 6 A X 9 I O 9 c C N 8 E S b h m 6 U 1 D F Y 9 T 9 k / F b 7 / D S v t w U 0 = < / l a t e x i t > a f / F g c 7 7 S S 3 d a P w 9 3 m / k E d x y r b Y F / Y V 5 a w P b b P f r M 2 6 z B g f 4 M P w c f g U 3 A b r o e f w z q 7 M K h 7 3 r N / K t y + A 8 Z O v F Q = < / l a t e x i t > s f p 7 I m a e 1 g P P N Z 0 e w 4 6 e 9 U b i f 1 4 1 w t Z 5 P R Z + G C H 4 f P J R K 5 I U A z r K g z a F A o 5 y Y A j j S p h d K e 8 w x T i a 1 J I m B G f 2 5 H l S O s k 6 u e z p T S 6 d v 5 z G k S A H 5 J B k i E P O S J 5 c k w I p E k 4 e y T N 5 J W / W k / V i v V s f k 9 Y F a z q z R / 7 A + v w B n L e W 7 A = = < / l a t e x i t >

Figure 3 :
Figure 3: Results with limited fine-tuning data.Dashed lines are two strongest baselines trained on whole data.
e x i t >

Table 1 :
Statistics of the datasets used in our experiments."Avg.length"means the average number of characters in a sequence (review or ad text)., negative ad text 2 ), in Chinese, through online A/B test.Overall the user reviews are ensured to be high-quality based on rules and filtering models.Each ad text is written by human editors given the review and aspect term, covering 4,047 advertisers.More details about data preprocessing and filtering can be found in Appendix A.1.We also produce a large-scale review corpus D x for constructing pre-training dataset via aspect-controlled masking ( § 3.1).Table1lists the statistics.We split D with 7:1:2 to obtain the training/development/test set.

Table 2 :
Main results."RG" stands for ROUGE.Both BLEU and ROUGE scores are multiplied by 100.

Table 3 :
Comparison of pre-training objectives.

Table 4 :
Comparison of contrastive learning objectives.