MediaHG: Rethinking Eye-catchy Features in Social Media Headline Generation

An attractive blog headline on social media platforms can immediately grab readers and trigger more clicks. However, a good head-line shall not only contract the main content but also be eye-catchy with domain platform features, which are decided by the website’s users and objectives. With effective headlines, bloggers can obtain more site traffic and profits, while readers can have easier access to topics of interest. In this paper, we propose a disentanglement-based headline generation model: MediaHG (Social M edia H eadline G eneration), which can balance the content and contextual features. Specifically, we first devise a sample module for various document views and generate the corresponding headline candidates. Then, we incorporate contrastive learning and auxiliary multi-task to choose the best domain-suitable headline, according to the dis-entangled budgets. Besides, our separated processing gains more flexible adaptation for other headline generation tasks with special domain features. Our model is built from the content and headlines of 70k hot posts collected from REDBook , a Chinese social media platform for daily sharing. Experimental results with language metrics ROUGE and human evaluation show the improvement in the headline generation task for the platform 1 .


Introduction
Nowadays, in the midst of massive flow of information on large-scale social network sites (such as Facebook and Instagram), users always feel more difficult to quickly obtain the information they want.As a result, people tend to focus on more niche platforms, which are often made up of groups of users with common personal interests.The vertical platforms are not only used for easier information search but also strengthen the community of users with the same interests.Users' attention is always limited to attractive headlines that can catch their eyes at first glimpse.As the headline condenses the main topic into a concise and appealing description, a good headline can trigger a high click rate.So generating better headlines is significant for media platforms to compete for attracting users' limited attention and deliver users better experiences.
We conduct the research based on a vertical Chinese social media platform REDBook (Figure 2) because it is widely praised by hundreds of millions of users and targeted by the same interest group.As more than 70% of users are female (an official reported data), the topics and tone of posts are feminine such as some typical words "Babycare", "Makeup" etc.The platform breaks down identity restrictions and allows people to share their colorful lives and experience, which are of reference value for users.We define the eye-catchy features of posts as the ability to attract users, which can be intuitively measured by the number of "likes", that is, the heat of posts on the platform.Producers can obtain more traffic and profits while getting more likes from the platform, then they may be subscribed by more fans.The advertising revenue of bloggers is usually closely related to both the number and profile of their followers.So only with the help of good headlines, can the bloggers attract their target user group.
By analyzing eye-catchy REDBook headlines with more likes (more than 2k) on the platform, we find that both contents and style influence attractiveness.Since the majority of users are women, topics of interest to women occupy a large part of the topics on the platform.Accordingly, headlines with domain special topics such as "Lipstick swatches" or "ootd" ("Outfit of the Day") are more appealing with more likes.Combined with hot topics, the style of the headline also has a great impact on eye-catchy features.For example, when reporting the hot issue "lipsticks sharing", the headline "New Lipsticks for Winter So Tender like a Creamy Almond Peach!!" (as shown on the platform "冬季口红新品～奶茸茸的杏仁桃子好 温柔" in Chinese) wins over 13,000 likes, while another headline with the same topic but a plain description "Lipstick Share, a New Style for Winter" only has 300 likes.We also find the eye-catchy style of the headlines on the platform is not that similar to the style of news media headlines, as it more closely resembles the tone of women talking and sharing with their friends.
Focusing on relevant works, we found that recent researches simply regard the headline generation task as a typical summarization task (Shu et al., 2018).They only focus on the content parallel to the given reference summary, ignoring the domain eye-catchy features.However, attractive headlinegeneration tasks have received less attention.A recent clickbait research (Xu et al., 2019) leverages adversarial training and attractiveness scores module to guide the summarization process.Another (Jin et al., 2020) introduces a novel parametersharing scheme to disentangle the attractive style from the text.However, these previous works only concentrate on style and neglect the content importance, which also weighs in eye-catchy headline generation.Disentanglement module is devised to divide the style and content into latent spaces, but a style encoder in generator training is not flexible enough (Li et al., 2021).
To address the headline generation issue, we pro-pose the MediaHG model which disentangles the eye-catchy features as additional requirements in sequence-to-sequence training.The model is composed of a headline candidates generator and an eye-catchy headline selector.In our setting, the neural abstractive model is responsible for headline generation, capturing the main topic of the input document, while the selection module with constraint will encourage the adherence of generated headlines to domain eye-catchy features.Instead of confounding eye-catchy features, we treat the content feature and style feature extraction respectively.In the generation period, we devise a random sample module with different parts of the text and generate candidate headlines responding to the content.During selection period, we leverage ranking-based contrastive learning (Hopkins and May, 2011) (Zhong et al., 2020) (Liu et al., 2021) and multi-task (Luong et al., 2015) to select the best headlines among candidates.The selection is decided by the coordinating quality scores of stylecontent attractiveness.We will describe the specific quality metrics model in detail in the following part.Therefore, candidates are assigned with probabilities according to their quality, which will further influence the generation model.In other words, the headline generation model not only generates output headlines autoregressively but also estimates the probability distribution over candidate headlines.
Our main contributions are listed below: • We propose a new Headline Generation model namely MediaHG to generate topic-catchy and contextual harmonized headlines for vertical niche platforms to enhance the click ratio and draw users' attention.While we base our experiments on a typical vertical platform REDBook, our methods can be adapted to other platforms through the same platformsuitable features extraction methods.
• Our model is proved to be effective by both automatic and human evaluation scores of fluency, consistency, and attractiveness, which means it achieves a style-content dual balance.
• To the best of our knowledge, it is the first research to focus on vertical interest platforms.
We also give a new definition of domain eyecatchy headlines that is, those attractive combinations with topic and style suitable to the platform users. 2 Related work  (Takase and Okazaki, 2019), a length-aware attention mechanism (Liu et al., 2022), and a length constraint optimization (Makino et al., 2019).Content guidance in GSum (Dou et al., 2020) is used as the input for its sequence-to-sequence model, and shifts in guidance distribution would require further training.Attractive headline generation is paid less attention by researchers.A sensation scorer (Xu et al., 2019) is designed to judge whether a headline is attractive and guide the headline generation by reinforcement learning.Also, a parameter-sharing scheme (Jin et al., 2020) is introduced to further extract style from the text.The style-content duality is considered with VAE (Variational AutoEncoder) as a feature extractor (Li et al., 2021) and two disentangled space constraints in parallel tasks.Differently, MediaHG allows flexible shifts in various guidance without expensive retraining.Disentanglement Disentangling neural networks' latent space has been explored in the computer vision domain to factorize the features (such as ro-tation and color) of images (Chen et al., 2016)(Higgins et al., 2017) (Luan et al., 2017).Compared to the computer vision field, NLP tasks mainly treat sentiment as a salient style and focus on invariant representation learning.It is used to control sentiment through training a discriminator (Hu et al., 2017).Then, disentangled representation learning is further widely adopted in nonparallel text style transfer.For example, separate training with style-specific embeddings and style-specific embeddings are proposed (Fu et al., 2018).Some work also focuses on disentangling syntax and semantic representations in text.VGVAE (Chen et al., 2019a) trains the generative model with multiple losses that exploit aligned paraphrastic sentences and word-order information to get better syntax and semantics representations.We utilize the core principle of disentangling to separate different feature budgets.
Reranking Candidates Recent conditional generation work explores the idea of reranking candidates from different dimensions (Wan et al., 2015)(Mizumoto andMatsumoto, 2016).Different search methods have been used in neural language summarization models, such as greedy search in Fac-torSum (Fonseca et al., 2022) and beam search (Vijayakumar et al., 2016) in SimCLS (Liu and Liu, 2021)according to a learned evaluation function.
The Perturb-and-Select summarizer (Oved and Levy, 2021) performs random perturbations and uses similar ideas to generate candidates ranked according to a coherence model.Unlike only intrinsic importance compared with the original document in SimCLS, content and contextual eye-catchy budgets are both considered in our work.In this section, we describe our approach in detail, as shown in Figure 3. Inspired by FAC-TORSUM (Fonseca et al., 2022), we treat the content importance model as sampling document views(intrinsic importance), and contextual features as additional budgets(extrinsic importance).We pre-train the headline generation model with our dataset REDBook (Table 1) to generate candidates with intrinsic content and latent features in Sec3.1.Then, a specific metric M is employed to evaluate the effectiveness of different criteria, composed of scores from both content and domain style factors in Sec3.2.To assign higher probabilities to a more suitable candidate, we use contrastive learning for better re-ranking.The metric M construction is demonstrated in Sec3.3 which is optimized with multi-task loss.

Generate Candidates
The headline candidates are generated from two tasks: document parts sampling and corresponding headline generation.The candidate document views are generated from different random samples of article parts.We hypothesize that the main topic of short posts (limited to 1000 Chinese characters) shall be contracted from various sampled incomplete parts.Using samples allows the sequence-tosequence model to focus on concise and appealing topics, as we further considered.
To generate multiple views for the same document, we implement the following steps: • From a document D, we first split the sentences and generate a random sample collection of sentences, called document views S v .
The number of sentences of each document view in S v is controlled by the sampling parameter For each document in the RED-IN (Table 1), we repeat the sampling method and headline generation work.While dealing with different datasets containing various lengths of documents and titles, the hyperparameters may be tuned.According to the basic REDBook platform format restrictions, the length of the document and title is limited to 1000 characters and 20 characters respectively.So we choose k candidate document views, each with s f = 2/3 sentences of the document.Different choices with appropriate values k are discussed in the ablation study.
Powerful sequence-to-sequence PLMs models such as PEGASUS (Zhang et al., 2020) and BART (Lewis et al., 2019) are trained to estimate the probability of a sequence of tokens by minimizing crossentropy with respect to the data distribution.We hypothesize that these models generate good candidates to fulfill the content importance and latent style features objectives, which are described below.
Learning Maximum likelihood estimation (MLE) is the standard training algorithm.Given the training dataset X ′ consists of hot post document D (i) and reference headline H (i) , the loss is defined as a negative log-likelihood function: (1) where p θ H (i) | D (i) is a distribution over the possible headline H (Lewis et al., 2019).
For a specific sample D (i) , H * (i) , Eq.2 is equivalent to minimizing the sum of negative loglikelihood of the tokens in the reference headline H * whose length is l, through the cross-entropy loss: where H * <j denotes the partial reference headline h * 0 , • • • , h * j−1 .p true is defined as a one-hot distribution under the standard MLE framework: During learning stage, we find the parameters θ * minimize the loss above.Since the model is trained with a confounded feature dataset, we notice the results are generated with both content and latent features.
Inference During inference stage, the abstractive model g is used to generate the candidate headline in an autoregressive manner.It is intractable to enumerate all the possible candidate outputs, so methods such as beam search decoding (Sutskever et al., 2014) are used to reduce the search space.
Estimating the probability of the next word h t is the significant step during the search: which is different from Eq.3 with its own previous predictions resource H <t instead of reference headline H * <t .

Coordinating Headline Selection
Eq.4 implies that the headline generation model g should be able to assign a higher estimated probability to the better candidate summary during inference.However, this intuition is not directly captured in the standard MLE objective used in training.No option is adopted for the ordering of imperfect references, which will lead to the existence of multiple generations (Khayrallah et al., 2020).Therefore, we propose the probability that one candidate should be well-correlated with its quality as evaluated by an automatic feature metric M. It is intractable to enumerate all the possible candidate outputs, so we only require an accurate prediction of the most probable candidate headlines ranking order via beam search (See Appendix A).
We use label-smoothing (Szegedy et al., 2016) and maintain the general functional form Eq.3, but specify the marginal probability of the nonreference candidates H to be β.Additionally, we encourage the coordination of probabilities and qualities among headline candidates by contrastive learning as follows: (5) The candidate quality measure M in our work is defined with two scores: content score and style score which are responsible for the topic extraction and contextual media style features, separately.

M(H
where m c score measures the topic components of a candidate headline H i extracted from the document and style score m s measures the contextual latent features.We fine-tune the model with contrastive loss (Hopkins and May, 2011) (Zhong et al., 2020) which encourages the model to assign higher probabilities to a more suitable candidate as follows: ) where H i and H j are two different candidate headlines and M (H i , H * ) > M (H j , H * ), ∀i, j, i < j by metrics M .λ ij is the margin multiplied by the difference in rank between the candidates, i.e.λ ij = (j − i) * λ.
Following multi-task fine-tuning (Edunov et al., 2017), we combine the contrastive (Eq.7) and crossentropy (Eq.2) losses to preserve the generation ability of the pre-trained abstractive model: where γ is the weight of the contrastive loss.We note that the contrastive and the cross-entropy loss can effectively complement each other.Since the contrastive loss is defined on the eye-catchy features, the token-level cross-entropy loss serves as a normalization to ensure content-style balanced probability assignment.This optimization loss of the result can be used in the two-stage summarization pipeline.

Disentangled Space Constraint
The disentanglement scores framework shown in Figure 3 consists of content (intrinsic) constraints and appealing style (extrinsic) constraints.
Content Space Constraint As the above styleoriented loss has already imposed constraints on the style information, the content space constraints methods will be discussed in this part.Different from the style constraint design, it is hard to find parallel sentences with the same content but different styles.Previous work DAHG (Li et al., 2021) used the prototype document and its most similar document to improve the classifier.However, the similarity precision is not clearly defined.The bag-of-words (BOW) method is proposed to approximate content information (John et al., 2018) disentanglement in document style transfer tasks, but our generation objectives are concise headlines.
Inspired by the original (BOW) method, we use ROUGE (Lin, 2004)scores to measure the main content overlapping.
Style Space Constraint We design a multi-task loss that ensures the style information is contained in the space S.Although our dataset for style extraction is non-parallel, we assume that each sentence is labeled with its style (with domain eyecatchy features or not).We select the eye-catchy headlines from the platform and other plain corpora sentences to train the style classifier.Following the previous work (Hu et al., 2017) (Shen et al., 2017) (Fu et al., 2018) (Zhao et al., 2018) we treat each sentence with a binary style tag (positive or negative).
To disentangle the style information, two headlines H p and H n with different labels are selected as two candidates for the classifier.Then the headlines are embedded with the same matrix to obtain the representation h p and h n of H p and H n , respectively.A two-way softmax layer (equivalent to logistic regression) is applied to the style vector ∫ : where θ mul(s) = [W ss ; b ss ] are parameters for multi-task learning of style, and y s is the output of softmax layer.The classifier is trained with a crossentropy loss against the ground truth distribution c s (•), shown as The optimization can be viewed as multi-task learning loss at the same time.It not only autodecodes the sentence but also predicts the possible style (Luong et al., 2015) (John et al., 2018) (Balikas et al., 2017).

Dataset
We collect the dataset from a social media platform REDBook with plenty of life-sharing "hot post" records.As long as the content/headline is compelling enough, the blog will get more exposure and attract more followers."Likes" is a measure of a post's popularity, which is also proof of eye-catchy quality.So we filter 70k different posts with more than 2k "likes", shown in Table 1 for the headline generator training period.To extract the global contextual features, we select the content and corresponding headline of hot post, from different bloggers with the consideration of avoiding personal-style influence.We randomly divide the REDBook dataset into train set, validation set, and test set.For the inference and selection of the best candidate task, we randomly choose 20k posts (RED-IN).Our dataset contains the hot post published during 2021, and time is not an influencing factor as the like counts have already accumulated.

Baselines
We select the related Seq2seq summarization methods:BART (Lewis et al., 2019) and PEGA-SUS (Zhang et al., 2020) as basic large pre-trained standard in the literature.The tokenizers of BART and PEGASUS are also well-established with Chinese datasets.
Implementation Details In the following experiments, we use either BART or PEGASUS as a backbone.We label our proposed methods Medi-aHG with several variants: (1) MediaHG-BA is fine-tuned with eye-catchy features based on BART.
(2)MediaHG(-PG) is fine-tuned with eye-catchy features based on PEGASUS.We also change the eye-catchy features influence as (3) MediaHG-c using content constraint only and MediaHG-s using style constraint only.The choice of sample times k also has a great influence on the results.So we set (4) MediaHG-m with k = 10.

Experiment Settings
Consistent with the platform requirements, we set the maximum target title length as 20 characters for all models.According to the average length of the documents and titles, we set the max length of the tokenizer as 512.The encoder and the decoder of all Seq2Seq models are set as the same parameters as BART (Lewis et al., 2019) and PE-GASUS (Zhang et al., 2020).As we have discussed in Sec 3.1, the number of sentences in document views S v is controlled by the sampling factor parameter s f = 2 3 .Another hyperparameter k to control the number of samples |S v | is set as 5 and 10.During the inference period with content and style 5771 budgets to select the best headline, the eye-catchy feature budgets are respectively set with α = 0.5 and γ = 0.1.
BLEU: To evaluate our model more comprehensively, we also use the metric BLEU proposed by (Papineni et al., 2002) which measures word overlap between the generated text and the groundtruth.
Human Evaluation: As single autometric evaluation can be misleading (Schluter, 2017), we add human evaluation metrics to our work.We randomly sample 500 cases from the test set and ask three familiar and loyal REDBook users as annotators to score the headlines generated by BART, PEAGUSUS, and MediaHG.Referring to the gender and age distribution of REDBook users, reviewers consist of a man and two women about 30 years old.

Results
Overall Performance We compare our model with baselines in Table 2. Firstly, PEGASUS still outperforms BART, which means our task needs a bit more abstractive summarization.Secondly, our model achieves 21.46, 7.79, 19.05, and 11.26 in terms of ROUGE-1, ROUGE-2, and ROUGE-L respectively outperforming both PEAGASUS and BART and thus proves the superiority of our model.Besides, MediaHG outperforms MediaHG-BA in terms of all the metrics scores.An example of headlines generated by BART, PEAGASUS, and our model MediaHG can be found in Appendix B. We refer readers to Appendix B for more details.
We also add MediaHG-c and MediaHG-s to see the eye-catchy features set influence.MediaHG-c achieves better automatic scores than MediaHG-s, which means content budget impact on the main topic extraction of the headline.The disparity between MediaHG and Media-c illustrates the stylecontent duality of a good eye-catchy headline.
The MediaHG-m model shares the same parameters with MediaHG except the number of sample times k.The outperformance of MediaHG-m also certificates the importance of sampling.However, as the number of sampling increases, experiment time increases correspondingly.
The human evaluation is based on 3 aspects: sentence fluency, content faithfulness, and contextual eye-catchy requirements.The rating score of each model ranges from 1 to 3, with 3 being the best.Additionally, headlines with domain features like female tongue will get better scores in attractiveness.

Analysis
We further analyze the contribution of different module parts from diverse perspectives to gain more insights into our method.
Coefficients      loss and the contrastive loss.In order to study the influence of the contrastive learning module, we train our model with different contrastive learning coefficients γ.As the cross-entropy loss is necessary to predict sequential tokens and preserve the generation model ability, we only change the values of γ (shown in Figure 4).When γ is smaller than 0.1, the larger γ, the better of performance.But when γ is bigger than 0.1, the smaller γ leads to better performance.When γ is small, the contrastive learning module has a little positive impact on the whole model, so the results are getting better.While the γ increases, contrastive impact too much on the whole model, which disturbs the training of headline generation.Generation-Fintune as a Loop As our inference selection results are style-content dual, a new set of candidates can be generated in the same way as the pre-trained model dynamically and continuously.Table 4 illustrates the effectiveness of this loop operation.It also demonstrates our method's potential improvement in headline generation.
s f = 1.The results indicate the necessity of sampling in extracting the main topic.
Another hyperparameter k to control the number of samples |S v | is set from 4 to 10 due to the actual needs.We see a rapid increase from 4 to 5 and then a slower increase.As the k rises, more selections are given to the selector, so the scores will increase accordingly.With the dataset length features (Table 1), we set the max k as 10 to create various but nonredundant results.At the same time, for the level of higher complexity, the experimental time increases much.

Conclusion and Future
In this paper, we propose an eye-catchy headline generation model MediaHG for vertical interest social media platforms.Our research is the first one focusing on vertical interest websites.As people's interests flourish with the information gap broken down, more websites will be designed to appeal to the same interest groups.Our design allows the features extractor approach to be used more flexibly with other websites' data.Both automatic and human evaluation show our improvement in headline generation.

Limitations
When dealing with texts of different lengths, selecting parts and generating headlines may result in redundant similar candidates or insufficient information.It is necessary to select appropriate model parameters according to the characteristics of posts.

Figure 1 :
Figure 1: Comparison and improvement with previous work.(a) Previous Seq2Seq models input the whole document to generate a headline, resulting in the omission of the main topic and latent features.(b) In contrast, MediaHG samples document views to capture the main topic and selects the best headline with eye-catchy budgets.Besides, the eye-catchy feature extractor is disentangled with content and style to achieve a duality balance.

Figure 2 :
Figure 2: REDBook Homepage and Specific Hot Post Page Display.The layout of the browsing page shows only the posts' cover and the title, thus emphasizing the importance of the title.

Figure 3 :
Figure 3: Overview of MediaHG.We divide our models into 3 parts: Sample Module for headlines candidates generation; Headline Generator with different datasets training and inference; Features Reranking to select the domain-best title according to disentangled style and content budgets.

Figure 4 :
Figure 4: Model performance with different γ coefficients weighting the contrastive loss(Eq.7). s The number of samples |S v | is controlled with hyperparameter k.

Table 1 :
Different REDBook datasets are used for HG generation training and budgets constraint inference time, respectively.
Table 3 lists the average scores of each model, demonstrating that MediaHG outperforms other baseline models.

Table 3 :
Fluency(Flu), consistency(Con) and attractiveness(Attr) comparison by human evaluation.The data is the average score of the three labelers' results.