PP-Rec: News Recommendation with Personalized User Interest and Time-aware News Popularity

Personalized news recommendation methods are widely used in online news services. These methods usually recommend news based on the matching between news content and user interest inferred from historical behaviors. However, these methods usually have difficulties in making accurate recommendations to cold-start users, and tend to recommend similar news with those users have read. In general, popular news usually contain important information and can attract users with different interests. Besides, they are usually diverse in content and topic. Thus, in this paper we propose to incorporate news popularity information to alleviate the cold-start and diversity problems for personalized news recommendation. In our method, the ranking score for recommending a candidate news to a target user is the combination of a personalized matching score and a news popularity score. The former is used to capture the personalized user interest in news. The latter is used to measure time-aware popularity of candidate news, which is predicted based on news content, recency, and real-time CTR using a unified framework. Besides, we propose a popularity-aware user encoder to eliminate the popularity bias in user behaviors for accurate interest modeling. Experiments on two real-world datasets show our method can effectively improve the accuracy and diversity for news recommendation.


Introduction
Personalized news recommendation is a useful technique to help users alleviate information overload when visiting online news platforms (Wu et al., 2020d(Wu et al., ,b, 2021Ge et al., 2020). Existing personalized news recommendation methods usually recommend news to a target user based on the matching between the content of candidate news and user interest inferred from previous behaviors (Zhu et al., 2019;Wu et al., 2019f). For example, Wu et al. (2019e) proposed to model news content from news title based on multi-head self-attention. In addition, they modeled user interest from the previously clicked news articles with multi-head self-attention to capture the relatedness between different behaviors.  proposed to use CNN network to learn news embeddings from news titles and categories, and model both long-term and short-term user interests from news click behaviors. However, these personalized news recommendation methods usually have difficulties in making accurate recommendations to cold-start users, since the behaviors of these users are very sparse and it is difficult to model their interest (Trevisiol et al., 2014). Besides, these methods tend to recommend similar news with those users have read (Nguyen et al., 2014), which may hurt user experience and is not beneficial for them to receive new information.
The motivation for this work is that popular news usually convey important information such as catastrophes, epidemics, presidential election and so on, as shown in Fig. 1. These popular news can attract many users to read and discuss even if they have different personal interest (Yang, 2016). In addition, popular news are diverse in content and can cover many different topics (Houidi et al., 2019). Thus, incorporating popular news has the potential to alleviate the cold-start and diversity problems in personalized news recommendation.
In this paper, we propose a new method named PP-Rec for news recommendation 1 , which can consider not only personalized user interest in news but also the popularity of candidate news. In our method, the ranking score of recommending a candidate news to a target user is the combination of a personalized matching score and a news popularity score. The personalized matching score is used to measure personal user interest in the content of candidate news. The news popularity score is used to measure the time-aware popularity of candidate news. Since news popularity is influenced by many different factors such as content and freshness, we propose a unified model to predict time-aware news popularity based on news content, recency, and near real-time click-through rate (CTR). These two scores are combined via a personalized aggregator for news ranking, which can capture the personalized preferences of different users in popular news. Moreover, we propose a knowledge-aware news encoder to generate news content embeddings from both news texts and entities. Besides, since news popularity can effect users' click behaviors (Zheng et al., 2010) and lead to bias in behavior based user interest modeling, we propose a popularity-aware user encoder which can consider the popularity bias in user behaviors and learn more accurate user interest representation. Extensive experiments on two real-world datasets show PP-Rec can effectively improve the performance of news recommendation in terms of both accuracy and diversity.

Personalized News Recommendation
Personalized news recommendation are widely used in online news platforms (Liu et al., 2010;Bansal et al., 2015;Wu et al., 2020dWu et al., ,c, 2019d. Existing personalized news recommendation methods usually rank candidate news for a target user based on the matching between news content and user interest (Wang et al., 2018;Wu et al., 2020aWu et al., , 2019c. For example, Okura et al. (2017) learned news embeddings from news bodies via an auto-encoder and modeled user interests from the clicked news via a GRU network. The matching between news and user is formulated as the dot product of their embeddings. Wu et al. (2019e) used multi-head self-attention networks to generate 1 https://github.com/JulySinceAndrew/PP-Rec news content embeddings from news titles and generate user interest embeddings from clicked news. They also used the dot product of user and news embeddings as personalized matching scores for news ranking. These personalized news recommendation methods usually model user interests from previous news click behaviors. However, it is difficult for these methods to make accurate recommendation to cold-start users whose behaviors are very sparse (Trevisiol et al., 2014). These users are very common in online news platforms, making the cold-start problem become a critical issue in real systems (Sedhain et al., 2014). Although some methods were proposed to alleviate the cold-start problem in personalized recommendation (Sedhain et al., 2014;Trevisiol et al., 2014), they usually utilized side information (Son, 2016) such as social network (Lin et al., 2014) to enhance user interest modeling. However, the side information used in these methods may be unavailable in news recommendation. In addition, these personalized methods tend to recommend similar news with those users have already read, which makes it difficult for users to receive new news information and may hurt their news reading experience (Nguyen et al., 2014;Wu et al., 2019f). Different from these methods, in PP-Rec we consider not only users' personal interest in news but also the popularity of candidate news, which can alleviate both cold-start and diversity problems to some extent.

Popularity-based News Recommendation
Our work is also related to popularity-based news recommendation methods. Different from personalized news recommendation methods which rank candidate news based on users' personal interests, popularity-based news recommendation methods rank candidate news based on their popularity (Phelan et al., 2009;Tatar et al., 2014;Lerman and Hogg, 2010;Szabo and Huberman, 2010;Jonnalagedda et al., 2016). A core problem in popularity-based news recommendation methods is how to estimate the popularity of candidate news accurately. Most existing methods estimated news popularity based on the statistics of users' interactions with news on online news platforms, such as the number of views and comments (Yang, 2016;Tatar et al., 2014;Lee et al., 2010). For example, Yang (2016) proposed to use the frequency of views to measure news popularity. Tatar et al. (2014) proposed to predict news popularity based on the number of comments of news via a linear model. Li et al. (2011) proposed to use the number of clicks on news to model their popularity and further adjust the ranking of news with same topics based on their popularity. However, different news usually have significant differences in impression opportunities, and these view and comment numbers are biased by impression times. Different from these methods, we use CTR to model news popularity, which can eliminate the impression bias. Besides CTR, we also incorporate the content and recency information of candidate news to predict the popularity of candidate news in a more comprehensive and time-aware manner.

Methodology
In this section, we introduce PP-Rec for news recommendation which can consider both the personal interest of users and the popularity of candidate news. First, we introduce the overall framework of PP-Rec, as shown in Fig. 2. Then we introduce the details of each module in PP-Rec, which are shown in Figs. 3, 4 and 5.

Framework of PP-Rec
In PP-Rec, the ranking score of recommending a candidate news to a target user is the combination of a personalized matching score s m and a news popularity score s p . The personalized matching score is used to measure the user's personal interest in the content of candidate news, and is predicted based on the relevance between news content embedding and user interest embedding. The news content embedding is generated by a knowledgeaware news encoder from both news texts and entities. The user interest embedding is generated by a popularity-aware user encoder from the content of clicked news as well as their popularity. The news popularity score is used to measure the time-aware popularity of candidate news, which is predicted by a time-aware news popularity predictor based on news content, recency, and near real-time CTR.

Knowledge-aware News Encoder
First, we introduce the knowledge-aware news encoder, which is shown in Fig. 3. It learns news representation from both text and entities in news title. Given a news title, we obtain the word embeddings based on word embedding dictionary pretrained on large-scale corpus to incorporate initial word-level semantic information. We also convert entities into embeddings based on pre-trained entity embeddings to incorporate knowledge information in knowledge graphs to our model.
There usually exists relatedness among entities in the same news. For example, the entity "MAC" that appears with the entity "Lancome" may indicate cosmetics while it usually indicates computers when appears with the entity "Apple". Thus, we utilize an entity multi-head self-attention network (Vaswani et al., 2017) (MHSA) to learn entity representations by capturing their relatedness. Besides, textual contexts are also informative for learning accurate entity representations. For example, the entity "MAC" usually indicates computers if its textual contexts are "Why do MAC need an ARM CPU?" and indicates cosmetics if its textual contexts are "MAC cosmetics expands AR try-on". Thus, we propose an entity multi-head cross-attention network (MHCA) to learn entity representations from the textual contexts. Then we formulate the unified representation of each entity as the summation of its representations learned by the MHSA and MHCA networks. Similarly, we use a word MHSA network to learn word representations by capturing the relatedness among words and a word MHCA network to capture the relatedness between words and entities. Then we build the unified word representation by adding its representations generated by the word MHSA and the word MHCA networks.
Since different entities usually contribute differently to news representation, we use an entity attention network to learn entity-based news representation e from entity representations. Similarly, we use a word attention network to learn word-based news representation w from word representations. Finally, we learn the unified news representation n with a weighted combination of e and w via an attention network.

Time-aware News Popularity Predictor
Next, we introduce the time-aware news popularity predictor, as shown in Fig. 4. It is used to predict time-aware news popularity based on news content, recency, and near real-time CTR information. Since popular news usually have a higher click probability than unpopular news, CTR can provide good clue for popular news (Jiang, 2016). Thus, we incorporate CTR into news popularity prediction. Besides, popularity of a news article usually dynamically changes. Popular news may become less popular as they get out-of-date over time. Thus, we use user interactions in recent t hours to calculate near real-time CTR (denoted as c t ) for news popularity prediction. However, the accurate computation of CTR needs to accumulate sufficient user interactions, which is challenging for those newly published news.
Fortunately, news content is very informative for predicting news popularity. For example, news on breaking events such as earthquakes are usually popular since they contain important information for many of us. Thus, besides near real-time CTR, we incorporate news content into news popularity prediction. We apply a dense network to the news content embedding n to predict the contentbased news popularityp c . Since news content is time-independent and cannot capture the dynamic change of news popularity, we incorporate news recency information, which is defined as the duration between the publish time and the prediction time. It can measure the freshness of news articles, which is useful for improving content-based popularity prediction. We quantify the news recency r in hours and use a recency embedding layer to convert the quantified news recency into an embedding vector r. Then we apply a dense network to r to predict the recency-aware content-based news popularityp r . Besides, since different news content usually have different lifecycles, we propose to model time-aware content-based news popularityp fromp c andp r using a content-specific aggregator: where θ ∈ (0, 1) means the content-specific gate, σ(·) means the sigmoid activation, [·, ·] means the concatenation operation, W p and b p are the trainable parameters. Finally, the final time-aware news popularity s p is formulated as a weighted summation of the content-based popularityp and the CTRbased popularity c t , i.e., s p = w c ·c t +w p ·p, where w c and w p are the trainable parameters.

Popularity-aware User Encoder
Next, we introduce the popularity-aware user encoder in PP-Rec for user interest modeling, which is shown in Fig. 5. In general, news popularity can influence users' click behaviors, and causes bias in behavior based user interest modeling (Zheng et al., 2010). Eliminating the popularity bias in user behaviors can help more user interest from user behaviors more accurately. For example, a user may click the news "Justin Timberlake unveils the song" because he likes the songs of "Justin Timberlake", while he may click the news "House of Representatives impeaches President Trump" because it is popular and contains breaking information. Among these two behaviors, the former is more informative for modeling the user interest. Thus, we design a popularity-aware user encoder to learn user interest representation from both content and popularity of clicked news. It contains three components, which we will introduce in details.
First, motivated by Wu et al. (2019e), we apply a news multi-head self-attention network to the representations of clicked news to capture their relatedness and learn contextual news representation. Second, we uniformly quantify the popularity of the i-th clicked news predicted by the time-aware news popularity predictor 2 and convert it into an embedding vector p i via popularity embedding. Third, besides news popularity, news content is also useful for selecting informative news to model user interest (Wu et al., 2019a). Thus, we propose a content-popularity joint attention network (CPJA) to alleviate popularity bias and select important clicked news for user interest modeling, which is formulated as: where α i and m i denote the attention weight and the contextual news representation of the i-th clicked news respectively. q and W u are the trainable parameters. The final user interest embedding u is formulated as a weighed summation of the contextual news representations:

News Ranking and Model Training
In this section, we introduce how we rank the candidate news and train the model in detail. The ranking score of a candidate news for a target user is based on the combination of a personalized matching score s m and a news popularity score s p . The former is computed based on the relevance between user embedding u and news embedding n. Following Okura et al. (2017), we adopt dot product to compute the relevance. The latter is predicted by the time-aware news popularity predictor. In addition, the relative importance of the personalized matching score and the news popularity score is usually different for different users. For example, the news popularity score is more important than the personalized matching score for cold-start users since the latter is derived from scarce behaviors and is usually inaccurate. Thus, we propose a personalized aggregator to combine the personalized matching score and news popularity score: where s denotes the ranking score, and the gate η is computed based on the user representation u via a dense network with sigmoid activation. We use the BPR pairwise loss (Rendle et al., 2009) for model training. In addition, we adopt the negative sampling technique to select a negative sample for each positive sample from the same impression. The loss function is formulated as: where s p i and s n i denote the ranking scores of the i-th positive and negative sample respectively, and D denotes the training dataset.

Dataset and Experimental Settings
To our best knowledge, there is no off-the-shelf news recommendation dataset with news popularity information. Thus, we built two datasets by ourselves. The first one is collected from the user logs in the Microsoft News website from October 19 to November 15, 2019, and is denoted as MSN. We use the user logs in the last week for evaluation and others for model training and validation. The second dataset is collected from a commercial news feeds in Microsoft from January 23 to April 23, 2020, and is denoted as Feeds. We use the logs in the last three weeks for evaluation and the rest for model training and validation. For both datasets, we randomly sample 500k impressions for model training, 100k impressions for validation, and 500k impressions for evaluation, respectively. The detailed statistics are listed in Table 1  In our experiments, word embeddings are 300dimensional and initialized by the Glove embeddings (Pennington et al., 2014). The entity embeddings are 100-dimensional vectors pre-trained on knowledge tuples extracted from WikiData via TransE (Bordes et al., 2013). We use clicked and unclicked impressions in the recent one hour to compute the near real-time CTR. The recency and popularity embeddings are set to 100 dimensions and initialized randomly. All multi-head attention networks are set to have 20 attention heads and the output dimension of each head is 20. All gate networks are implemented by a two-layer dense network with 100-dimensional hidden vectors. Dropout approach (Srivastava et al., 2014) is applied to PP-Rec to migrate overfitting. The dropout probability is set to 0.2. Adam (Kingma and Ba, 2015) is used for model training with 10 −4 learning rate. Hyper-parameters of PP-Rec and baselines are tuned based on the validation set.

Performance Evaluation
We compare PP-Rec with two groups of baselines. The first group is popularity-based news recommendation methods, including: (1) ViewNum (Yang, 2016): using the number of news view to measure news popularity; (2) RecentPop (Ji et al., 2020): using the number of news view in recent time to measure news popularity; (3) SCENE (Li et al., 2011): using view frequency to measure news popularity and adjusting the ranking of news with same topics based on their popularity; (4) CTR (Ji et al., 2020): using news CTR to measure news popularity. The second group is personalized news recommendation methods, containing: (1) EBNR (Okura et al., 2017): utilizing an auto-encoder to learn news representations and a GRU network to learn user representations; (2) DKN (Wang et al., 2018): utilizing a knowledge-aware CNN network to learn news representations from news titles and entities; (3) NAML (Wu et al., 2019a): utilizing attention network to learn news representations from news title, body and category; (4) NPA (Wu et al., 2019b): utilizing personalized attention networks to learn news and user representations; (5) NRMS (Wu et al., 2019e): utilizing multi-head self-attention networks to learn both news and user representations; (6) LSTUR : modeling users' short-term interests via the GRU network and longterm interests via the user ID; (7) KRED (Liu et al., 2020): learning news representation from titles and entities via a knowledge graph attention network.
We repeat each experiment 5 times and show average performance and standard deviation in Table 2, from which we have the following observations. First, among the popularity-based news recommendation methods, the CTR method outperforms the ViewNum method. This is because the number of news views is influenced by impression bias while CTR can eliminate the impression bias and better measure news popularity. Second, PP-Rec outperforms all popularity-based methods. This is because these methods usually recommend popular news to different users. However, differ-  ent users might prefer different news according to their personalized interests, some of which are not popular and cannot be recommended by these popularity-based methods. In contrast, HieRec considers both popularity and personalization in news recommendation. Third, PP-Rec outperforms all personalized methods. This is because personalized methods usually recommend news based on the matching between news and user interest inferred from users' clicked news, and they ignore the popularity of each news. However, popular news usually contain important and eye-catching information and can attract the attention of many users with different interests. Different from these personalized methods, PP-Rec incorporates news popularity into personalized news recommendation, which can recommend popular news to users and improve the performance of news recommendation.

Performance on Cold-Start Users
We evaluate the performance of PP-Rec and several personalized methods on news recommendation for cold-start users. We compare PP-Rec with NAML, KRED, LSTUR and NMRS since they achieve good performance in Table 2. We evaluate their performance on recommending news to users with K ∈ {k|k = 0, 1, 3, 5} historical clicked news. In the following sections, we only show experimental results on the MSN dataset since results on MSN dataset and Feeds dataset are similar. As shown in Fig. 6, PP-Rec significantly outperforms other personalized methods. This is because these personalized methods usually recommend news based on the matching between news and user interests. However, it is difficult for these methods to accurately model personal interests of cold-start users from their scarce clicks and accurately help them find their interested news. Different from these methods, PP-Rec recommends news based on both personalized interest matching and news popularity. Popular news usually contains important information and can attract many users with different interests. Thus, incorporating news popularity into news recommendation can effectively improve the reading experiences of cold-start users.

Recommendation Diversity
In this section, we evaluate the recommendation diversity of PP-Rec and other personalized methods. We use two metrics, i.e., intra-list average distance and new topic ratio, to measure the diversity of the top K (K ∈ {k|k = 1, ..., 10}) recommended news. The former is used to measure the average distance between recommended news based on their representations, which is widely used in previous works (Zhang and Hurley, 2008;Chen et al., 2018). The second one is used to measure the topic similarity between recommended news and users' historical clicked news. It counts the number of topics of the top K recommended news  which are clicked and are not included in topics of users' historical clicked news. Besides, we use K to normalize the number. Fig. 7 and 8 show that PP-Rec can consistently improve the recommendation diversity. This is because these personalized methods recommend news to users based on the matching between news and user interest inferred from clicked news, making the recommended news tend to be similar to users' consumed news. Different from these methods, PP-Rec incorporates news popularity into news recommendation. Besides the news which is related to user interest, PP-Rec can also recommend popular news, which are very diverse in content and topics, to users. Thus, PP-Rec can enhance recommendation diversity.

Ablation Study
In this section, we conduct several ablation studies on PP-Rec. First, we verify the effectiveness of the two scores for candidate news ranking, i.e., news popularity score and personalized matching score, Figure 9: Effectiveness of personalized matching score and news popularity score. by removing them individually from PP-Rec. The experimental results are shown in Fig. 9. We have two findings from the results. First, after removing the news popularity score, the performance of PP-Rec declines. This is because PP-Rec incorporates news popularity into news recommendation via this score. In addition, popular news usually contains important information and can attract many users with different interests. Thus, recommending popular news can improve news recommendation accuracy. Second, removing the personalized matching score also hurts the recommendation accuracy. This is because this score measures user interest in news and incorporates personalized matching into news recommendation in PP-Rec. Since users like to click news related to their personalized interests, recommending users' interested news can effectively improve recommendation accuracy. Next, as shown in Fig. 10, we conduct an ablation study to verify the effectiveness of different information in the time-aware news popularity predictor by removing them individually. We have several observations from the results. First, removing news recency makes the performance of PP-Rec decline. This is because news popularity usually dynamically changes, and popular news will become unpopular once its information is expired. Since news recency can reflect the freshness of news information, incorporating it makes the news popularity modeling more accurate. Second, the performance of PP-Rec without news content also declines. This is because after removing it, PP-Rec predicts news popularity based on the near realtime CTR and recency. However, it usually takes some time to accumulate enough impressions to calculate accurate CTR. Thus, removing the news content makes PP-Rec cannot effectively model the popularity of news just published. Third, PP-Rec performs worse without the near real-time CTR. This is because near real-time CTR effectively measures the click probability of the news based on the behaviors of a large number of users in the recent period. Thus, removing the near real-time CTR makes it PP-Rec lose much useful information for modeling the dynamic news popularity.

Case Study
We conduct a case study to show the effectiveness of PP-Rec. We compare PP-Rec with LSTUR since LSTUR can achieve the best performance among baseline methods on the MSN dataset. In Fig. 11, we list top 3 news recommended by two methods to a randomly sampled user and their normalized popularity predicted by PP-Rec. We also list user's clicked news. First, we find that the user clicked a news on football, which is recommended by both LSTUR and PP-Rec. This is because the user has previously clicked three news on football, which indicates the user is interested in football. Thus, both LSTUR and PP-Rec recommend that news based on the personal interest of this user. Second, the user did not click other news on football recommended by PP-Rec and LSTUR. This may be because recommending too much news with similar information may make users feel bored, making the user only click a part of them. This inspires us that recommending news with diverse information may help improve users' reading experience. Third, the user clicked a news on crime, which is only recommended by PP-Rec. This is because it is hard to predict user's interests in criminal events from her clicks, making it difficult for LSTUR to recommend this news. Different from LSTUR, PP-Rec recommends news based on both personal user interest and news popularity. PP-Rec successfully predicts that this news is popular and recommends it. This case shows that PP-Rec can improve the recommendation accuracy and enhance the recommendation diversity by incorporating news popularity.

Conclusion
In this paper, we propose a new news recommendation method named PP-Rec to alleviate the coldstart and diversity problems of personalized news recommendation, which can consider both the personal interest of users and the popularity of candidate news. In our method, we rank the candidate news based on the combination of a personalized matching score and a news popularity score. We propose a unified model to predict time-aware news popularity based on news content, recency, and near real-time CTR. In addition, we propose a knowledge-aware news encoder to generate news content embeddings from news texts and entities, and a popularity-aware user encoder to generate user interest embeddings from the content and popularity of clicked news. Extensive experiments on two real-world datasets constructed by logs of commercial news websites and feeds in Microsoft validate that our method can effectively improve the accuracy and diversity of news recommendation.