TrNews: Heterogeneous User-Interest Transfer Learning for News Recommendation

We investigate how to solve the cross-corpus news recommendation for unseen users in the future. This is a problem where traditional content-based recommendation techniques often fail. Luckily, in real-world recommendation services, some publisher (e.g., Daily news) may have accumulated a large corpus with lots of consumers which can be used for a newly deployed publisher (e.g., Political news). To take advantage of the existing corpus, we propose a transfer learning model (dubbed as TrNews) for news recommendation to transfer the knowledge from a source corpus to a target corpus. To tackle the heterogeneity of different user interests and of different word distributions across corpora, we design a translator-based transfer-learning strategy to learn a representation mapping between source and target corpora. The learned translator can be used to generate representations for unseen users in the future. We show through experiments on real-world datasets that TrNews is better than various baselines in terms of four metrics. We also show that our translator is effective among existing transfer strategies.


Introduction
News recommendation is key to satisfying users' information need for online services. Some news articles, such as breaking news, are manually selected by publishers and displayed for all users. A huge number of news articles generated everyday make it impossible for editors and users to read through all of them, raising the issue of information overload. Online news platforms provide a service of personalized news recommendation by learning from the past reading history of users, e.g., Google (Das et al., 2007;, Yahoo (Trevisiol et al., 2014;Okura et al., 2017), and Bing news (Lu et al., 2015;. When a new user uses the system (cold-start users) or a new article is just created (cold-start items), there are too few observations for them to train a reliable recommender system. Contentbased techniques exploit the content information of news (e.g., words and tags) and hence new articles can be recommended to existing users (Pazzani and Billsus, 2007). Content-based recommendation, however, suffers from the issue of data sparsity since there is no reading history for them to be used to build a profile (Park and Chu, 2009).
Transfer learning is a common technique for alleviating the issues of data sparsity (Pan et al., 2010;Cantador et al., 2015;. A user may have access to many websites such as Twitter.com and Youtube.com (Roy et al., 2012;Huang and Lin, 2016), and consume different categories of products such as movies and books (Li et al., 2009). In this case, transfer learning approaches can recommend articles to a new user in the target domain by exploiting knowledge from the relevant source domains for this new user.
A technical challenge for transfer learning approaches is that user interests are quite different across domains (corpora). For example, users do not use Twitter for the same purpose. A user may follow up on news about "Donald Trump" because she supports republican party (in the political news domain), while she may follow up account @tay-lorswift13 ("Taylor Swift") because she loves music (in the entertainment news domain). Another challenge is that the word distribution and feature space are different across domains. For example, vocabularies are different for describing political news and entertainment news. An illustration is depicted in Figure 1. As a result, the user profile computed from her news history is heterogeneous across domains.
Several strategies have been proposed for heterogeneous transfer learning ). The transferable contextual bandit (TCB)  learns a translation matrix to translate target feature examples to the source feature space. This linear mapping strategy is also used in collaborative cross networks (CoNet) (Hu et al., 2018) and deep dual transfer cross domain recommendation (D-DTCDR) (Li and Tuzhilin, 2020). To capture complex relations between source and target domains, some nonlinear mapping strategy is considered in the embedding and mapping cross-domain recommendation (EMCDR) (Man et al., 2017) which learns a supervised regression between source and target factors using a multilayer perceptron (MLP).
Since aligned examples between source and target domains are limited, they may face the overfitting issues.
To tackle challenges of heterogeneous user interests and limited aligned data between domains, we propose a novel transfer learning model (TrNews) for cross-corpora news recommendation. TrNews builds a bridge between two base networks (one for each corpus, see Section 3.1.1) through the proposed translator-based transfer strategy. The translator in TrNews captures the relations between source and target domains by learning a nonlinear mapping between them (Section 3.2). The heterogeneity is alleviated by translating user interests across corpora. TrNews uses the translator to transfer knowledge between source and target networks. TrNews alleviates the limited data in a way of alternating training (Section 3.3). The learned translator is used to infer the representations of unseen users in the future (Section 3.4). By "translating" the source representation of a user to the target domain, TrNews offers an easy solution to create unseen users' target representations. TrNews outperforms the state-of-the-art recommendation methods on four real-world datasets in terms of four metrics (Section 4.2), while having an explanation advantage by allowing the visualization of the importance of each news article in the history to the future news (Section 4.4).

Related Work
Content recommendation Content-based recommendation exploits the content information about items (e.g., news title and article body (Yan et al., 2012;Xiao et al., 2019;Ma et al., 2019;Wu et al., 2020;Hu et al., 2020), tag, vlog (Gao et al., 2010)), builds a profile for each user, and then matches users to items (Lops et al., 2011;Yu et al., 2016;Wu et al., 2019b). It is effective for items with content or auxiliary information but suffers from the issues of data sparsity for users. DCT (Barjasteh et al., 2015) constructs a user-user similarity matrix from user demographic features including gender, age, occupation, and location (Park and Chu, 2009). NT-MF (Huang and Lin, 2016) constructs a useruser similarity matrix from Twitter texts. Browse-Graph (Trevisiol et al., 2014) addresses the freshly news recommendation by constructing a graph using URL links between web pages. NAC (Rafailidis and Crestani, 2019) transfers from multiple source domains through the attention mechanism. PdMS (Felício et al., 2017) assumes that there are many recommender models available to select items for a user, and introduces a multi-armed bandit for model selection. LLAE (Li et al., 2019a) needs a social network as side information for cold-start users. Different from the aforementioned works, we aim to recommending news to unseen users by transferring knowledge from a source domain to a target domain. Transfer learning Transfer learning aims at improving the performance of a target domain by exploiting knowledge from source domains (Pan and Yang, 2009). A special setting is domain adaptation where a source domain provides labeled training examples while the target domain provides instances on which the model is meant to be deployed (Glorot et al., 2011;Li et al., 2019b). The coordinate system transfer (CST) (Pan et al., 2010) firstly learns the principle coordinate of users in the source domain, and then transfers it to the target domain in the way of warm-start initialization. This is equivalent to an identity mapping from users' source representations to their corresponding target representations. TCB   trategy is also used in CoNet (Hu et al., 2018) and DDTCDR (Li and Tuzhilin, 2020) which transforms the source representations to the target domain by a translation matrix. Nonlinear mapping strategy (Man et al., 2017;Fu et al., 2019) is to learn a supervised mapping function between source and target latent factors by using neural networks. SSCDR (Kang et al., 2019) extends them to the semi-supervised mapping setting. Our translator is general to accommodate these identity, linear, and nonlinear transfer-learning strategies.

Architecture
The architecture of TrNews is shown in Figure 2a, which has three parts. There are a source network for the source domain S and a target network for the target domain T , respectively. The source and target networks are both an instantiation of the base network (Section 3.1.1). The translator enables knowledge transfer between the two networks (Section 3.2). We give an overview of TrNews before introducing the base network and the translator. Target network The information flow goes from the input, i.e., (user u, candidate news c T ) to the output, i.e., the preference scorer uc T , through the following three steps. First, the news encoder ψ T computes the news representation from its content. The candidate news representation is ψ is i's content and n uc T is size of the history. Second, the user encoder φ T computes the user representation from her news history by: Third, the neural collaborative filtering (CF) module f T computes the preference score by: . We can denote the target network by a tuple (ψ T , φ T , f T ). Source network Similarly to the three-step computing process in target network, we compute preference scorer uc S from input (u, c S ) by: Translator The translator F learns a mapping from the user's source representation to her target representation by F : φ S (u) → φ T (u).

Base network
There is a base network for each of the two domains. It is an attentional network which has three modules (ψ, φ, f ): the news encoder ψ to learn news representations, the user encoder φ to learn user representations, and a neural collaborative filtering module f to learn user preferences from reading behaviors. News encoder The news encoder module is to learn news representation from its content. The news encoder takes a news article c's word sequence d c = [w j ] nc j=1 (n c is length of c) as the input, and outputs its representation where e w is the embedding of w. User encoder The user encoder module is to learn the user representation from their reading history. The user encoder takes a user's reading his- In detail, given a pair of user and candidate news (u, c), we get the user representation φ(u|c) as the weighted sum of her historical news articles' rep- where a is the attention function with parameters to be learned. We use an MLP to compute it. For a specific candidate news c, we limit the history news to only those articles that are read before it. For notational simplicity, we do not explicitly specify the candidate news when referring to a user representation, i.e., φ(u) for short of φ(u|c). Neural CF The neural collaborative filtering module is to learn preferences from user-news interactions. The module takes concatenated representations of user and news [φ(u), ψ(c)] as input, and outputs preference scorer where f is an MLP.

Translator
The target network suffers from the data sparsity issue of users who have no reading history. In this section, we propose a transfer learning component (i.e., the translator) to enable knowledge transfer for cross-domain news recommendation. The challenge is that user interests and word distributions are different across domains. For example, we compute the word clouds for two news corpora as shown in Figure 1. We can see that their word distributions are quite different and vocabularies are also different. Hence, user representations computed from their news history are heterogeneous across domains.
We build a translator, F : φ S (u) → φ T (u), to learn a mapping from a user's source representation to her target representation as shown in Figure 2b. This translator captures the relationship and heterogeneity across domains. The translator learns to approximate the target representation from the source representation.
The translator takes a user's source representation φ S (u) as the input, and maps it to a hidden representation z u via an encoder parameterized by θ, and then gets a approximated representationφ S (u) from it via a decoder parameterized by θ . The parameters Θ F = {θ, θ } of the translator are optimized to minimize the approximation error: where U 0 = U S ∩ U T , and U S and U T are the user sets of source and target domains, respectively. H is to match the dimensions of source and target representations. Note that, we do not minimize the approximation error between φ S (u) andφ S (u) as with the standard autoencoder because our goal is to learn a mapping from a user's source representation to her corresponding target representation. After training, the learned mapping function is then used for inferring representations of unseen users in the target domain (the inference process will be described later in Section 3.4). It fulfills knowledge transfer from the source to the target domain via a supervised learning process. Extensions The translator can be generalized to multiple, say k, source domains. We learn k translators using the aligned examples from each of the source domain to the target domain and then we average (or concatenate) the k mapped representations as the final representation for the user. Another extension is to introduce denoising or stacking techniques into the translator framework, not just the MLP structure in (Man et al., 2017).

Model learning
We learn TrNews in two stages. First, we train the source network using source training examples D S and train the target network using target training examples D T , respectively. Second, we train the translator by pairs of user representations computed on-the-fly from source and target networks. We introduce these two stages in detail.
First, TrNews optimizes the parameters associated with target network Θ T = {θ φ T , θ ψ T , θ f T } and source network Θ S = {θ φ S , θ ψ S , θ f S } by minimizing the joint cross-entropy loss: where the two terms on the right-hand side are to optimize losses over user-news examples in the target and source domains, respectively. They are related by the word embedding matrix for the union of words of the two domains. We generate D T and D S as follows and take the target domain as an example since the procedure is the same for the source domain. Suppose we have a whole news reading history for a user u, say [d 1 , d 2 , ..., d nu ]. Then we generate the positive training examples by sliding over the history sequence: .., n u }. We adopt the random negative sampling technique (Pan et al., 2008) to generate the corresponding negative training examples , that is, we randomly sample a news article from the corpus as a negative sample which is not in this user's reading history.
Second, TrNews optimizes the parameters associated with the translator Θ F = {θ, θ } by Eq. (1). Since the aligned data is limited, we increase the training pairs by generating them on-the-fly during the training of the two networks, i.e., in an alternating way. The model learning is summarized in Algorithm 1. , we do not have any previous history to rely on in learning a user representation for her. That is, the shaded area of the target network in Figure 2a is empty for unseen users.
TrNews estimates a new user u * 's target representation by mapping from her source representation using the learned translator F by: where we compute φ S (u * ) using u's latest reading history in the source domain. Then we can predict the user preference for candidate news c * by:

Experiment
We evaluate the performance of TrNews (Section 4.2) and the effectiveness of the translator (Section 4.3) in this section.

Datasets and experimental setup
Datasets We evaluate on two real-world datasets. The first N Y, F L, T X, &CA are four subdatasets extracted from a large dataset provided by an internet company Cheetah Mobile Hu et al., 2019). The information contains news reading logs of users in a large geographical area collected in January of 2017, ranging from New York (NY), Florida (FL), Texas (TX), to California (CA) based on the division of user geolocation. They are treated as four rather than a single because the user set is not overlapped among them. The top two categories (political and daily) of news are used as the cross corpora. The mean length of news articles is around 12 words while the max length is around 50 words. The mean length of user history is around 45 articles while the max length is around 900 articles. The second M IN D is a benchmark dataset released by Microsoft for news recommendation (Wu et al., 2020). We use the MIND-small version to investigate the knowledge transfer when news reading examples are not so large and it is publicly available https://msnews.github.io/. The title and abstract of news are used as the content. The clicked historical news articles are the positive examples for user. The top two categories (news and sports) of news are used as the cross corpora. The word clouds of the two datasets are shown in Figure 1 and the statistics are summarized in Table 1. The mean length of news articles is around 40 words while the max length is around 123 words. Besides, the mean length of user history is around 13 articles while the max length is around 246 articles. Evaluation protocol We randomly split the whole user set into two parts, training and test sets where the ratio is 9:1. Given a user in the test set, for each news in her history, we follow the strategy in  to randomly sample a number of negative news, say 99, which are not in her reading history and then evaluate how well the recommender can rank this positive news against these negative ones. For each user in the training set, we reserve her last reading news as the valid set. We follow the typical metrics to evaluate top-K news recommendation (Peng et    2017;  which are hit ratio (HR), normalized discounted cumulative gain (NDCG), mean reciprocal rank (MRR), and the area under the ROC curve (AUC). We report the results at cut-off K ∈ {5, 10}. Implementation We use TensorFlow. The optimizer is Adam (Kingma and Ba, 2015) with learning rate 0.001. The size of mini batch is 256. The neural CF module has two hidden layers with size 80 and 40 respectively. The size of word embedding is 128. The translator has one hidden layer on the smaller datasets and two on the larger ones. The history is the latest 10 news articles.

Comparing different recommenders
In this section, we show the recommendation results by comparing TrNews with different state-ofthe-art methods.
Baselines We compare with following recommen-dation methods which are trained on the merged source and target datasets by aligning with shared users: POP (Park and Chu, 2009) recommends the most popular news. LR (McMahan et al., 2013) is widely used in ads and recommendation. The input is the concatenation of candidate news and user's representations. DeepFM (Guo et al., 2017) is a deep neural network for ads and recommendation based on the wide & deep structure. We use second-order feature interactions of reading history and candidate news, and the input of deep component is the same as LR. DIN (Zhou et al., 2018) is a deep interest network for ads and recommendation based on the attention mechanism. We use the news content for news representations.
TANR (Wu et al., 2019a) is a state-of-the-art deep news recommendation model using an attention network to learn the user representation. We adopt the news encoder and negative sampling the same with TrNews.

Results
We have observations from results of different recommendation methods as shown in Table 2. Firstly, considering that breaking and headline news articles are usually read by every user, the POP method gets competitive performance in terms of NDCG and MRR since it ranks the popular news higher than the other news. Secondly, the neural methods are generally better than the traditional, shallow LR method in terms of NDCG, MRR, and AUC on the four subdatasets. It may be that neural networks can learn nonlinear, complex relations between the user and the candidate news to capture user interests and news semantics. Considering that the neural representations of user and candidate news are fed as the input of LR, it gets competitive performance on MIND data. Finally, the proposed TrNews model achieves the best performance with a large margin improvement over all other baselines in terms of HR, NDCG, and MRR and also with an improvement in terms of AUC. It validates the necessity of accounting for the heterogeneity of user interests and word distributions across domains. This also shows that the base network is an effective architecture for news recommendation and the translator is effective to enable the knowledge transfer from the source domain to the target domain. In more detail, it is inferior by training a global model from the mixed source and target examples and then using this global model to predict user preferences on the target domain, as baselines do. Instead, it is good by training source Approach Transfer strategy Formulation CST (Pan et al., 2010) Identity mapping φT (u) = φS(u) TCB  DDTCDR (Li and Tuzhilin, 2020) Linear mapping φT (u) = HφS(u) H is orthogonal EMCDR (Man et al., 2017) Nonlinear mapping φT (u) = MLP(φS(u))  and target networks on the source and target domains, respectively, and then learning a mapping between them, as TrNews does.

Comparing different transfer strategies
In this section, we demonstrate the effectiveness of the translator-based transfer-learning strategy.
Baselines We replace the translator of TrNews with the transfer-learning strategies of baseline methods as summarized in Table 3. All baselines are state-of-the-art recommenders and capable of recommending news to cold-start users. Note that, the compared transfer-learning methods are upgraded from their original versions. We strengthen them by using the neural attention architecture as the base component. In their original versions, CST and TCB use matrix factorization (MF) while D-DTCDR and EMCDR use multilayer perceptron. The neural attention architecture has shown superior performance over MF and MLP in the literature (Zhou et al., 2018;Wu et al., 2019a). As a result, we believe that the improvement will be larger if we compare with their original versions but this is obviously unfair.

Results
We have observations from results of different transfer learning strategies as shown in Table 4. Firstly, the transfer strategy of identity mapping (CST) is generally inferior to the linear (TCB and DDTCDR) and nonlinear (EMCDR and TrNews) strategies. CST directly transfers the source knowledge to the target domain without adaptation and hence suffers from the heterogeneity of user interests and word distributions across domains. Secondly, the nonlinear transfer strategy of EMCDR is inferior to the linear strategy of TCB in terms of MRR and AUC on the two smaller NY and FL datasets. This is probably because EMCDR increases the model complexity by introducing two large fully-connected layers in its MLP component. In contrast, our translator is based on the small-waist autoencoder-like architecture and hence can resist overfitting to some extent. Finally, our translator achieves the best performance in terms of NDCG, MRR and AUC on the two smaller NY and FL datasets, and achieves competitive performance on the two larger TX and CA datasets, and achieves the best performance in terms of HR and AUC on the MIND dataset, comparing with other four transfer methods. These results validate that our translator is a general and effective transfer-learning strategy to capture the diverse user interests accurately during the knowledge transfer for the unseen users in cross-domain news recommendation.

Analysis
Benefit of knowledge transfer We vary the percentage of shared users used to train the translator (see Eq. (1)) with {90%, 70%, 50%, 30%}. We   compare with a naive transfer strategy of CST, i.e., the way of direct transfer without adaptation. The results are shown in Figure 3 on the New York dataset. We can see that it is beneficial to learn an adaptive mapping during the knowledge transfer even when limited aligned examples are available to train the translator. TrNews improves relative 0.82%, 0.77%, 0.67%, 0.64% in terms of HR@10 performance over CST by varying among {90%, 70%, 50%, 30%} respectively. So we think that the more aligned examples the translator has, the more benefits it achieves. Impact of sharing word embeddings We investigate the benefits of sharing word embeddings between source and target domains. There is a word embedding matrix for each of the domains and we share the columns if the corresponding words occur in both domains. Take the New York dataset as an example, the size of the intersection of their word vocabularies is 11,291 while the union is 50,263. From the results in Table 5 we can see that it is beneficial to share the word embeddings even when only 22.5% words are intersected between them. Impact of alternating training We adopt an alternating training strategy between training the two (source & target) networks and training the translator in our experiments. In this section, we compare this alternating strategy with the separating strategy which firstly trains the two networks and then trains the translator after completing the training of the two networks. That is, the training pairs of user representations for the translator are not generated on-the-fly during the training of source and target networks but generated only once after finishing their training. From the results in Table 6, we see that the alternating strategy works slightly better. This is probably because the aligned data between domains is limited and the alternating strategy increases the size of training pairs. Impact of two-stage learning We adopt a two-  stage model learning between training the two (source & target) networks and training the translator in our experiments. In this section, we compare this two-stage learning with an end-to-end learning which jointly trains the two networks and the translator. That is, the parameters of the translator depend on the word embedding matrix and on parameters of the user encoder. From the results in Table 7, we see that the two-stage learning works slightly better. This is probably because the aligned data between domains is too limited to reliably update the parameters which do not belong to the parameters of the translator. Impact of the length of the history Since we generate the training examples by sliding over the whole reading history for each user, the length of reading history is a key parameter to influence the performance of TrNews. We investigate how the length of the history affects the performance by varying it with {3, 5, 10, 15, 20}. The results on the New York dataset are shown in Figure 4a. We can observe that increasing the size of the sliding window is sometimes harmful to the performance, and TrNews achieves good results for length 10. This is probably because of the characteristics of news freshness and of the dynamics of user interests. That is, the latest history matters more in general. Also, increasing the length of the input makes the training time increase rapidly, which are 58, 83, 143, 174, and 215 seconds when varying with {3, 5, 10, 15, 20} respectively. Impact of the embedding size In this section, we evaluate how different choices of some key hyperparameter affect the performance of TrNews. Except for the parameter being analyzed, all other parameters remain the same. Since we compute the news and user representations using the content of words, the size of word embedding is a key parameter to influence representations of word-  s, uses, and news articles, and hence the performance of TrNews. We investigate how embedding size affects the performance by varying it with {32, 64, 100, 128, 200}. The results on the New York dataset are shown in Figure 4b. We can observe that increasing the embedding size is generally not harmful to the performance until 200, and TrNews achieves good results for embedding size 128. Changing it to 200 harms the performance a little bit since the model complexity also increases.
Optimization performance and loss We show the optimization performance and loss over iterations on the New York dataset in Figure 5. We can see that with more iterations, the training losses gradually decrease and the recommendation performance is improved accordingly. The most effective updates are occurred in the first 15 iterations, and performance gradually improves until 30 iterations. With more iterations, TrNews is relatively stable. For the training time, TrNews spends 143 seconds per iteration. As a reference, it is 134s for DIN and 139s for TCB, which indicates that the training cost of TrNews is efficient by comparing with baselines. Furthermore, the test time is 150s. The experimental environment is Tensorflow 1.5.0 with Python 3.6 conducted on Linux CentOS 7 where The GPU is Nvidia TITAN Xp based on CUDA V7.0.27. Examining user profiles One advantage of TrNews is that it can explain which article in a user's history matters the most for a candidate article by using attention weights in the user encoder module. Table 8 shows an example of interactions between some user's history articles No. 0-9 and a candidate article No. 10, i.e., the user reads the candidate article after read these ten historical articles. We can see that the latest three articles matter the most since the user interests may remain the same during a short period. The oldest two articles, however, also have some impact on the candidate article, reflecting that the user interests may mix with a long-term characteristic. TrNews   can capture these subtle short-and long-term user interests.

Conclusion
We investigate the cross-domain news recommendation via transfer learning. The experiments on real-word datasets demonstrate the necessity of tackling heterogeneity of user interests and word distributions across domains. Our TrNews model and its translator component are effective to transfer knowledge from the source network to the target network. We also shows that it is beneficial to learn a mapping from the source domain to the target domain even when only a small amount of aligned examples are available. In future works, we will focus on preserving the privacy of the source domain when we transfer its knowledge.