Denoising Neural Network for News Recommendation with Positive and Negative Implicit Feedback

News recommendation is different from movie or e-commercial recommendation as people usually do not grade the news. Therefore, user feedback for news is always implicit (click behavior, reading time, etc). Inevitably, there are noises in implicit feedback. On one hand, the user may exit immediately after clicking the news as he dislikes the news content, leaving the noise in his positive implicit feedback; on the other hand, the user may be recommended multiple interesting news at the same time and only click one of them, producing the noise in his negative implicit feedback. Opposite implicit feedback could construct more integrated user preferences and help each other to minimize the noise influence. Previous works on news recommendation only used positive implicit feedback and suffered from the noise impact. In this paper, we propose a denoising neural network for news recommendation with positive and negative implicit feedback, named DRPN. DRPN utilizes both feedback for recommendation with a module to denoise both positive and negative implicit feedback to further enhance the performance. Experiments on the real-world large-scale dataset demonstrate the state-of-the-art performance of DRPN.


Introduction
Online news platforms, such as Google News and Microsoft News, have attracted a large population of users (Wu et al., 2020b). However, massive news articles emerging every day on these platforms make it difficult for users to find appealing content quickly (Wu et al., 2019b). To alleviate the information overload problem, recommender systems have become integral parts of these platforms.
A core problem in news recommendation is how to learn better representations of users and * This work was done when Yunfan Hu was an intern at Tencent. † Xian Wu is the Corresponding Author.
news (Hu et al., 2020b). Early works include collaborate filtering (CF) based methods (Das et al., 2007), content-based methods (IJntema et al., 2010) and hybrid methods (De Francisci Morales et al., 2012) that combine the two. These methods usually have the cold start problem when being exposed to the sparsity of user-item interactions (Zhu et al., 2019). Recently, deep learning methods have been proposed to learn better user and news representations. The techniques evolve from using recursive neural network (Okura et al., 2017), attention mechanism (Zhu et al., 2019;Wu et al., 2019c), to graph neural network (Wang et al., 2018a;Hu et al., 2020b,a;Qiu et al., 2022). These methods usually recommend news for users based on their historical feedback.
Implicit feedback is more commonly collected than explicit feedback for news because the users usually do not grade the news. Hence, current news recommendation methods naturally use positive implicit feedback like click behavior as the historical feedback to model user interests. However, there are gaps between positive implicit feedback and user real preferences (Wang et al., 2018b). For example, the click behaviors do not fully reflect the user's preferences. The user may exit the news immediately after clicking, which will involve a noise in the positive feedback. Additionally, some news that users did not click, may also attract them later. Ignoring them also impacts the recommendation performance. Our observation is that using both positive and negative implicit feedback can better model user interests. Besides, positive and negative implicit feedback can help to denoise each other by conducting inter-comparison and intra-comparison. If a news story in one feedback sequence is more similar to the news in the opposite feedback sequence rather than the news in the same sequence, it is very likely that this news story constitutes noise. We can remove this news when building user interests. This idea is shown in Figure 1.
In this paper, we propose the Denoising neural network for news Recommendation with Positive and Negative implicit feedback, named DRPN. It first introduces a news encoder to represent the news in two implicit feedback sequences. Then two parallel aggregators are used to extract user representations from both positive and negative historical feedback: (1) content-based aggregator, which selects the informative news in the feedback sequences to represent the user; (2) denoising aggregator, which finds and reduces the noises in the feedback sequences. In addition to the semantic information, we introduce a graph neural network to incorporate the collaborative information to further enrich the user representation. Finally, the user and candidate news representations are used to predict the clicking probability. The contributions of this paper are summarized as follows: • We propose a novel neural news recommendation approach DRPN which jointly models both positive and negative implicit feedback sequences to represent the user to improve recommendation performance.
• In DRPN, to minimize the impacts of the noises in the implicit feedback, the denoising aggregators are designed to refine the two feedback sequences and can help to further improve the recommendation performance.
• The experiments on the large-scale real-world dataset demonstrate that DRPN achieves stateof-the-art performance.
2 Related Works

Recommendation with Multi-type Feedback
Few works notice the noise problem in the implicit feedback. (Zhao et al., 2018;Liu et al., 2020) use multiple types of feedback to improve recommendation. However, they ignore the noise in the im-plicit feedback. (Wang et al., 2018b) notices the noise problem but it fails to use the meaningful semantic information in the news. (Wu et al., 2020a;Xie et al., 2020;Bian et al., 2021) use the explicit feedback (such as reading time and like/dislike behaviors) to help denoise the implicit feedback. However, the explicit feedback is harder to collect than the implicit feedback. Differently, DRPN only depends on the implicit feedback (click and nonclick behaviors) to conduct the denoise to better model the user preferences.

Graph Neural Network
Recently, graph neural networks (GNN) have received wide attention in many fields (Wu et al., 2020c). The convolutional GNN can learn powerful node representations by aggregating the neighbors' features. Recently, some works have attempted to leverage the graph information to enhance the representations learning for news recommendation with GNNs. (Wang et al., 2018a) uses entities in news to build a knowledge graph and use the entity embeddings to improve the model performance. (Ge et al., 2020) combines the oneand two-hop neighbor news and users to enrich the representations of the candidate news and user, respectively. However, these methods also depend on the positive implicit feedback to model user representations and ignore the noise problem.

Problem Formulation
The news recommendation problem in our paper can be illustrated as follows. Let U and R denote the entire user set and news set. The feedback matrix for the users over the news is denoted as Z ∈ R lu×lr , where z u,r = 1 means user u gives a positive implicit feedback to news r (e.g., u clicks r), z u,r = −1 means user u gives a negative implicit feedback to news r (e.g., u sees r but ignores it), and z u,r = 0 means no feedback. l u and l r denote the numbers of the users and news, respectively. For each specific user, his historical positive feedback sequence [p 1 , ..., p lp ] and negative feedback sequence [n 1 , ..., n ln ] can be gathered from the feedback matrix Z, where p i , n j ∈ R.
Given the feedback matrix Z, the goal is to train a model M (i.e., DRPN). For each new pair of user and candidate news (u ∈ U, r ∈ R), we can use M to estimate the probability that u would like to click r.  Figure 2 shows the architecture of DRPN. It first employs the title encoder and id embedding layer to represent all news in two feedback sequences and the candidate news. Then two separate encoders are employed to extract the user semantic interest and collaborative interest information from both positive and negative implicit feedback sequences. Next, two fusion nets combine multiple interest representations to represent the user. Finally, we use the user and candidate news representations to estimate the clicking probability. We will detail each component in the following subsections.

Input
The inputs of the DRPN contain six parts: the titles of positive feedback sequence [p t 1 , ..., p t lp ], the titles of negative feedback sequence [n t 1 , ..., n t ln ], the candidate news title r t c , the IDs of positive feedback sequence [p o 1 , ..., p o lp ], the IDs of negative feedback sequence [n o 1 , ..., n o ln ], and candidate news ID r o c . For each news title t, we convert its every word w to a d-dimensional vector w via an embedding matrix E W ∈ R lw×d , where l w is the vocabulary size and d is the dimension of word embedding. Then, the title t is transformed into a matrix T.
For each news ID o, we also convert it to a d-

Title Encoder
The title encoder can extract the sentence-level semantic representation of the news title. It contains two sub-layers. We take the title T as an example to detail the encoding process. The first sub-layer is a multi-head self-attention layer, which can model the contextual representation of each word. Given three input matrices Q ∈ R lq×d , K ∈ R lv×d and V ∈ R lv×d , the attention function is defined as: Multi-head self-attention layer MH(·, ·, ·) will further project the input to multiple semantic subspaces and capture the interaction information from multiple views: are the parameters to learn. l h is the number of heads.
Moreover, we employ the residual connection and layer normalization function LN defined in (Ba et al., 2016) to fuse the original and contextual representations: T = LN(T + MH(T, T, T)).
The second sub-layer is a gated aggregation layer (Qiu et al., 2020). It will select the important words to generate an informative title representation. The gated mechanism is employed to decide the weight of each words. Given the word embedding matrix T, its sentence-level semantic representation t is calculated as follows: are trainable parameters. Finally, we can use the title encoder to model the titles of all news in two user feedback sequences to obtain P t = [p t 1 , ..., p t lp ] and N t = [n t 1 , ..., n t ln ]. For the candidate news, we can also obtain its title representation r t c via the same title encoder.

Semantic Interest Encoder
The titles of the news which the user interacted usually reflect the user's interests. Hence, we can learn user interest representations by encoding the semantic representations of the news. As is shown in Figure 3, the semantic interest encoder leverages two aggregators, content-based aggregator (CA) and denoising aggregator (DA), to extract user preferences from both positive and negative feedback sequences.

Content-based Aggregator
Different news have different informativeness when representing users. For example, sport news are more informative than weather news in modeling user personality, since the latter are usually browsed by most users. The content-based aggregator (CA) will first evaluate the importance of different news in the feedback sequence from the content view and then aggregate the important news to represent the user. It contains two sub-layers. The first one is a multi-head self-attention layer, which can enhance the news representations by capturing their interactions. For the positive feedback sequence P t , the multi-head self-attention layer generates P t = LN(P t + MH(P t , P t , P t )). The MH is define in Eq.(2) with independent parameters and the LN is the layer normalization function.
The second sub-layer is a gated aggregation layer that has the same structure as the one defined in Eq.(3). For P t , it can select the more informative news to generate the user representation: p t s = Aggregate( P t ). We also use the content-based aggregator to generate another user representation from the negative feedback sequence N t , n t s .

Denoising Aggregator
Denoising aggregator will conduct what we call a refining operation, which aims to mitigate the impacts of the noises in the feedback when modeling the user interests. Intuitively, if news clicked by the user is more semantically relevant with the news in the positive feedback sequence, this news is more likely the user true preference. Otherwise, if it is more semantically relevant with the news in the negative feedback sequence, it is more likely a noise for representing the user interest. As shown in Figure 3, for each news in the positive feedback sequence, we will conduct the intracomparisons with the news in the positive sequence and inter-comparisons with the news in the negative sequence to decide its weight when representing the user. This module contains three sub-layers. The first sub-layer is an intra-attention layer. For news p t j ∈ P t , this layer uses it as the query to aggregate all news in P t except p t j by the attention mechanism to obtain the sequence-level representation, The second sub-layer is an inter-attention layer. For p t j , this layer uses it as the query to aggregate its relevant news in the negative feedback sequence N t by the attention mechanism.
The third sub-layer is a gated aggregation layer. The weight of the news p t j are decided by the semantic similarities between p t j and two sequencelevel representations,p t j andn t j .
∈ R and γ are learnable parameters. Then, this layer will aggregate all news according to their weights to obtain the denoised representatioin, p t h = j=lp j=0 α j p t j . For the negative feedback sequence N t , we take a dual denoising process to obtain its final representation n t h .

Graph Neural Network
If two news, r i and r j , are co-clicked by the user u 1 and r i is also clicked by u 2 , u 2 may also prefers r j based on the idea of the collaborative filtering. Hence, we can further enrich the user interest representations by modeling the collaborative information. Like the knowledge graph, we build a collaborative graph G = {(r i , r j )|r i , r j ∈ R} over the news set R based on the co-clicking relationships in the historical feedback matrix Z. (r i , r j ) indicates they are neighbors in the graph and have been clicked by the same user. To incorporate the collaborative information, we employ the graph transformer neural network (Shi et al., 2021) to model the news in the user feedback sequence.
First, for each news node r o in P o and N o , we compute the attention weights between it and its neighbors N (r o ) in G. N (r o ) denotes the neighbor set of node r o . Take its neighbor r o k (k ∈ N (r o )) as an example, the attention weight between r o and r o k at the m-th head is calculated by where W g m, * ∈ R d×d/l ′ h are learnable parameters. l ′ h is the number of heads andd is equal to d/l ′ h . Next, each news node will aggregate the information of its neighbors from multiple heads according to the attention weights. For the node r o , the representation aggregated from its neighbors is: are trainable parameters. ∥ denotes the concatenation operation for l ′ h heads.
Finally, we will update the representation of each node by fusing its aggregated and original representations.
where W f 1 , W f 2 ∈ R 2d×d are learnable parameters. ⊙ denotes the element-wise multiplication operation. σ is the sigmoid function.
We can use this graph neural network to encode all news in the user positive and negative feedback sequences to obtain

Collaborative Interest Encoder
The module aims to model user interests by aggregating the representations of two feedback sequences encoded by the graph neural network layer, which have incorporated the collaborative information. The structure of the collaborative interest encoder is similar to that of the semantic interest encoder and also contains two aggregators, a content-based aggregator and a denoising aggregator. The denoising aggregators have the same structure as the one in the semantic interest encoder. The only structural difference between two content-based aggregators of two encoders is that there is no multi-head self-attention operation in the content-based aggregator of the collaborative interest encoder. This is because the context information is already propagated by the graph neural work, which has a similar effect with the multihead self-attention.
The inputs of this encoder are the positive sequence representation P o and the negative sequence representation N o . The content-based aggregator will generate two user representations, p o s and n o s , based on two sequence representations, respectively. Similarly, the denoising aggregator will denoise two sequences and generate two user representations p o h and n o h .

Fusion Net
There are two fusion nets as shown in Figure 2. They are used to fuse multiple user interest representations extracted by two interest encoders to form a comprehensive user representation. For different user-candidate news (u, r) pairs, the fusion net dynamically allocates different weights for different interest representations. Two fusion nets have similar structures but different parameters. We take the one for the semantic interest encoder as an example to detail the fusion process.
The fusion net first represents the (u, r) pair. It should mitigate the effect of two interest encoders and independently calculate the weights for the output representations of two encoders. Hence, it uses the outputs of the title encoder to represent (u, r), f t = [u t f ; r t c ], where u t f = Aggregate([P t |N t ]). P t , N t and r t c are the title representations of the news in user positive and negative feedback sequences and the candidate news extracted by the title encoder.
Then, this module leverages four different fully connected layers to calculate the weights for four representations extracted by the semantic interest encoder (i.e., p t s , n t s , p t h and n t h ). For example, the weight of p t s is calculated by 2 ∈ R are learnable parameters. The weights, β n s , β p h and β n h , of the representations n t s , p t h and n t h can be calculated by the same way in Eq.(9). Finally, the user content-view representation is calculated by Another fusion net is used to fuse four interest representations extracted by the collaborative interest encoder and has a similar structure with the above one. The only difference is that it uses the outputs of the news ID embedding layer to represent the (u, r) pair,

Prediction
Following (Wu et al., 2019c), the clicking probability scoreŷ is computed by the inner product of the user representation and the candidate news representation:ŷ = u t ⊤ r t c + u o⊤ r o c , where u t ⊤ r t c stands for the score calculated from title information and u o⊤ r o c stands for the score calculated from collaborative information.

Training
Following (Wu et al., 2019c), for each positive sample, we randomly select l k negative samples from the same user to construct a l k + 1 classification task. Each output of the DRPN for a classification sample is like [ŷ + ,ŷ − 1 , ...,ŷ − l k ], whereŷ + denotes the clicking probability score of the positive sample and the rest denote the scores of the l k negative samples. We define the training loss (to be minimized) as follows.
where P denotes the set of positive samples.

Computation Complexity
The time complexity of the title encoder is O(L 2 d+ Ld 2 ), where L is the title length and d is the embedding size. The time complexity of each interest encoder is O((l p + l n )d 2 + (l 2 p + l 2 n + (l p + l n ) 2 )d) where l p and l n are the lengths of positive and negative feedback sequences. The time complexity of GNN is O(|G|d), where |G| denotes the number of edges that existed in collaborative graph. Hence, The overall time cost is O((l p + l n )(Ld 2 + L 2 d) + (l 2 p + l 2 n + (l p + l n ) 2 + |G|)d). During the inference phase, we can compute the news representations in advance and the computation complexity will be O((l 2 p + l 2 n + (l p + l n ) 2 )d).

Dataset
There is no off-the-shelf dataset in which the user profile includes both positive and negative historical feedback sequences. Therefore, we use MIND 1 dataset (its original user profile only contains positive feedback) to re-build one to conduct the experiments. The original MIND dataset contains the user impression logs. An impression log records the news displayed to a user when visiting the news website homepage at a specific time, and the click behaviors on the news list. We re-build the dataset based on the MIND's impression logs as follows: (1) Select the impression logs of the first 5 days of the original training set. Then we add the news that a user has seen but did not click to his negative feedback sequence, and add the news he clicked to his positive feedback sequence. In this manner, the user profile includes both positive and negative historical feedback sequences; (2) Training set: the impression logs of 6-th day of the original training set; (3) Validation set: the first 10% chronological impression logs of the original validation set; (4) Testing set: the last 90% chronological impression logs of the original validation set. The training, validation, and testing sets use the same user profile built in Step (1). Since the user profiles are only built in Step (1) which is ahead of Step (2)-(4), there is no label leakage to validation and testing sets. Moreover, same as the original MIND dataset, the re-built dataset also has 44.6% users of validation set and 48.7% users of test set that are not shown in the re-built training set. Table 1 shows some statistics of the re-built dataset.

Baseline Approaches and Metrics
We evaluate the performance of DRPN by comparing it with several baseline methods, including: (1) LibFM (Rendle, 2012), factorization machine (FM); (2) DeepFM (Guo et al., 2017), which combines the FM and neural networks; (3) DKN (Wang et al., 2018a), which uses the CNN to fuse the entity and word embeddings to learn news representations; (4) LSTUR , which uses the GRU to model short-and long-term interests from the click history; (5) NPA (Wu et al., 2019b), which introduces the attention mechanism to select important words and news; (6) DEERS (Zhao et al., 2018), which uses GRU to encode positive and negative feedback sequences; (7) DFN (Xie et al., 2020), a factorization-machine based network which uses transformers to encode both positive and negative feedback sequences to enhance performance; (8) GERL (Ge et al., 2020), which constructs user-news graph to enhance the performance; (9) NAML (Wu et al., 2019a), which uses multi-view learning to aggregate different kinds of information to represent news; (10) NRMS (Wu et al., 2019c), which uses multi-head self-attention to learn news and user representations; (11) NAML + TCE, which incorporates the denoising training strategy TCE  into NAML; (12) NRMS + TCE, which improves NRMS by using TCE.

Implementation Details
For DRPN, the representation dimension d is set to 300. We use the GloVe.840B.300d (Pennington et al., 2014) as the pre-trained word embeddings. The maximum title length is set to 15. The lengths of feedback sequences l p and l n are set to 30 and 60. Padding and truncation are used to keep sequence and word numbers the same. The head number l h in multi-head self-attention is set to 6. The hidden size d ′ in the gated aggregation layer is set to 200. The head number in graph neural network l ′ h is set to 2. The negative sampling ratio l k is set to 4. When preparing data for graph neural network, we only input sub-graph that contains nodes in the user feedback sequences. Moreover, we pick the maximum 5 neighbor nodes for each node r in user feedback sequences, which are most frequently co-clicked with r. We have also released the source code at https://github.com/chungdz/DRPN.
For NRMS, DKN, LSTUR, NPA, and NAML, we use the official code and settings 2 . For others, we reimplement them and set their parameters based on the experimental setting strategies reported by their papers.
For fair comparisons, all methods only use the news ID, title, category and subcategory as features. The validation set was used for tuning hyperparameters and the final performance comparison was conducted on the test set.

Performance Evaluation
The experimental results of all models are summarized in Table 2. We make the following observations from the results. First, our proposed

Sports
The Geno Smith … thing in Seahawks' win over 49ers.

Sports
Russell Wilson has MVP … beat 49ers in OT classic. Weather … farmers endure major crop and profit losses as climate changes.

Movies
Actress accuses Roman Polanski of raping her in 1975. Finance Dean Foods files for bankruptcy.

Music
Broadway actress Laurel Griggs dies at Age 13. Lifestyle A master suite … is asking for $1,200/month in rent.

Weights Categories Titles
Finance Confidence in the US economy accelerates.  perform better than the feature-based methods (e.g., LibFM and DeepFM). This performance improvement should be attributed to better news representation methods. Among the deep neural methods, NRMS+TCE achieves the best performance by using two level multi-head self-attention to learn user representations and using TCE to denoise the negative samples. Third, among two baselines that use both positive and negative feedback, DFN performs worse than DEERS. The reason may be that original DFN depends on the explicit feedback but the experimental dataset only contains implicit feedback. Compared with NAML, DEERS has a competitive performance even if its news encoder is a simple pooling layer. This also proves the effectiveness of the negative implicit feedback.

Ablation Study
To highlight the individual contribution of each module, we use the following variants of DRPN to run an ablation study: (1) DRPN-D, which removes the denoising aggregator; (2) DRPN-G, which re-moves the knowledge graph part; (3) DRPN-DG, which removes the knowledge graph part and the denoising aggregator; (4) DRPN-N, which only uses the positive feedback; (5) DRPN-P, which only uses the negative feedback.
The results are shown in Table 3. First, DRPN-D and DRPN-G perform worse than DRPN, proving the effectiveness of the designed denoising module and the collaborative graph. Second, the results of DRPN-N and DRPN-P indicate the effectiveness of negative and positive feedback, respectively. Third, even without deliberately designing, by using both positive and negative implicit feedback, DRPN-DG can achieve competitive performance compared with the strongest baseline NRMS+TCE. This further proves the effectiveness of the negative feedback.

Case Study
To intuitively illustrate the effectiveness of the denoising aggregator, we sample a user and visualize his historical feedback attention weights in the denoising aggregator of the semantic interest encoder. The upper part of the Figure 4 shows the attention weights and ranks the news in descending order of the attention weight. We can find in positive feedback sequence, the top 4 news are about sports and weather and the last 4 news are about music, movie, finance, and lifestyle. Meanwhile, in negative feedback sequence, the top 4 news are about finance, music, politics, and lifestyle, and the last 4 news are all about sports. This indicates that the denoising aggregator believes that the user likes sports, and dislikes the topics such as finance, music, movies, politics, and lifestyle. As shown in the lower part of Figure 4, based on the predicted user preferences, we can see DRPN prefers to recommend the sports news for this user. Moreover, in the validation data, we can observe that this user clicks the top 2 recommended news and ignores the last 2 news. It suggests the user preference extracted by the denoising aggregator is consistent with the user's real behaviors. In summary, the visualization results indicate the denoising module can better capture the user's real preferences by conducting the interand intra-comparisons between the positive and negative implicit feedback sequences.

Conclusion
In this paper, we propose a novel deep neural news recommendation model DRPN. In DRPN, we design two aggregators to extract user interests from both positive and negative implicit feedback. The content-based aggregator focuses on the contents in the news representations and the denoising aggregator aims to mitigate the noise impact commonly existing in the implicit feedback. Besides, apart from the title information, DRPN also exploits the collaborative information by the graph neural network to further improve the recommendation performance. Experimental results on a large-scale public dataset demonstrate the state-of-the-art performance of DRPN. The further study results also show the effectiveness of the denoising module.

A.1 Limitations
In this paper, to better learn the representations, our method refines the historical behaviors of the user by the denoising manner. There are still some potential directions to further improve our approach. First, since the user profile in the experimental dataset only contains the historical behaviors and has no basic information (e.g., gender and age), our current approach doesn't support these features but they are widely used in practice. After these features are ready, we can convert them to embeddings and fuse them with the semantic interest representations obtained by two interest encoders to better represent the user. Second, the news generally contains many forms of features except for the title (such as the cover image and author information) and our approach will explore how to involve more features to better represent the news.

A.2 Potential Risks
Our approach is based on the collaborative filtering, which may lead to that all of recommended news are similar to what the user has seen. This is a common problem faced by the majority of recommender systems. The concentration of a large number of similar information may narrow users' perspective and result in an imbalance on the personal information structure (Li and Wang, 2019). Our method can combine with some rule/human-based strategies (such as popularity based recommendation) to improve the recommendation diversity to alleviate this problem.