Social-aware Sparse Attention Network for Session-based Social Recommendation

,


Introduction
Session-based Social Recommendation (SSR) is proposed based on Session-based Recommendation (SR).Initially, SR aims to predict the next item for the current anonymous session.The anonymous session is a sequence of items clicked in a transaction, without user IDs and users' social networks.Later, with the boom of modern network and social media, user IDs can be tracked, and it becomes more common for users to have a social network.Therefore, SSR has been proposed, which aims to capture user preferences based on their social networks and historical sessions to provide more personalized recommendations (Song et al., 2019).
SSR is initially proposed in DGRec (Song et al., 2019), which uses graph neural networks (GNNs) to aggregate the preferences of neighbors for each user.Afterward, SERec (Chen and Wong, 2021) proposes to use a heterogeneous graph neural network to learn user and item representations that integrate the knowledge from social networks.Since they take advantage of users' social networks and historical interactions, they have a significant performance improvement over previous conventional SR methods (Li et al., 2017;Liu et al., 2018).
However, most of the existing SSR models suffer from two defects: (a) They follow a strong underlying assumption that users' all friends and interactions can influence their preferences.Therefore, they aggregate the features of all friends and historical interactions to model the user and do not screen the irrelevant items when modeling the current session.However, user preferences are mainly affected by several close friends and key interactions.Moreover, it has been shown in the literature that irrelevant items can interfere with session modeling (Yuan et al., 2021).In other words, aggregating all information without filtering can lead to bias in modeling user preferences.(b) When modeling the current session, they do not make full use of the user's personalized information.For example, SERec (Chen and Wong, 2021) only concatenates user representations at the last stage of model inference, which limits the expressiveness of the personalized knowledge of the user.
To tackle these issues, we propose to eliminate the low-confidence information, and incorporate personalized knowledge into the modeling of the current session.Hence, we put forward a novel Social-aware Sparse Attention Network for SSR, abbreviated as SSAN.It mainly consists of the Heterogeneous Graph Embedding (HGE) module and the Social-aware Encoder-decoder Network (SEN) module.The HGE module aims to model user preferences based on users' social relationships and historical interactions.In the HGE module, we use a heterogeneous graph neural network, which focuses more on close friends and important historical interactions, to enhance user/item representations.It can alleviate the impact of invalid social relationships and useless historical interactions on user modeling.The SEN module aims to model the current session based on user preferences and interactions in the current session.In the SEN module, we mine the latent intents from the interactions and inject user preference information into the modeling process of the current session.It can alleviate the impact of unreliable interactions and take full advantage of the personalized knowledge of users.Extensive experiments on two public benchmark datasets demonstrate the superiority of SSAN.Further ablation experiments demonstrate the effectiveness of the HGE and SEN modules.
To summarize, we mainly make the following contributions: • Mine latent key information.We construct the HGE module that concentrates on close friends and key historical interactions to enhance user/item representations.Besides, we use the sparse transformation function to mitigate the impact of irrelevant interaction items.
• Integrate personalized knowledge.We devise the SEN module to closely integrate user preference information to make more personalized recommendations for current session.
• Excellent performance.We perform extensive comparisons with recent SSR and SR methods on two public real-world datasets, demonstrating the superiority of SSAN.

Related Work
In this section, we retrospect the existing work related to our research, which mainly consists of the following three subsections.

Session-based Recommendation
Session-based Recommendation can be mainly divided into Anonymous Session-based Recommendation (ASR) and Personalized Session-based Recommendation (PSR).

Anonymous Session-based Recommendation (ASR)
Let I = {i 1 , i 2 , . . ., i N } denote the set of items, where N is the total number of items.A session is represented as a list S = [i s,1 , i s,2 , . . ., i s,t ] ordered by the timestamp and i s,k ∈ I(1 ≤ k ≤ t) represents an interacted item of the anonymous user.The task of ASR is to predict the next item i s,t+1 for an anonymous session S.
Early ASR studies (Rendle et al., 2010) focused on extracting sequence information from session data using Markov chains.Following these works, GRU4Rec (Hidasi et al., 2015) is the first research that formally defines ASR and proposes a multilayered GRU model.NextItNet (Yuan et al., 2019) applies dilated convolutional layers to model the local item dependence.Recently, GNNs have drawn increasing attention in various tasks, including ASR.SR-GNN (Wu et al., 2019) represents sessions as directed subgraphs and apply GNN to capture the item transitions.GCE-GNN (Wang et al., 2020) exploits global-level item-transitions over all sessions to learn global-level contextual information.Since these methods are designed for the anonymous session, they do not leverage the knowledge of users' social networks.Moreover, most of them ignore the randomness of user behavior and do not consider the reliability of user interactions.

Personalized Session-based Recommendation (PSR)
Let the sets of users and items be denoted by U = {u 1 , u 2 , . . ., u M } and I = {i 1 , i 2 , . . ., i N }, respectively.The historical session set D contains all sessions of each user.Let D u = {S u 1 , S u 2 , . . ., S u |D u | } represents the session set associated with user u ∈ U , where S u T ∈ D u denotes the T th session of user u, and S u T [t] ∈ I denotes the t th item in session S u T .The task of PSR is to predict the next item S u T [t + 1] for session S u T .Different from ASR, PSR knows which user the sequence belongs to, so it can model the user's preferences and exploit them.
In recent years, various attempts have been made for PSR.Quadrana et al. (Quadrana et al., 2017) use hierarchical recurrent neural networks to capture users' evolving interests.Then, Zhang et al. (Zhang et al., 2020) explicitly model the effect of the users' historical interests on the current session by the attention mechanism.Guo et al. (Guo et al., 2019) improve the attention mechanism by applying Matrix Factorization to users' historical interactions.These methods leverage users' longterm interaction history to provide more personalized recommendations, but they fail to capture the impact of users' social networks.

Social Recommendation
It is a growing trend towards leveraging social networks to make recommendations more personalized and effective.Ma et al. (Ma et al., 2011) regularize the latent user factors so that connect users with similar latent factors and make recommendations.Zhao et al. (Zhao et al., 2014) apply matrix factorization to extract additional training instances from social networks.Wang et al. (Wang et al., 2017) propose to distinguish strong and weak relationships and learn personalized preferences from social networks.Xiao et al. (Xiao et al., 2017) propose to model user-items interactions and recognize the social relationships of the user using transfer learning.Wang et al. (Wang et al., 2019) maintain a heterogeneous social graph to extract the social knowledge to enhance the user representations.These methods only utilize collaborative information from user-item interactions and users' social networks without considering the sequential information of interactions.Thus, they are not suitable for the session-based recommendation.

Session-based Social Recommendation (SSR)
SSR is proposed to predict users' next click in the current short-term session based on social networks and historical sessions.It aims to combine the advantages of session-based recommendation and social recommendation and provide more accurate and personalized recommendations.The first SSR model is DGRec (Song et al., 2019), which uses a graph attention network to model the social influence of the user.Then, SERec (Chen and Wong, 2021) proposes to use a heterogeneous graph to process related users and items when making predictions for the current session.Unfortunately, while these models are laudable attempts to integrate social networks for the session-based recommendation, they still fail to take full advantage of the personalized knowledge of the user when modeling the current session.Moreover, they ignore the fact that users' preferences are influenced mainly by several close friends and key interactions.

Task Definition
Let U = {u 1 , u 2 , . . ., u M }, I = {i 1 , i 2 , ..., i N } denote the set of users and items, respectively.M, N are the total number of users and items, respectively.Let D represents the set of all historical sessions of users.The set of all sessions of a user is represented by the original embedding set corresponding to S u T .For briefly, the superscript and/or the subscript in S u T and C u T may be dropped if there is no ambiguity.Different from PSR, SSR has a social network for each user, which is a graph denoted as G = (U, E).The node set U is the user set U , and the edge set E indicates the users' social relationships.Specifically, an edge (u, v) ∈ E from user u to user v represents u is followed by v.
The task of SSR is to predict the next item of a new session S u / ∈ D u based on social network G and the set of all previous sessions D u .It can be formalized as predicting the probability of user interaction with each item i ∈ I at time step t + 1: where î ∈ I represents the candidate item.Since a recommender usually needs to provide multiple recommendations for users, SSR will recommend top-K items according to the scores.

Method
In this section, we introduce the SSAN in detail.

Sparse Transformation Function
In general, the attention mechanism uses softmax (Bridle, 1990) to convert weights into probabilities.Essentially, softmax is a mappings function However, it may assign weights to the useless data due to its nonzero probability, affecting the ability to find the relevant items.Then, a sparse transformation method is proposed to assign zero for the low-scoring items, named sparsemax (Martins and Astudillo, 2016): where x is the input weights, and p is the output probabilities vector.Recently, a novel transfor- , where is the Tsallis α-entropies (Tsallis, 1988).In particular, 1-entmax equals the softmax function and 2-entmax (Peters et al., 2019) equals the Sparsemax.In this paper, we replace the transformation function with α-entmax in the attention mechanism, to filter the irrelevant interactions in the current session.

Architecture
The architecture of SSAN is depicted in Figure 1.SSAN mainly consists of the following parts: • Heterogeneous Graph Embedding (HGE) module.In this module, we employ social networks and historical sessions to enhance the user/item representations.
• Social-aware Encoder-decoder Network (SEN) module.In this module, we integrate user preference information based on the sparse attention network to model the current session.Besides, we also mine the short-term intent of the user.
• Prediction and Optimization.In this part, we evaluate the probability of each candidate item using the final session representation.
We will introduce the above parts in detail in the rest of this section.

Heterogeneous Graph Embedding (HGE) module
To integrate users' social networks and historical sessions to enhance user/item representations, we construct the HGE module inspired by SERec (Chen and Wong, 2021).Moreover, it can alleviate the impact of useless social relationships and invalid interactions when modeling user preferences.

Build Heterogeneous Graph
In this layer, we build a heterogeneous graph based on users' social networks and historical sessions.Formally, let K = {N , E} be the heterogeneous graph.For simplicity, we use the symbols of the users U and items I to indicate the type of the corresponding node.N = U ∪ I denotes the node set of the graph consisting of all users and items involved in D. E is the edge set contains three types of directed edges, i.e., user-user edges (U U ), user-item edges (U I),and item-item edges (II).
Specifically, a user-user edge (u, v) ∈ E if user u is followed by user v. a user-item edge (u, i) ∈ E if user u clicks item i in any session.Besides, a user-item edge (i 1 , i 2 ) ∈ E if a transition from i 1 to i 2 appears in any session.

Learn Enhanced Representation
To capture the user preference information contained in the user's social relationships and historical interactions, we apply the heterogeneous graph neural network on the graph K.
Let R l [v] denotes the representation of node v at layer l, where v is a user or an item, where R 0 [v] ∈ R d is the initial user/item embedding, where d is the embedding size.To generate a new representation for node v from layer l − 1 to layer l, we calculate the importance of each connection node of v in the graph K: (4) where θ l ∈ R d is a learnable parameter, e l ∈ R d is the feature vector of edge (u, v), u, v ∈ N , σ denotes the sigmoid activation function, MLP indicates the multi-layer perceptron, R l−1 [u] means the representation of node u at layer l − 1, and • means the element-wise multiplication.Then, we normalize the score a uv : where H u indicates the neighbors of node u.We argue that the neighbors with low weight may not benefit to the update of the representation, so we set the values âuv less than β as 0: Then, we re normalize them so that a uv = 1.Finally, we aggregation the neighbor nodes by: where ReLU is the activation function, and || denotes concatenate operation.
In particular, in the HGE module, we set different MLP for different edges and different layers.After L gnn layers of the above process, we obtain R = R L [N ], which is the final enhanced user/item representation set.

Social-aware Encoder-decoder Network (SEN) module
To make full use of user preference information and alleviate the negative impact of users' unreliable interaction signals, we construct the SEN module.
It is mainly made up of the Encoder and Decoder.

Encoder
The Encoder aims to mine the user's latent intent sequence based on the interaction sequence of the current session.First, for a session S = [i 1 , i 2 , . . ., i t ] corresponding to a user u, where t is the length of current session, we can obtain the enhanced item representation set R = [r 1 , r 2 , . . ., r t ], r ∈ R d , and the enhanced user representation û ∈ R d from set R. To capture the sequence information of the session, we employ a learnable positional embedding module (Sun et al., 2019).Formally, for each item i of input session, the hidden representation is: where x ∈ R d denotes the hidden representation of item i, and p ∈ R d is the position embedding.Thus, we obtain the hidden representation set X = [x 1 , x 2 , . . ., x t ] ∈ R t×d for session S. Latent Intent Modeling.In this layer, we mine the latent intents of the user based on the interaction sequence of the current session and eliminate the low-confidence items.Specifically, we use the sparse attention network to encode the interaction sequence to obtain the latent intent sequence: where L ′ ∈ R t×d , MLP indicates the multi-layer perceptron, ReLU is the activation function, and the SparseAttention can be formalized as: ) where Q, K, and V are the input matrices, and SparseAttention is a multi-head network.It is worth mentioning that the model performance is not sensitive to the number of heads, so we empirically set it to 4. Besides, we use the mask matrix to ensure that the mining for t-th item can depend only on its previous items.Then we endow the model with more non-linearity: where MLP indicates the multi-layer perceptron, and all sessions will share the same parameters.
After that, we add a Residual Connection and Layer Normalization to the result to alleviate the instability of model training.We also add the dropout mechanism to alleviate the overfitting.The Encoder is stackable, and we let L enc denote the number of the Encoder layers.The output of last layer is L = [l 1 , l 2 , . . ., l t ] ∈ R t×d , which represents the latent intent sequence.

Decoder
The Decoder aims to achieve the social-aware modeling of the current session based on user preference information and the latent intent sequence.
Short-term Intent Modeling.To mine more sequence information in the interaction sequence, we employ GRU (Cho et al., 2014) on the enhanced interaction sequence: where R ∈ R t×d is the set of enhanced item representation.ST ∈ R d is the last hidden state of GRU, and we view it as the short-term intent representation of the current session.User Information Fusing.Considering that only focusing on the interactions in the current session and ignoring the use of user preference information will limit the performance of the model.In this layer, we integrate the user preference information into the modeling of the current session.
Specifically, we fuse the representation of shortterm intent ST with the enhanced user representation û from the HGE module, to obtain the personalized intent representation: where P ∈ R d is the personalized intent representation that integrates user preference information and short-term intent.Session Modeling.In this layer, we implement the social-aware decoding on the output L of the Encoder based on the personalized representation.Technically, we input P and the latent intent sequence L into the sparse attention network for decoding to model the current session: where L ∈ R t×d is the final output of the Encoder.Similarly, we add a Residual Connection and Layer Normalization on the result to alleviate the instability of model training, and also add the dropout mechanism to alleviate the overfitting.Moreover, the Decoder is stackable, and we let L dec denote the number of the Dncoder layers.The output of the last layer is F ∈ R d , which is the final session representation.

Prediction and Optimization
In this part, we complete the prediction of the current session.First, to capture the user's intent at the end of the session and make full use of the user's preference information, we integrate some key information: where O ∈ R d is the final representation used to make recommendations, r t ∈ R d represents the enhanced representation of the last item, û ∈ R d is the enhanced user representation, and W o ∈ R d×3d is the projection matrix.
Since the next item prediction can convert into a probability distribution of items, we calculate the similarity of all items to the representation O: where c i ∈ R d is the initial embedding of candidate item i ∈ I, and z i is similarity score.We use the softmax function to normalize the similarity score: where ŷi is the probability of item i appearing in the next click in the current session.
For any given session, the loss function is defined as the cross-entropy of the ground truth y i and the prediction result ŷi : where y is the ground truth probability distribution, which is a one-hot vector.

Experiments
We conduct experiments on two real-world benchmark datasets and mainly aim to answer the following research questions: RQ1: How does SSAN compare to other stateof-the-art (SOTA) models?RQ2: Whether using the users' social networks and historical sessions is conducive to predicting the user's next click?
RQ3: Whether HGE module is beneficial to the final performance?
RQ4: Whether SSAN is efficient?RQ5: How the modules and layers of SSAN affect the final performance?

Datasets
We conduct extensive experiments on the following two public datasets: Gowalla 1 : it comes from a location-based social networking website, where users can share their location by checking in.Following SERec (Chen and Wong, 2021), we divide the two check-in records into two sessions, if the interval between them is longer than 1 day.Delicious 2 : it comes from an online bookmarking system, where users can assign various semantic tags to bookmarks.Following SERec, we take a series of tag operations with a small timestamp gap as a session.
We follow the same data processing method as SERec.For each dataset, we take the first 60% as the training set, 20% as the validation set, and the rest 20% as the test set.Then, we filter the short sessions and the infrequent items, and apply a data augmentation technique described in SERec on these two datasets.The statistics of datasets after preprocessing are shown in Table 1.

Implementation Details
For a fair comparison, we implement our model on the public pre-processed version datasets provided by SERec (Chen and Wong, 2021).Not only that, we also following SERec to make the following settings: We use Adam (Kingma and Ba, 2015) optimizer with learning rate 0.001, and the weight decay coefficient is 0.0001.We use 128-dimensional embeddings for items and users.We apply early stopping if the performance does not improve in 2 epochs on the validation set.We set the number of epochs to 30 and set the mini-batch size to 128.
Besides, we search for the number of GNN layers L gnn in {1, 2, 3} and finally set it to 1.The L enc and L dec are tuned amongst {1, • • • , 4} and finally set to 3 and 1, respectively.The α of αentmax is tuned amongst {1.1, •

Baselines
We compare SSAN with the following representative SOTA recommendation methods.They can be categorized into session-based recommendation models and SSR models: Item-KNN (Sarwar et al., 2001), which recommends items similar to the last item in the session.FPMC (Rendle et al., 2010), which is a traditional sequential method based on Markov Chain.NARM (Li et al., 2017) 3 , which utilizes RNN and attention mechanism to capture the main purpose of the session.STAMP (Liu et al., 2018) 4 , which uses the self-attention mechanism to captures the long-term and short-term preferences of sessions.SR-GNN (Wu et al., 2019) 5 , which employs gated graph convolutional neural networks to capture complex transitions of items to achieve promising results.SSRM (Guo et al., 2019), which proposes a Matrix Factorization based attention model.NextItNet (Yuan et al., 2019) 6 , which is a classic CNN-based method for sequential recommendation.GCE-GNN7 (Wang et al., 2020), which is a widely compared GNN-based model that learns global and local information of sessions.DSAN (Yuan et al., 2021) 8 , which utilizes sparse attention mechanism to alleviate the effect of unrelated items that clicked by users.
And we compare SSAN with the following SSR models: DGRec (Song et al., 2019) 9 , which uses RNN and graph attention neural network to model the dynamic interests and social influences.SERec (Chen and Wong, 2021)10 is the state-ofthe-art method for SSR, which has an efficient and effective knowledge embedding framework.
For a fair comparison, our implementation provides user and item representations of the Heterogeneous Graph Embedding (HGE) module for SR models.In addition, since some models miss some metrics to varying degrees in the public results, for a fair comparison, we report the best results in the original paper (if available) and the reproduced results on the same device as SSAN.

Experimental Results
In this section, we investigate SSAN in detail according to the experimental results.

Overall Performance
The experimental results of overall performance are reported in Table 2, and we can draw the following conclusions: (RQ2).All variants of non-social-aware methods (e.g., G_NARM) significantly outperform original models (e.g., NARM), which strongly demonstrates the superiority of using social knowledge.
(RQ3).The performance of these variants is close to or even better than SERec, which shows the superiority of the HGE module which pays attention to close friends and important interactions.
(RQ1).SSAN is overwhelmingly superior to all baseline models, which indicates the effectiveness of SSAN.We believe the performance improvement of our model mainly comes from the following aspects: (1) We construct the HGE module, which uses an improved heterogeneous graphical neural network to inject social knowledge and historical interaction information into the user modeling process.( 2  capture the sequence information and fully integrate the user preference information during decoding.In a word, our efforts in more effectively using the knowledge of social networks and filtering out irrelevant items make us achieve better performance.

Efficiency of SSAN (RQ4)
To investigate the efficiency of SSAN, we compared some models' running time during both training and inference on the Delicious dataset.The experimental results are shown in trained once in a period.Although the HGE module which captures social knowledge requires more training time, it does not need additional time in the inference process, which demonstrates that using social knowledge is feasible.SSAN achieves significant performance improvement without requiring more training time than SERec, which shows the efficiency and superiority of SSAN.

Ablation Study (RQ5)
To investigate different modules and explore the effectiveness of some layers in SSAN, we compared four variants of SSAN with the original SSAN.According to the experimental results shown in Table 4, we can draw the following conclusions: (1) SSAN-noHGE denotes SSAN without the HGE module.SSAN-noSEN represents SSAN without the SEN module, and we make prediction using the mean of all item embedding in the session.These two variants have a pretty poor performance, which indicates HGE and SEN modules are beneficial to the final performance, and also shows the effectiveness of the two modules.
(2) SSAN-noST means SSAN without Shortterm Intent Modeling layer.We can observe that its performance degrades significantly, which indicates the step of modeling the short-term intent is essential for SSAN.
(3) SSAN-noU denotes SSAN without User Information Fusing layer.We can observe that its performance has also declined, which indicates that it is important to use user preference information to predict the next item of the current session.

Conclusion
In this paper, we summarize two issues of previous SSR methods.To tackle these issues, we propose a novel Social-aware Sparse Attention Network, abbreviated as SSAN.In this model, we construct the HGE module based on the improved heterogeneous graph neural network.It can inject the high-confidence social knowledge and historical interaction information into the modeling process of users and items.Meanwhile, we construct the SEN module based on the sparse attention mechanism to integrate user preference information when modeling the current session.Extensive experimental results on two datasets demonstrate the superiority of SSAN over the state-of-the-art models.In future work, we plan to explore more efficient methods to capture the social knowledge and enhance the ability in screening irrelevant items.

Figure 1 :
Figure 1: The architecture of SSAN.It is mainly composed of HGE and SEN modules.

Table 1 :
Statistics of the two datasets.
) We construct the SEN module based on the sparse transformation function.It can

Table 3 :
Running time in seconds (s) per 1000 batches.

Table 3 .
We can observe that those variants (e.g., G_NARM) of nonsocial-aware methods run slightly slower than original models (e.g., NARM) during training and run as fast as their original models during inference.In the real world, as long as the model can have better performance, a lightly large time cost of training is acceptable, since the model only needs to be29.54 SSAN-noST 40.07 21.54 25.92 49.62 22.20 28.33 SSAN-noU 42.16 23.21 27.79 51.65 23.75 30.08 SSAN 42.66 23.52 28.07 52.27 24.15 30.39

Table 4 :
Performance of the variants of SSAN.