TADI: Topic-aware Attention and Powerful Dual-encoder Interaction for Recall in News Recommendation

,


Introduction
News recommendation is one of the widest commercialization in natural language processing research area, which feeds rich and suitable news to users based on their interests.Currently, news recommendation is generally used on online news websites, movie review websites and etc (such as MSN News), it thus has become an useful tools to provide masses of custom information in one go.Generally, recall and ranking are two main steps of news recommendation (Wu et al., 2022).The first one is to recall candidates from a very large news database, while the second one is to rank news candidates for display.News recall determines the room of recommendation performance, and thus this paper discusses about it.
Researches of news recall (Wu et al., 2022) (or new candidate generation (Covington et al., 2016) or news retrieve (Wang et al., 2023)) can be categorized into feature-based and content-based models.Feature-based models focus on feature interaction modeling such as YoutubeNet (Covington et al., 2016) and Pinnersage (Pal et al., 2020).Feature-based models require to summarize features by manual text mining and thus they inevitably lose useful information.With the development of content understanding technology, researchers incorporate content understanding with feature interaction, that is content-based models, such as (Okura et al., 2017).Different from manual text mining of feature-based models, contentbased models directly learn representations of user and news by modeling news contents.However, content-based models ignore irrelevant word distraction problem.Every word is encoded equally, that is why irrelevant words would bring sideeffect for news recommendation.For example, football fans are more interested in the news "Lionel Messi comes back Spain for taking a holiday" than tourists, but only two words "Lionel Messi" in the title are related.
To recall news candidates, researches mostly rely on dual-encoder architecture which is shown in Figure 1.Dual-encoder architecture is able to serve efficiently in real time for a large scale of news.Because it encodes user and news inde-pendently, and solves the top-k nearest neighbor search problem in sublinear complexity, by converting the problem to Maximum-Inner-Product Search (MIPS) (Yao et al., 2021).However, there exists a classical challenge of dual-encoder architecture (Khattab and Zaharia, 2020), that is weak dual-encoder interaction.Specifically, the click predictor (e.g., dot product), which unfortunately is the only interation between dual encoders in that architecture.Weak interaction makes one encoder difficultly utilize the information of the other encoder.Therefore, model underestimates actual correlation between dual encoders, resulting in severe performance degradation.
To response to the aforementioned challenges, we propose a news recall model, namely, Topicaware Attention and powerful Dual-encoder Interaction for recall in news recommendation (TADI)1 .First, we design the Topic-aware Attention (TA) mechanism to avoid irrelevant word distraction.Because news topic is one of the most important interest indicators which directly reflects preference of potential target users.So it is reasonable to weight words by using TA.Secondly, TADI involves the Dual-encoder Interaction (DI) module which helps dual encoders interact more powerful than typical dualencoder models.In detail, DI provides two auxiliary targets to enhance dual encoders interaction on training, without changing the mechanism of online news recall in dual-encoder architecture.Therefore, TADI can leverage the efficiency of dual-encoder architecture on news recall while simultaneously gaining the ability to more powerful interaction.Afterwards, the effectiveness of TADI is verified by conducting a series of experiments on benchmark dataset MIND (Wu et al., 2020).
In summary, our contributions are four-fold: (1) We propose a news recall model TADI.(2) We design the topic-aware attention TA to avoid the distraction of irrelevant words.(3) We propose the dual-encoder interaction DI to enhance dualencoder interaction.(4) Extensive experiments are conducted on the benchmark datasets, which demonstrate the effectiveness of TADI.

Related Works
Researches of news recall have rather matured works, mainly could be divided into fearture-based models and content-based models.Furthermore, we introduce researches of news ranking because its technologies are available for news recall.

News Recall.
Feature-based models focus on feature interaction modeling, and they are usually utilized in product recommendation and movie recommendation.In our common live, YoutubeNet and Pinnersage are well known featured-based baselines in news recall (Wu et al., 2022).YoutubeNet uses the average of clicked news embeddings for recall.Pinnersage recall items based on hierarchical clustering.However, difficulty of effective content mining lead to information loss, which limits performances of feature-based models.
In contrast to feature-based models, contentbased models pay attention to content understanding modeling.Most content-based recommendation models (Wu et al., 2019b) learn user representation from sequential user clicked news, and learn news representations from news candidates.Besides, regarding to description of diverse and multi-prained user, a few researchers find that a series of user interrest representation are more suitable than a single one.Authors of HieRec (Qi et al., 2021) research user interest reporesentations more deeply.They split user interest into category, sub-category and overall, so that to learn multiple representations for a user.Compared with single representation models, multiple representation models achieve better performance, but consume times of computing resources on click prediction.(Yu et al., 2022) aims to improve both the effectiveness and the efficiency of pretrained language models for news recommendation, and finally achieves significant performance improvement.But it consumes more computing resources by training M+1 teacher models and distilling twice to get 2 student models.
We review many researches respectively on two main branches of news recall: feature-based and content-based models.However, irrelevant word distraction might impact model with learning confusion.TADI avoids the problem by utilizing news topic.Many researches (Qi et al., 2021;Wu et al., 2019c) involves news topic in modeling, but they seldom aware the distraction and take action to solve it.Furthermore, weak interaction makes one encoder difficultly utilize the information of the other encoder.TADI exploit powerful interaction between dual encoders for information utilization.

News Ranking.
Researches of news ranking can be also categorized into feature-based and content-based models.FM and DeepFM are well known featuredbased baselines in news ranking.FM models second order feature interactions, while DeepFM models higher order feature interactions.Recently, researchers (Kang and McAuley, 2018;Sun et al., 2019) additionally model user sequential behaviors for performance improvement, e.g., SDM (Lv et al., 2019) respectively models longterm and short-term user sequential behaviors.Turn to content-based models, MINE (Li et al., 2022) makes the number of user insterest representaions controlable by tunable hyper-parameter.And the model achieve best performance when hyperparameter at 32. (Mao et al., 2022) enriches the semantics of users and news by building user and news graphs.

Problem Formulation
In this section, we give main notations and define the news recall problem.Features of news and user are the same as previous works (Qi et al., 2021).First, for news candidate defination, we make it represented by four type features: title, category, sub-category and title entity.The news title t n is a word sequence.Denote the category and sub-category of the news are c n and s n .Denote the title entity as d n where consists of entities.Secondly, we assume a user has N historical clicked news, and structure of each historical news representation is the same as news candidate.Denote titles as T The target of news recall is to learn the mapping from users to the most relevant news.Technically, the target is to minimize the gap between the ground truth y and the predicted label ŷ via optimizing model parameters.

The Proposed Model
TADI is divided into four modules, i.e., user encoder, news encoder, predictor, and dual-encoder interaction, which is shown in Figure 2. The user encoder and the news encoder respectively generate embeddings of user and news.The predictor calculates dot product between embeddings of user and news, in order to predict click probability.The dual-encoder interaction provides a capability which helps dual encoders interact more powerful.

Basic Components
Before discussing the main modules of TADI, we introduce basic components at first, they are feature encoder, aggregation attention and topicaware attention.
Feature Encoder.The purpose of feature encoder is to transform feature into dense embedding.Each news is representated by title, category, sub-category and title entity.First of all, similar to previous works (Qi et al., 2021), we adopt pretrained models such as word2vec (Mikolov et al., 2013) and BERT to map word tokens of titles into dense embeddings.In experiment, we discuss advantages of word2vec and BERT in detail, for title encoding.For clarity, we name this type of feature encoder as title encoder.Secondly, category and sub-category are embeded by using GloVe (Pennington et al., 2014).Thirdly, entity embedding is learnt from knowledge graphs which is provided by datasets.Fourth, a shallow transformer encoder (2-layer and 4-head) are used to learn feature correlation of title and title entity.Fifthly, only for user encoding, all type embeddings are processed by shallow transformer encoders so that to learn cross-news information.
Aggregation Attention.Aggregation attention is used to integrate embeddings by using a query, a set of keys and values.The query and keys are used to calculate attention weights which measure the importance of values, and then the weighted sum of values is output.Suppose input is X = [x 1 , x 2 , ..., x M ] where x M ∈ R dx and d x is embedding dimension, and M is the number of embeddings which need to be aggregated.Inspired by the attention used in Poly-encoder (Humeau et al., 2020), aggregation attention is designed in the same manner.The query is a trainable vector where k a M ∈ R dq , they are the output of a Fully Connected Network (FCN) whose input is X , and values are the input itself X .The attention weight is where α a M ∈ R. In summary, we give the mathematical formula of aggregation attention as follows: Topic-aware Attention.Topic-aware attention aims to intergrate word embeddings by using topics, which getting rid of irrelevant word distraction.First, a topic embedding and word embedings are used to generate a query and a series of key-value pairs.In detail, we map news topic embeding to d-dimension query q t ∈ R dt by a FCN.And then we respectively map word embedings to keys where v M ∈ R dt by two FCNs.Secondly, we obtain attention weights where α t M ∈ R. We scale down the dot product of q t and K t by the square root of length d t , and then normalize it to the attention weights by using softmax function.Thirdly, we aggregate V t by using attention weights.The mathematical formula is below:

User Encoder
User Encoder is used to learn a user embedding from historical clicked news.The architecture of user encoder is shown in Figure 3, we introduce the main procedure below: Feature Aggregation.Titles, categories, subcategories and title entities of historical clicked news are transformed into dense embeddings E u,t , E u,c , E u,s and E u,e by using feature encoder.The above embeddings of historical clicked news need to be aggregated for unified embeddings obtaining.The aggregation operation is divided into two types according to feature type.First, category and sub-category aggregation.Categories and subcategories embeddings of historical clicked news are respectively integrated into two embeddings g u,c and g u,s by using the aggregation attention.Secondly, title and title entity aggregation.Since each news has several words and entities of the title, the aggregation module uses the aggregation attention to integrate embeddings of words and entities into G u,t and G u,e .By doing so, we obtain title embedding and title entity embedding of each news.And then we use the aggregation attention twice to integrate title embeddings and title entity embeddings of historical clicked news for unified embeddings g u,t and g u,e obtaining.
Topic-aware Encoding.Models would be distracted by irrelevant words because such words are treated fairly with relevant words in feature encoding.The distration would be sevious especially when few related words in long title are avaiable to predict target.Therefore, user interest understanding by fair title encoding is not enough.As a supplement, model could utilize category and subcategory of news to identify relevant words in title.To do this, we use topic-aware attention to pay more attention to the topic correlated information.
User Aggregation.The target of user encoder is to learn a unified user embedding from historical clicked news.Therefore, all aggregated embedddings g u,t , e u,c , f u,c , e u,s , f u,s and g u,e are concatenated at first, then we map it to the user embedding e u via a FCN.

News Encoder
The target of news encoder is to represent news by embeding learning.The architecture of news encoder is shown in Figure 4.In contrast to user encoder, the input of news encoder is only one news, which reduces a few aggregation operations.The procedure is similar to user encoder, so we give a brief introduction below.First of all, title, category, sub-category and entity sequence of a news candidate are transformed into dense embeddings E n,t , e n,c , e n,s and E n,e by using feature encoder.Secondly, because the aforementioned embeddings of title and title entity are token-wise, so they are aggregated and we obtain g n,t and g n,e .Thirdly, to avoid distraction of irrelevant words, we use category and sub-catogory to identify relevant information with topic-aware attention.As a result, we get category-and sub-catogory-wise embeddings a n,c and a n,s .Finally, we integrate all embeddings and obtain news embedding e n by using a FCN.

Dual-encoder Interaction
In order to make dual encoders interact more powerfully, dual-encoder interaction module provides two auxiliary targets by utilizing some layer outputs of dual encoders.These two auxiliary targets are only used on training, so they do not change the mechanism of online news recall in dual-encoder architecture.The first one is powerful interaction (PI) target.It is to get more powerful interaction than only using dot product, by making use of top-level embeddings c u,f and c n,f to predict target.The formula is below.First of all, concatenating top-level embeddings and get the concatenated embedding c f .Then, a FCN and a sigmoid function are used to predict labels ŷ .
The second one is earlier interaction (EI) target, which aims to help model interact earlier, which is to predict target by using category-and sub-category-wise aggregated embeddings.To utilize hierarchical information between category and sub-category, we design a block.Specifically, the block first uses a FCN to process sub-categorywise concatenated embeddings c s .Then, the block concatenates the above output with category-wise embeddings (f u,c and f n,c ).After processing by a FCN, we get c c .Finally, c c is used to predict labels ŷ via a FCN and a sigmoid function.The formula is below: (4)

Optimization
The loss function L of TADI is divided into three parts: L, L and L .L is the loss between the predicted label ŷ and the ground truth y, while L and L are losses of two auxiliary targets.Technically, they measure differences between their predicted labels (ŷ and ŷ ) and the ground truth.The mathematical formula of L is below: where a and b are hyper-parameters, they are set to 0.8 and 0.1 in expriments.Following previous works (Qi et al., 2021), L utilizes Noise Contrastive Estimation (NCE) loss.Given the ith positive sample (a clicked news) in a batch of dataset, we randomly select K negative samples (non-clicked news) for it.The selection is from the same news impression which displayed to the user.The NCE loss requires positve sample which assigning higher score than negative one and it is formulated as: where N b is the batch size.L and L utilize Binary Cross Entropy (BCE) loss.Take L as an example, the formula is below:

Experiment
We now turn our attention to empirically testing TADI, and conduct expriment analyses are to verify the effectiveness of our work.(Qi et al., 2021), we employ four ranking metrics, i.e., AUC, MRR, nDCG@5, and nDCG@10, for performance evaluation.The evaluation metrics in our experiments are used on both news ranking models and recall models, such as previous works (Wang et al., 2023;Khattab and Zaharia, 2020;Cen et al., 2020).On the purpose of utilizing experimental results from previous works (such as (Qi et al., 2021;Li et al., 2022;Wu et al., 2020)), our experiments apply the same.
The test ste of MIND-large does not have labels, so the evaluation is on an online website 2 .Our experiments are conducted on 12 vCPU Intel(R) Xeon(R) Platinum 8255C CPU@2.50GHz,43GB memory and GPU RTX 3090.We count the time consumption of model training on MIND-small: running one epoch respectively consumes about 28 minutes and 120 minutes when using GloVe and MiniLM as title encoder.
We utilize users' most recent 40 clicked news to learn user representations.From each news, we use NLTK to split a title into words, then select the first 30 words.For title entity, we select 2 https://codalab.lisn.upsaclay.fr/competitions/420the first 10 entities.To explore the influence of pre-trained models to title encoder, we adopt the 300-dimensional GloVe and MiniLM (MiniLM-12l-384d, a distilled BERT) (Wang et al., 2021) to initialize title encoder, because MiniLM can save more time consumption than BERT.Embeddings of category and sub-category are initialized by using GloVe, and they are unfixed during training.The K of Eq. 6 is set to 4 during training, which means each positive news is paired with 4 negative news.We employ Adam (Kingma and Ba, 2015) as the optimization algorithm.

Compared Models
Considering characteristic of title encoder, we categorize models into W2V and BERT types.W2Vbased and BERT-based models mean that using W2V (such as word2vec, GloVe) or BERT-like (such BERT, MiniLM) model to encode titles. W2V.
(1) DKN (Wang et al., 2018): It uses CNN to learn news representation, and a targetaware attention network to learn user representation.( 2) NPA (Wu et al., 2019b): It learns news and user representations by considering user personality.( 3) NAML (Wu et al., 2019a): It learns user and news representations by using multi-view learning, and it is the State-Of-The-Art (SOTA) of single representation models with GloVe.(4) LSTUR (An et al., 2019): It models both shortterm and long-term user interests by using GRU networks and user ID embeddings.( 5) NRMS (Wu et al., 2019d): It employs multi-head selfattentions to learn user and news representations; (6) HieRec (Qi et al., 2021): To represent a user, it learns an overall embedding, embeddings for each category and embeddings for each sub-category.HieRec costs about 300 times of time consumption than single representation model for news recall. BERT.

Experiment Analysis
In this section, we first analyze model performance.By doing so, we conduct an ablation analysis.Finally, extensive analyses illustrate the effect of embedding dimension and model performance on different title encoders.(Li et al., 2022;Qi et al., 2021;Wu et al., 2021;Zhang et al., 2021).Bold means the best performance, while underline means the best performance of baseline models.We repeated experiment TADI 3 times and reported average result with standard deviation.

Performance Analysis
We compare model performances to demonstrate effectiveness of our work, in the perspective of title encoders.Table 2 illustrates the performance of each model on MIND-small and MIND-large, from which we have following observations: W2V.First of all, TADI is the best performance model, which verifies the effectiveness of our work.Secondly, performance gaps are large comparing TADI with single representation models both on MIND-small and MIND-large.From the comparisons, we find that TADI achieves significant improvement over baseline models, and the comparison results powerfully support the effectiveness of our work.Thirdly, TADI is better than multiple representation models no matter on performance or online speed.Performance gaps are smaller between TADI and HieRec than previous comparisons, but TADI is much faster than HieRec for news recall.Efficiency is the key feature when we considering TADI.The reason why TADI is able to achieve good efficiency is because DI only exists on model training, to help the model obtain interactive information and achieve better performance on news recall.Therefore, it does not add additional computing complexity to embedding inference and news recall.When measuring efficiency, news recall only considers the time consumption of rating because user embedding and news embedding can be inferred offline.The way that TADI recalls news is similar with basic dual-encoder models, that is calculating the dot product of two embeddings.However, HieRec trains multiple embeddings for each user and each news.For example, in the MIND small dataset, each user and each news respectively have 289 embeddings (1 overall embedding, 18 categorywise embeddings, and 270 sub-category-wise embeddings) for rating.Therefore, when recalling news, the time consumption of HieRec is 289 times that of TADI.BERT.First, similar to the previous analyses on W2V, TADI is better than baseline models with large performance gap.This observation demonstrates the effectiveness of TADI with BERT.Secondly, compared with using W2V to encode titles, TADI with BERT performs better.From the comparison, we find that it is worth to use BERT on title encoding, even thought it brings more computing complexity.
Summary.The proposed TADI outperforms the SOTA whenever using W2V or BERT type models to encode title.Furthermore, large performance gaps between TADI and single representation models illustrate that TADI achieves significant improvement.Finally, TADI demonstrates that single representation models are competitive in contrast to multiple representation models.

Ablation Analysis
To understand the importance of TA and DI, we conduct an ablation analysis, as shown in Table 3. Observations are that: First of all, we verify the importance of DI.Performance degradates after removing DI, which reveals that DI is necessary for TADI.Secondly, we furtherly remove TA to analyze its effect.After removing TA, we observe that the performance of TADI further declines.Therefore, the effect of TA is enormous for TADI.Thirdly, we verify the importance of PI and EI targets in DI.After comparing their performances, we find that one of them for TADI is enough.Fourth, we combine Table 2 and Table 3, TADI without TA and DI is already outperforming most of the baselines.The reasons are that: First, title encoder uses strong component (transformer encoder) to learn cross information among history clicked news, while NAML ignores them and LSTUR adopts a weak component (GRU).Secondly, TADI concatenates shallow and deep information when integrating all information after feature encoder, while baselines don't.Therefore, TADI without TA and DI but also achieves better performance.
Figure 5: Dimension Analysis on MIND-small.

Embedding Dimension Analysis
We explore the optimal embedding dimension of TADI with GloVe or MiniLM on MIND-small, as shown in Figure 5.We observe that: First, TADI with GloVe achieves the optimal performance when the dimension is set to 768.In de-tail, performance gaps between the optimal performance and the rest are larger than 0.4%.Secondly, different from using GloVe, TADI with MiniLM achieves the best performance when the dimension is set to 1024.Specifically, the performance is continously rising until the dimension reaches to 1024, and the performance declines when the dimension large than 1024.In summary, it is optimal to set the embedding dimension as 768 when TADI uses GloVe, and to set the embedding dimension as 1024 when TADI uses MiniLM.
Figure 6: Title Encoder Analysis on MIND-small.

Title Encoder Analysis
We conduct an experiment to analyze the influence of title encoder.Previously, we have analyzed TADI performance by using GloVe and MiniLM as title encoder, but MiniLM is only one of distilled BERT versions.Besides, title encoders of a few baselines are different from them.Therefore, we compare the performance and inference speed of TADI by using more title encoders.From comparisons, title encoders are respectively set to GloVe, MiniLM-6L (MiniLM-6l-384d), MiniLM-12L (MiniLM-12l-384d) and BERT-base-uncased.
From Figure 6, we observe: First, TADI with MiniLM-12L achieves the best performance.Because MiniLM-12L learns more useful information from pre-training than GloVe and MiniLM-6L.And the data scale of MIND-small might not fine-tune BERT-base-uncased well, which makes it perform worse.Secondly, the inference speed of TADI with GloVe is the fastest.With the increasing of model complexity, inference speed becomes slower and slower.In summary, we prefer to use GloVe or MiniLM-12L as title encoder in terms of performance and inference speed.

Topic-aware Attention Analysis
We conduct case studies to verify the effectiveness of TA.Five news are randomly selected and  To quantitatively analyze the effectiveness of TA based on a whole dataset, we additionally provide Part-Of-Speech (POS) rank.We firstly use NLTK to tag POS of each title and rank POS by using topic attention weights (category-wise).Secondly, to normalize the rank, we divide the rank by the word size of the title.Finally, we select POS with frequency large than 50, and count their average ranks.By doing so, we list the top 5 POS and the last 5 POS in Table 5.We observe that TA pays more attention to informative words, for example, JJT is obviously more informative than BEZ, so TA is effective.

Conclusion
In this paper, we propose the model TADI for news recommendation.To avoid irrelevant word distraction, TADI designs the Topic-aware Attention (TA).TA uses news topic to weight words since news topic is one of important insterest indicators which directly reflects preference of potential target users.To make dual encoders interaction more powerful, TADI provides Dual-encoder Interaction (DI).DI helps dual-encoder interact more powerfully by providing two auxiliary targets.After conducting a series of experiments, the effectiveness of the proposed TADI is verified.Besides, extensive analyses demonstrate the robustness of the proposed model.In the future, we plan to achieve better performance by TADI optimization and data augmentation usage.

Limitations
Although TADI achieves good performance, but it remains limitations.First, DI is inflexible.DI could not be directly utilized to other models.Secondly, lack of data cannot fully fine-tune pretrained feature encoder.

A Appendix
We response some questions.
Q1: TADI seems to overlook a crucial problem: the training and inference ranking objective is discrepant.
First of all, I agree with the viewpoint: "the training and inference ranking objective is discrepant".But we could not say it's a problem.Multi-target losses allow the main target to learn knowledge from other targets.By weighting losses of multi-targets through parameters, we can retain beneficial knowledge to the main target, thereby the main target could achieve better performance.Secondly, we agree that multi-task gradient conflict sometimes makes the main target achieve bad performance, but in our experiments, we find that the main target performs better.In experiment, we have considered the conflicts among losses and have tried some methods to alleviate conflicts, such as GradNorm [1] and PCGrad [2].These methods consume a lot of extra training time but contribute not much performance improvement.Considering the goal of our paper, we believe it is worth to have more discussion in a separate paper.Finally, it is correct on the point of view "the interaction performance and efficiency are always in a trade-off relation" when using user and news embeddings to recall news.Nonetheless, researchers think about to do something to help dual-encoder models better such as ColBERT [3].To achieve better performance, TADI enhance interaction of dual-encoder models on training procedure.Therefore, TADI will not reduce the efficiency of the model when recalling news, as it still rates news by calculating the dot product of two embeddings.

Figure 2 :
Figure 2: The overview of TADI architecture.

Figure 3 :
Figure 3: The architecture of user encoder.

Figure 4 :
Figure 4: The architecture of news encoder.

Table 2 :
Performance Analysis.Baseline performances are provided by

Table 3 :
Ablation Analysis on MIND-small.

Table 4 :
Case Studies of TA.

Table 5 :
Quantitatively Analysis of Topic-aware Attention.The number within parentheses is the normalized rank.JJT means morphologically superlative adjective.VBD means verb, past tense.VBG means verb, present participle/gerund.OD means ordinal numeral.VBN means verb, past participle.DT means singular determiner.WDT means wh-determiner.BEZ means the word "is".DO means the word "do".