Personalized Entity Resolution with Dynamic Heterogeneous Knowledge Graph Representations

The growing popularity of Virtual Assistants poses new challenges for Entity Resolution, the task of linking mentions in text to their referent entities in a knowledge base. Specifically, in the shopping domain, customers tend to use implicit utterances (e.g.,"organic milk") rather than explicit names, leading to a large number of candidate products. Meanwhile, for the same query, different customers may expect different results. For example, with"add milk to my cart", a customer may refer to a certain organic product, while some customers may want to re-order products they regularly purchase. To address these issues, we propose a new framework that leverages personalized features to improve the accuracy of product ranking. We first build a cross-source heterogeneous knowledge graph from customer purchase history and product knowledge graph to jointly learn customer and product embeddings. After that, we incorporate product, customer, and history representations into a neural reranking model to predict which candidate is most likely to be purchased for a specific customer. Experiments show that our model substantially improves the accuracy of the top ranked candidates by 24.6% compared to the state-of-the-art product search model.


Introduction
Given an entity mention as a query, the goal of entity resolution (or entity linking) (Ji and Grishman, 2011) is to link the mention to its corresponding entry in a target knowledge base (KB). In an academic shared task setting, an entity mention is usually a name string, which can be a person, organization or geo-political entity in a news context, and the KB is usually a Wikipedia dump with rich structured properties and unstructured text descriptions. Stateof-the-art entity resolution methods can achieve higher than 90% accuracy in such settings (Ji and Grishman, 2011;Agarwal and Bikel, 2020), and they have been successfully applied in hundreds of languages (Pan et al., 2017) and various domains such as disaster management (Zhang et al., 2018a) and scientific discovery (Wang et al., 2015). Therefore, we tend to think entity resolution is a solved problem in academia. In industry, with the rise in popularity of Virtual Assistants (VAs) in recent years, an increasing number of consumers now rely on VAs to perform daily tasks involving entities, including shopping, playing music or movies, calling a person, booking a flight, and managing schedules. The scale and complexity of industrial applications presents the following unique new challenges.
Unpopular majority. There is a massive number of new entities emerging every day. The entity resolver may know very little about them since very few users interact with them. Handling these tail entities effectively requires the use of property linkages between entities and shared user interests Large number of ambiguous variants. When interacting with VAs, users tend to use short and less informative utterances with the expectation that the VAs can intelligently infer their actual intentions. This further raises the need for the technique to resolve entities with personalization.
In the shopping domain, this problem is even more challenging as customers typically use implicit entity reference utterances (e.g., "organic milk") instead of explicit names (e.g., "Horizon Organic Shelf-Stable 1% Lowfat Milk") which usually lead to a large number of candidates due to the ambiguity. However, with VAs' voice user interface (VUI), the number of products that can be shown to the customers is very limited. In this work, we focus on the problem of personalized entity resolution in the shopping domain. Given a query and a list of retrieved candidates, we aim to return the product that is most likely to be pur-  chased by a customer.
We make three assumptions: (H1) customers tend to purchase products they have purchased in the past; (H2) customers tend to purchase a set of products that share some properties; (H3) two customers who purchased products with similar properties may share similar interests. Based on these assumptions, we propose to represent customers and products as low-dimensional distributional vectors learned from a graph of customers and products. However, unlike social media sites with rich interactions among users, customers of most shopping services are isolated, which prevents us from learning user embeddings as distributed representations. To address this issue, we propose to build a cross-source heterogeneous knowledge graph as Figure 1 depicts to establish rich connections among customers from two data sources, users' purchase history (customer-product graph) and product knowledge graph, and further jointly learn the representations of nodes in this graph using a Graph Neural Network (GNN)-based method. We further propose an attentive model to generate a query-aware history representation for each user based on the current query.
Experiments on real data collected from an online shopping service show that our method substantially improves the purchase rate and revenue of the top ranked products.

Methodology
Given a query q from a customer c, and a list of candidate products P = {p 1 , ..., p L }, where L is the number of candidates, our goal is to predict the product that the customer will shop for based on the customer's purchase history and the product knowledge graph. Specifically, we use purchase records {r 1 , ..., r H } where H is the number of historical records. As Figure 2 illustrates, we jointly learn customer and product embeddings from a cross-source customer-product graph using GNN. To perform personalized ranking, we incorporate the learned customer embedding and the queryaware history representation as additional features when calculating the score of each candidate. We then rank all candidates by score and return the top one.

Candidate Retrieval
We first retrieve candidate products for each query using QUARTS (Nigam et al., 2019;Nguyen et al., 2020), which is an end-to-end neural model for product search. QUARTS has three major components: (1) an LSTM-based (long short-term memory) classifier that predicts whether a query-product pair is matched; (2) a variational query generator that generates difficult negative examples, and (3) a state combiner that switches between query repre-sentations computed by the classifier and generator.

Joint Customer and Product Embedding
The next step is to obtain the representations of customers and products. Customer embeddings are usually learned from user-generated texts (Preoţiuc-Pietro et al., 2015;Yu et al., 2016;Ribeiro et al., 2018) or social relations (Perozzi et al., 2014a;Grover and Leskovec, 2016;Zhang et al., 2018b), neither of which are available in the shopping dataset we use. Alternatively, we establish indirect connections among customers through their purchased products under hypothesis H3, and form a customer-product graph as shown in Figure 1(a). This graph only contains a single type of relation (i.e., purchase) and ignores product attributes. As a result, it tends to be sparse and less effective for customer representation learning.
In order to learn more informative embeddings, we propose to incorporate richer information from a product knowledge graph (Figure 1(b)) where products are not only connected to different attribute nodes (e.g., brands, flavors), but they may also be associated with textual features (e.g., title) and boolean features (e.g., isOrganic).
By merging the product knowledge graph and the customer-product graph, we obtain a more comprehensive graph (Figure 1(c)) of higher connectivity. For example, in the original customer-product graph, Customer 1 and Customer 2 are disconnected because they do not share any purchase. In the new graph, they have an indirect connection through Product 2 and Product 3, which share the same flavor and ingredient.
From this heterogeneous graph, we jointly learn customer and product representations using a two-layer Relational Graph Convolutional Network (Schlichtkrull et al., 2018). The embedding of each node is updated as: where h l i is the representation of node i at the l-th layer, N r i is the set of neighbor indices of node i under relation r ∈ R, and W l 0 and W l r are learnable weight matrices.
In order to capture textual features such as product titles and descriptions, we use a pre-trained RoBERTa (Liu et al., 2019) encoder to generate a fix-sized representation for each product. Specifically, we concatenate textual features using a special separator token [SEP], obtain the RoBERTa representation for each token, and then use the averaged embedding to represent the whole sequence. To reduce the runtime, we calculate customer and product embeddings offline and cache the results.

Candidate Representation
In addition to the product embedding, we further incorporate the following features to enrich the representation of each candidate.
Rank: the order of the candidate returned by the product retrieval system.
Relative Price: how much a product's absolute price is higher or lower than the average price of all retrieved candidates.
Previously Purchased: a binary flag indicating whether a candidate has been purchased by the customer or not.
We concatenate these features with the product embedding and project the vector into a lower dimensional space using a feed forward network.

History Representation
Although customer embeddings can encode purchase history information, they are static and may not effectively provide the most relevant information for each specific query. For example, if the query is "bookshelf", furniture-related purchase records are more likely to help the model predict the product that the customer will purchase, while if the query is "sulfate-free shampoo", purchase records of beauty products are more relevant. To tackle this issue, we propose to generate a dynamic history representation v based on the current query q from all purchase record representations {v 1 , ..., v H } of the customer.
We first represent each purchase record as the concatenation of the product embedding, product price, and purchase timestamp. The queryaware history representation is then calculated as a weighted sum of the customer's purchase record representations using an attention mechanism as follows.

Candidate Ranking
We adopt a feed forward neural network that takes in the candidate, customer, and history representations, and returns a confidence scoreŷ i . The confidence score is scaled to (0, 1) using a Sigmoid function. During training, we optimize the model by minimizing the following binary cross entropy loss function.
where N denotes the total number of candidates, and y i ∈ {0, 1} is the true label. In the inference phase, we calculate confidence scores for all candidates for each session and return the one with the highest score.

Data
Product Knowledge Graph. In our experiment, we use a knowledge graph of products in five categories (i.e., grocery, beauty, luxury beauty, baby, and health care), which contains 24,287,337 unique product entities. As Figure 1 depicts, products in this knowledge graph are connected through attribute nodes, including brands, scents, flavors, and ingredients. This knowledge graph also provides rich attributes for each product node. We use two types of attributes in this work, textual features (i.e., title, description, and bullet) and binary features (e.g., isOrganic, isNatural). Evalution Dataset. We randomly collect 1 million users' purchase sessions from November 2018 to October 2019 on an online shopping service. Each session contains a query, an obfuscated identifier, a timestamp, and a list of retrieved candidate products where only one product is purchased. We split sessions before and after September 1, 2019 into two subsets. The first subset only serves as the purchase history and is used to construct the customer-product graph. From the second subset, we randomly sample 22,000 customers with at least one purchase record in the first subset and take their last purchase sessions for training or evaluation. Specifically, we use 20,000 sessions for training, 1,000 for validation, and 1,000 for test. If a customer has multiple purchase sessions in the second subset, other sessions before the last one are also considered as purchase history when we generate history representations, while they are excluded from the customer-product graph, which is constructed from the first subset.

Experimental Setup
We optimize our model with AdamW for 10 epochs with a learning rate of 1e-5 for the RoBERTa encoder, a learning rate of 1e-4 for other parameters, weight decay of 1e-3, a warmup rate of 10%, and a batch size of 100.
To encode textual features, we use the RoBERTa base model 1 with an output dropout rate of 0.5. To represent query words, we use 100-dimensional GloVe embeddings pre-trained on Wikipedia and Gigaword 2 . We set the size of pre-trained customer and product embeddings to 100 and freeze them during training.
We use separate fully connected layers to project candidate and history representations into 100dimensional feature vectors before concatenating them for ranking. We use a two-layer feed forward neural network with a hidden layer size of 50 as the ranker and apply a dropout layer with a dropout rate of 0.5 to its input.

Quantitative Analysis
We compare our model to the state-of-the-art product search model QUARTS as the baseline. Because our target usage scenarios are VAs where only one result will be returned to the user, we use Accuracy@1 as our evaluation metric. We implement the following baseline ranking methods. Purchased: We prioritize products previously purchased by the customer. If multiple candidates are previsouly purchased, we return the one ranked higher by QUARTS. ComplEx: Customer and product embeddings are learned using ComplEx (Trouillon et al., 2016), a widely used knowledge embedding model.
In Table 1, we show the relative gains compared to the baseline model QUARTS. With personalized features, our method effectively improves Acc@1 on both development and test sets.
We also conduct ablation studies by removing the following features and show results in Table 2. Ranking: In this setting, our model ignores the original retrieval ranking returned by QUARTS.
Personalized Features: We remove personalized features (e.g., customer embedding, whether a product is previously purchased) in this setting. Product Embedding: We remove pre-trained product embedding but still use textual features and binary features to represent products. Joint Embedding: Customer and product embeddings are not jointly learned from the merged graph. Alternatively, customer embeddings are learned from the customer-product graph, and product embeddings are learned from the product knowledge graph.
In Table 2, from the results of Methods 6 and 7, we can see that removing either product or customer embedding degrades the performance of the model. The result of Method 8 shows that embeddings jointly learned from the merged crosssource graph achieve better performance on our downstream task. We also observe that the ranking returned by the product search system is still an important feature as Method 6 shows.

Qualitative Analysis
In Table 3 and Table 4, we show some positive and negative examples in the test set. Table 4 shows examples where our model fails to return the correct item. In many cases, such as Example #4, the purchased product and the top ranked one only differ in packaging size. We also observe that sometimes customers may not repurchase a product even if it is in the candidate list.
To better understand the remaining errors, we randomly sample 100 examples where our model fails to predict the purchased items. As Figure 3 illustrates, we analyze these examples and classify the possible reasons into the following categories. Different size. The predicted product and ground truth are the same product but differ in size. For example, while our model predicts "Lipton Herbal Tea Bags, Peach Mango, 20 ct", the customer purchases another item "Lipton Tea Herbal Peach Mango (pack of 2)", which is actually the same product in 2 pack.
Purchased. The customer has purchased the predicted product but decides not to repurchase it. This usually happens in categories (e.g., toothpaste) where customers are more willing to try new products. Additionally, customers may be less likely to repurchase a product in some categories such as books and electronics. Uninformative title. The purchased product has an uninformative title and is therefore not promoted. For example, when the customer searches for "masaman curry paste maesri", our model promotes "Maesri Thai Masaman Curry -4 oz (pack of 4)", while the customer purchases "6 Can (4oz. Each) of Thai Green Red Yellow Curry Pastes Set", which is also a Maesri product, but this key information is missing from its title. Similar title. The title of the predicted product is similar to the titles of some purchased products in the customer's history in a less important aspect.
For example, the model promotes a "moisturizing" shave gel because the customer has purchased a "moisturizing" body wash, whereas the customer decides to purchase a product for "sensitive skin".
Brand. The customer has purchased one or more products of the same brand.
Attribute. The customer has purchased one or more products with the same attribute (e.g., organic, keto, kosher). Other. The model may fail to predict the purchased item in other uncategorized cases. For example, when a customer searches for "nail clippers" but has purchased only food in the past, the model * crest 3d white toothpaste radiant mint (3 count of 4.1 oz tubes), 12.3 oz packaging may vary * [4] colgate cavity protection toothpaste with fluoride -6 ounce (pack of 6) * nivea shea daily mointure body lotion -48 hour moisture for dry skin -16.9 fl. oz. pump bottle, ... Although the previously purchased item is no longer available, with entity embedding learned from the cross-source graph, our model successfully promotes the most similar product. #3 sun dried tomatoes * [3] 365 everyday value, organic sundried tomatoes in extra virgin olive oil, 8.5 oz * #1 usda organic aloe vera gel -no preservatives, no alcohol -from freshly cut usa grown 100% pure ... * [1] 35 oz bella sun luci sun dried tomatoes julienne cut in olive oil (original version) * organic aloe vera gel with 100% pure aloe from freshly cut aloe plant, not powder -no xanthan ... * [2] julienne sun-dried tomatoes -16oz bag (kosher) * wicked joe organic coffee wicked italian ground * [4] organic sun-dried tomatoes with sea salt, 8 ounces -salted, non-gmo, kosher, raw, vegan, ... *thayers alcohol-free original witch hazel facial toner with aloe vera formula, clear, 12oz Our model promotes an organic product as the customer probably prefers organic products based on the shopping records.  is unlikely to utilize the history records to improve the ranking.
Although our framework can improve the accuracy of predicting products that will be purchased, there are still some remaining challenges. Incorporating more informative features. Some important features that affect purchase decisions are still missing in our framework, such as the average rating, customer reviews, and number of ratings. For example, we may promote the highest rated product for a customer who usually buys products with high ratings. Building a more comprehensive cross-source customer-product graph. In this work, we merge the customer-product graph and product knowledge graph into a single graph, which has been proved to produce better embeddings for our target task. A natural extension is to include records from more sources, such as music or video playing history, and multimedia features. Modeling the interactions among purchase behaviors. Our current attention-based method that generates history representations is "flat" and ignores the relationship among purchase behaviors. For example, for a customer who previously purchases a pod coffee maker, we should promote coffee capsules in the candidates over coffee beans or grounds.

Neural Entity Linking
A variety of neural models (Gupta et al., 2017;Kolitsas et al., 2018;Cao et al., 2018;Sil et al., 2018;Gillick et al., 2019;Logeswaran et al., 2019;Wu et al., 2019;Agarwal and Bikel, 2020) have been applied to Entity Linking in recent years. Compared to traditional entity linking, our task is different in three aspects: (1) Our mentions are typically vague and occur in uninformative contexts, such as "add toothpaste to my cart" ; (2) A mention may be reasonably linked to multiple entities, while only one of them is considered "correct" (purchased by the customer); (3) The ground truth for the same mention can be different for different customers.

Personalized Recommendation
A recommender system is an information filtering system that aims to suggest a list of items in which a user may be interested. Content-based filtering (Billsus and Pazzani, 2000;Aciar et al., 2007; and collaborative filtering (Shardanand and Maes, 1995;Konstan et al., 1997;Linden et al., 2003;Zhao and Shang, 2010) are two common approaches used in recommender systems. In recent years, researchers have also applied neural methods to improve the quality of recommendations (Xue et al., 2017;He et al., 2017;Wang et al., 2019a,b). Recommender systems usually rank items based on the user's past behaviors (e.g., purchasing, browsing, rating) and current context (Linden et al., 2003;Smith and Linden, 2017), whereas the results are not constrained by queries. Instead, our task requires a specific query and only returns the product that is most likely to be purchased from a list of relevant candidates.

Graph Embedding
Various methods have been proposed to learn lowdimensional vectors for nodes in knowledge graphs. Knowledge graph embedding methods, such as TransE (Bordes et al., 2013), DistMult (Yang et al., 2014), ComplEx (Trouillon et al., 2016), and Ro-tatE (Sun et al., 2018), typically represent the head entity, relation, and tail entity in each triplet in the knowledge graph as vectors and aim to rank true triplets higher than corresponding corrupted triplets. Matrix Factorization-based methods (He and Niyogi, 2004;Nickel et al., 2011;Qiu et al., 2018) represent the graph as a matrix and obtain node vectors by factorizing this matrix. Another category of frameworks (Perozzi et al., 2014b;Yang et al., 2015;Grover and Leskovec, 2016) use random walk to sample paths from the input graph and learn node embeddings from the sampled paths using neural models such as SkipGram and LSTM.

Conclusion and Future Work
We propose a novel framework to jointly learn customer and product representations based on a crosssource heterogeneous graph constructed from customers' purchase history and the product knowledge graph to improve personalized entity resolution. Experiments show that our framework can effectively increase the purchase rate of top ranked products. In the future, we plan to investigate better approaches to integrating personalized features and extend the framework to cross-lingual cross-media settings and generate conversations for more proactive and explainable entity recommendation and summarization.