FolkScope: Intention Knowledge Graph Construction for E-commerce Commonsense Discovery

Understanding users' intentions in e-commerce platforms requires commonsense knowledge. In this paper, we present FolkScope, an intention knowledge graph construction framework to reveal the structure of humans' minds about purchasing items. As commonsense knowledge is usually ineffable and not expressed explicitly, it is challenging to perform information extraction. Thus, we propose a new approach that leverages the generation power of large language models~(LLMs) and human-in-the-loop annotation to semi-automatically construct the knowledge graph. LLMs first generate intention assertions via e-commerce-specific prompts to explain shopping behaviors, where the intention can be an open reason or a predicate falling into one of 18 categories aligning with ConceptNet, e.g., IsA, MadeOf, UsedFor, etc. Then we annotate plausibility and typicality labels of sampled intentions as training data in order to populate human judgments to all automatic generations. Last, to structurize the assertions, we propose pattern mining and conceptualization to form more condensed and abstract knowledge. Extensive evaluations and studies demonstrate that our constructed knowledge graph can well model e-commerce knowledge and have many potential applications.


Introduction
In e-commerce platforms, understanding users' searching or purchasing intentions can benefit and motivate a lot of recommendation tasks (Dai et al., 2006;Zhang et al., 2016;Hao et al., 2022b).Intentions are mental states where agents or humans commit themselves to actions.Understanding others' behaviors and mental states requires rationalizing intentional actions (Hutto and Ravenscroft, 2021), where we need commonsense, or, in other words, good judgements (Liu and Singh, 2004).For example, "at a birthday party, we usually need a birthday cake."Meanwhile, commonsense knowledge can be factoid (Gordon et al., 2010), which is not invariably true, and is usually ineffable and not expressed explicitly.Existing intention-based studies on recommendation are either of limited numbers of intention categories (Dai et al., 2006;Zhang et al., 2016) or using models to implicitly model the intention memberships (Hao et al., 2022b).Thus, it is very challenging to acquire fine-grained intention knowledge in a scalable way.
Existing related knowledge graphs (KGs) can be categorized into two folds.First, some general situational commonsense KGs deal with everyday social situations (Rashkin et al., 2018;Sap et al., 2019;Zhang et al., 2020b), but they are not directly related to massive products on e-commerce platforms and thus not generalized well on users' behavior data even for generative models, e.g., COMET (Bosselut et al., 2019).Second, most e-commerce KGs leverage existing KGs, such as ConceptNet (Liu and Singh, 2004;Speer et al., 2017) and Freebase (Bollacker et al., 2008), to integrate them into the e-commerce catalog data (Li et al., 2020a;Luo et al., 2020;Zalmout et al., 2021;Luo et al., 2021;Deng et al., 2022).However, such integration is still based on factual knowledge, such as IsA and DirectorOf relations, and does not truly model the commonsense knowledge for purchase intentions.Although some of these KGs may include information related to space, crowd, time, function, and event, they still fall short of modeling true commonsense knowledge (Luo et al., 2021).
Existing KGs constructed for e-commerce platforms can be evaluated for their factual knowledge in terms of plausibility.However, when it comes to purchasing intentions, a person's beliefs and desires (Kashima et al., 1998) are mediated by their intentions, which can be reflected by the typicality of commonsense (Chalier et al., 2020;Wilhelm, 2022).For example, in Figure 1, a user bought an Apple watch because "Apple watches can be used for telling the time" where the reason is highly plausible (but other watches can also serve similar functions), whereas a more typical reason would be "apple watches are able to track running," or "the user is simply a fan of Apple products."Thus, no matter what kind of factual knowledge a KG contains, if it is not directly linked to rationalization, it cannot be regarded as typical commonsense.In addition, the task of explaining a user's rating of an item has been proposed as a means of providing recommendations.To achieve this, researchers have suggested using online reviews as a natural source of explanation (Ni et al., 2019;Li et al., 2020b).However, online reviews are often noisy and diverse and may not directly reflect the user's intention behind their purchase or rating.Instead, they may reflect the consequences of the purchase or the reasons behind the user's rating.Existing sources of information, such as question-answering pairs, reviews, or product descriptions, do not explicitly mention the user's intentions behind their purchases, making it a challenge to extract intentional commonsense knowledge for e-commerce.As a result, constructing an intention KG for ecommerce requires sophisticated information extraction techniques and thus remains challenging.
In this paper, we propose a new framework, FolkScope, to acquire intention knowledge in ecommerce.Instead of performing information extraction, we start from enormous user behaviors that entail sustainable intentions, such as co-buy behaviors, and leverage the generation power of large language models (LLMs), e.g., GPT (Radford et al., 2019;Brown et al., 2020;Ouyang et al., 2022), to generate possible intentions of the purchasing behaviors as candidates.LLMs have shown the capability of memorizing factual and commonsense knowledge (Petroni et al., 2019;West et al., 2022), and "sometimes infer approximate, partial representations of the beliefs, desires, and intentions possessed by the agent that produced the context" (Andreas, 2022).As open prompts in the above example can be arbitrary and loosely constrained, we also align our prompts with 18 ConceptNet relations, such as IsA, HasPropertyOf, CapableOf, UsedFor, etc.In addition, as the generated knowledge by LLMs can be noisy and may not be able to reflect human's rationalization of a purchasing action, we also perform human annotation for plausibility and typicality.
Given generated candidates and annotations to construct the KG, we first perform pattern mining to remove irregular generations.Then we train classifiers to populate the prediction scores to all generated data.Finally, for each of the generated intentions, we perform conceptualization to map the key entities or concepts in the intention to more high-level concepts so that we can build a denser and more abstract KG for future generalization.An illustration of our KG is shown in Figure 1.To assess the overall quality of our KG, we randomly sample populated assertions and estimate their quality.Furthermore, we demonstrate the quality and usefulness of our KG by using it in a downstream task, CF-based (collaborative filtering) recommendation.The contributions of our work can be summarized as follows.
• We propose a new framework, FolkScope, to construct large-scale intention KG for discovering e-commerce commonsense knowledge.
• We leverage LLMs to generate candidates and perform two-step efficient annotation on Amazon data with two popular domains, and the process can be well generalized to other domains.
• We define the schema of the intention KG aligning with famous commonsense KG, ConceptNet, and populate a large KG based on our generation and annotation with 184,146 items, 217,108 intentions, 857,972 abstract intentions, and 12,755,525 edges (assertions).
• We perform a comprehensive study to verify the validity and usefulness of our KG.

Overview of FolkScope Framework
We call our framework FolkScope as we are the first attempt to reveal the structure of e-commerce intentional commonsense to rationalize purchasing behaviors.As shown in Figure 2, FolkScope is a human-in-the-loop approach for the semiautomatic construction of the KG.We first leverage the LLMs to generate candidate assertions of intentions for purchasing or co-purchasing behaviors based on co-buy data from the released Amazon dataset.Then we employ two-step annotations to annotate the plausibility and typicality of the generated intentions, where the corresponding definitions of scores are as follows.
• Plausibility: how possible the assertion is valid regarding their properties, usages, functions, etc.
• Typicality: how well the assertion reflects a specific feature that causes the user behavior.Typical intentional assertions should satisfy the following criteria.1) Informativeness: contains key information about the shopping context rather than a general one, e.g., "they are used for Halloween parties ."v.s."they are used for the same purpose."2) Causality: captures the typical intention of user behaviors, e.g., "they have a property of water resistance."Some specific attributes or features might largely affect the users' purchase decisions.
After the annotation, we design classifiers to populate prediction scores to all generated candidates.Then the high-quality ones will be further structured using pattern mining on their dependency parses to aggregate similar assertions.Then, we also perform conceptualization (Song et al., 2011;Zhang et al., 2022a)

Knowledge Generation
User Behavior Data Sampling.We extract the users' behavior datasets from open-sourced Amazon Review Data (2018)1 (Ni et al., 2019) with 15.5M items from Amazon.com.In our work, we mainly consider co-buy pairs, which might indicate stronger shopping intent signals than co-view pairs.After the pre-processing and removing duplicated items, the resulting co-buy graph covers 3.5M nodes and 31.4Medges.The items are organized into 25 top-level categories from the Amazon website, and among them, we choose two frequent categories: "Clothing, Shoes & Jewelry" and "Electronics" to sample co-buy pairs because those items substantially appear in situations requiring commonsense knowledge to understand, while other categories such as "Movie" or "Music" are more relevant to factual knowledge between entities.We uniformly sample co-buy pairs from the two categories, and the statistics are shown in Table 1.Prompted Generation.As shown in Table 2, we verbalize the prompt templates using the titles of co-buy pairs.Besides the general prompt (i.e., "open"), we also align our prompts with 18 relations in ConceptNet highly related to commonsense.For example, for the relation HasA, we can design a prompt "A user bought 'item 1' and 'item 2' because they both have [GEN]" where [GEN] is a special token indicating generation.Since the long item titles might contain noise besides useful attributes, we use heuristic rules to filter out items whose titles potentially affect the conditional generation, like repeated words.We use the OPT model (Zhang et al., 2022b)   buy pairs, we set the max generation length as 100 and generate 3 assertions using nucleus sampling (p = 0.9) (Holtzman et al., 2020).We post-process the candidates as follows.
(1) We discard the generations without one complete sentence.
(2) We use the sentence segmenter from Spacy library 4 to extract the first sentence for longer generations.
After removing duplicates, we obtain 16.64M candidate assertions for 293K item pairs and 4.06M unique tails among them.The statistics of the two categories are listed in Table 1.

Two-step Annotation and Population
As the generated candidates can be noisy or not rational, we apply the human annotation to obtain high-quality assertions and then populate the generated assertions.We use Amazon Mechanical Turk (MTurk) to annotate our data.Annotators are provided with a pair of co-buy items with each item's title, category, shopping URL, and three images from our sampled metadata.Assertions with different relations are presented in the natural language form by using the prompts presented in  satisfiable for such large-scale annotations.Different from the simple binary plausibility judgments, in the second step, we have more finegrained and precise typicality indicators concerning informativeness and causality.Here we choose the candidates automatically labeled as plausible based on our classifier trained on the first step's data.We ask the annotators to judge whether they are strongly acceptable (+1), weakly acceptable (0.5), rejected (0), or implausible (-1) that the assertion is informative and casual for a purchasing behavior.Considering the judgments might be subjective and biased with respect to different annotators, we collect five annotations for each assertion and take the average as the final typicality score. 5imilar to the first step, we collect around 60K assertions.Empirically, we find annotating more data does not bring significantly better filtering accuracy.The statistics are presented in Table 3. Population.For plausibility population, we train binary classifiers based on the majority voting results in the first step, which can produce binary labels of the plausibility of unverified generations.For the typicality score, as we take the average of five annotators as the score, we empirically use scores greater than 0.8 to denote positive examples and less than 0.2 as negative examples.We split the train/dev sets at the ratio of 80%/20% and train binary classifiers using both DeBERTa-large (He et al., 2021(He et al., , 2023) ) and RoBERTa-large (Liu et al., 2019) as base models.The best models are selected to maximize the F1 scores on the validation sets, and results are shown in Table 4 (more results can be found in Appendix B).DeBERTa-large achieves better performance than RoBERTa-large on both they could both be used for his daughter.they could both be used for his daughter and they could both be used for his daughter to they could both be used for " his daughter " … Assertions Pattern Merged Intention they could both be used for his daughter they could both be used for his family-member they could both be used for his offspring they could both be used for his relative Conceptualized Intentions plausibility and typicality evaluation.We populate the inference over the whole generated corpus in Table 1 and only keep the assertions whose predicted plausibility scores are above 0.5 (discarding 32.5% generations and reducing from 16.64M to 11.24M).Note that only plausible assertions are kept in the final KG.Using different confidence cutting-off thresholds leads to trade-offs between the accuracy of generation and the size of the corpus.After the two-step populations, we obtain the plausibility score and typicality score for each assertion.Due to the measurement of different aspects of knowledge, we observe low correlations between the two types of scores (Spearman correlation ρ: 0.319 for clothing and 0.309 for electronics).

Knowledge Aggregation
To acquire a KG with topology structures instead of sparse triplets, we aggregate semantically similar assertions.This is done by (1) pattern mining to align similar generated patterns and (2) conceptualization to produce more abstract knowledge.
Assertions are typically expressed as free-form text phrases, some of which may have similar syntax and semantics.By extracting the skeleton and necessary modifiers, such as demonstrative pronouns, adjectives, and adverbs, we can reduce the noise generated by these phrases.For example, as shown in Figure 3, several generations can be simplified to "they could both be used for his daughter," despite the presence of punctuation and incomplete content.To achieve this, we employ frequent graph substructure mining over dependency parse trees  to discover linguistic patterns (More details in Appendix C).
After pattern mining, we can formally construct our knowledge graph, where the head is a pair of items (p 1 , p 2 ), the relation r is one of the relations shown in Table 2, and the tail is an aggregated assertion e that is originally generated and then mapped to a particular one among 256 patterns.Each of the knowledge triples is associated with two populated scores, i.e., plausibility and typicality.
To produce abstract knowledge generalizable to new shopping contexts, we also consider the conceptualization with the large-scale concept KG, Probase (Wu et al., 2012;He et al., 2022;Wang et al., 2023b).The conceptualization process maps one extracted assertion e to multiple conceptualized assertions with concepts c.For example, in Figure 3, "they could be used for his daughter" can be conceptualized as "they could be used for his offspring," "they could be used for his relative," and "they could be used for his family-member," etc.The conceptualization weight P (c|e) can be determined by the likelihood for IsA(e, c) in Probase.This process has been employed and evaluated by ASER 2.0 (Zhang et al., 2022a).Finally, we obtain a KG with 184,146 items, 217,108 intentions, 857,972 abstract intentions, and 12,755,525 edges to explain 236,739 co-buy behaviors, where 2,298,011 edges from the view of original assertions and 9,297,500 edges from the angle of conceptualized ones, and 1,160,014 edges model the probabilities of the conceptualization.

Intrinsic Evaluations
In this section, we present some examples of our constructed KG and conduct comprehensive intrinsic evaluations of KG.

Examples in KG
We show two examples of co-purchasing products and their corresponding knowledge ( § 2.2) as well as populated scores ( § 2.3) in Table 7.We measure the quality of assertions using both plausibility and typicality scores, which are again shown they are not correlated.For example, "they are SimilarTo the product they bought" for the first pair and "they are DistinctFrom other similar products" for the second pair are plausible assertions but not typical explanations of why a user would buy them together.Moreover, some of the open relations are very good as well.Take the second pair as an example: the open relation shows "he was worried about his baby's skin" as both products are related to baby skin protection.We also append more typical knowledge examples in Table 14 of the Appendix.

Human Evaluation
As we populate the whole generated assertions using classifiers based on DeBERTa-large model, we conducted human evaluations by sampling a small number of populated assertions from different scales of predicted scores to evaluate the effectiveness of the knowledge population.

Plausibility Evaluation
We randomly sample 200 plausible assertions from each relation in each of the clothing and electronics domains to test the human acceptance rate.The annotation is conducted in the same way as the construction step.As we only annotate assertions predicted to be greater than the 0.5 plausibility score, the IAA is above 85%, even greater than the one in the construction step.As shown in Table 5, different cutting-off thresholds (based on the plausibility score by our model) lead to the trade-offs between the accuracy and the KG size.Overall, FolkScope can achieve an 83.4% acceptance rate with a default threshold (0.5).To understand what is filtered, we manually check the generations with low plausibility scores and find that OPT can generate awkward assertions, such as simply repeating the item titles or obviously logical errors regarding corresponding relations.Our classifier trained on annotated datasets helps resolve such cases.Using a larger threshold of 0.9, we attain a 95.35% acceptance rate, a nearly 11.96% improvement while still keeping above 8M plausible assertions.We also report the accuracy in terms of different relations in Table 6.We can observe that assertions concerning the relations of human beings' situations like Cause, Result, and CauseDesire have relatively lower plausibility scores and longer lengths than the relations of items' property, function, etc.This is because there exist some clues about items' knowledge in the item titles, while it is much harder to generate (or guess) implicit human beings' casual reasons using language generation.

Typicality Evaluation
The goal of the typicality population is to precisely recognize high-quality knowledge, and we evaluate whether assertions with high typicality scores are truly good ones.We randomly sample 200 assertions from each relation whose predicted typicality scores are above 0.8 for human evaluation.Each of the assertions is again annotated by five AMT workers, and the average rating is used.The results are shown in  further decreased after conceptualization.This is because, first, the conceptualization model may introduce some noise, and second, the more abstract knowledge tends to be less typical when asking humans to annotate.We also show the typicality scores of each relation in Figure 4. Different from plausibility, SimilarTo, DistinctFrom, DefinedAs, and HasPropertyOf are less typical compared to other relations.They describe items' general features but can not well capture typical purchasing intentions though they have high plausibility scores, whereas CapableOf and MadeOf are the most typical features that can explain purchasing intentions for the two domains we are concerned about.More evaluation on the diversity of implicit generation and fine-grained subcategory knowledge aggregation can be found in Appendix D.
4 Extrinsic Evaluation 4.1 Experimental Setup Data Preparation.We conduct extrinsic evaluation via knowledge-augmented recommendation tasks.Specifically, we use the same categories' user-item interaction data from the Amazon Review dataset (Ni et al., 2019) shown in Table 9.We split datasets into train/dev/test sets at a ratio of 8:1:1 and report averaged RMSE (root mean square error) scores over five runs.
To fairly evaluate the KG for recommendations, we sample the sub-graph from the original KG  where co-buy pairs are simultaneously purchased by at least one user in the recommendation training set.The detailed statistics of the matched KG are in Table 10.The item coverage computes the percentage of the items in the recommendation dataset that are covered by the matched KG.Moreover, we also filter the matched KG with the threshold of 0.5 or 0.9 on plausibility and typicality scores to evaluate the effectiveness of the knowledge population.From Table 10, we can observe the number of edges essentially reduces when the filters are applied, but the coverage of the items does not drastically drop.
Knowledge Representation.As our constructed KG can be represented as the triplet ((p 1 , p 2 ), r, e), where the head (p 1 , p 2 ) is the co-buy pair, the relation r is from relations in Table 2 and e refer to generated tails.To combine both structural and textual information from KG, we modify the original TransE model (Bordes et al., 2013) to the following objective: where γ is a margin parameter, and p 1 , p 2 , p  and Gurevych, 2019) representations.After training the modified TransE model, all the item embeddings p can be used as extra features to enhance recommendations.

Experimental Results
Baselines.We adopt commonly-used NCF (He et al., 2017) and Wide&Deep model (Cheng et al., 2016) as our baselines.As our goal is to evaluate the effectiveness of features derived from KG, we leave advanced KG fusion methods, such as hyperedges or meta path-enhanced, to future work.
Ablation Study.We conduct two ablation studies to evaluate the effect of structural information provided by the co-buy pairs and the semantic information provided by the tails' text only.For the former, we train a standard TransE model solely on co-buy pairs to learn the graph embeddings of items.For the latter, for each item in the matched KG, we conduct average pooling of its neighbor tails' Sentence-BERT embeddings as its semantic representations.The experimental results are shown in Table 11, and we have the following observations.First, the textual information contained in intentional assertions is useful for product recommendations.This can be testified as the W&D model can perform better even when only features of the assertions are provided.Second, our KG, even before annotations and filtering, can produce better item embeddings than solely using the cobuy item graphs.As we can see, the performance of our matched KG is better than that of the co-buy pair graphs.Third, the two-step annotation and population indeed help improve the item embeddings for recommendations.The higher the scores are, the larger improvement the recommendation system obtains.

Related Work
Knowledge Graph Construction.An early approach of commonsense KG construction is proposed in ConceptNet (Liu and Singh, 2004;Speer Method Clothing Electronics NCF (He et al., 2017) 1.117 1.086 W&D (Cheng et al., 2016) 1   et al., 2020), is constructed based on pattern mining (Wu et al., 2012), which can model both plausibility and typicality of conceptualizations (Song et al., 2011).Recently, situational commonsense knowledge, such as Event2Mind (Rashkin et al., 2018) and ATOMIC (Sap et al., 2019), has attracted more attention in the field of AI and NLP.
In e-commerce, Amazon Product Graph (Zalmout et al., 2021) is developed to align Amazon catalog data with external KGs such as Freebase and to automatically extract thousands of attributes in millions of product types (Karamanolakis et al., 2020;Dong et al., 2020;Zhang et al., 2022c).Alibaba also develops a series of KGs including Al-iCG (Zhang et al., 2021), AliCoCo (Luo et al., 2020(Luo et al., , 2021)), AliMeKG (Li et al., 2020a), and OpenBG (Deng et al., 2022;Qu et al., 2022).As we have stated in the introduction, there is still a gap between collecting factual knowledge about products and modeling users' purchasing intentions.
Language Models as Knowledge Bases.Researchers have shown LLMs trained on large corpus encode a significant amount of knowledge in their parameters (AlKhamissi et al., 2022;Ye et al., 2022).LLMs can memorize factual and commonsense knowledge, and one can use prompts (Liu et al., 2023) to probe knowledge from them (Petroni et al., 2019).It has been shown that we can derive factual KGs at scale based on LLMs for factual knowledge (Wang et al., 2020;Hao et al., 2022a) and distill human-level commonsense knowledge from GPT3 (West et al., 2022).None of the above KGs are related to products or purchasing intention.We are the first to propose a complete KG construction pipeline from LLMs and several KG refinement methods for e-commerce commonsense discovery.

Conclusion
In this paper, we propose a new framework, FolkScope, to acquire intention commonsense knowledge for e-commerce behaviors.We develop a human-in-the-loop semi-automatic way to construct an intention KG, where the candidate assertions are automatically generated from large language models, with carefully designed prompts to align with ConceptNet commonsense relations.Then we annotate both plausibility and typicality scores of sampled assertions and develop models to populate them to all generated candidates.Then the high-quality assertions will be further structured using pattern mining and conceptualization to form more condensed and abstractive knowledge.We conduct extensive evaluations to demonstrate the quality and usefulness of our constructed KG.In the future, we plan extend our framework to multi-domain, multi-behavior type, multilingual (Huang et al., 2022;Wang et al., 2023a) and temporal (Wang et al., 2022b,a) scenarios for empowering more e-commerce applications.

Limitations
We outline two limitations of our work from user behavior sampling and knowledge population aspects.Due to huge-volume user behavior data produced every day in the e-commerce platform, it is crucial to efficiently sample significant behaviors that can indicate strong intentions and avoid random co-purchasing or clicking etc.Though in this work we adopt the criteria of selecting nodes whose degree are more than five in the co-buy graph, it is still coarse-grained and more advanced methods remain to be explored in order to sample representative co-buy pairs for intention generation.Some potential solutions are to aggregate frequent co-buy category pairs and then sample product pairs within selected category pairs.Moreover, our proposed framework can be generalized to other types of abundant user behaviors such as search-click and search-buy, which requires to design corresponding prompts.We leave these designs to future work.
For open text generation from LLMs, it becomes common practices to label high-quality data for finetuning to improve the quality and controllability of generation such as LaMDA (Thoppilan et al., 2022), InstructGPT (Ouyang et al., 2022), and ChatGPT6 .However, computation cost is the major bottleneck to use annotated data as human feedback for language model finetuning with billions of parameters, like OPT-30b in our work.Hence we adopt a trade-off strategy to populate human judgements by training effective classifiers and conducting inferences over all the generation candidates.With impressive generation performance of Chat-GPT, we expect efficient methods to directly optimize LLMs with human feedback in more scalable way like reinforcement learning (RLHF), and enable LLMs to generate more typical intention knowledge with less annotation efforts.

Ethics Statement
As our proposed framework relied on large language models, text generation based on LLMs often contains biased or harmful contexts.We argue that our work largely mitigated the potential risks in the following ways.First, our carefuldesigned prompting leads to rather narrow generations constrained on small domains, i.e., products in e-commerce.Second, we also had a strict data audit process for annotated data from annotators and populated data from trained classifiers.On a small scale of inspections, we found none belongs to significant harmful contexts.The only related concern raised here is that some generated knowledge is irrelevant to the products themselves.The major reason is due to imprecise product titles written by sellers for search engine optimization, such as adding popular keywords to attract clicks or purchases.Our human-in-the-loop annotation identified such cases and the trained classifier further assisted machines in detecting bias, as we hope our intention generations can be safe and unbiased as much as possible.
ITC of Hong Kong and the National Key R&D Program of China (2019YFE0198200) with special thanks to HKMAAC and CUSBLT.We also thank the support from the UGC Research Matching Grants (RMGS20EG01-D, RMGS20CR11, RMGS20CR12, RMGS20EG19, RMGS20EG21, RMGS23CR05, RMGS23EG08).

D More Evaluations D.1 Implicit Generation Evaluation
As we know, language model based generation capture spurious correlation given the condition of the generation (Ji et al., 2022).Hence we simply quantify the diversity as the novelty ratio of generated tails not appearing in the item titles, i.e., novel generations.Different from explicit attribute extraction (Vilnis et al., 2022;Yang et al., 2022), our generative method is able to extract implicit knowledge behind item titles or descriptions.For example, the title "Diesel Analog Three-Hand -Black and Gold Women's watch" contains specific attributes like "Black and Gold" or type information "women's watch."Such knowledge can be easily extracted by off-the-shelf tools.Traditional information extraction based approaches mostly cover our knowledge if the generation simply copies titles to reflect the attributes.Otherwise, it means that we provide much novel and diverse information compared with traditional approaches.The novelty ratio increases from 96.85% to 97.38% after we use the trained classifiers for filtering.Intuitively, filtering can improve the novelty ratio.For the assertions whose typicality scores are above 0.9, we also observe that the novelty ratio reaches 98.01%.These findings suggest that FolkScope is indeed an effective framework for mining high-quality implicit knowledge.

Figure 1 :
Figure 1: An overview of FolkScope.It starts from users' purchasing or co-purchasing behaviors and links them to intentions.Then more abstract intentions are formed to condense the representation of intentions.The intentions can be noun phrases or verb phrases (italics).

Figure 2 :
Figure 2: The overall framework of FolkScope.It includes the generation, population, and conceptualization to semi-automatically construct the e-commerce intention commonsense KG with the help of human-inthe-loop annotations and evaluation.

Figure 4 :
Figure 4: Average typicality score of each relation in the populated KG with the cutting-off threshold 0.8.

Table 1 :
to further aggregate assertions to form more abstract intentions.Statistics of sampled co-buy pairs and generated candidate assertions.Note that the prompts in the generation are not included in the calculations of assertion lengths.

Table 2 :
Prompts for different commonsense relations.

Table 2 .
More details are listed in Appendix A. Annotation.To filter out incorrect candidates, we begin by annotating plausibility in the first step.

Table 3 :
Statistics of annotated data.

Table 5 :
Acceptance ratios of plausible assertions and the corresponding sizes of populated assertions with different cutting-off thresholds.

Table 6 :
Evaluation on plausible rate and size of the populated KG.The prompts in the generation are not included in the calculations of assertion lengths.

Table 8
. It shows that average annotated scores are lower than the predicted ones due to harder judgments for typicality.Similarly, predicted typicality scores are less accurate than plausibility.Especially the typicality score will be

Table 7 :
Two examples from the constructed knowledge graph."P." and "T." stand for the predicted plausibility and typicality scores.Generated tails with high typicality (in green) and low typicality (in red) scores are highlighted.

Table 8 :
Average annotated typicality scores for assertions after pattern mining and conceptualization with different thresholds of predicted typicality scores.

Table 9 :
Statistics of the recommendation datasets.
Meanwhile, r is the relation embedding for the relation r, e is the embedding for the tail e, and the function d is Euclidean distance.Moreover, the node embeddings for e are initialized by Sentence-BERT(Reimers

Table 11 :
Recommendation results in RMSE.

Table 12 :
Frequent linguistic patterns and corresponding coverage on human-annotated knowledge.