Gold: A Global and Local-aware Denoising Framework for Commonsense Knowledge Graph Noise Detection

Commonsense Knowledge Graphs (CSKGs) are crucial for commonsense reasoning, yet constructing them through human annotations can be costly. As a result, various automatic methods have been proposed to construct CSKG with larger semantic coverage. However, these unsupervised approaches introduce spurious noise that can lower the quality of the resulting CSKG, which cannot be tackled easily by existing denoising algorithms due to the unique characteristics of nodes and structures in CSKGs. To address this issue, we propose Gold (Global and Local-aware Denoising), a denoising framework for CSKGs that incorporates entity semantic information, global rules, and local structural information from the CSKG. Experiment results demonstrate that Gold outperforms all baseline methods in noise detection tasks on synthetic noisy CSKG benchmarks. Furthermore, we show that denoising a real-world CSKG is effective and even benefits the downstream zero-shot commonsense question-answering task.


Introduction
The emergence of Commonsense Knowledge Graphs (CSKGs) has significantly impacted the field of commonsense reasoning (Liu et al., 2021;Zhang et al., 2020) as CSKGs provide commonsense knowledge that is often not explicitly stated in the text and difficult for machines to capture systematically (Davis and Marcus, 2015).While existing methods bank on expensive and timeconsuming crowdsourcing to collect commonsense knowledge (Sap et al., 2019a;Mostafazadeh et al., 2020), it remains infeasible to obtain CSKGs that are large enough to cover numerous entities and situations in the world (He et al., 2022;Tandon et al., 2014).To overcome this limitation, various automatic CSKG construction methods have been proposed to acquire commonsense knowledge at scale (Bosselut et al., 2019), including prompting Large Language Model (LLM) (West et al., 2022;Yu et al., 2022), rule mining from massive corpora (Tandon et al., 2017;Zhang et al., 2022a), and knowledge graph population (Fang et al., 2021a(Fang et al., ,b, 2023)).Although those methods are effective, they still suffer from noises introduced by construction bias and the lack of human supervision.Therefore, how to identify noise in large-scale CSKG accurately and efficiently becomes a crucial research question.
To tackle this issue, noise detection algorithms have been proposed for conventional entitybased KGs by primarily adopting two approaches: learning-based and rule-based.Learning-based methods like TransE (Bordes et al., 2013) learn representations of entities and relations that adhere to specific relation compositions like translation assumption or relational rotation.To enhance their performance, researchers also incorporate local information around the head and tail entities, such as different paths from head to tail (Lin et al., 2015;Xie et al., 2018;Jia et al., 2019) and neighboring triples (Zhang et al., 2022b).These methods aim to improve their ability to capture the complex relationships between entities in KGs.However, they are not easily adaptable to the unique characteristics of CSKGs.In CSKGs, nodes are noncanonicalized, free-form text, meaning nodes with different descriptions may have related semantics.As illustrated in Figure 1, "paint door" and "paint house" are two distinct nodes but imply related semantics (Speer et al., 2017).Additionally, when detecting noise (paint, UsedFor, brush your teeth), "brush your teeth" is an isolated node that cannot be distinguished based on any structural information.Only through the power of a language model can it be learned that "paint" and "brush your teeth" are uncorrelated, thus detecting such noise.The aforementioned methods overlook this semantic information and cannot generalize to semantically similar events with diverse structural information.
On the other hand, rule-based methods utilize logical rules in KGs for noise detection.For instance, as shown in Figure 1, the correct relation between "painter" and "paint house" should be CapableOf.This can be easily detected through the learned logical rule: CapableOf(x, y) ← CapableOf(x, z)∧HasPrerequisite(y, z).Belth et al. (2020) similarly propose an approach based on information theory that extracts sub-graph patterns to identify the noise.However, the sparsity of edges in CSKGs (Malaviya et al., 2020) posits a serious challenge to learning structural information well, as the number of learnable rules decreases significantly.This requires a generalizable rulelearning ability at the noise detector side to expand the rule bank accordingly, which is currently lacking.Therefore, applying noise detection models for KGs directly to CSKGs can result in incomplete learning of both semantic and structural information in the CSKGs.
In order to detect noises in CSKGs effectively, it is important to consider both the semantic information and the global and local structural information jointly.However, these factors have not been given enough importance in existing denoising approaches.To address this gap, we propose GOLD (Global and Local-aware Denoising), a CSKG noise detector that uses a PLM-based triple encoder and two noise detectors that take into account both global and local structures (Section 4).Specifically, the triple encoder extracts the semantic information contained in the free-text formatted nodes in CSKGs.To identify correct patterns, the global detector uses high-frequency patterns extracted through rule mining, which intrinsically uses a rule encoder to generalize the learned rules and guide noise detection.The local detector, inspired by Zhang et al. (2022b), adopts a graph neural network to efficiently measure the similarity of aggregated semantic information of neighboring triples of the head and tail nodes to help detect noise.Extensive experiments on two manually synthesized noisy-CSKG benchmarks demonstrate the efficacy and state-of-the-art performance of GOLD.Further experiments and analyses with ATOMIC 10X (West et al., 2022), a large-scale CSKG distilled from GPT3, demonstrates its proficiency in identifying noise within real-world CSKGs, while also yielding advantages in the downstream zero-shot commonsense question-answering task.
In summary, in this paper, we make the following contributions: • We introduce a new task: CSKG denoising, which can be applied to various CSKG construction and LLM distillation works.• We propose a novel framework GOLD, which outperforms all existing methods (Section 6.1) and LLMs (Section 6.3).• We show that GOLD successfully detects noises in real-world CSKGs (Section 6.5) and such denoising extrinsically benefits downstream zero-shot commonsense questionanswering task (Section 6.4).
2 Related Work

Knowledge Graph Noise Detection
Many existing knowledge graph noise detection approaches utilize some local information while simultaneously training embeddings to satisfy the relational assumption.Path information is the most commonly used type of local information, as the reachable path from the head entity to the tail entity has been proven crucial for noise detection in knowledge graphs (Lin et al., 2015;Xie et al., 2018;Jia et al., 2019).Zhang et al. (2022b) show that contrastive learning between the information of neighboring triples of the head and tail entities is more effective because of the triple-level contrasting instead of entity or graph-level, leading to superior performance compared to all path-based methods.Clustering methods (Ge et al., 2020) are also used to partition noise from triples, and an active learning-based classification model is proposed to detect and repair dirty data.While these methods consider local information, our work also accounts for semantic information and the global information of the knowledge graph to guide noise detection, better mitigating the impact of noise on local information.Regarding direct noise detection in CSKGs, Romero and Razniewski (2023) study the problem of mapping the open KB into the structured schema of an existing one, while our methods only use the CSKG to be denoised itself, not relying on any other CSKG.

Knowledge Graph Rule Mining
Another related line of work is knowledge graph rule mining, which is essential to our method.This task has received great attention in the knowledge graph completion.The first category of methods is Inductive Logical Programming (ILP) (Muggleton and Raedt, 1994), which uses inductive and logical reasoning to learn rules.On the other hand, AMIE (Galárraga et al., 2013) proposes a method of association rule mining, which explores frequently occurring patterns in the knowledge graph to extract rules and counts the number of instances supporting the discovered rules and their confidence scores.AMIE+ (Galárraga et al., 2015) and AMIE 3 (Lajus et al., 2020) further improve upon this method by introducing several pruning optimizations, allowing them to scale well to large knowledge graphs.SWARM (Barati et al., 2017) also introduces a statistical method for rule mining in large-scale knowledge graphs that focuses on both instance-level and schema-level patterns.
However, it requires type information of entities, which is not available in the CSKG and, therefore, cannot be applied to CSKG.Recently, with the success of deep learning, the idea of ILP has been neuralized, resulting in a series of neuralsymbolic methods.Neural LP (Yang et al., 2017) and DRUM (Sadeghian et al., 2019) both propose end-to-end differentiable models for learning firstorder logical rules for knowledge graph reasoning.Despite the great success achieved by the combination of Recurrent Neural Network (RNN) (Schuster and Paliwal, 1997) with rule mining (Qu et al., 2021;Cheng et al., 2022Cheng et al., , 2023)), neuralized methods are intuitively hard to interpret due to the confidence scores output by neural networks.Furthermore, jointly learning rules and embedding has been proven to be effective (Guo et al., 2016), and iteratively learning between them can also promote the effectiveness of both (Guo et al., 2018;Zhang et al., 2019b).For noise detection in knowledge graphs, Belth et al. (2020) learn higher-order pat-terns based on subgraphs to help refine knowledge graphs, but it requires type information of node and hence cannot be applied to the CSKG.

Knowledge Graph Completion with Pretrained Language Models
Aside from specifically designed noise-detection methods, the line of works targetting KG completion can also be transferred to tackle noisedetection tasks.Previous research has shown that PLMs can achieve outstanding performance on KG completion tasks for both conventional KGs (Wang and Li, 2016;An et al., 2018;Yao et al., 2019;Wang et al., 2021b;Markowitz et al., 2022;Shen et al., 2022) and CSKGs (Su et al., 2022;Yasunaga et al., 2022) due to their ability to capture linguistic patterns and semantic information.However, two limitations still exist.First, performing edge classification using a PLM requires optimizing a large number of parameters on textual data that has been transformed from edges in CSKGs.Such fine-tuning is not only computationally expensive but also incapable of learning structural features in graphs, which are essential for accurately identifying and classifying edges.Second, recent studies (Safavi et al., 2021;Chen et al., 2023) have shown that language models, regardless of their scale, struggle to acquire implicit negative knowledge through costly language modeling.This makes them potentially vulnerable to noise detection tasks, as these noises typically belong to negative knowledge.Therefore, more sophisticated manipulations of the semantic information extracted by PLMs are needed to leverage them for noise detection tasks efficiently.

Problem Definition
Noises in CSKG Commonsense knowledge represents not only basic facts in traditional knowledge graphs but also the understanding possessed by most people (Liu and Singh, 2004), we evaluate whether a triple is a noise from two perspectives: • Truthfulness: It should be consistent with objective facts.For example, (London, IsA, city in France) is not true because London is not in France but in England.• Reasonability: It should align with logical reasoning and be consistent with cultural norms.For example, (read newspaper, MotivatedByGoal, want to eat vegetables) is not logically reasonable.The two nodes are not directly related, and there is no clear relationship between them.Another example is that (hippo, AtLocation, in kitchen) violates our understanding and experience of reality because hippos are large mammals that are highly unlikely and unrealistic to be found in a kitchen.
If a triple fails to satisfy any of the aspects mentioned above, we define it as noise.
CSKG Denoising A CSKG can be represented as G = (V, R, E), where V is a set of nodes, R is a set of relations, and E ⊆ V × R × V is a set of triples or edges.Given a triple (h, r, t) ∈ E in a CSKG, we concatenate the language descriptions of h, r, and t and determine whether this description conforms to commonsense.We note that each triple violates commonsense to a different degree, and we define noise detection as a ranking problem to standardize the evaluation process better.Thus, we model noise detection as a ranking process where a scoring function f : E → R indicates the likelihood of the triple being noisy.

The GOLD Method
Our proposed method GOLD comprises four components: triple encoder, global noise detector, local noise detector, and comprehensive evaluation scorer.An overview is presented in Figure 2. First, we leverage a PLM to encode the natural language descriptions of nodes and relations in CSKGs to obtain their sentence embeddings, thus further encoding the triples.When detecting noise, we evaluate the likelihood of a triple being noise from both a global and local perspective.From the global perspective, we aim to identify high-frequency patterns in the knowledge graph, as a small amount of noise is less likely to affect correct high-frequency patterns (Belth et al., 2020).To accomplish this, we employ rule mining to extract high-quality rules from the knowledge graph.From the local perspective, we adopt graph networks to aggregate the neighboring triple information around both the head and tail nodes of a given edge, allowing us to estimate if there is any correlation.Finally, based on these two aspects of detection, we obtain a comprehensive score indicating the noise level.

Triple Encoder
As we mentioned earlier, the nodes in CSKG are linguistic descriptions that are not restricted to any specific canonicalized form.If their semantic information is ignored, it will inevitably affect the accuracy of noise detection.Therefore, the Triple Encoder (TE) employs a PLM to encode the semantics of each node and relation.For instance, considering an example of triple (h, r, t), their embeddings are defined as: where LM is a frozen PLM that maps the input text to an embedding.To strike a balance between capturing the relationship between h, r, and t and maintaining model efficiency, we opt an efficient RNN as our encoding method for the CSKG triples: Then, we simply concatenate them together to get the representation of the triple (h, r, t): (3)

Global Rule Mining
To detect noisy triples, scoring (h, r, t) only from a local perspective, such as modeling the neighbors of h and t, or analyzing the path from h to t may not be sufficient to eliminate the interference of noisy triples, as it is difficult to determine what is noise from local structures alone.In commonsense knowledge graphs, the noise ratio should not be excessively high.So, learning high-frequency patterns from a global perspective is likely to cover correct triples.In turn, patterns can guide us in identifying the noise data when detecting violations.
To incorporate the global information of the entire CSKG when determining the probability of a triple being noise, we use the method of rule mining to first extract high-frequency, high-confidence, and interpretable rules from the CSKG.Taking into account both the interpretability and efficiency of the model, we employ AMIE 3 (Lajus et al., 2020), a rule mining method based on the frequency of each pattern, to generate logical rules automatically with the following format: where r h (x, y) is rule head and which is learned from the entire CSKG provides guidance to noise detection, while the neighboring triples of "cook meal" and "buy food" are used for aggregation as features for local structure learning.
Equation ( 4), the rule body consists of k triples: To address the issue of poor generalization of mined rules due to sparsity in edges in CSKGs, we consider a rule body r b as a sequence and employ an RNN as the neuralized Rule Encoder (RE) to generalize the rules: Specifically, for each relation as the rule head, we retain the top k rules rules with the highest confidence score given by AMIE 3 for training the rule encoder.In cases where there is no corresponding instance for a rule body, we fill all triples in the rule body with (x, h, y) to align the energy scores of the other triples.And we believe that a well-generalized rule encoder can learn a representation that can explicitly infer the rule head r h , i.e., (x, h, y).Hence, we align the dimensions of the outputs from TE and RE and define the energy function as follows:

Local Neigboring Triple Learning
Structural information plays a significant role in enhancing performance for KG noise detection tasks.
Most methods require that the relationship between two nodes should be equivalent to a translation between their embeddings (Xie et al., 2018;Zhang et al., 2022b).We relax this restriction and aim to determine some level of contextual correlation between two related nodes.As for the specific relation, our global rule mining component will learn its corresponding representation.To capture the contextual semantic information of the triples around nodes, we adopt Graph Attention Network (GAT) (Velickovic et al., 2018) to aggregate the information of the neighboring triples.
We use a transformation matrix W ∈ R F ×d to map the i-th triple (h i , r i , t i ) to the embedding where F is the dimension of the latent space and d is the embedding dimension of the triple, and perform the self-attention function a : R F ×R F → R on the triples to get w ij = a (v i , v j ), which indicates the context of the j-th triple to the i-th triple.To compute the attention of the neighboring triples on the head and tail nodes, respectively, we define the neighboring triples of the node e as N e = {( h, r, t)| h = e ∨ t = e}, and then use the softmax function to normalize the coefficients: where α ij (h) represents the attention of the j (h)th triple on node h i , while β ij (t) represents the attention of the j (t) -th triple on node t i .It is worth noting that the j (h) -th triple is required to meet the condition of being a neighbor of node h i , and similarly, the j (t) -th triple must also be a neighbor of node t i .We use the normalized attention coefficients to calculate a linear combination of the corresponding embeddings, which then serves as the final output: where p i is obtained from the perspective of the neighbors of node h i , q i is obtained from the perspective of the neighbors of node t i , and σ represents a nonlinearity.
We simply employ the Euclidean distance between them to measure the correlation between h i and t i and obtain the energy function of triple (h i , r i , t i ) under local perception as follows: (11)

Jointly Learning and Optimization
The overall energy function of each triple (h, r, t) is obtained by combining the global and local energy functions.We have: where λ is a hyperparameter.We use negative sampling to minimize the margin-based ranking loss where i + represents a positive triple (h, r, t), and i − represents a negative triple.We follow the setting of DistMult (Yang et al., 2015): a set of neg-ative examples E i + is constructed based on i + by replacing either h or t with a random node ẽ ∈ V: 5 Experimental Setup

Datasets
To evaluate the detection capability of denoising models, we follow the method introduced by Xie et al. ( 2018) to construct benchmark datasets for evaluation, which involves generating noise with manually defined sampling rules and injecting it back into the original CSKG.We select Concept-Net (Speer et al., 2017) and ATOMIC (Sap et al., 2019a) as two source CSKGs due to their manageable scale and diverse coverage of edge semantics, including various entities, events, and commonsense relations.Since these manually curated CSKGs do not contain noise naturally, we synthesize noise for each CSKG separately using meticulously designed rules, as done by Jia et al. (2019), that incorporate modifications on existing edges and random negative sampling.This approach, as demonstrated by Jia et al. (2019), ensures that the resulting noises not only maintain being highly informative, thus more challenging for the model to detect, but also stimulate several types of noise that may appear in real-world CSKGs.More details for noise synthesis are provided in Appendix A.1.

Evaluation Metrics
We use two common metrics to evaluate the performance of all methods.
Recall@k.Given that there are k noisy triples in the dataset, we sort all triples by their score in descending order, where a higher score indicates a higher probability of being a noisy triple.We then select the top k triples and calculate the recall rate: AUC. Area Under the ROC Curve (AUC) measures the probability that a model will assign a higher score to a randomly chosen noisy triple than a randomly chosen positive triple.A higher AUC score indicates a better performance.

Competing Methods
We compare our model with state-of-the-art models, which can be mainly divided into three categories: (i) structure embedding-based methods that are unaware of noise, including TransE (Bordes R@5 AUC R@10 AUC R@20 AUC R@5 AUC R@10 AUC R@20 AUC  et al., 2013), DistMult (Yang et al., 2015), Com-plEx (Trouillon et al., 2016), and RotateE (Sun et al., 2019); (ii) embedding-based methods that are aware of noise, including CKRL (Xie et al., 2018) and CAGED (Zhang et al., 2022b); (iii) language model-based methods that encode both semantic and structural embeddings and are unaware of noise, including KG-BERT (Yao et al., 2019) and LASS (Shen et al., 2022).KGist (Belth et al., 2020) as a rule-based method requires node type information, which is unavailable in the CSKG, making it infeasible to use as a baseline.More detailed descriptions are in Appendix A.2.

Implementation Details
We leverage three families of PLMs from the Huggingface Library (Wolf et al., 2020) to build our GOLD framework, including RoBERTa (Liu et al., 2019), DeBERTa-v3 (He et al., 2023), and Sentence-T5 (Ni et al., 2022).Detailed variants of these PLMs are included in Table 1.We train GOLD with an Adam (Kingma and Ba, 2015) optimizer, with the learning rate set to 1e-3.The default number of training epochs is 10, with a margin γ of 5 and a rule length set to 3. Additionally, we conduct a grid search for λ, ranging from 0 to 1, to find the best hyperparameter for k rules from 0 to 500.Further information regarding the implementation is discussed in Appendix A.3.
6 Experiments and Analyses

Main Results
The performance of all models on the six datasets in the noise detection task is shown in Table 1.In general, GOLD can detect noise in CSKG more accurately, outperforming all baseline methods by a large margin.Unlike baseline models based on language models, whose performance significantly increases with the size of the language model, our GOLD method consistently surpasses the baseline across different language model backbones with small performance variation.Specifically, when using the RoBERTa family of language models, our GOLD method achieves an average accuracy improvement of 8.64% and 8.50% compared to LASS methods on the ConceptNet and ATOMIC dataset series, respectively.Among the language models we use, the Sentence-T5-xxl model exhibits the best overall performance, with the highest accuracy improvement over 10.14% and 9.17% on the Con-ceptNet and ATOMIC dataset series, respectively, compared to the baseline.Additionally, the AUC score also improves by 1.02% and 0.62%.

Ablation Study
In this section, we conduct an ablation study on the ConceptNet-N10 dataset to evaluate the contribution of each component in our proposed model.The results of this study are presented in Table 2.
Overall, we observe that removing any of the components results in varying degrees of performance degradation, emphasizing the essentiality of each component in our GOLD model.

Influence of Language Model
We remove the PLM from the triple encoder and use random embeddings to encode the information of nodes and relations, obtaining the embeddings s h , s r , s t in Equation ( 1).This results in a 5.7% decrease in the model's accuracy and a 1.3% decrease in AUC, indicating that the PLM indeed contributes to understanding the semantic information of nodes.It is worth noting that even after removing the language model, the accuracy and AUC still outperform all competing methods.

Influence of Global Rule Mining
We remove the global rule encoder, which results in a 3.8% decrease in accuracy and a 1.0% decrease in AUC, implying the important role of the rule encoder in guiding noise detection.Furthermore, as we train the rule encoder using the top k rules rules with the highest confidence score for each relation from the rules mined by AMIE 3, we test the impact of different values of k rules on the accuracy using three datasets from the ConceptNet series.We vary k rules among {100, 200, 300, 400, 500}.The results are shown in Figure 3.We observe that when the noise level is relatively low, i.e., in the N5 dataset, k rules = 200 achieves the best performance, and adding more rules degrades the model's performance.However, increasing the number of rules improves the model's performance to some extent when the noise level is high, such as in the N10 and N20 datasets.We analyze that this is because as the noise level increases, learning local information becomes more prone to being misled.Hence, more rules are needed to provide global guidance.Influence of Local Neighbor Learning Moreover, we remove the local neighbor information learning component, resulting in a significant decrease of 30.1% in accuracy and 5.7% in AUC, demonstrating the crucial role of neighboring triple information in noise detection.More comprehensive ablation studies are in Appendix C.

Comparison with ChatGPT
Recent breakthroughs in Large Language Models (LLMs), such as GPT-3.5 (Brown et al., 2020;Ouyang et al., 2022) and ChatGPT (OpenAI, 2022), have demonstrated remarkable performance across a diverse range of NLP tasks (Chan et al., 2023;Qin et al., 2023).In light of this, we benchmark these LLMs on our defined noise detection task to establish another competitive baseline for comparison.To accomplish this, we randomly select 1,000 triples from our poisoned ConceptNet-N10 CSKG and ask the LLMs to rank them by iteratively comparing two triples and merge-sorting them (more detailed information in Appendix B).This evaluation setting ensures that the LLMs follow an objective that is mostly identical to GOLD.The results, as shown in Table 3, indicate that both LLMs perform significantly poorly on our task, leaving a substantial gap compared to GOLD.One possible explanation is that these LLMs operate in a zero-shot setting and lack prior knowledge of noisy knowledge contained in CSKGs.This highlights the significance of GOLD, which exhibits a keen sensitivity to noise in CSKGs through fine-tuning.

Downstream Benefits of Denoising CSKG
We finally validate the effectiveness of our proposed noise detection framework by investigating whether eliminating noise from ATOMIC 10X would yield extrinsic benefits for downstream tasks, specifically, zero-shot commonsense Question-Answering (QA) (Ma et al., 2021).This task involves performing QA on commonsense benchmarks, such as Abductive NLI (aNLI; Bhagavatula et al., 2020) (Wang et al., 2023b).Specifically, the head node and relation of an edge are transformed into a question using natural language templates, and the tail node serves as the ground-truth answer.Distractors are tails of other edges sampled from the same CSKG whose head node does not share common keywords with the question.A PLM is then fine-tuned on such synthetic QA entries using marginal ranking loss to serve as a general QA model.To this extent, we keep the QA synthesis protocol and model training process fixed and ablatively study the role of leveraging different CSKGs, in our case, raw ATOMIC 10X and noise-cleaned ATOMIC 10X .We use accuracy as the evaluation metric and trained three QA models separately on (1) the original ATOMIC 10X , (2) ATOMIC 10X denoised with LASS, and (3) ATOMIC 10X denoised with GOLD, where the former two served as the baselines.The results are reported in Table 4.We observe that cleaning ATOMIC 10X with GOLD outperforms both baselines on average, indicating that denoising CSKG is potentially useful for automatically generated CSKGs and that GOLD is superior to other noise detection frameworks on real-world CSKGs.

Case Study
We present specific case studies on the mined logical rules and detected noises in the real large-scale CSKG in Appendix D. Those cases directly show the effectiveness of our proposed method.

Conclusions
In this paper, we propose GOLD, a noise detection framework leveraging the power of language models, global rules, and local structural information.This method is motivated by the fact that nodes in CSKGs are in free-text format, and correct patterns are unlikely to be drowned out by noise.Experimental results indicate that our method achieves state-of-the-art performances in CSKG noise detection tasks.This method shows promising directions for automatically obtaining a large-scale CSKG with minimal noise, as well as effectively representing knowledge for downstream tasks.

Limitations
In our experiments, we follow the approach of previous noise detection literature (Xie et al., 2018;Jia et al., 2019) and inject synthesized noise back into the original CSKGs.Although this noise injection technique has been deemed reliable in previous works, further investigation is necessary to verify its rigor in the field of commonsense reasoning.This is because such noise can typically be classified as negative commonsense knowledge, which, as suggested by Chen et al. (2023), should be verified by whether it can be grounded as negative knowledge.Alternatively, we could inject noise from the perspective of graph attacks (Zhang et al., 2019a) to increase the difficulty of noise detection and improve the model's robustness.rated and further processed to ensure that they are anonymized and desensitized.The experiments align with their intended usage, which is for research purposes.Additionally, while ATOMIC 10X is generated using language models, its prompt engineering ensures that no harmful content is generated, which has been verified by manual inspection (West et al., 2022).Therefore, to the best of the authors' knowledge, we believe that GOLD introduces no additional risk.
end if 20: end for 21: return L iments conducted on the six datasets are listed in Table 8.
Influence of Language Model By removing the PLM from the triple encoder, we observe an average decrease of 6.1% in accuracy on the Concept-Net series datasets and an average decrease of 9.7% on the ATOMIC series datasets.This indicates that PLM has a greater impact on the accuracy of the ATMOIC datasets, as the average number of words per node in ATOMIC is much higher than that in ConceptNet.Therefore, PLM plays a more crucial role in capturing semantic information.
Influence of Global Rule Mining After eliminating the global rule encoder, the accuracy of the ConceptNet series and ATOMIC series datasets decreases by 3.9% and 1.6%, respectively.Our analysis suggests that the lower number of relations in the ATOMIC datasets, only 9 compared to 34 in the ConceptNet datasets, results in a significantly lower number of learnable rules compared to the ConceptNet.As a result, the global rule encoder provides limited assistance in the ATOMIC datasets, and its contribution is not as significant as in the ConceptNet datasets.

Influence of Local Neighbor Learning
The local neighbor learning component exhibits the highest contribution across all datasets, as evidenced by the average accuracy drops of 33.1% and 21.0% in accuracy, as well as 6.7% and 4.2% in AUC after its removal on ConceptNet series and ATOMIC series datasets, respectively.We believe that the reason why this component has a smaller impact on the ATOMIC datasets is still due to the limited number of relations, leading to a less diverse set of information learned from the neighboring triple information.

Influence of Translation Assumption
We attempt to investigate whether the model would benefit from the incorporation of a translation assumption, such as the h + r ≈ t relation in TransE (Bor- where λ and λ (t) are both hyperparameters.We perform a grid search for them between 0.001 to 1 and report the best results in Table 8.The experimental results indicate that the energy function based on the translation assumption in the form of Equation ( 16) cannot provide significant assistance to our model.The overall impact on precision is negative, with an average decrease of 0.4%.This suggests that our GOLD method does not need to rely on such translation assumption constraints when performing noise detection task.It can implicitly learn the relationship between nodes using the energy functions of the global and local parts.

D Case Studies
Mined Logical Rules We list the most frequent rules mined from the ConceptNet-N10 dataset using AMIE 3 and present them in Table 9.We can observe that these rules are highly interpretable and not affected by mixed-in noise.Therefore, they can be treated as ground truth to validate the entire knowledge graph (Bai et al., 2023).
Detected Noise We conduct our proposed GOLD method on the ATOMIC 10X dataset and examine the triples with noise levels in the top 1%.We list ten specific examples that violate reasonability (see Section 3) in Table 10.The results show that our method can effectively extract noise triples from a large-scale CSKG.

Figure 1 :
Figure 1: A subgraph of CSKG with two instances of noise.One noise is (paint, UsedFor, brush your teeth), as there should be no relation between these two nodes.Another noise is (painter, IsA, paint house) because the correct relation should be CapableOf.

Figure 3 :
Figure 3: Accuracy of noise detection vs k rules , the number of selected logical rules on ConceptNet series.

Table 1 :
Comparison of the effectiveness of different methods.We highlight that our proposed GOLD model outperforms all baselines across six data sets and both metrics.We denote the best results in bold, while the best result among the competing methods is marked with an underline.Further analysis is provided in Section 6.1.

Table 2 :
Ablation study results comparing the performance of GOLD with and without each component on the ConceptNet-N10 dataset.

Table 8 :
The comprehensive ablation study results comparing the impact of each component on the results on all six datasets.Additionally, we verify the effect of adding the translation-based energy function on the results.
Input: A triple list L Output: A tiple list sorted from high to low according to the noise level Function: MERGESORT

Table 9 :
Examples of the most frequent rules mined from the ConceptNet-N10 dataset.

Table 10 :
Example of noise detected in the ATOMIC 10X CSKG.des et al., 2013), where h, r, t represents the embedding of the head entity, relation, and tail entity respectively.Inspired by this, we also integrate an energy function based on the translation assumption into our approach.We design the energy function for the translation part as follows: E translation (h, r, t) = ∥e h + er − et∥2.E(h, r, t) = Eglobal(h, r, t)+ λElocal(h, r, t)+ λ (t) Etranslation(h, r, t),