Multilingual Knowledge Graph Completion from Pretrained Language Models with Knowledge Constraints

Multilingual Knowledge Graph Completion (mKGC) aim at solving queries like (h, r, ?) in different languages by reasoning a tail entity t thus improving multilingual knowledge graphs. Previous studies leverage multilingual pretrained language models (PLMs) and the generative paradigm to achieve mKGC. Although multilingual pretrained language models contain extensive knowledge of different languages, its pretraining tasks cannot be directly aligned with the mKGC tasks. Moreover, the majority of KGs and PLMs currently available exhibit a pronounced English-centric bias. This makes it difficult for mKGC to achieve good results, particularly in the context of low-resource languages. To overcome previous problems, this paper introduces global and local knowledge constraints for mKGC. The former is used to constrain the reasoning of answer entities, while the latter is used to enhance the representation of query contexts. The proposed method makes the pretrained model better adapt to the mKGC task. Experimental results on public datasets demonstrate that our method outperforms the previous SOTA on Hits@1 and Hits@10 by an average of 12.32% and 16.03%, which indicates that our proposed method has significant enhancement on mKGC.


Introduction
Knowledge graphs are collections of entities and facts, and utilized as a valuable resource in a variety of natural language processing (NLP) tasks, such as Question Answering and Recommender Systems (Shah et al., 2019;Du et al., 2021;Wang et al., 2019).The language-specific nature of many NLP tasks necessitates to consider the knowledge Results of Ours Figure 1: The top part introduces unbalance language distribution for DBpedia.The low part shows the sampling comparison results of Prix-LM model and our method.The type of prediction entity and the correct answer are shown in brackets and red font, respectively.Our approach exhibits superior consistency and accuracy in generating answers.expressed in a particular language.For example, multilingual question answering needs multilingual knowledge graphs (Zhou et al., 2021).The utilization of multilingual knowledge graphs (mKGs) with a vast amount of knowledge in multiple languages, such as DBpedia (Lehmann et al., 2015), Wikidata (Vrandečić and Krötzsch, 2014), can be advantageous in plenty of NLP tasks (Zhou et al., 2021;Fang et al., 2022).
There is a significant amount of potential facts that have not been captured in current knowledge graphs, resulting in their incompleteness (Chen et al., 2020).To address this issue, various stud-ies have proposed for Knowledge Graph Completion (KGC) to automatically discovery potential facts through observed facts (Bordes et al., 2013), rules (Meilicke et al., 2019) and language models (Lv et al., 2022).
In fact, as shown in Figure 1, there is more English-centric knowledge than other languages, so that it is difficult to leverage knowledge graphs on non-English tasks.For example, English-centric commonsense reasoning tasks obtain better development and performance than other languages (Lin et al., 2021a).And the knowledge coverage of non-English knowledge graphs is even worse, it will poses challenges for traditional KGC methods to achieve superior performance.
Nowadays, pretrained language models (PLMs) learn various knowledge modeling capabilities (Petroni et al., 2019;Jiang et al., 2020) from massive unlabeled data.And most studies have demonstrated that the knowledge contained within PLMs can significantly improve the performance of downstream tasks (Li et al., 2021;Lin et al., 2021b).Most recently, Prix-LM (Zhou et al., 2022) approached mKGC as an end-to-end generative task using multilingual PLMs.For example, for predicting the missing entity of the query (86th All Japan Football Championship, Stadium, ?) (see Figure 1), Prix-LM converts the query into a sequence with pre-defined template, which is then processed by an encoder to generate a query representation.The decoder then uses this representation to generate the final answer Axis Bird Stadium.
Despite the successes achieved through the combination of PLMs and the generative paradigm, there remain limitations for mKGC.On the one hand, the gap between the pretraining task and the KGC task may contribute to the limitations.It arise that the answers generated by Prix-LM are ambiguous in type.On the other hand, languages and tokens that occur more frequently in the pretraining data have richer representations.Linguistic bias for KGs and PLMs would arise that entities in low-resource languages are difficult to be represented, resulting answer incorrect.As illustrated in Figure 1, the query (86th All Japan Football Championship, stadium, ?) expects a response of the type stadium, but the top-ranked answers from Prix-LM are diverse, and the top answer is incorrect.
We argue that the incorporation of knowledge constraints into the generation process can increase PLMs suitability for mKGC tasks.We categorize knowledge effective for mKGC into global and local knowledge.Global Knowledge limit the types of answers based on building the relationship of entity and relation representations.This helps to ensure that the generated answers are semantically and logically consistent with the intent of query.On the other hand, local knowledge in PLMs can enhance the ability to comprehend the interconnections between the sub-tokens within the query.This helps the model to better understand the context of query and generate more accurate answers.Incorporating knowledge constraints into the generative process brings two advantages for mKGC: 1) It makes PLMs to better adapt to mKGC task.2) It enables PLMs to learn more effective representations from low-resource data.
In this paper, we propose to incorporate the global and local knowledge into the answer generation process through two knowledgeable tasks.To learn global knowledge, special tokens ([H],[R],[T]) are introduced as semantic representations of head entity, relation, and tail entity in a triple.A scoring function measures the plausibility of the resulting facts, such as Since the same special token is used in each triple in different languages, trained models are able to learn knowledge reasoning ability beyond language boundaries.To capture local knowledge, we consider the representation of answer and each word of query as two separate distributions P (H q ) and P (H [T ] ), and then use an estimator to estimate and maximize the mutual information between them I(H q ; H [T ] ).The local knowledge serves to augment the query representations for trained model through the utilization of minimal amounts of data.The experimental results on seven language knowledge graph from DBpedia show that our proposed method achieves significant improvement as compared to Prix-LM and translated-based methods.We publicize the dataset and code of our work at https: //github.com/Maxpa1n/gcplm-kgc.
In short, our main contributions are as follows: • We attempt to utilize diverse knowledge constraints to enhance the performance of PLMbased mKGC.It effectively addresses the inconsistency of PLM and mKGC task, and alleviates language and data bias from PLMs and KGs.introducing global knowledge constraints for entity placeholders and mutual information constraints for other contextual symbols.
• Our proposed method outperforms the Prix-LM (Zhou et al., 2022) in both mKGC and cross-lingual entity alignment, as shown by experiments on a public dataset.The performance of our method on Hits@1, Hits@3, and Hits@10 shows an average improvement of 12.32%, 11.39%, and 16.03%, respectively.

Basic Model
A knowledge graph G = (R, E) is a collection of connected information about entities, often represented using triples (h, r, t) where r ∈ R is relation and h, t ∈ E are entities.Prix-LM is an important work of mKGC and is also used as the basic model in this paper.Prix-LM transfer link prediction from discriminative task to generative task for mKGC.The goal of mKGC is to generate the missing tail entity, which may contain multiple tokens, for the query (h, r, ?) of different languages.The use of template is employed as a means of transforming queries into textual sequences that can be encoded by PLMs.The template includes special tokens, which serve to identify the specific role of each element within the query triple: where <s> is beginning token of sentence and </s> is the separator, both are applied in PLMs, as known as [CLS] and [SPE].
[H], [R] and [E] are additional special tokens for the representation of head, relation and tail.
[E] is the end-of-sequence token.X h ∈ {x h 1 , x h 2 , x h 3 , ..., x h n } are text words of head entity, X r and X t in the same way.
The training goal is to generate the tail entity X t by giving the sequence containing the head entity X h and relation X r .For example, for the query (Le-Bron James, team member of, ?), the constructed sequence is <s>[H] LeBron James</s></s>[R] team member of </s></s>[T], and the target of mKGC is generate Los Angeles Lakers [E].The process is as follows: where θ is the pretrained model parameter.According to the mechanism of causal language model, the probability of i-th token depend on previous token representation h i−1 : where W is causal language model decoder from PLMs.
The utilization of PLMs for generating answers directly can be subject to language bias, resulting in ambiguous and incorrect answers.The representation of the special token [T ] is a crucial factor in determining the quality of the generated answers.To improve the representation of the [T ] token, we have implemented two supplementary strategies aimed at incorporating additional knowledge into its representation.

The Proposed Model
In this section, we describe the components of our proposed approach.The architecture of the model is depicted in Figure 2. Our approach comprises four key components: a query encoder, a global knowledge constraint, a local knowledge constraint, and an answer generation module.These components operate in tandem to generate accurate and coherent answers for given queries.

Triple Encoder
We leverage the PLM to encode the triple and an attention mask to control the access to each subtoken in the sequence during training process.We use previous template to convert a triple (h, r, t) to a sequence S (h,r,t) ∈ {X h , X r , X t , X a }, and X a is special token.The attention mask mechanism allows the query sequence to be seen as the source text and the answer entity as the target text.The process as following: where hidden representation of triple is H ∈ {h [H] , h h 1 , .., h [R] , h r 1 , ..., h [T ] , h t 1 , ..., h [E] }.The attention mask is a matrix that specifies whether each subtoken should be attended or ignored, as illustrated in Figure 3.By making special tokens only visible to their own subtokens, model can effectively separate each role in a triple.And the mask matrix M add in attention score calculated by query Q, key K, value V: M = 0, allow to attend −∞, prevent from attending (4) where Q, K, V ∈ R l×d , l is length of the input sequence, and d is the hidden size.

Global Knowledge Constraint
To bridge the gap between the pretraining task and the KGC task, we introduce the global knowledge build logical relationship between entities.Unlike previous approaches such as Prix-LM, our method does not rely on cross-lingual links for equivalent entities to learn shared knowledge in different languages.Instead, shared knowledge between languages is learned through the global knowledge constraint, which is inspired by embedding-based methods.We leverage the TransE framework in our model, and methods such as CompleX, RotatE are also applicable.The goal of the global knowledge constraint is to represent entities and relation in a semantic space and enforce the translational principle: h + r ≈ t: where are special tokens representation, and ∥.∥ is L1 norm.And a triple global knowledge score is described by: We use the same special tokens for different languages.The following loss function is used to optimize the model.
where G is all language knowledge graphs set, and γ is correction factor.

Local Knowledge Constraint
The local knowledge enables the model to learn more accurately for generated answers with lowresource data.Therefore, we consider establishing the connection between query and answer in a triple.Specifically, we view the the representation of query words H q and tail entity H [T ] as two distributions and maximizing the mutual information between them I(H q , H [T ] ).The theoretical foundation for this idea is provided by MIEN (Belghazi et al., 2018), which demonstrates that mutual information follows a parametric lower-bound: Inspired from previous Mutual Information Maximization (Tschannen et al., 2019;Zhang et al., 2020) (MIM) method in unsupervised learning, we take the local features, represented by H q , and the global features, represented by H [T ] , as the inputs for MIM.Benefit from the mask mechanism and PLM's powerful learning capability, we do not strictly distinguish the parameter of encoder and decoder different from previous works.In this work, we select a Jensen-Shannon MI estimator to parameterize Mutual Information: where } is tail entity representation.T θ is a discriminator function support by the PLM parameters.H ′ q is representation sampled from other query in the same min batch.And P = P make guarantee the expectation easy to calculated.sp(x) = log(1 + e x ) is the softplus activation function.The learning object is to make PLM estimate and maximize the Mutual Information: where b j is mini batch from training dataset.To optimize model by gradient descent, we set loss function as following: where the E j P is expectation for query and tail entity.The local knowledge constraint within PLM enhance its capacity to obtain rich representations of queries and tail entities, particularly in situations where training data is limited.

Answer Generation Module
Follow the paradigm that given a serialized query and generate answer token, we use the casual language model with PLM.The generation loss function as Cross Entropy Loss function: where the f (•) is like Formula 2, x j is subtoken of tail entity.
In training process, the model would generate answer with global and local knowledge, we define the loss for model as: where α and β are hyperparameter.The mask mechanism achieved that all subtokens of tail entity be trained in one round.

Inference
During the inference phase of our model, we utilize an autoregressive approach to generate the tokens of tail entity for given query.This autoregressive approach involves predicting the next token based on the previous tokens.The query (h, r, ?) be transferred to a sequence X q and generating the answer entity by trained model.The process as following: where x i ∈ X t .Additionally, we assume a closedworld setting and utilize constrained beam search to restrict the final output to a predefined set of possibilities, in order to ensure the validity of the generated answer.

Experiments
In this section, we evaluate the effectiveness of our approach on tasks related to mKGC and Entity Alignment for mKGs.To further understand the role of the various knowledge-gathering strategies in our method, we also conduct ablation experiments.Additionally, we provide case studies to demonstrate the superior performance of our method on specific examples.These experiments and analyses provide insight into the strengths and limitations of our approach for addressing challenges in mKGC for sparse knowledge graphs.

Datasets and Evaluation Metrics
To evaluate our method, we utilize the Link Prediction dataset provided by Prix-LM (Zhou et al., 2022) and split it by the closed-world setting.The dataset consists of data from DBpedia, a large multilingual knowledge graph, and the amount of data is shown in Table 2.We ensure that entities and relations appearing in the validation and test sets are included in the training set.We introduce a ratio between entities and triples as a measure of the knowledge density of the dataset.This ratio has a lower bound of 0.5, which indicates that there are no cross-links between triples.The ratio of our dataset is much lower than that of publicly available datasets.The evaluation metrics we use are standard Hits@1, Hits@3, and Hits@10, which are commonly used in the evaluation of KGC methods.

Implementation Details
In our experiments, we used XLM-R (Base) as the base pre-trained language model and did not introduce any additional parameters beyond those provided by XLM-R.The model was implemented using the Huggingface Transformers library (Wolf et al., 2020) and the hyperparameters α and β were set to 0.001 and 0.005.The learning rate and batch size were selected from the sets {4e-5, 5e-5} and {128, 256}.And the maximum length of a triple sequence was 35.The model was trained using a single Nvidia RTX 3039 GPU.

Multilingual Knowledge Graph Completion
Our method for mKGC was compared to various embedding-based methods and Prix-LM on seven languages KG, as shown in Table 1.The results show that our method outperformed Prix-LM on the metrics of Hits@1, Hits@3, and Hits@10, with average improvements of 12.32%, 11.39%, and 16.03%, respectively.These improvements suggest that the integration of both global and local knowledge significantly enhances the effectiveness of the mKGC task, leading to a higher ability to accurately predict missing triple in KG.Table 3: This table presents the results of entity alignment tasks for low-resource languages.The parallel entity pairs were obtained from Wikidata (Vrandečić and Krötzsch, 2014).We transformed the entity alignment into a KGC task by augmenting the knowledge graph with additional edges representing the linguistic relations between the entity pairs.meaningful relationships between entities.In contrast, the use of PLMs, as employed in our method, can effectively address the issue of data sparsity and still achieve notable impact on performance.
Overall, these results demonstrate the effectiveness of our approach in comparison to the use of PLMs alone for mKGC.

Cross-lingual Entity Alignment
To assess the generalizability of the proposed method, we conduct a comparison on the entity alignment task.As shown in Table 3, we compared the proposed method with the Prix-LM.This comparison allowed us to assess the performance of the proposed method on a different task and determine its potential for use in a wider range of applications.The results show our method The results of the comparison indicate that our proposed method outperforms the Prix-LM in most of the evaluation indicators.This suggests that our method is able to generalize well to different tasks and is capable of achieving improved performance on the entity alignment task.Counterintuitively, the results show that languages with fewer resources tend to yield better performance.This may be due to the fact that the relationship between low-resource entity pairs is relatively simple and easier for the model to learn.

Ablation Experiment
Our proposed method introduces a novel approach for extracting both global and local knowledge through the use of a scoring function function and the maximization of mutual information.As shown in Table 4, we conducted an extensive comparison with various alternatives for scoring function and mutual information estimation, showcasing the superior performance of the proposed method.And we also verified the effect of different module, our findings indicate that the use of global features is better than local features, and that the difference between local and global results on the H10 metric is minimal.This supports our expectation that local features improve the ranking of entity types.Using the tasks matrix reduces some of the noise and allows faster convergence, which is the key to improving performance.We will include these results in our revised manuscript and provide a detailed discussion on their implications.

Answer Length Comparison
As shown in Figure 5, we compared the performance of the proposed method on answers of different lengths to assess its robustness.The results of this comparison demonstrate that the proposed method exhibits strong performance across a range of answer lengths, indicating its ability to handle diverse inputs effectively.The results show that our method outperforms the baseline in terms of Hits values for answers of various lengths, with particularly strong performance on short answers.

Case Study
As shown in Figure 4, we compare the performance of our method with Prix-LM on a set of real examples.The predicted answers generated by both methods are presented and analyzed in order to evaluate the effectiveness of each approach for mKGC.The results of these case studies provide additional evidence for the effectiveness of our approach in comparison to the baseline model.
Our analysis of the top three cases reveals that our method produces a higher number of predictions of the same type as the correct answer compared to the baseline model.This finding suggests that our approach effectively addresses the task bias and demonstrates the adaptability of the PLM for the KGC task.Despite, the predicted answer types in the bottom three examples are all same, our method is able to accurately identify the correct answer.This demonstrates the robustness and effectiveness of our approach in generating accurate results even in situations where the predicted answers type are similar.
5 Related Work

Embedding-based Methods for KGC
There has been a amount of research focused on developing embedding-based methods for finding potential knowledge within a knowledge graph (Wang et al., 2017;Dai et al., 2020).These methods typically involve representing entities and relations within the graph as low-dimensional vector embeddings.Such like TransE (Bordes et al., 2013) makes entity and relation vectors follow the translational principle h + r = t.The choice of scoring function and the specific vector space used can have a significant impact on the performance of the method, including RotatE (Sun et al., 2019), TransH (Wang et al., 2014), HolE (Nickel et al., 2016), ComplEx (Trouillon et al., 2016).However embedding-based methods may not fully consider the background knowledge that is implicit in the text associated with entities and relations.

Pretrained Language Models for KGC
Recently, some research leverage pretrained language models to complete KGC task.There meth-ods represent entities and relations by PLMs, and score high for positive triplets (Lv et al., 2022;Kim et al., 2020).This manner enables the introduction of knowledge that has already been learned in PLMs.Hits ours Prix-LM Hits@1 Hits@3 Hits@10 Figure 5: The figure presents the results of the Hits@k evaluation metric for a mKGC task, focusing on answers of varying lengths.In order to facilitate a more straightforward analysis, the results are limited to those sets of lengths that have more than 100 occurrences.
To fully utilize the PLM, some research focus on generative paradigm for knowledge graph construction (Ye et al., 2022).GenKGC (Xie et al., 2022) transforms knowledge graph completion into a sequence-to-sequence generation task base on pretrained language model and propose relationguided demonstration and entity-aware hierarchical decoding.COMET (Bosselut et al., 2019) propose the Commonsense Transformer to generate commonsense automatically.KGT5 (Saxena et al., 2022) consider KG link prediction as sequence-tosequence tasks base on a single encoder-decoder Transformer.It reduce model size for KG link prediction compare with embedding-based methods.While previous efforts to utilize PLMs for KGC have demonstrated effectiveness, they have not fully considered the inherently knowledge-based nature of KGC tasks.This oversight may hinder the full potential of such models in addressing the unique challenges and requirements of KGC.

Conclusion
Our work improve the multilingual knowledge graph completion performance base on PLM and generative paradigms.We propose two two knowledgeable tasks to integrate global and local knowledge into answer generation given a query.The global knowledge improves the type consistency of the generated answers.Local knowledge enhances the accuracy of answer generation.We conducted experiments and the results showed that the proposed method is better than the previous model.

Limitations
While our approach effectively predicts the relationships between entities in a knowledge graph, there are limitations in the scope of knowledge graph resources that can be modeled.The knowledge graph contains a vast array of resources, including attributes, descriptions, and images, which are not easily captured by embedding-based methods, but can be effectively modeled using PLMs.To improve the compatibility of KGC with actual needs, it is necessary to consider a broader range of data types in the knowledge graph and develop complementary methods to effectively incorporate them.

Ethics Statement
This paper proposes a method for Multilingual Knowledge Graph Completion, and the experiments are conducted on public available datasets.As a result, there is no data privacy concern.Meanwhile, this paper does not involve human annotations, and there are no related ethicalconcerns.D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 3 :
Figure 3: The operation mechanism of mask matrix during training process.The darker squares indicate that attention is allowed, while the lighter squares indicate that attention is suppressed.

Figure 4 :
Figure 4: This figure presents a comparison of the performance of our method and baseline model on a set of case studies.The blue font is used to indicate that the predicted answer aligns with the golden answer type.The bold font in the predicted answer signifies the correct answer.

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?4.2 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 4.2 D Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.

Table 1 :
(Han et al., 2018) results of seven language-specific knowledge graph completion (KGC) tasks are presented.The embedding-based methods, including TransE, complEx, and RotatE, were implemented using the OpenKE framework(Han et al., 2018).The results for these methods were obtained by training separate knowledge graphs for each language.The Single make the monolingual version, which is trained independently for each language.The numbers in bold represent the best results among the methods and languages considered.

Table 2 :
The table show the statistics of multi language knowledge graph dataset.The T/E Ratio is equal to the number of triples divided by the number of entities.
It is worth noting that the low knowledge density in the training set can hinder the performance of traditional embedding-based methods, which rely on the presence of sufficient training data to learn

Table 4 :
This table presents the average results of seven languages of the ablation experiment which investigates the impact of different methods for acquiring global and local knowledge and different module effects.The results of the experiment provide insights into the relative importance of these different methods for improving the performance of the mKGC.