MixTEA: Semi-supervised Entity Alignment with Mixture Teaching

Semi-supervised entity alignment (EA) is a practical and challenging task because of the lack of adequate labeled mappings as training data. Most works address this problem by generating pseudo mappings for unlabeled entities. However, they either suffer from the erroneous (noisy) pseudo mappings or largely ignore the uncertainty of pseudo mappings. In this paper, we propose a novel semi-supervised EA method, termed as MixTEA, which guides the model learning with an end-to-end mixture teaching of manually labeled mappings and probabilistic pseudo mappings. We firstly train a student model using few labeled mappings as standard. More importantly, in pseudo mapping learning, we propose a bi-directional voting (BDV) strategy that fuses the alignment decisions in different directions to estimate the uncertainty via the joint matching confidence score. Meanwhile, we also design a matching diversity-based rectification (MDR) module to adjust the pseudo mapping learning, thus reducing the negative influence of noisy mappings. Extensive results on benchmark datasets as well as further analyses demonstrate the superiority and the effectiveness of our proposed method.


Introduction
Entity alignment (EA) is a task at the heart of integrating heterogeneous knowledge graphs (KGs) and facilitating knowledge-driven applications, such as question answering, recommender systems, and semantic search (Gao et al., 2018).Embeddingbased EA methods (Chen et al., 2017;Wang et al., 2018;Sun et al., 2020a;Yu et al., 2021;Xin et al., 2022a) dominate current EA research and achieve promising alignment performance.Their general pipeline is to first encode the entities from different KGs as embeddings (latent representations) in a uni-space, and then find the most likely counterpart for each entity by performing all pairwise comparison.However, the pre-aligned mappings (i.e., training data) are oftentimes insufficient, which is challenging for supervised embedding-based EA methods to learn informative entity embeddings.This happens because it is time-consuming and labour-intensive for technicians to manually annotate entity mappings in the large-scale KGs.
To remedy the lack of enough training data, some existing efforts explore alignment signals from the cheap and valuable unlabeled data in a semi-supervised manner.The most common semisupervised EA solution is using the self-training strategy, i.e., iteratively generating pseudo mappings and combining them with labeled mappings to augment the training data.For example, Zhu et al. (2017) propose IPTransE which involves an iterative process of predicting on unlabeled data and then treats the predictions above an elaborate threshold (confident predictions) as pseudo mappings for retraining.To further improve the accuracy of pseudo mappings, Sun et al. (2018) design a heuristic editing method to remove wrong alignment by considering one-to-one alignment constraint, while Mao et al. (2020) and Cai et al. (2022) utilize a bi-directional iterative strategy to determine pseudo mapping if and only if the two entities are mutually nearest neighbors of each other.Despite the encouraging results, existing semisupervised EA methods still face the following problems: (1) Uncertainty of pseudo mappings.Prior works have largely overlooked the uncertainty of pseudo mappings during semi-supervised training.Revisiting the self-training process, the generation of pseudo mappings is either black or white, i.e., an entity pair is either determined as a pseudo mapping or not.While in fact, different pseudo mappings have different uncertainties and contribute differently to model learning (Zheng and Yang, 2021).(2) Noisy pseudo mapping learning.The performance of semi-supervised EA methods depends heavily on the quality of pseudo mappings, while these pseudo mappings inevitably contain much noise (i.e., False Positive mappings).Even worse, adding them into the training data would misguide the subsequent training process, thus causing error accumulation and further hurting the alignment performance.
To tackle the aforementioned limitations, in this paper, we propose a simple yet effective semisupervised EA solution, termed as MixTEA.To be specific, our method is based on a Teacher-Student architecture (Tarvainen and Valpola, 2017), which aims to generate pseudo mappings from a gradually evolving teacher model and guides the learning of a student model with a mixture teaching of labeled mappings and pseudo mappings.We explore the uncertainty of pseudo mappings via probabilistic pseudo mapping learning rather than directly adding "reliable" pseudo mappings into the training data, which lends us to flexibly learn from pseudo mappings with different uncertainties.To achieve that, we propose a bi-directional voting (BDV) strategy that utilizes the consistency and confidence of alignment decisions in different directions to estimate the uncertainty via the joint matching confidence score (converted to matching probability after a softmax).Meanwhile, a matching diversity-based rectification (MDR) module is designed to adjust the pseudo mapping learning, thus reducing the influence of noisy mappings.Our contributions are summarized as follows: (I) We propose a novel semi-supervised EA framework, termed as MixTEA1 , which guides the model's alignment learning with an end-to-end mixture teaching of manually labeled mappings and probabilistic pseudo mappings.
(II) We introduce a bi-directional voting (BDV) strategy which utilizes the alignment decisions in different directions to estimate the uncertainty of pseudo mappings and design a matching diversitybased rectification (MDR) module to adjust the pseudo mapping learning, thus reducing the negative impacts of noise mappings.
(III) We conduct extensive experiments and thorough analyses on benchmark datasets OpenEA (Sun et al., 2020b).The results demonstrate the superiority and effectiveness of our proposed method.

Embedding-based Entity Alignment
While the recent years have witnessed the rapid development of deep learning techniques, embedding-based EA approaches obtain promising results.Among them, some early studies (Chen et al., 2017;Sun et al., 2017) are based on the knowledge embedding methods, in which entities are embedded by exploring the fine-grained relational semantics.For example, MTransE (Chen et al., 2017) applies TransE (Bordes et al., 2013) as the KG encoder to embed different KGs into independent vector spaces and then conducts transitions via designed alignment modules.However, they need to carefully balance the weight between the encoder and alignment module in one unified optimization problem.Due to the powerful structure learning capability, Graph Neural Networks (GNNs) like GCN (Kipf and Welling, 2017) and GAT (Veličković et al., 2018) have been employed as the encoder with Siamese architecture (i.e., shared-parameter) for many embedding-based models.GCN-Align (Wang et al., 2018) applies Graph Convolution Network (GCN) for the first time to capture neighborhood information and embed entities into a unified vector space, but it suffers from the structural heterogeneity of different KGs.To mitigate this issue and improve the structure learning, AliNet (Sun et al., 2020a) adopts multi-hop aggregation with a gating mechanism to expand neighborhood ranges for better structure modeling, and KE-GCN (Yu et al., 2021) combines GCN and knowledge embedding methods to jointly capture the rich structural features and relation semantics of entities.More recently, IMEA (Xin et al., 2022a) designs a Transformer-like architecture to encode multiple structural contexts in a KG while capturing alignment interactions across different KGs.
In addition, some works further improve the EA performance by introducing the side information about entities, such as entity names (Zhang et al., 2019), attributes (Liu et al., 2020), and literal descriptions (Yang et al., 2019).Afterward, a series of methods were proposed to integrate knowledge from different modalities (e.g., relational, visual, and numerical) to obtain joint entity representation for EA (Chen et al., 2020;Liu et al., 2021;Lin et al., 2022).However, these discriminative features are usually hard to collect, noise polluted, and privacy sensitive (Pei et al., 2022).

Semi-supervised Entity Alignment
Since the manually labeled mappings used for training are usually insufficient, many semi-supervised EA methods have been proposed to take advan-tage of labeled mappings and the large amount of unlabeled data for alignment, which can provide a more practical solution in real scenarios.The mainstream solutions focus on iteratively generating pseudo mappings to compensate for the lack of training data.IPTransE (Zhu et al., 2017) applies threshold filtering-based self-training to yield pseudo mappings but it fails to obtain satisfactory performance since it brings much noise data, which would misguide the subsequent training.Besides, it is also hard to determine an appropriate threshold to select "confident" pseudo mappings.KDCoE (Chen et al., 2018) performs co-training of KG embedding model and literal description embedding model to gradually propose new pseudo mappings and thus enhance the supervision of alignment learning for each other.To further improve the quality of pseudo mappings, BootEA (Sun et al., 2018) designs an editable strategy based on the one-to-one matching rule to deal with matching conflicts and MRAEA (Mao et al., 2020) proposes a bi-directional iterative strategy which imposes a mutually nearest neighbor constraint.Inspired by the success of self-training, RANM (Cai et al., 2022) proposes a relation-based adaptive neighborhood matching method for entity alignment and combines a bi-directional iterative co-training strategy, making become a natural semisupervised model.Moreover, CycTEA (Xin et al., 2022b) devises an effective ensemble framework to enable multiple alignment models (called aligners) to exchange their reliable entity mappings for more robust semi-supervised training, but it requires high complementarity among different aligners.

Problem Statement
A knowledge graph (KG) is formalized as G = (E, R, T ), where E and R refer to the set of entities and the set of relations, respectively.
is the set of triples, where h, r, and t denote head entity (subject), relation, tail entity (object), respectively.Given a source KG

Proposed Method
In this section, we present our proposed semisupervised EA method, called MixTEA, in Figure 1.MixTEA follows the teacher-student training scheme.The teacher model is performed to generate probabilistic pseudo mappings on unlabeled entities and student model is trained with an endto-end mixture teaching of manually labeled mappings and probabilistic pseudo mappings.Compared to previous methods that require filtering pseudo mappings via thresholds or constraints, the end-to-end training gradually improves the quality of pseudo mappings, and the more and more accurate pseudo mappings in turn benefit EA training.

KG Encoder
We first introduce the KG encoder (denoted as f (; θ)) which utilizes neighborhood structures and relation semantics to embed entities from different KGs into a unified vector space.We randomly initialize the trainable entity embeddings H ent ∈ R (|Es|+|Et|)×de and relation embeddings H rel ∈ R |Rs∪Rt|×dr , where d e and d r are the dimension of entities and relations, respectively.Structure modeling.Structural features are crucial since equivalent entities tend to have similar neighborhood contexts.Besides, leveraging multirange neighborhood structures is capable of providing more alignment evidence and mitigating the structural heterogeneity issue.In this work, we apply Graph Attention Network (GAT) (Veličković et al., 2018) to allow an entity to selectively aggregate its surrounding information via attentive mechanism and we then recursively capture multirange structural features by stacking L layers: (2) )) (3) where ⊤ represents transposition, ⊕ means concatenation, W g and a are the layer-specific transformation parameter and attention transformation vector, respectively.N e i means the neighbor set of e i (including e i itself by adding a self-connection), and α ij indicates the learned importance of entity e j to entity e i .H (l) denotes the entity embedding matrix at l-th layer with H (0) = H ent .σ(•) is the nonlinear function and we use ELU here.
Relation modeling.Relation-level information which carries rich semantics is vital to align entities in KGs because two equivalent entities may share overlapping relations.Considering that relation directions, i.e., outward (e i → e j ) and inward (e i ← e j ), have delicate impacts on characterizing the given target entity e i , we use two mean aggregators to gather outward and inward relation semantics separately to provide supplementary features for heterogeneous KGs: where N r+ e i and N r− e i are the sets of outward and inward relations of entity e i , respectively.
Weighted concatenation.After capturing the contextual information of entities in terms of neighborhood structures and relation semantics, we con-catenate intermediate features for entity e i to obtain the final entity representation: where K = {(1), ..., (L), r+, r−} and w ∈ R |K| is the trainable attention vector to adaptively control the flow of each feature.We feed w to a softmax before multiplication to ensure that the normalized weights sum to 1.

Alignment Learning with Mixture Teaching
In the following, we will introduce mixture teaching, which is reached by the supervised alignment learning and probabilistic pseudo mapping learning in an end-to-end training manner.
Teacher-student architecture.Following Mean Teacher (Tarvainen and Valpola, 2017), we build our method which consists of two KG encoders with identical structure, called student model f (; θ stu ) and teacher model f (; θ tea ), respectively.The student model constantly updates its parameters supervised by the manually labeled mappings as standard and the teacher model is updated via the exponential moving average (EMA) (Tarvainen and Valpola, 2017) weights of the student model.Moreover, the student model also learns from the pseudo mappings generated by the teacher model to further improve its performance, in which the uncertainty of pseudo mappings is formalized as calculated matching probabilities.Specifically, we update the teacher model as follows: where θ denotes model weights, and m is a preset momentum hyperparameter that controls the teacher model to update and evolve smoothly.
Supervised alignment learning.In order to make equivalent entities close to each other and unmatched entities pull away from each other in a unified space, we apply a margin-based alignment loss (Wang et al., 2018;Mao et al., 2020;Yu et al., 2021) supervised by pre-aligned mappings: where ρ is a hyperparameter of margin, [x] + = max{0, x} is to ensure non-negative output, S denotes the set of negative entity mappings, and || • || 2 means L2 distance (Euclidean distance).Negative mappings are sampled according to the cosine similarity of two entities (Sun et al., 2018).
Probabilistic pseudo mapping learning.As mentioned above, the teacher model is responsible for generating probabilistic pseudo mappings for the student model to provide more alignment signals and thus enhance the alignment performance.Benefiting from the EMA update, the predictions of the teacher model can be seen as an ensemble version of the successive student models' predictions.Therefore it is more robust and stable for pseudo mapping generation.Moreover, bi-directional iterative strategy (Mao et al., 2020) reveals the asymmetric nature of alignment directions (i.e., sourceto-target and target-to-source), which can produce pseudo mappings based on the mutually nearest neighbor constraint.Inspired by this, we propose a bi-directional voting (BDV) strategy which fuses alignment decisions in each direction to yield more comprehensive pseudo mappings and model their uncertainty via the joint matching confidence score.Concretely, after encoding, we can first obtain the similarity matrix by performing pairwise similarity calculation between the unlabeled source and target entities as follows: where sim(•) denotes cosine similarity function.
M tea s→t and M tea t→s represent similarity matrices in different directions between source and target entities, and M tea t→s is the transposition of M tea s→t (i.e., M tea t→s = (M tea s→t ) ⊤ ).Next, for each matrix, we pick up the entity pair which has the maximum  The illustration of the process of generating the probabilistic pseudo mapping matrix from two alignment directions.We assume that entity pairs on the diagonal are correct mappings and that the default alignment direction for inference is from source to target.
predicted similarity in each row as the pseudo mapping and then we combine the results of the pseudo mappings in different directions weighted by their last Hit@1 scores on validation data to obtain the final pseudo mapping matrix: ) where g(•) is the function that converts the similarity matrix to a one-hot matrix (i.e., the only position with a value 1 at each row of the matrix indicates the pseudo mapping).In this manner, we arrive at the final pseudo mapping matrix P tea generated by the teacher model, in which each pseudo-mapping is associated with a joint matching confidence score (the higher the joint matching confidence, the less the uncertainty).Different from the bi-directional iterative strategy, we use the voting consistency and matching confidence of alignment decisions in different directions to facilitate uncertainty estimation.Specifically, given an entity pair (ê i , êj ), its confidence P tea i,j is 1 when and only when both directions unanimously vote this entity pair as a pseudo mapping, otherwise its confidence is in the interval (0,1) when only one direction votes for it and 0 when no direction votes for it (i.e., this entity pair will not be regarded as a pseudo mapping).
In addition, the ideal predictions of EA need to satisfy the one-to-one matching constraint (Suchanek et al., 2011;Sun et al., 2018), i.e., a source entity can be matched with at most one target entity, and vice versa.However, the joint decision voting process inevitably yields matching conflicts due to the existence of erroneous (noisy) mappings.Inspired by Gal et al. (2016), we further propose a matching diversity-based rectification (MDR) module to adjust the pseudo mapping learning, thus mitigating the influence of noisy mappings dynamically.We denote M stu (i.e., M stu i,j = sim(ê i , êj ; θ stu )) as the similarity matrix calculated based on the student model and define a Cross-Entropy (CE) loss between M stu and P tea rectified by matching diversity: where Ptea denotes the rectified pseudo mapping matrix.To be specific, the designed rectification term (Eq.( 13)) measures how much a potential pseudo mapping deviates (in terms of joint matching confidence score) from other competing pseudo mappings in P tea i: and P tea :j .The larger the deviation, the greater the penalty for this pseudo mapping, and vice versa.Notably, both M stu and Ptea are fed into a softmax to be converted to probability distributions before CE to implement probabilistic pseudo mapping learning.Besides, an illustrative example of generating the probabilistic pseudo mapping matrix is provided in Figure 2.
Optimization.Finally, we minimize the following combined loss function (final objective) to optimize the student model in an end-to-end training manner: where λ is a ramp-up weighting coefficient used to weight between the supervised alignment learning (i.e., L a ) and pseudo mappings learning (i.e., L u ).In the beginning, the optimization is dominated by L a and during the ramp-up period, L u will gradually participate in the training to provide more alignment signals.The overall optimization process is outlined in Algorithm 1 (Appendix A), where the student model and the teacher model are updated alternately, and the final student model is utilized for EA inference (Eq.( 1)) on test data.
5 Experimental Setup

Data and Evaluation Metrics
We evaluate our method on the 15K benchmark dataset (V1) in OpenEA (Sun et al., 2020b) since the entities thereof follow the degree distribution in real-world KGs.The brief information of experimental data is shown in Table 3 (Appendix B).It contains two cross-lingual settings, i.e., EN-FR-15K (English-to-French) and EN-DE-15K (Englishto-German), and two monolingual settings, i.e., D-W-15K (DBPedia-to-Wikidata) and D-Y-15K (DBPedia-to-YAGO).Following the data splits in OpenEA, we use the same split setting where 20%, 10%, and 70% pre-aligned mappings are utilized for training, validation, and testing, respectively.Entity alignment is a typical ranking problem, where we obtain a target entity ranking list for each source entity by sorting the similarity scores in descending order.We use Hits@k (k=1, 5) and Mean Reciprocal Rank (MRR) as the evaluation metrics (Sun et al., 2020b;Xin et al., 2022a).Hits@k is to measure the alignment accuracy, while MRR measures the average performance of ranking over all test samples.The higher the Hits@k and MRR, the better the alignment performance.
As our method and the above baselines only contain a single model and mainly rely on structural information, for a fair comparison, we do not compare with ensemble-based frameworks (e.g., CycTEA (Xin et al., 2022b)) and models infusing side information from multi-modality (e.g., EVA (Liu et al., 2021), RoadEA (Sun et al., 2022)).For the baseline RANM, we remove the name channel to guarantee a fair comparison.
. OpenEA (Sun et al., 2020b), we report the average results of five-fold cross-validation.The embedding dimensions of entities d e and relations d r are set to 256 and 128, respectively, the number of GAT layer L is 2, the margin ρ is 2.0, and the momentum m is 0.9.In the EA inference phase, we use Cosine distance as the distance metric and apply Faiss 2 to perform NN search efficiently.The default alignment direction is from left to right, e.g., in D-W-15K, we regard DBpedia as the source KG and seek to find the counterparts of source entities in the target KG Wikidata.  1 shows our method consistently achieves the best performance in all tasks with a small standard deviation (std.).More precisely, our model surpasses state-of-theart baselines averagely by 3.1%, 3.3%, and 3.5% in terms of Hit@1, Hit@5, and MRR, respectively.

Ablation Study
To verify the effectiveness of our method, we perform the ablation study with the following variant settings: ( 1

Auxiliary Experiments
Training visualization.To inspect our method comprehensively, we also plot the test Hit@1 curve throughout the training epochs in Figure 3. KG Encoder (th=0.9)represents the KG Encoder described in Sec.4.1 applying the self-training with threshold=0.9 to generate pseudo mappings every 20 epochs.We control the same experimental settings to remove the performance perturbations induced by different parameters.From Figure 3, we observe that our method converges quickly and achieves the best and most stable alignment performance.The performance of the KG Encoder gradually decreases in the later stages since it gets stuck in overfitting to the limited training data.Although self-training brings some performance gains after data augmentation in the early stages, the performance drops dramatically in the later stages.This is because it involves many noise pseudo mappings and causes error accumulation as the self-training continues.In the later stages, self-training has difficulty in further generating new mappings while existing erroneous mappings constantly misguide the model training, thus hurting the performance.Hyperparameter analysis.We design hyperparameter experiments to investigate the performance varies with some hyperparameters.Due to the space limitation, these experimental results and analyses are listed in Appendix D.

Conclusion
In this paper, we propose a novel semi-supervised EA framework, termed as MixTEA, which guides the model learning with an end-to-end mixture teaching of manually labeled mappings and probabilistic pseudo mappings.Meanwhile, we propose a bi-directional voting (BDV) strategy and a matching diversity-based rectification (MDR) module to assist the probabilistic pseudo mapping learning.Experimental results on benchmark datasets show the effectiveness of our proposed method.

Limitations
Although we have demonstrated the effectiveness of MixTEA, there are still some limitations that should be addressed in the future: (1) Currently, we only utilize structural contexts which are abundant and always available in KGs to embed entities.However, when side information (e.g., visual contexts, literal contexts) is available, MixTEA needs to be extended into a more comprehensive EA framework and ensure that it does not become over-complex in the teacher-student architecture.Therefore, how to involve this side information is our future work.(2) Vanilla self-training iteratively generates pseudo mappings and adds them to the training data, where the technicians can perform spot checks during model training to monitor the quality of pseudo mappings.While MixTEA computes probabilistic pseudo mapping matrix and performs end-to-end training, thus making it hard to provide explicit entity mappings for the technicians to check their correctness.Therefore, it is imperative to design a strategy to combine the selftraining and probabilistic pseudo mapping learning to enhance the interpretability and operability.

Ethics Statement
This work does not involve any discrimination, social bias, or private data.Therefore, we believe that our study complies with the ACL Ethics Policy.

Figure 1 :
Figure 1: The overall of our proposed MixTEA, which consists of two KG encoders, called student model and teacher model.We obtain entity embeddings via the KG encoder.Both labeled mappings and probabilistic pseudo mappings are used to train the student model.The final student model is used for alignment inference.

Figure 2 :
Figure2: The illustration of the process of generating the probabilistic pseudo mapping matrix from two alignment directions.We assume that entity pairs on the diagonal are correct mappings and that the default alignment direction for inference is from source to target.
) w/o rel.removes the relation modeling.(2) w/o L u removes probabilistic pseudo mapping learning.(3) w/o BDV only considers EA decisions in the default alignment direction to generate pseudo mappings instead of applying the bidirectional voting strategy (i.e., P tea = g(M tea s→t )).(4) w/o MDR removes matching diversity-based rectification module in pseudo mappings learning.(5) w/o B&M denotes that the complete model deletes both BDV and MDR module.The ablation results are shown in
and a small set of pre-aligned mappings (called training data) S = {(e s , e t )|e s ∈ E s ∧e t ∈ E t ∧e s ≡ e t }, where ≡ means equivalence relationship, entity alignment (EA) task pairs each source entity e i ∈ E s via nearest neighbor (NN) search to identify its corresponding target entity e j ∈ E t : e j = arg min Ês and Êt , which denote the unlabeled entity set of source KG and target KG, respectively.

Table 1 :
Entity alignment performance of different methods in the cross-lingual and monolingual settings of OpenEA.The results with † are retrieved from Sun et al. (2020b), and ‡ from Xin et al. (2022a).Results labeled by * are reproduced using the released source codes and labeled by • are reported in the corresponding references.The boldface indicates the best result of each column and underlined the second-best.std.means standard deviation.
The details of hyperparameter settings are shown in Appendix C. cause error accumulation.Although GAEA learns representations of vast unseen entities via contrastive learning, its performance is unstable.The bottom part of Table