From Alignment to Assignment: Frustratingly Simple Unsupervised Entity Alignment

Cross-lingual entity alignment (EA) aims to find the equivalent entities between crosslingual KGs (Knowledge Graphs), which is a crucial step for integrating KGs. Recently, many GNN-based EA methods are proposed and show decent performance improvements on several public datasets. However, existing GNN-based EA methods inevitably inherit poor interpretability and low efficiency from neural networks. Motivated by the isomorphic assumption of GNN-based methods, we successfully transform the cross-lingual EA problem into an assignment problem. Based on this re-definition, we propose a frustratingly Simple but Effective Unsupervised entity alignment method (SEU) without neural networks. Extensive experiments have been conducted to show that our proposed unsupervised approach even beats advanced supervised methods across all public datasets while having high efficiency, interpretability, and stability.


Introduction
The knowledge graph (KG) represents a collection of interlinked descriptions of real-world objects and events, or abstract concepts (e.g., documents), which has facilitated many downstream applications, such as recommendation systems (Cao et al., 2019b; and question-answering Qiu et al., 2020). Over recent years, a large number of KGs are constructed from different domains and languages by different organizations. These cross-lingual KGs usually hold unique information individually but also share some overlappings. Integrating these cross-lingual KGs could provide a broader view for users, especially for the minority language users who usually suffer from lacking language resources. Therefore, how to fuse the knowledge from cross-lingual KGs has attracted increasing attentions.
As shown in Figure 1, cross-lingual entity alignment (EA) aims to find the equivalent entities  across multi-lingual KGs, which is a crucial step for integrating KGs. Conventional methods (Suchanek et al., 2011;Jiménez-Ruiz and Grau, 2011) usually solely rely on lexical matching and probability reasoning, which requires machine translation systems to solve cross-lingual tasks. However, existing machine translation systems are not able to achieve high accuracy with limited contextual information, especially for language pairs that are not alike, such as Chinese-English and Japanese-English.
Recently, Graph Convolutional Network (GCN) (Kipf and Welling, 2017) and subsequent Graph Neural Network (GNN) variants have achieved state-of-the-art results in various graph application. Intuitively, GNN is better in capturing structural information of KGs to compensate for the shortcoming of conventional methods. Specifically, several GNN-based EA methods Wu et al., 2019a; indeed demonstrate decent performance improvements on public datasets. All these GNN-based EA methods are built upon a core premise, i.e., entities and their counterparts have similar neighborhood structures. However, better performance is not the only outcome of using GNN. Existing GNN-based methods inevitably inherit the following inborn defects from neural networks: (1) Poor Interpretability: Recently, many researchers view GNN Wu et al., 2019a) as a black box, focusing on improving performance metrics. The tight coupling between non-linear operations and massive parameters makes GNN hard to be interpreted thoroughly. As a result, it is hard to judge whether the new designs are universal or just over-fitting on a specific dataset. A recent summary  notes that several "advanced" EA methods are even beaten by the conventional methods on several public datasets.
(2) Low Efficiency: To further increase the performance, newly proposed EA methods try to stack novel techniques, e.g., Graph Attention Networks (Wu et al., 2019a), Graph Matching Networks , and Joint Learning (Cao et al., 2019a). Consequently, the overall architectures become more and more unnecessarily complex, resulting in their time-space complexities also dramatically increase.  present that the running time of complex methods (e.g., RDGCN (Wu et al., 2019a)) is 10× more than that of vanilla GCN (Wang et al., 2018).
In this paper, we notice that existing GNN-based EA methods inherit considerable complexity from their neural network lineage. Naturally, we consider eliminating the redundant designs from existing EA methods to enhance interpretability and efficiency without losing accuracy. Leveraging the core premise of GNN-based EA methods, we restate the assumption that both structures and textual features of source and target KGs are isomorphic. With this assumption, we are able to successfully transform the cross-lingual EA problem into an assignment problem, which is a fundamental and well-studied combinatorial optimization problem. Afterward, the assignment problem could be easily solved by the Hungarian algorithm (Kuhn, 1955) or Sinkhorn operation (Cuturi, 2013).
Based on the above findings, we propose a frustratingly Simple but Effective Unsupervised EA method (SEU) without neural networks. Compared to existing GNN-based EA methods, SEU only retains the basic graph convolution operation for feature propagation while abandoning the complex neural networks, significantly improving efficiency and interpretability. Experimental results on the public datasets show that SEU could be completed in several seconds with the GPU or tens of seconds with the CPU. More startlingly, our unsupervised method even outperforms the state-of-the-art supervised approaches across all public datasets. Furthermore, we discuss the possible reasons behind the unsatisfactory performance of existing complex EA methods and the necessity of neural networks in cross-lingual EA. The main contributions are summarized as follows: • By assuming that both structures and textual features of source and target KGs are isomorphic, we successfully transform the crosslingual EA problem into an assignment problem. Based on this finding, we propose a frustratingly Simple but Effective Unsupervised entity alignment method (SEU).
• Extensive experiments on public datasets indicate that our unsupervised method outperforms all advanced supervised competitors while preserving high efficiency, interpretability, and stability.

Task Definition
KG stores the real-world knowledge in the form of triples (h, r, t). A KG could be defined as G = (E, R, T ), where E, R, and T represent the entity set, relation set, and triple set, respectively. Given a source graph G s = (E s , R s , T s ) and a target graph G t = (E t , R t , T t ), EA aims to find the entity correspondences P between KGs.
3 Related Work

Cross-lingual Entity Alignment
Existing cross-lingual EA methods are based on the premise that equivalent entities in different KGs have similar neighboring structures. Following this idea, most of them can be summarized into two steps (as shown in Figure 2): (1) Using KG embedding methods (e.g., TransE (Bordes et al., 2013) and GCN (Kipf and Welling, 2016)) to generate low-dimensional embeddings for entities and relations in each KGs.
(2) Mapping these embeddings into a unified vector space through contrastive losses (Hadsell et al., 2006;Schroff et al., 2015) and pre-aligned entity pairs. Based on the vanilla GCN, many EA methods design task-specific modules for improving the performance of EA. Cao et al. (2019a) propose a multichannel GCN to learn multi-aspect information from KGs. Wu et al. (2019a) use a relation-aware dual-graph network to incorporate relation information with structural information. Moreover, due to the lack of labeled data, some methods (Sun et al., 2018;Mao et al., 2020) apply iterative strategies to generate semi-supervised data. In order to provide a multi-aspect view from both structure and semantic, some methods (Wu et al., 2019b;Yang et al., 2019) use word vectors of translated entity names as the input features of GNNs.

Assignment Problem
The assignment problem is a fundamental and wellstudied combinatorial optimization problem. An intuitive instance is to assign N jobs for N workers. Assuming that each worker can do each job at a term, though with varying degrees of efficiency, let x ij be the profit if the i-th worker is assigned to the j-th job. Then the problem is to find the best assignment plan (which job should be assigned to which person in one-to-one basis) so that the total profit of performing all jobs is maximum. Formally, it is equivalent to maximizing the following equation: X ∈ R N ×N is the profit matrix. P is a permutation matrix denoting the assignment plan. There are exactly one entry of 1 in each row and each column in P while 0s elsewhere. P N represents the set of all N-dimensional permutation matrices.
Here, · F represents the Frobenius inner product.
In this paper, we adopt the Hungarian algorithm (Kuhn, 1955) and the Sinkhorn operation (Cuturi, 2013) to solve the assignment problem.

From Alignment to Assignment
The inputs of our proposed SEU are four matrices: A s ∈ R |Es|×|Es| and A t ∈ R |Et|×|Et| represent the adjacent matrices of the source graph G s and the target graph G t . H s ∈ R |Es|×d and H t ∈ R |Et|×d represent the textual features of entities that have been pre-mapped into a unified semantic space through machine translation systems or crosslingual word embeddings. Similar to the assignment plan, aligned entity pairs in EA also needs to satisfy the one-to-one constraint. Let a permutation matrix P ∈ P |E| represent the entity correspondences between G s and G t . P ij = 1 indicates that e i ∈ G s and e j ∈ G t are an equivalent entity pair. The goal of SEU is to solve P according to {A s , A t , H s , H t }. Consider the following ideal situation: (1) A s and A t are isomorphic, i.e., A s could be transformed into A t by reordering the entity node indices according to P (as shown in Figure 3): (2) The textual features of equivalent entity pairs are mapped perfectly by the translation system. Therefore, H s and H t could also be aligned according to the entity correspondences P : By combining Equation (2) and (3), the connection between the 5-tuple {A s , A t , H s , H s , P } could be described as follows: Based on Equation (4), P could be solved by minimizing the Frobenius norm under the one-to-one constraint P ∈ P |E| . Theoretically, for arbitrarily depth l ∈ N, the solution of P should be the same. However, the above inference is based on the ideal isomorphic situation. In practice, G s and G t are not strictly isomorphic and the translation system cannot perfectly map the textual features into a unified semantic space either. In order to reduce the impact of noise existing in practice, P should be fit for various depths l. Therefore, we propose the following equation to solve the cross-lingual EA problem: Theorem 1 Equation (5) is equivalent to solving the following assignment problem: Proof: According to the property of Frobenius Here, the permutation matrix P must be orthogonal, so both P A l s H s 2 F and A l t H t 2 F are constants. Then, Equation (7) is equivalent to maximizing as below: For arbitrarily real matrices A and B, these two equations always hold: Tr(X) represents the trace of matrix X. Therefore, Theorem 1 is proved: By Theorem 1, we successfully transform the EA problem into the assignment problem. Compared to GNN-based EA methods, our proposed method retains the basic graph convolution operation for feature propagation but replaces the complex neural networks with the well-studied assignment problem. Note that the entity scales |E s | and |E t | are usually inconsistent in practice, resulting in the profit matrix not being a square matrix. This kind of unbalanced assignment problem could be reduced to the balanced assignment problem easily. Assuming that |E s |>|E t |, a naive reduction is to pad the profit matrix with zeros such that its shape becomes R |Es|×|Es| . This naive reduction is suitable for the dataset with a small gap between |E s | and |E t |. For the dataset with a large entity scale gap, there is a more efficient reduction algorithm available (Ramshaw and Tarjan, 2012).

Two Algorithms for Solving the Assignment Problem
The first polynomial time-complexity algorithm for the assignment problem is the Hungarian algorithm (Kuhn, 1955), which is based on improving a matching along the augmenting paths. The time complexity of the original Hungarian algorithm is O(n 4 ). Later, Jonker and Volgenant (1987) improve the algorithm to achieve O(n 3 ) running time, which is one of the most popular variants.
Besides the Hungarian algorithm, the assignment problem could also be regarded as a special case of the optimal transport problem. In the optimal transport problem, the assignment plan P could be any doubly stochastic matrix instead of a permutation matrix. Based on the Sinkhorn operation (Sinkhorn, 1964;Adams andZemel, 2011), Cuturi (2013) proposes a fast and completely parallelizable algorithm for the optimal transport problem: S 0 (X) = exp(X), S k (X) = N c (N r (S k−1 (X))), where N r (X)=X (X1 N 1 T N ) and N c =X (1 N 1 T N X) are the row and column-wise normalization operators of a matrix, represents the elementwise division, and 1 N is a column vector of ones. Then, Mena et al. (2018) further prove that the assignment problem could also be solved by the Sinkhorn operation as a special case of the optimal transport problem: In general, the time complexity of the Sinkhorn operation is O(kn 2 ). Because the number of iteration k is limited, the Sinkhorn operation can only obtain an approximate solution in practice. But according to our experimental results, a very small k is enough to achieve decent performance in entity alignment. Therefore, compared to the Hungarian algorithm, the Sinkhorn operation is n times more efficient, i.e., O(n 2 ).

Implementation Details
The above two sections introduce how to transform the cross-lingual EA problem into the assignment problem and how to solve the assignment problem. This section will clarify two important implementation details of our proposed method SEU.

Textual Features H
The input features of SEU include two aspects: Word-Level. In previous cross-lingual EA methods Wu et al., 2019a), the most commonly used textual features are wordlevel entity name vectors. Specifically, these methods first use machine translation systems or crosslingual word embeddings to map entity names into a unified semantic space and then average the pretrained entity name vectors to construct the initial features. To make fair comparisons, we adopt the same entity name translations and word vectors provided by .
Char-Level. Because of the contradiction between the extensive existence of proper nouns (e.g., person and city name) and the limited size of word vocabulary, the word-level EA methods suffer from a serious out of vocabulary (OOV) issue. Therefore, many EA methods explore the char-level features, using char-CNN  or name-BERT  to extract the char/sub-word features of entities. In order to keep the simplicity and consistency of our proposed method, we adopt the character bigrams of translated entity names as the char-level input textual features instead of complex neural networks.
In addition to these text-based methods, we notice that some structure-based EA methods (Wang et al., 2018; do not require any textual information at all, where the entity features are randomly initialized. Section 5.6 will discuss the connection between text-based and structure-based methods and challenge the necessity of neural networks in cross-lingual EA.

Adjacent Matrix A
In Section 4.1, all deductions are built upon the assertion that the adjacency matrices A s and A t are isomorphic. Obviously, let D be the degree matrix of adjacency matrix A s/t , the equal probability random walk matrix A r = D −1 A s/t and the symmetric normalized Laplacian matrix A L = I − D −1/2 A s/t D −1/2 of A s and A t are also isomorphic too. Therefore, if A s/t is replaced by A r or A L , our method still holds.  However, the above matrices ignore the relation types in the KGs and treat all types of relations equally important. We believe the relations with less frequency should have higher weight because they represent more unique information. Following this intuition, we apply a simple strategy to generate the relational adjacency matrix A rel , for a ij ∈ A rel : where N i represents the neighboring set of entity e i , R i,j is the relation set between e i and e j , |T | and |T r | represent the total number of all triples and the triples containing relation r, respectively.

Experiments
Our experiments are conducted on a workstation with a GeForce GTX Titan X GPU and a Ryzen ThreadRipper 3970X CPU. The code and datasets are available in github.com/MaoXinn/SEU.

Datasets
To make fair comparisons with previous EA methods, we experiment with two widely used public datasets: (1)  Each subset also contains 15, 000 entity pairs but with much fewer triples compared to DBP15K. The statistics of these datasets are summarized in Table 1. Most of the previous studies (Wang et al., 2018;Cao et al., 2019a) randomly split 30% of the entity pairs for training and development, while using the remaining 70% for testing. Because our proposed method is unsupervised, all of the entity pairs could be used for testing.  Table 2: Main experimental results on DBP15K and SRPRS. Baselines are separated in accord with the three groups described in Section 5.2. Most results are from the original papers. Some recent papers are failed to run on missing datasets or do not release the source code yet. We will fill in these blanks after contacting their authors.

Settings
Metrics. Following convention, we use Hits@k and Mean Reciprocal Rank (MRR) as our evaluation metrics. The Hits@k score is calculated by measuring the proportion of correct pairs in the top-k. In particular, Hits@1 equals accuracy.
Hyper-parameter. In the main experiments, we use the Sinkhorn operation to solve the assignment problem. For all dataset, we use a same default setting: the depth L = 2; the iterations k = 10; the temperature τ = 0.02. Table 2 shows the main experimental results of all EA methods. Numbers in bold denote the best results among all methods.

Main Experiments
SEU vs. Baselines. According to the results, our method consistently achieves the best performance across all datasets. Compared with the previous SOTA methods, SEU (w+c) improves the performance on Hits@1 and M RR by 1.5% and 1.3% at least. More importantly, SEU outperforms the supervised competitors as an unsupervised method, which is critical in practical applications.
In addition to the better performances, SEU also has better interpretability and stability: (1) When solving with the Hungarian algorithm, we can trace the reasons for each decision by the augmenting path, which brings better interpretability.
(2) As we all know, neural networks optimized by SGD usually have some performance fluctuations. Since both the Hungarian algorithm and Sinkhorn operation are deterministic, multiple runs of these algorithms remain unchanged under the same hyperparameters, which means better stability.   Word vs. Char. From Table 2, we observe that the char-level SEU greatly outperforms the word-level SEU. Especially in SRPRS FR−EN , the performance gap on Hits@1 is more than 16%. As mentioned in Section 4.3.1, the main reason is that these datasets contain extensive OOV proper nouns. For example, in DBP15K, 4-6% of the words are OOV; while in SRPRS DE−EN and SRPRS FR−EN , more than 12% and 16% of the entity names are OOV, respectively.
Note that the performance difference between SEU(word) and SEU (char) is vast, but these two features still complement to each other, so the combination of them still improves the performances (especially on DBP ZH−EN dataset). We believe the hidden reason is synonyms. For example, soccer and football refer to the same Chinese phrase, but there is almost no overlap in the char-level between these two English words. However, the word-level features could bridge such semantic gap via pretrained cross lingual word vectors. SEU vs. PARIS. As mentioned in Section 1, a recent summary  notes that several "advanced" EA methods are even beaten by the conventional methods. To make this study more comprehensive, we also compare 1 Since the Hungarian algorithm only outputs the assigned entity pairs, instead of a probability matrix P , we can only report the Hits@1 performance.

Method
DBP15K SRPRS GCN-Align (Wang et al., 2018) 103 87 MuGNN (Cao et al., 2019a) 3,156 2,215 BootEA (Sun et al., 2018) 4,661 2,659 MRAEA (Mao et al., 2020) 3,894 1,248 GM-Align  26,328 13,032 RDGCN (Wu et al., 2019a) 6,711 886 HGCN (Wu et al., 2019b) 11,275 2,504 SEU(CPU) 22.1 13.8 SEU(GPU) 16.2 9.6 SEU against a representative conventional method PARIS (Suchanek et al., 2011) in Figure 4, which is a holistic unsupervised solution to align KGs based on probability estimates. Since PARIS may not always output a target entity for every source entity, we use the F1-score as the evaluation metric to deal with entities that do not have a match. In our method, the F1-score is equivalent to Hits@1. Consistent with Zhang's summary, PARIS is better than most GNN-based EA methods. On the other hand, SEU outperforms PARIS significantly on these public datasets except for DBP ZH−EN . Hungarian vs. Sinkhorn Table 3 reports the performances of SEU(w+c) with the Hungarian algorithm and Sinkhorn operation, respectively. Theoretically, the Hungarian algorithm could generate the optimal solution precisely, while the Sinkhorn operation can only generate an approximate solution. Therefore, the Hungarian algorithm is always slightly better, but the performance gap is relatively small. Furthermore, we list the time costs of these two algorithms in Table 4. We observe that the time costs of the Hungarian algorithm are unstable, which depend on the dataset. Meanwhile, the time costs of the Sinkhorn operation are much more stable. Because the Sinkhorn operation is completely parallelizable, its time costs could be further reduced by the GPU. In general, the Sinkhorn operation is more suitable for large-scale EA because of its higher efficiency.
Overall Time Efficiency We specifically evaluate the overall time costs of some EA methods and report the results in Table 5. It is obvious that the efficiency of SEU far exceeds all advanced competitors. Typically, existing GNN-based methods require forward propagations on every batch, and the convergence of models usually requires hundreds of batches. Since SEU does not have any trainable parameters, it only requires forward propagation once, enabling SEU to achieve such acceleration.

Auxiliary Experiments
To explore the behavior of SEU in different situations, we design the following experiments: Temperature τ . Similar to the temperature τ in the softmax operation, τ in the Sinkhorn operation is also used to make the distribution closer to onehot. With the remaining config unchanged, we set τ with different values and report the corresponding performances of SEU(w+c) on DBP ZH−EN in Figure 5. If we choose an appropriate τ , the Sinkhorn algorithm will converge quickly to the optimal solution. But if τ is set too large, the algorithm will fail to converge.
Depth L. For depth L, we list the experimental results in Figure 6. In particular, L = 0 is equivalent to aligning entities only according to their own features without the neighborhood information. SEU(w+c) with L = 2 achieves the best performance on all subsets of DBP15K, which indicates the necessity of introducing neighborhood information. Similar to GNN-based EA methods, SEU is also affected by the over-smoothing problem. When stacking more layers, the performances begin to decrease slightly.
Adjacency matrix A. To distinguish different relation types in KGs, we adopt a simple strategy to generate the relational adjacency matrix A rel . Table 6 reports the performances of SEU(w+c) with different types of adjacency matrices. A is the standard adjacency matrix, A r = D −1 A is the  equal probability random walk matrix and A L = I − D −1/2 AD −1/2 is the symmetric normalized Laplacian matrix. The experimental results show that A rel achieves the best performance across all these three subsets.

Discussion
From the experimental results, we observe that the supervised EA methods are even beaten by the unsupervised methods. In this section, we propose a hypothesis that the reason behind this counterintuitive phenomenon is potential over-fitting. As mentioned in Section 5.2, existing EA methods could be divided into structure-based and textbased according to the input features. The only difference between them is that the structure-based methods use randomly initialized vectors as the entity features, while the text-based methods use pre-mapped textual features as the inputs. Let us consider the vanilla GCN as a sample: where σ represents the activation function. For the structure-based methods, since the input features H and the transformation matrix W are both randomly initialized, they could be simplified into one matrix, i.e., H l+1 = σ(A L H l ). This idea has been proved by many structure-based EA methods (Cao et al., 2019a;Mao et al., 2020), which propose to diagonalize or remove the transformation matrix W . In this situation, GCN is reduced to a simple fully connected neural network with adjacency matrices as its input features. The essence of structure-based EA methods is to map the features of adjacency matrices into a unified vector space. Therefore, these structure-based EA methods require supervised data to learn the parameters. As for the text-based EA methods, the textual features of entities have already been pre-mapped into a unified semantic space by machine translation or cross-lingual word vectors. Therefore, these text-based EA methods are equivalent to further fitting these pre-mapped features on a few aligned entity pair seeds, which could cause potential overfitting. Considering that we could directly align entities as an assignment problem, it is unnecessary to further fit entity features via neural networks.
As a simple unsupervised method, our proposed SEU achieves excellent performances on several EA datasets, which confirms the above analysis from the empirical side. It is noted that this section only proposes a possible explanation, not rigorous proof. We will continue to explore in this direction.

Conclusion
In this paper, we successfully transform the crosslingual EA problem into the assignment problem. Based on this finding, we propose a frustratingly Simple but Effective Unsupervised EA method (SEU) without neural networks. Experiments on widely used public datasets indicate that SEU outperforms all advanced competitors and has high efficiency, interpretability, and stability.