Interpretable and Low-Resource Entity Matching via Decoupling Feature Learning from Decision Making

Entity Matching (EM) aims at recognizing entity records that denote the same real-world object. Neural EM models learn vector representation of entity descriptions and match entities end-to-end. Though robust, these methods require many annotated resources for training, and lack of interpretability. In this paper, we propose a novel EM framework that consists of Heterogeneous Information Fusion (HIF) and Key Attribute Tree (KAT) Induction to decouple feature representation from matching decision. Using self-supervised learning and mask mechanism in pre-trained language modeling, HIF learns the embeddings of noisy attribute values by inter-attribute attention with unlabeled data. Using a set of comparison features and a limited amount of annotated data, KAT Induction learns an efficient decision tree that can be interpreted by generating entity matching rules whose structure is advocated by domain experts. Experiments on 6 public datasets and 3 industrial datasets show that our method is highly efficient and outperforms SOTA EM models in most cases. We will release the codes upon acceptance.


Introduction
Entity Matching (EM) aims at identifying whether two records from different sources refer to the same real-world entity. This is a fundamental research task in knowledge graph integration (Dong et al., 2014;Daniel et al., 2020;Christophides et al., 2015;Christen, 2012) and text mining (Zhao et al., 2014). In real applications, it is not easy to decide whether two records with ad hoc linguistic descriptions refer to the same entity. In Figure 1, e 2 and e 3 refer to the same publication, while e 1 refers to a different one. Venues of e 2 and e 3 have different expressions; Authors of e 3 is misplaced in its Title field. Early works include feature engineering (Wang et al., 2011) and rule matching (Singh et al., 2017;Fan et al., 2009). Recently, the robustness of Entity Matching has been improved by deep learning models, such as distributed representation based models (Ebraheem et al., 2018), attention based models (Mudgal et al., 2018;Fu et al., 2019Fu et al., , 2020, and pre-trained language model based models . Nevertheless, these modern neural EM models suffer from two limitations as follows. Low-Resource Training. Supervised deep learning EM relies on large amounts of labeled training data, which is extremely costly in reality. Attempts have been made to leverage external data via transfer learning (Zhao and He, 2019;Kasai et al., 2019;Loster et al., 2021) and pre-trained language model based methods . Other attempts have also been made to improve labeling efficiency via active learning (Nafa et al., 2020) and crowdsourcing techniques (Gokhale et al., 2014;Wang et al., 2012). However, external information may introduce noises, and active learning and crowdsourcing still require additional labeling work. Lack of Interpretability. It is important to know why two entity records are equivalent , however, deep learning EM lacks inter-pretability. Though some neural EM models analyze the model behavior from the perspective of attention (Nie et al., 2019), attention is not a safe indicator for interpretability (Serrano and Smith, 2019). Deep learning EM also fails to generate interpretable EM rules in the sense that they meet the criteria by domain experts (Fan et al., 2009).
To address the two limitations, we propose a novel EM framework to decouple feature representation from matching decision. Our framework consists of Heterogeneous Information Fusion (HIF) and Key Attribute Tree (KAT) Matching Decision for low-resource settings. HIF is robust for feature representation from noisy inputs, and KAT carries out interpretable decisions for entity matching.
In particular, HIF learns from unlabeled data a mapping function, which converts each noisy attribute value of entity into a vector representation. This is carried out by a novel self-supervised attention training schema to leverage the redundancy within attribute values and propagate information across attributes.
KAT Matching Decision learns KAT using decision tree classification. After training, KAT carries out entity matching as a task of the classification tree. For each entity pair, it first computes multiple similarity scores for each attribute using a family of metrics and concatenates them into a comparison feature vector. This classification tree can be directly interpreted as EM rules that share a similar structure with EM rules derived by domain experts.
Our EM method achieves at least SOTA performance on 9 datasets (3 structured datasets, 3 dirty datasets, and 3 industrial datasets) under various extremely low-resource settings. Moreover, when the number of labeled training data decreases from 60% to 10%, our method achieves almost the same performance. In contrast, other methods' performances decrease greatly.
The rest of the paper is structured as follows. Section 2 defines the EM task; Section 3 presents HIF and KAT-Induction in details; Section 4 reports a series of comparative experiments that show the robustness and the interpretability our methods in low-resource settings; Section 5 lists some related works; Section 6 concludes the paper.

Task Definitions
Entity Matching. Let T 1 and T 2 be two collections of entity records with m aligned attributes {A 1 , · · · A m }. We denote the i th attribute val-ues of entity record e as e[A i ]. Entity matching aims to determine whether e 1 and e 2 refer to the same real-world object or not. Formally, entity matching is viewed as a binary classification function T 1 × T 2 → {T rue, F alse} that takes (e 1 , e 2 ) ∈ T 1 × T 2 as input, and outputs T rue (F alse), if e 1 and e 2 are matched (not matched).
Current neural EM approaches simultaneously embed entities in low-dimensional vector spaces and obtain entity matching by computations on their vector representations. Supervised deep learning EM relies on large amounts of labeled training data, which is time-consuming and needs costly manual efforts. Large unlabelled data also contain entity feature information useful for EM, yet has not been fully exploited by the existing neural EM methods. In this paper, we aim at decoupling feature representation from matching decision. Our novel EM model consists of two sub-tasks: learning feature representation from unlabeled data and EM decision making.
Feature Representation from Noisy Inputs. Entity records are gathered from different sources with three typical noises in attribute values: misplacing, missing, or synonym. Misplacing means that attribute value of A i drifts to A j (i = j); missing means that attribute values are empty; synonym means that attribute values with the same meaning have different literal forms. Our first task is to fusion noisy heterogeneous information in a selfsupervised manner with unlabelled data.
Interpretable EM. Domain experts have some valuable specifications on EM rules as follow: (1) an EM rule is an if-then rule of feature comparison; (2) it only selects a part of key attributes from all entity attributes for decision making; (3) feature comparison is limited to a number of similarity constraints, such as =, ≈ (Fan et al., 2009;Singh et al., 2017). Our second task is to realize an interpretable EM decision process by comparing feature representation per attribute by utilizing a fixed number of quantitative similarity metrics and then training a decision tree using a limited amount of labeled data. Our interpretable EM decision making will ease the collaboration with domain experts.

Methodology
In this section, we introduce (1) a neural model, Heterogeneous Information Fusion (HIF), for the task of feature representation, and (2) a decision Comparison Features tree, Key Attribute Tree (KAT), for the task of interpretable EM. Figure 2 illustrates the overall workflow of our method. The following subsections dive into details of the two tasks and propose a novel training scheme for low resource settings by exploiting unlabelled entity records.

HIF for Entity Attribute Embedding
HIF : T → R m×d is a function that maps entity records into vector representations. An attribute value e[A i ] of a record e is mapped to a d dimensional vector, written as HIF(e)[A i ] ∈ R d . HIF treats attribute values as strings of words and performs word embedding (EMB), word information aggregation (AGG), and attribute information propagation (PROP) successively.
Word Embedding (EMB). Word embedding is a pre-train language model that contains features learned from a large corpus. We convert numerical and encoded attribute values into strings of digits or alphabets. For Chinese attribute values, we do word-segmentation using pkuseg (Luo et al., 2019). Then, we mark the beginning and the end of an attribute value with two special tokens, namely BEG and END . Finally, we pad each attribute value with PAD so that they are represented in the same length l. The representation after padding is illustrated as below: Let W be the set of words, each word w ∈ W is mapped into a vector, and each attribute value is mapped into a matrix. Formally, EMB : W N → R N ×de maps N words into an N × d e matrix by executing a look-up-table operation. N is the dictionary size. In particular, we have EMB(e)[A i ] ∈ R l×de , in which d e is the dimension of word embedding vectors. It is worth noting that PAD is embedded to zero vector to ensure that it does not interfere with other non-padding words in the following step.
Word Information Aggregation (AGG). Summing up the l word embeddings as the embedding of an attribute value will neglect the importance weight among the l words. We leverage a more flexible framework, which aggregates word information by weighted pooling. The weighting coefficients α i for different words are extracted by multiplying its embedding vector with a learnable, and attribute-specific vector a i ∈ R de×1 . Subscript i implies that α i and a i are associated with the i th attribute A i . The weighting coefficients are normalized by Softmax function among words. Finally, we enable a non-linear transformation (e.g., ReLU) during information aggregation with parameters W ai ∈ R de×da . Formally, AGG maps each attribute value of entity record e into a d a dimensional vector AGG(EMB(e)[A i ]) ∈ R da as below: Attribute Information Propagation (PROP). The mechanism of attribute information propagation is the key component for noise reduction and representation unification. This mechanism is inspired by the observation that missing attribute values often appear in other attributes (e.g., Venue and Conference in Figure 1, Mudgal et al. (2018) also reported the misplacing issue).
We use "Scaled Dot-Product Attention" (Ashish et al., 2017) to propagate information among different attribute values. We use parameters Q, K, V i to convert AGG(EMB(e)[A i ]) into query, key, and value vectors, respectively (Notice that only V i is attribute-specific). A ∈ R m×m is the attention matrix. A ij denotes the attention coefficients from the i th attribute to the j th attribute: Record notation e is omitted in vectors q, k, v for brevity. To keep the identity information, each attribute value after attribute information propagation is represented by the concatenation of the context and the value vector: The whole process can be summarized as follows: After HIF, each attribute A i of an entity record e has a feature embedding HIF(e)[A i ].

KAT for Matching Decision
KAT Matching Decision consists of two steps: comparison feature computation (CFC) and decision making with KAT. CFC computes similarity score for each paired attribute features by utilizing a family of well-selected metrics, and concatenate these similarity scores into a vector (comparison feature). KAT takes comparison feature as inputs, and perform entity matching with a decision tree.
Comparison Feature Computing (CFC). Given a record pair (e 1 , e 2 ), CFC implements a function that maps (e 1 , e 2 ) to a vector of similarity scores CFC(e 1 , e 2 ).
The similarity score CFC(e 1 , e 2 ) is a concatenation of a similarity vector between paired attribute values (i.e., e 1 [A i ], e 2 [A i ]) and a similarity vector between their vector embeddings (i.e., To compare paired attribute values, we follow Konda et al. (2016) and classify attribute values into 6 categories, according to the type and the length, each with a set of comparison metrics for similarity measurement, such as Jaccard similarity, Levenshtein similarity, Monge-Elkan similarity, etc. More details are presented in Table 1.
For attribute value embeddings, we choose three metrics: the cosine similarity, the L 2 distance, and the Pearson coefficiency. In this way, we convert entity record pair into similarity score vector of attributes. Each dimension indicates the similarity degree of one attribute from a certain perspective.
KAT Induction. In the matching decision, we take CFC(e 1 , e 2 ) as input, and output binary classification results. We propose Key Attribute Tree, a decision tree, to make the matching decision based on key attribute heuristic, in the sense that some attributes are more important than others for EM. For example, we can decide whether two records of research articles are the same by only checking their Title and Venue without examining their Conference. Focusing only on key attributes not only saves computations, but also introduces interpretability that has two-folded meanings: (1) each dimension of CFC(e 1 , e 2 ) is a candidate feature matching which can be interpreted as a component of an EM rule; (2) the decision tree learned by KAT can be converted into EM rules that follow the same heuristics as the EM rules made by domain experts (Fan et al., 2009).

Model Training
HIF and KAT Induction are trained separately.
HIF Training. We design a self-supervised training method for HIF to learn from unlabeled data.  KAT Induction Training. KAT is trained with a normal decision tree algorithm. We constrain its depth, in part to maintain the interpretability of transformed EM rules. We use xgboost (Tianqi and Carlos, 2016) and ID3 algorithm (Quinlan, 1986) in the experiments. To preserve interpretability, the booster number of xgboost is set to 1, which means it only learns one decision tree. For (e 1 , e 2 , T rue) ∈ D, KAT takes CFC(e 1 , e 2 ) as input, and T rue as the target classification output.  tured and Dirty datasets are benchmark datasets 1 released in (Mudgal et al., 2018). The Real datasets are sampled from Taobao-one of the biggest Ecommerce platform in China, a portion of which are manually labeled to indicate whether they are the same entity or not. The real datasets have notably more attributes than the structured or dirty datasets. Statistics of these datasets are listed in Table 2. We focus on setting of low resource EM and use Rate% of labelled data as training set. The validation set uses the last 20% labeled pairs, and the rest pairs in the middle are the test set. This splitting is different from the sufficient resource EM (Mudgal et al., 2018;Konda et al., 2016) where up to 60% pairs are used in the training set. For I-A 1 , I-A 2 , and Phone, we use 10% labeled pairs as training data, because some of the baselines will crash, if the training data is too small.
We remove trivial entity pairs from the Real datasets, as Structured and Dirty datasets have been released. For Real datasets, we remove matching pairs with large Jaccard similarity (0.32 for Phone, 0.36 for others) and non-matching pairs with small Jaccard similarity (0.3 for Phone, 0.332 for others).

Baselines
We implement 3 variants of our methods with different KAT Induction algorithms. HIF+KAT ID3 and HIF+KAT XGB inducts KAT with ID3 algorithm and xgboost respectively constraining maximum depth to 3. HIF+DT inducts KAT with ID3 algorithm with no constraints on the tree depth. We include reproducibility details in Appendix B.
We compare our methods with three SOTA EM methods, among which two are publicly available end-to-end neural methods, and one is feature engineering based method. 2. HierMatcher (Fu et al., 2020) is also an endto-end neural EM method that compare entity records at the word level 3 .
3. Magellan (Konda et al., 2016) integrates both automatic feature engineering for EM and classifiers. Decision tree is used as the classifier of Magellan in our experiments.
For ablation analysis, we replace a single component of our model with a new model as follows: HIF+LN replaces KAT with a linear classifier; HIF+LR replaces KAT with a logistic regression classifier; HIF-ALONE removes comparison metrics of attribute values (yellow segment of comparison features in Figure 2). We 2 https://github.com/anhaidgroup/ deepmatcher 3 https://github.com/cipnlu/ EntityMatcher  also do ablation analysis for HIF-ALONE as follows: HIF-WBOW replaces outputs of HIF with d-dimensional WBOW vectors using PCA. HIF-EMB replaces the outputs of HIF with the mean pooling of word embeddings.

Evaluation Metrics
We use F 1 score as the evaluation metric. Experiment results are listed in Table 3 and Table 5. All the reported results are averaged over 10 runs with different random seeds.

Experimental Results
General Results. We evaluate the performance of our model against 3 SOTA models under low resource settings, where only 1% or 10% of the total amount of labeled pairs are used for training (See Table 2). Comparative experiment results on the 9 datasets are listed in Table 3.
Our decoupled framework achieves SOTA EM results on all the nine datasets, and demonstrates significant performance on Dirty datasets, with a boosting of 4.3%, 14.7%, and 8.4% in terms of F 1 score on I-A 2 , D-A 2 , D-S 2 , compared to the best performance of baselines on their corresponding datasets. Our methods also outperforms all baselines on Structured and two Real datasets (the same as Magellan on Toner). The out-performance on Real datasets is marginal because attribute values in Real datasets are quite standard, which means that our model does not have many chances to fix noisy attribute values. Still, our methods achieve a high F 1 score (≥ 94.9%) in Real datasets. These results indicate out methods are both effective under low resource settings and robust to noisy data. x-axis is the rate of labelled data used in training. y-axis is the F 1 score.

Effectiveness to Low Resource Settings
We reduce the training rate from 60% to 10% to see whether our method is sensitive to the number of labeled record pairs as training resources. Experimental results are shown in Figure 3. HIF+KAT (red line) achieves a stable performance as the number of labeled record pairs decreases, while the F 1 score of DeepMatcher and HierMatcher decrease simultaneously. Besides, our methods continuously outperform DeepMatcher and HierMatcher, ranging from low resource setting to sufficient resource setting. These results indicate that by exploring unlabelled data, HIF alleviates the reliance on labeled record pairs.

Effectiveness to Noisy Heterogeneous Data.
We manually aggravate the quality of datasets by randomly dropping p% of attribute values (p% ranges from 0% to 40%), and see to what degree the feature representations delivered by HIF will affect the EM decision matching. From left to right, columns of subgraphs in Figure 3 demonstrates results with increasing dropping rate. On the I-A 1 dataset, the influence of dropping rate is marginal to HIF+KAT , whose F 1 score fluctuates around 95%. In contrast, F 1 scores of both DeepMatcher and HierMatcher will decrease if more attribute values are dropped. On the Phone dataset, the dropping rate's influence is not severe to HIF+KAT, especially when the training rate is low. These results show that HIF is efficient in recovering noisy heterogeneous inputs.

Case Study for Interpretablity
The interpretability of our model means that the process of decision making of KAT can be easily transformed into EM rules whose structure is recommended by domain experts. Figure 4 illustrates a tree decision process of KAT that determines whether two records denote the same publication in the D-A 1 (DBLP and ACM) datasets. Each path from the root to a leaf node of the tree structure can be converted into an EM rule as follows: They can be further read as descriptive rules: Rule 1: if two records have different authors, they will be different publications. Rule 2: if two records have similar authors and similar titles, they will be the same publication. Rule 3: if two records have similar authors and dissimilar titles, they will not be the same publication.
The soundness of such rules can be examined by our experience.
Important features of KAT are as follows: (1) KAT is conditioned on attribute comparison; (2) KAT only selects a few key attributes to compare features. In our example, there are 4 attributes, Author, Title, Venue and Conference in D-A 1 dataset, 2777 < < < < < Figure 4: The Key Attribute Tree generated by HIF+KAT XGB for D-A 1 dataset.
KAT only selects Title and Author for EM decision making. The transformed rules meet the specifications of manually designed EM rules of domain experts (Fan et al., 2009;Singh et al., 2017). This kind of interpretability will ease the collaboration with domain experts, and increase the trustworthiness, compared with uninterpretable end-to-end Deep learning EM models.

Discussions
Ablation Analysis. Experiment results for ablation models are listed in Table 3. On the one hand, HIF+LN and HIF+LR generally outperforms Deep-Matcher and HierMatcher on 7 datasets with on-par performance on 2 Real datasets. This indicates that HIF and CFC together extract better comparison features than end-to-end neural methods under low resource settings. On the other hand, HIF+LN and HIF+LR are weaker than the tree induction classifier, suggesting that KAT is more reliable. Compared with HIF-KAT ID3 , Magellan, and HIF-ALONE, HIF-KAT ID3 achieves the highest performance, indicating that comparison on both attribute value embeddings and the original attribute values are important. Compared with HIF-ALONE, HIF-WBOW, and HIF-EMB, HIF-ALONE outperforms HIF-WBOW and HIF-EMB on the Dirty datasets, showing the positive effects of its information reconstruction.
Finally, comparing HIF+KAT with HIF+DT, we find that HIF+KAT has better performances than HIF+DT on most of the datasets, except for (I-A 2 and Phone). This shows that non-key attributes   Efficiency. Table 4 shows the running times of our methods and of the two neural baselines. Our methods are highly efficient for inference, because our methods are highly parallel and are memorysaving. For example, on Phone datasets our methods can inference in a single batch, while Hier-Matcher can only run in a batch size of 4 with 24GiB RAM. The training efficiency of our method is comparable with baselines, because when the training data is small enough, baseline models may finish one epoch training with only few batches. Sufficient Resource EM. Table 5 shows the results with sufficient training data following the split method of Mudgal et al. (2018);Fu et al. (2020). Our method outperforms other methods on 4 datasets, and slightly fall behind on 5 datasets.
The interpretability of neural models will contribute to the trust and the safety. It has become one of the central issues in machine learning.  examines interpretability in EM risk analysis. There are also attempts to explain from the perspective of attention coefficients (Mudgal et al., 2018;Nie et al., 2019).

Conclusion
We present a decoupled framework for interpretable entity matching. It is robust to both noisy heterogeneous input and the scale of training resources. Experiments show that our method can be converted to interpretable rules, which can be inspect by domain experts and make EM process more reliable.
In the future, it is intriguing to explore more efficient ways to explore unlabeled data, such as levering connections among entities, or combine with pre-trained language models. It is also valuable to explore how to use our heterogeneous information fusion module to boost other EM methods, such as injecting HIF representation as supplementary information into end-to-end models.

Ethical Considerations
Intended Use. The reported technique is intended for reliable entity matching in large scale E-commercial products, where attribute values are mostly heterogeneous descriptive sentences. The 'low resource' feature is intended to avoid heavy labor force. The 'interpretability' is intended to risk control in entity matching.
Misuse Potential. As matching/alignment technique, our method may be misused in matching private information.
Failure Modes. Our method provides a promising way to have domain experts check the generated rules, thus reducing the failure risk.
Energy and Carbon Costs. The efficiency test in Section 4.4 shows that our method costs less computations and is more energy saving than existing methods.  Table 6: Experimental results under low-resource setting with precision, recall, and F 1 measure (%). Dash (-) indicates these methods fail to converge on the datasets. Table 6 in the main text only shows the F 1 measure of the all the methods. Here, we supplement the experimental results with precision (P = TP TP+FP ), recall (R = TP TP+FN ) on the 9 datasets for more comprehensive analysis. Experimental results are listed in Table 6. Our methods achieve the highest precision and recall on most of the datasets.

B Reproducibility Details
Each epoch of HIF training is evenly divided into 3 batches. The Title attribute values were padded to l = 64, and the other attribute values are all padded to l = 32. We modify the padding size on large datasets, so that our the experiments can be conducted on a single GPU. Chinese datasets are embedded with Tencent Embedding (Song et al., 2018) and English datasets use fastText embeddings (Bojanowski et al., 2017). Multi-head mechanism is used in the attention module. The embedding size d e for Chinese is 300, and for English is 200. AGG converts embedding into d a dimensional vectors, where d a = 100. PROP further outputs with a 2layer MLP with dimension size d = 64. The query vector and the key vector in the attention layer of PROP are 16 dimensional vectors. During training, attribute values are masked at a probability p = 0.4. The Adam optimizer (Kingma and Ba,