COMBO: A Complete Benchmark for Open KG Canonicalization

Open knowledge graph (KG) consists of (subject, relation, object) triples extracted from millions of raw text. The subject and object noun phrases and the relation in open KG have severe redundancy and ambiguity and need to be canonicalized. Existing datasets for open KG canonicalization only provide gold entity-level canonicalization for noun phrases. In this paper, we present COMBO, a Complete Benchmark for Open KG canonicalization. Compared with existing datasets, we additionally provide gold canonicalization for relation phrases, gold ontology-level canonicalization for noun phrases, as well as source sentences from which triples are extracted. We also propose metrics for evaluating each type of canonicalization. On the COMBO dataset, we empirically compare previously proposed canonicalization methods as well as a few simple baseline methods based on pretrained language models. We find that properly encoding the phrases in a triple using pretrained language models results in better relation canonicalization and ontology-level canonicalization of the noun phrase. We release our dataset, baselines, and evaluation scripts at path/to/url.


Introduction
Large ontological knowledge graphs (KG) such as Wikidata (Vrandečić and Krötzsch, 2014), DBpedia (Bizer et al., 2009), Freebase (Bollacker et al., 2008) use a complex ontology to formalize and organize all the entities and relations. Figure 1(a) shows an example ontological knowledge graph (Wikidata): "Joe Biden (Q6279)" is categorized as "Human (Q5)" in Wikidata and linked to "Scranton (Q271395)" with relation "birth place (P19)", where prefix Q and P denote unique identities for † This work was done during Chengyue Jiang's internship at DAMO Academy, Alibaba Group. * Yong Jiang and Kewei Tu are corresponding authors.   Figure 1: Example of ontological KG (a) and Open KG triples (b). The differently colored bounding boxes and the tags on the open KG triples illustrate three types of gold canonicalization. Yellow (e.g., Q5 Human) shows the gold ontology-level NP cluster, the salvia blue (e.g., Q6279) indicates the gold entity-level NP cluster, and the purple (e.g., P19) indicates gold RP cluster. entity and relation respectively in Wikidata 1 . As ontological KGs are well organized and canonicalized, one can efficiently query information and extract knowledge from them to assist NLP models in various tasks (Rao et al., 2013;Luo et al., 2015;Cui et al., 2019;Murty et al., 2018;Wang et al., 2021;Liu et al., 2023;Gao et al., 2022;Liu et al., 2022). However, building and maintaining an accurate ontological KG requires large human effort (Färber et al., 2015).

Was born in Scranton
In contrast, open knowledge graphs such as Re-Verb (Fader et al., 2011) and OLLIE (Mausam et al., 2012) are built using (subject, relation, object) triples automatically extracted from millions of raw text by OpenIE systems (Angeli et al., 2015;Fader et al., 2011;Mausam et al., 2012). They are frequently used to assist in building ontological KGs (Martinez-Rodriguez et al., 2018;Dessì et al., 2021) and slot filling (Broscheit et al., 2017). As OpenIE systems do not rely on pre-defined ontologies or human supervision, the extracted triples contain noun phrases (NPs) and relation phrases (RPs) that are not canonicalized. Take the open KG triples shown in Figure 1(b) as an example. The NP "Joseph Biden" and "Biden" both refer to the US president Joe Biden, but the open KG regards them as two different nodes because of their different surface forms. On the other hand, "was born in" in the first and second triple means "birth place of" and Existing open KG canonicalization datasets such as ReVerb-base, ReVerb-ambiguous (Galárraga et al., 2014), ReVerb45K (Vashishth et al., 2018) and CanonicNELL (Dash et al., 2021) mainly focus on entity-level canonicalization of NPs, providing the gold Entity-level NP Canonicalization (NPC-E). The blue tags and dashed boxes in Figure  1(b) show examples of NPC-E, e.g., "Biden" and "Joseph Biden" should be canonicalized as the same entity Q6279. However, these datasets do not provide the gold RP Canonicalization (RPC), and do not consider the Ontology-level Canonicalization of NP (NPC-O). RPC is to canonicalize RPs that mean same relation together, for example, the second and the third "was born in" in Figure 1(b) should be canonicalized into the same cluster of birth place (P19), different from the first one which means "birth time (P569)". Similarly, NPC-O is to canonicalize NPs that have same type together, for example, the "Scranton" should be canonicalized into class "CountySeat" and into class "Local Government" together with "Atlantic County", it can be viewed as canonicalizing special ontological relations such as "instance of ", "subclass of " represented by dotted arrows in Fig. 1(a). We formally define these tasks in Sec. 3.
RPC and NPC-E are important as parts of a canonicalization benchmark (1) Relations and ontology are necessary for an expressive KG (Klyne and Carroll, 2004) (2) Most KG queries involve relations and ontology (e.g., the query "actress that was born in California", involve the relational constraint "X, birth place, California" and the ontological constraint "X, instance of, Actress").
In this paper, we present COMBO, a complete benchmark for open KG canonicalization consisting of three subtasks: besides NPC-E which has been adequately studied in previous work, we additionally provide gold RPC and NPC-O along with their evaluation metrics. Gold NPC-O is obtained by querying the Wikidata using SPARQL, and RPC is obtained by performing Stanford Ope-nIE on sentences from Wiki20 (a distantly labeled relation extraction dataset), and a per-instance human revision process to ensure the quality of extracted RPs. We introduce the data construction process detailedly in Sec. 4. Our new benchmark makes it possible for the first time to quantitatively evaluate the full range of open KG canonicalization. We conduct comprehensive experiments to compare existing canonicalization methods as well as a few simple baseline methods proposed by us. Somewhat surprisingly, none of the existing methods utilizes pretrained contextualized word embedding, probably because previous work only focuses on NPC-E and NPs are often not very ambiguous, making contextualization not so helpful. For example, the "Joe Biden" and "Joseph Biden". However, contexts are more helpful in RPC and NPC-O. For RPC, relations are more ambiguous and diverse in surface forms (e.g., "was born in" in Figure 1) and contexts are needed for disambiguation. For NPC-O, the RP and the other NP in the triple will help understand the type of an NP. Therefore, our proposed baseline methods are based on pretrained language models (PLM) (Devlin et al., 2019;Liu et al., 2019b;Sun et al., 2019) which produce contextualized embedding and have been shown to contain a certain amount of factual knowledge (Petroni et al., 2019;Lauscher et al., 2020). We found that, after properly encoding triples and contexts, our baseline methods outperform well on all three subtasks compared with previous state-of-the-art methods, especially on RPC and NPC-O. We also propose a triple-based pretraining method and find that it further boosts the performance on all subtasks. Therefore, our work provides strong baselines for future research on open KG canonicalization.
In summary, our contributions are threefold. First, we propose a complete definition of the open KG canonicalization problem along with the metrics. Second, we construct the complete benchmark for open KG canonicalization consisting of entitylevel and ontology-level NP canonicalization and RP canonicalization. Third, we propose a stronger baseline based on autoencoding PLMs and conduct a comprehensive empirical comparison of canonicalization methods on our benchmark.

Open KG Canonicalization Datasets
We introduce existing open KG canonicalization datasets and COMBO. The statistics of datasets are shown in Table 1 (Mitchell and Fredkin, 2014) and the entity linking information for NPs (Pujara et al., 2013). They remove triples containing NPs without aliases. CanonicNELL does not provide source sentences. COMBO (Ours) As shown in the Table 1, the main differences of our dataset between others are that we additional provide gold RP canonicalization and ontology-level NP canonicalization. Constructed based on the large Ontological KG Wikidata 2 , the OpenIE system, a relation extraction dataset Wiki20m and human revisions, as detailed in next section. Our dataset contains 18K triples with their source sentences and we provide gold NPC-E, RPC and NPC-O annotations. We compare COMBO with existing datasets in Table 1. Although our dataset is middle-sized, it has the longest average triple length and the largest number of unique NPs, indicating the diversity of the surface forms of NPs and RPs. Providing source sentences of extracted OpenIE triples is natural but important since additional contextual information can be helpful in understanding and disambiguating NPs and RPs. We ensure all triples contain rich context, and the average length of source sentences is 21. We show some data samples in Appendix A, and analyze our data in Sec. 4.

Task Definition and Evaluation Metrics
Task Definition The goal of open KG canonicalization is to assign NPs and RPs in triples into clusters, such that NPs that refer to the same entity (NPC-E) or have the same type (NPC-O) are clustered together, and similarly, RPs that refer to the same relation are clustered together. Note that the task is unsupervised, meaning that the canonicalizer does not have access to gold annotations. We have N samples containing triples and their corresponding source sentences: NPC-E (Obj) and RPC satisfy similar conditions. NPC-O (Subj) and NPC-O (Obj) are overlapping cluster assignments, i.e., we allow an NP to belong to multiple clusters, so they only need to satisfy the first condition. The task is to predict the cluster assignments of NPs and RPs given their source triples and sentences. Following previous works, we assume the cluster number is unknown beforehand and split our data into the dev (20%) and test (80%) sets.
Task Evaluation Most clustering algorithms such as K-means (Lloyd, 1982) and Hierarchical Agglomerative Clustering (HAC) (Maimon and Rokach, 2005) produce non-overlapping cluster assignments, and several algorithms (e.g., HAC) can also produce hierarchical and overlapping cluster assignment. For the NPC-E subtask, we adopt the classic macro, micro and pairwise metrics to compare the gold and predicted NPC-E cluster assignments (please refer to App. C for details). For RPC, the macro metrics that calculate the fractions of pure clusters are too strict because gold RP clusters are large and hence are unlikely to be pure. Therefore we only use the micro and pairwise metrics to evaluate RPC.
For NPC-O, the gold cluster assignments are overlapping. If the predicted clusters are nonoverlapping, we can apply the macro and pairwise metrics and a modified micro metric (Appendix D). If the predicted clusters are overlapping, say P = {C p 1 . . . C p M }, we propose evaluation metrics J g→p and J p→g based on the Jaccard index (Jaccard, 1908;Tanimoto, 1958). J g→p (Eq.1) calculates the average Jaccard index of a gold cluster and its best matched predicted cluster. J p→g is similarly defined but with the roles of NPC-O and P switched. Table 2 summarizes the evaluation metrics of each subtask.

Construction of Our Dataset
We illustrate the construction process of COMBO in Figure 2. We rely on the Wiki20 dataset (Han et al., 2020) to obtain the source sentence and the gold NPC-E. Wiki20 is a large multi-domain relation extraction dataset constructed by aligning the Wikipedia corpus with Wikidata using distant supervision. As shown in the bottom of Figure 2, each sample of Wiki20 contains a sentence with the object and subject NP spans labeled and linked to entities in Wikidata and the relation between them is also labeled. To ensure data quality, we use the recently revised version of Wiki20 (Gao et al., 2021), which aligns the Wiki20 relation labels with the supervisedly constructed Wiki80 dataset (Han et al., 2019) and provides 56K human-annotated data samples. The object and subject NP spans and its entity linking information (e.g., Q6275) are from Wikipedia and have high precision, so we directly use it for task NPC-E.
Extracting Relational Phrases Wiki20 only provides the relation label of two NPs for each instance. We further extract RP for Wiki20 instances to obtain full open KG triples. We first discard samples with the relation label "NA" and then run the Stanford OpenIE system on Wiki20 sentences to extract triples. We choose Stanford OpenIE 3 because compared with older OpenIE systems such as ReVerb and NELL that are used in constructing previous datasets, Stanford OpenIE can leverage the linguistic structure of a sentence and generalizes better to out-of-domain and longer utterances (Angeli et al., 2015). We empirically find that Stanford OpenIE yields a better recall and can extract more triples per sentence compared to ReVerb. We use the default model configuration of Stanford OpenIE. After obtaining the OpenIE triples of each non-NA Wiki20 instance, we select the triples whose subject NP and object NP are consistent with the NP spans provided by Wiki20. This triple selection step ensures the NPs in the extracted triples have gold NPC-E annotations, and remove wrong relation spans caused by wrongly extracted head and tail entities. We filter out 88% of the original triples through this step. Although this step reduces noises caused by OpenIE, the extracted relation spans could still be wrong in two ways: 1. Invalid RP between correct NPs. For example, for sentence ". . . the Althing, the ruling legislative body of Iceland . . . ", OpenIE wrongly extracts (the Althing, body of, Iceland), while the true triple should be (the Althing, ruling legislative body of, Iceland).
2. Correct NPs and valid RP but RP does not imply the relation given by Wiki20. For the given relation mother of and sentence ". . . bart and lisa got sent out of the house by marge simpson . . . ", the extracted triple (lisa, got sent out of the house by, marge simpson) is valid but cannot imply the mother of relation.
Therefore, we manually check all the extracted triples for these two types of errors, correcting invalid relational phrase spans and removing triples whose RP cannot imply the given relation. We also standardize the form of RP (e.g., OpenIE sometimes includes "a" and "the" and sometimes does not). The detailed guidelines for the check and revision process are shown in the Appendix B. The error analysis is shown in Table 3.
After all these steps, we obtain an open KG consisting of 18K triples. Similar to NPC-E, we use the relation labels given by the Wiki20 annotations 3 https://stanfordnlp.github.io/CoreNLP/  Figure 2: Steps of dataset construction.
(e.g., P19) as the gold RPC. As shown in Figure 3, the constructed open KG contains 79 relations in various domains, such as relations between geopolitical entities (mouth of the watercourse (7.3%), mountain range (3.8%), etc), relations between people (spouse of (1.7%), child of (1.6%), etc), and various relations between people and other objects (citizenship (2.4%), work location (3.7%), etc). The extracted RPs are diverse in surface forms. The number of distinct RPs is 3.2K. We show RP examples in Table 4. There exist some RPs that represent multiple relations and one representative example is "in".
Extracting Ontology To obtain ontology-level NP clusters for the NPC-O subtask, we query Wikispouse of was married twice to , was married to, lover, consort of, second husband, widow of 's wife, 's second wife, arranged a wedding with mountain range peak, large nunatak, summits of, the only crossing of the most prominent feature of, small glacier, summits in valley in, only crossing of, northernmost subrange of, in location is headquartered in, moved to, is carved on took place at, ironworks in, was again held at, in  data for the classes of each entity. For example, to obtain the classes of "Joe Biden (Q6275)", we run the SPARQL (RDF query language) query " Q6275 P31 ?", where P31 represents the "instance of" relation in Wikidata. This query obtains all the classes of an NP. If an NP does not have a class, its NPC-O annotation is the same as its NPC-E annotation. If an NP has more than one class, we include all of them in the NPC-O annotation (e.g., city and big city for "New York"). We query Wikidata using a third-party client Wikidata Integrator 4 .As the ontology information in Wikidata is crowdsourced and contains errors, we apply pattern-based corrections to the extracted ontological NP clusters, for example, if an NP belongs to the cluster million cities, it should also belong to the cluster city. The resulting 2.9K ontological NP clusters form a 6level overlapping hierarchy which allows a node to have more than one parent. We illustrate part of the hierarchy in Figure 5 and show the statistics of the top 12 ontological NP clusters in Figure 4.

Comprehensive Evaluation of Methods
Our benchmark makes it possible to conduct a comprehensive empirical comparison of different methods on the full range of open KG canonicalization. Below we first give an overview of existing methods and propose a few new baseline methods. Then we present our experimental settings and results.

Previous Methods
Non-neural Methods Galárraga et al. (2014) Figure 6: Pipeline of the proposed PLM-based method.
same representation to phrases with the same surface form and therefore cannot deal with ambiguity.
No method utilizes original sentences to provide additional contexts. HAC is a popular choice of the clustering algorithm because it does not require knowing the number of clusters, but instead requires a distance threshold indicating when to stop merging. Unlike the number of clusters, the threshold can be tuned on a validation set and directly applied to the test set.

PLM-Based Baseline Methods
We propose a set of new baseline methods based on PLMs that produce contextualized embedding. We use a pipeline similar to CESI as shown in Figure  6. We encode NPs and RPs using different PLMs, PLM layers, and span representation methods and apply HAC clustering over their representations.
We use the cosine similarity as the distance function and apply the complete linkage variant of HAC clustering because we prefer compact clusters and the single linkage variant suffers from the chaining phenomenon. Before encoding, an optional triplelevel continuous pretraining step can be applied for better canonicalization. Token similarity and other side information are not used in our PLMbased method, but we generate them for our data using the code provided by Vashishth et al. (2018) to facilitate running of other methods.

Encoding
PLMs Input Given a triple t i = (s i , r i , o i ) and its corresponding source sentence c i , we formulate the input of PLM in the following four ways to obtain the contextualized embedding of words in the NPs and RP. Note that the fourth method sep indepen-5 Huggingface models https://huggingface.co/ dently encodes each phrase in the triple.

Triple-level Pretraining
Inspired by the HolE algorithm used in previous works (Vashishth et al., 2018;Dash et al., 2021), we may perform an optional triple-level continuous pretraining step before encoding to mimic the link prediction objectives in knowledge graph embedding learning. For each sentence in our dataset, we randomly mask a phrase in the triple and then train the PLM to predict the whole masked span. We perform pretraining for 10 epochs using the AdamW optimizer (Loshchilov and Hutter, 2019) with a linear scheduler and a start learning rate of 5e-5. We then use the continuously pretrained version of PLM as the phrase encoder. We also use the causal subword-level MLM strategy in BERT (Devlin et al., 2019) for comparison.

Experimental Setup
For each subtask, we use grid search to tune the HAC distance threshold on the dev set to obtain non-overlapping clusters for all the methods. We select the best threshold based on the average of the metrics shown in Table 2. We obtain overlapping clusters for NPC-O from the full HAC hierarchy. As HAC is deterministic, we run experiments once for methods without randomness and four times for methods involving random initialization (CUVA, Random+HAC). As Token Sim+SI and VAEGMM based methods cannot provide overlapping cluster assignments, we do not evaluate them by metrics based on the Jaccard index. For our PLM-based  methods, we select the best input form and span representation strategy based on the dev set performance. We also compare different encoding strategies in Appendix G, and layer-wise performances in Appendix H.

Overall Results
We report averaged metrics for each subtask in Table 5 because of limited space. The full results are shown in Appendix F. The results show that our PLM-based baseline methods outperform previous methods in most cases, especially on RPC and NPC-O, indicating the importance of contextual information. Trivial baselines such as Token Sim+SI, Random+HAC and GloVe+HAC already perform well (around 80%) on NPC-E, because NPs referring to the same entity usually have similar surface forms and do not have to rely on contexts for correct prediction. However, they perform badly on RPC and NPC-O, because surface forms alone are no longer adequate for these two subtasks because of higher ambiguity. CESI has bad RPC performance but is very competitive on NPC-E (Obj), and better than SpanBERT and RoBERTa without triple-level pretraining, but is still worse than the other PLM-based methods. CUVA performs generally badly, probably because it is sensitive to VAEGMM initialization and relies heavily on side information. As our dataset has the longest aver-age triple length and consists of texts from various domains, it could be more challenging for methods that do not use contextualized embedding. For PLM-based methods, BERT leads to the best overall performance on NPC-E (Obj), RPC and NPC-O (Subj); ERNIE2.0 performs best on NPC-E (Subj) and NPC-O (Obj) and is comparable to Bert on NPC-E (Obj) and RPC; RoBERTa and SpanBERT fall behind, but are still better than most other non-PLM methods on NPC-O and RPC. Large PLMs are better than base PLMs on NPC-E, comparable on NPC-O, but worse on RPC. We also found the triple-level pretraining effective, having a positive influence in most cases, especially on RPC (e.g., +9.73 for RoBERTa). In contrast, using the causal subword-level pretraining for Bert improves the object NPC but harms the subject NPC and RPC (-1.07 points). A detailed comparison between triple-level and subword-level pretraining is shown in Appendix I.

Conclusion
We present COMBO, a complete benchmark for open KG canonicalization. COMBO consists of three subtasks, entity-level and ontology-level NP canonicalization, and RP canonicalization. We construct the data and propose the evaluation metrics for the RPC and NPC-O that are not been adequately studied before. We also propose a stronger canonicalization method based on autoencoding PLMs and conduct a comprehensive comparison of different canonicalization methods on our dataset.
For future study, NPC-O and RPC still have a lot of room for improvement and the efficiency of canonicalization methods is also worth studying. We also note that COMBO can be additionally used as a probing benchmark for PLMs and as a phrase-level relation classification dataset.

Acknowledgement
This work was supported by the National Natural Science Foundation of China (61976139) and by Alibaba Group through Alibaba Innovative Research Program.

Limitations
One limitation of our work is that, the size of our dataset (18K) is relatively small compared to previous datasets (Table 1). Another limitation is that, similar to previous work, we perform clustering for three subtasks and evaluate the canonicalization results independently, but canonicalization of the head NP, tail NP and RP is in fact closely correlated. For example, the NPC-O clusters of the head NP and tail NP reveal the domain and range of the relation given by RPC. We leave jointly canonicalization and evaluation as future work. Our proposed baseline is straightforward. We encourage future studies to investigate better canonicalization methods based on pretrained language models.

Ethics Statement
Our dataset is constructed based on Wiki20 and Wikidata. The two sources are both publicly available. Wiki20 is under the MIT Licence and the Wikidata is under the Creative Commons CC0 License. Both of them allow modification and distribution. Regarding human revision during dataset construction, the annotators were properly paid. The annotating procedure lasted 12 days and the daily workload was relatively light: around 2.5 hours per day. During human inspection, we did not identify any unethical instances in our dataset. Regarding baseline models, we use PLMs as our text encoder and our task is inherently unsupervised. As PLMs are learned on large corpora, our method can potentially create biased clustering results. How to de-bias PLM embedding is worth further investigation.  tics, 9:176-194.

A Dataset Examples
We show examples of our dataset in Figure 7.

B Guidelines for Revising Relational Phrases
B.1 RP Annotating Procedure 1. We first split all triples by relations and form 79 json files for two major paid annotators, each annotator is responsible for around 40 relations.
2. Annotators should check one relation file at a time for annotating consistency, and start the next one after the former one is finished.
3. For each relation, annotators are given: (a) The original sentences of the relation with markers indicating the head NP, tail NP and RP extracted by OpenIE. (b) The name (e.g., composer), and the Wikidata ID (e.g., P86) of the gold relation.
4. Annotators should first understand the relation by querying the Wikidata, take the relation "composer (P86)" as an example, annotators should first query Wikidata through the link https://www.wikidata.org/wiki/ Property:P86 to obtain the definition of the relation and skim through example relational phrases of RPs. The Figure 8 shows the Wikidata page containing the definition and examples of "composer".  : Examples of our dataset, "h" means head or subject NP, "r" means relation, "t" means tail or object NP. "instance" stands for the gold ontology-level clusters. We also provide several revision examples of wrong OpenIE triples, part of them are shown in the Table 6 below.

B.3 Guideline for justifying if RP implies the given relation
As we stated in the fourth step of the overall annotating process in the Appendix B.1, we require annotators to fully understand the meaning of the given relation. For each triple, annotators should ask themselves if the relational phrase could express the relation of the head and tail NP in the given context sentence. Note that we don't require the relation could be solely implied by the RP, for example, given the triple and its context: "[mount elbert] in the [sawatch range] is the highest summit of the rocky mountains", it is impossible to infer the relation by the RP "in", but RP is a reasonable text representation of the relation mountain range in this context. We found that the extracted RPs can imply the relation in most cases, we show some concrete bad cases to the annotators to help them identify the bad RPs, part of the examples are shown in the Table 7.

C Classic Metrics
Gold cluster assignment: G = {C g 1 . . . C g K }, predicted cluster assignment: P = {C p 1 . . . C p M }, where C g i and C p j are gold and predicted cluster respectively.

Micro Metrics
P micro (G, P) = g∈G max p∈P |g ∩ p| N R micro (G, P) = P micro (P, G) Where N is the total number of different phrases that appear in G (or P).

Pairwise Metrics
For more details about classic metrics, please refer to the Sec. 7.2 of the CESI paper (Vashishth et al., 2018).

E Standardization
Following Timkey and van Schijndel (2021), we perform standardization for phrase embeddings to remove the rogue dimensions. Denote E R ∈ R N ×D as the embedding matrix of all RP phrases, where N is the number of triples, and D is the dimension of contextual embedding. The standardized RP embedding matrix E ′ R is: We empirically find that the standardized phrase embedding is better than the original one in most cases.

F Full Result
We show the full results in Table 8 and Table 9.   We compare the performance of encoding strategies averaged on all subtask metrics and PLM models in Figure 9, and the task-specific and modelspecific comparison of encoding strategies are shown in Figure 10. sentence is the best input form in general, probably because it is easier for a PLM to encode a valid sentence and the source sentence contains more context. sep is the worst on the RPC because it separately encodes the RPs and NPs. However, it is comparable to triple-sep and triple on NPC-E because NPC-E requires less context. mean is the best strategy for phrase rep-resentation, which is consistent with the results obtained by (Toshniwal et al., 2020) 6 , and diffsum is a bad choice for phrase canonicalization.

H Layerwise PLM Performance
We show the layerwise performance for all PLMs (base) on all subtasks in Figure 11, and find different layers perform differently on three subtasks. We empirically find that lower layers [1,2,3] perform well for upper layers [10,11,12] perform best for RPC and NPC-O (subj), while middle layers [3,4,5,6,7] perform relatively better on NPC-O (obj). As context-specificity increases in upper layers (Ethayarajh, 2019), these results make sense as NPC-E requires less context while RPC and NPC-O need more context.

I Triple Pretraining
We show the average performance difference after triple-level or causal subword-level pretraining in Figure 12 for different PLMs and subtasks.   Figure 12: Average performance difference after triplelevel or causal subword-level pretraining for different PLMs and subtasks.