Constructing Taxonomies from Pretrained Language Models

We present a method for constructing taxonomic trees (e.g., WordNet) using pretrained language models. Our approach is composed of two modules, one that predicts parenthood relations and another that reconciles those pairwise predictions into trees. The parenthood prediction module produces likelihood scores for each potential parent-child pair, creating a graph of parent-child relation scores. The tree reconciliation module treats the task as a graph optimization problem and outputs the maximum spanning tree of this graph. We train our model on subtrees sampled from WordNet, and test on nonoverlapping WordNet subtrees. We show that incorporating web-retrieved glosses can further improve performance. On the task of constructing subtrees of English WordNet, the model achieves 66.7 ancestor F1, a 20.0% relative increase over the previous best published result on this task. In addition, we convert the original English dataset into nine other languages using Open Multilingual WordNet and extend our results across these languages.


Introduction
A variety of NLP tasks use taxonomic information, including question answering (Miller, 1998) and information retrieval (Yang and Wu, 2012). Taxonomies are also used as a resource for building knowledge and systematicity into neural models (Peters et al., 2019;Geiger et al., 2020;Talmor et al., 2020). NLP systems often retrieve taxonomic information from lexical databases such as WORD-NET (Miller, 1998), which consists of taxonomies that contain semantic relations across many domains. While manually curated taxonomies provide useful information, they are incomplete and expensive to maintain (Hovy et al., 2009). * indicates equal contribution Traditionally, methods for automatic taxonomy construction have relied on statistics of web-scale corpora. These models generally apply lexicosyntactic patterns (Hearst, 1992) to large corpora, and use corpus statistics to construct taxonomic trees (e.g., Snow et al., 2005;Kozareva and Hovy, 2010;Bansal et al., 2014;Mao et al., 2018;Shang et al., 2020). In this work, we propose an approach that constructs taxonomic trees using pretrained language models (CTP). Our results show that direct access to corpus statistics at test time is not necessary. Indeed, the re-representation latent in largescale models of such corpora can be beneficial in constructing taxonomies. We focus on the task proposed by Bansal et al. (2014), where the task is to organize a set of input terms into a taxonomic tree. We convert this dataset into nine other languages using synset alignments collected in OPEN MULTI-LINGUAL WORDNET and evaluate our approach in these languages.
CTP first finetunes pretrained language models to predict the likelihood of pairwise parentchild relations, producing a graph of parenthood scores. Then it reconciles these predictions with a maximum spanning tree algorithm, creating a tree-structured taxonomy. We further test CTP in a setting where models have access to web-retrieved glosses. We reorder the glosses and finetune the model on the reordered glosses in the parenthood prediction module.
We compare model performance on subtrees across semantic categories and subtree depth, provide examples of taxonomic ambiguities, describe conditions for which retrieved glosses produce greater increases in tree construction F 1 score, and evaluate generalization to large taxonomic trees (Bordea et al., 2016a). These analyses suggest specific avenues of future improvements to automatic taxonomy construction.
Even without glosses, CTP achieves a 7.9 point absolute improvement in F 1 score on the task of constructing WORDNET subtrees, compared to previous work. When given access to the glosses, CTP obtains an additional 3.2 point absolute improvement in F 1 score. Overall, the best model achieves a 11.1 point absolute increase (a 20.0% relative increase) in F 1 score over the previous best published results on this task.
Our paper is structured as follows. In Section 2 we describe CTP, our approach for taxonomy construction. In Section 3 we describe the experimental setup, and in Section 4 we present the results for various languages, pretrained models, and glosses. In Section 5 we analyze our approach and suggest specific avenues for future improvement. We discuss related work and conclude in Sections 6 and 7.

Constructing Taxonomies from
Pretrained Models

Taxonomy Construction
We define taxonomy construction as the task of creating a tree-structured hierarchy T = (V, E), where V is a set of terms and E is a set of directed edges representing hypernym relations. In this task, the model receives a set of terms V , where each term can be a single word or a short phrase, and it must construct the tree T given these terms. CTP performs taxonomy construction in two steps: parenthood prediction (Section 2.2) followed by graph reconciliation (Section 2.3). We provide a schematic description of CTP in Figure 2 and provide details in the remainder of this section.

Parenthood Prediction
We use pretrained models (e.g., BERT) to predict the edge indicators I[parent(v i , v j )], which denote whether v i is a parent of v j , for all pairs (v i , v j ) in the set of terms V = {v 1 , ..., v n } for each subtree T .
To generate training data from a tree T with n nodes, we create a positive training example for each of the n − 1 parenthood edges and a negative training example for each of the n(n−1) 2 − (n − 1) pairs of nodes that are not connected by a parenthood edge.
We construct an input for each example using the template v i is a v j , e.g., "A dog is a mammal." Different templates (e.g., [TERM_A] is an example of [TERM_B] or [TERM_A] is a type of [TERM_B]) did not substantially affect model performance in initial experiments, so we use a single template. The inputs and outputs are modeled in the standard format (Devlin et al., 2019).
We fine-tune pretrained models to predict I[parent(v i , v j )], which indicates whether v i is the parent of v j , for each pair of terms using a sentencelevel classification task on the input sequence.

Tree Reconciliation
We then reconcile the parenthood graph into a valid tree-structured taxonomy. We apply the Chu-Liu-Edmonds algorithm to the graph of pairwise parenthood predictions. This algorithm finds the maximum weight spanning arborescence of a directed graph. It is the analog of MST for directed graphs, and finds the highest scoring arborescence in O(n 2 ) time (Chu, 1965).

Web-Retrieved Glosses
We perform experiments in two settings: with and without web-retrieved glosses. In the setting without glosses, the model performs taxonomy construction using only the set of terms V . In the setting with glosses, the model is provided with glosses retrieved from the web. For settings in which the model receives glosses, we retrieve a list of glosses d 1 v , ..., d n v for each term v ∈ V . 1 Many of the terms in our dataset are polysemous, and the glosses contain multiple senses of the word. For example, the term dish appears in the subtree we show in Figure 1. The glosses for dish include (1) (telecommunications) A type of antenna with Figure 2: A schematic depiction of CTP. We start with a set of terms (A). We fine-tune a pretrained language model to predict pairwise parenthood relations between pairs of terms (B), creating a graph of parenthood predictions (C) (Section 2.2). We then reconcile the edges of this graph into a taxonomic tree (E) (Section 2.3). Optionally, we provide the model ranked web-retrieved glosses (Section 2.4). We re-order the glosses based on relevance to the current subtree (Z).
a similar shape to a plate or bowl, (2) (metonymically) A specific type of prepared food, and (3) (mining) A trough in which ore is measured.
We reorder the glosses based on their relevance to the current subtree. We define relevance of a given context d i v to subtree T as the cosine similarity between the average of the GloVe embeddings (Pennington et al., 2014) of the words in d i v (with stopwords removed), to the average of the GloVe embeddings of all terms v 1 , ..., v n in the subtree. This produces a reordered list of glosses d We then use the input sequence containing the reordered glosses " [CLS] v j " to fine-tune the pretrained models on pairs of terms (v i , v j ).

Experiments
In this section we describe the details of our datasets (Section 3.1), and describe our evaluation metrics (Section 3.2). We ran our experiments on a cluster with 10 Quadro RTX 6000 GPUs. Each training runs finishes within one day on a single GPU.

Datasets
We evaluate CTP using the dataset of mediumsized WORDNET subtrees created by Bansal et al. (2014). This dataset consists of bottomed-out full subtrees of height 3 (this corresponds to trees containing 4 nodes in the longest path from the root to any leaf) that contain between 10 and 50 terms. This dataset comprises 761 English trees, with 533/114/114 train/dev/test trees respectively.

Multilingual WORDNET
WORDNET was originally constructed in English, and has since been extended to many other languages such as Finnish (Magnini et al., 1994), Italian (Lindén and Niemi, 2014), and Chinese (Wang and Bond, 2013). Researchers have provided alignments from synsets in English WORD-NET to terms in other languages, using a mix of automatic and manual methods (e.g., Magnini et al., 1994;Lindén and Niemi, 2014). These multilingual wordnets are collected in the OPEN MULTILIN-GUAL WORDNET project (Bond and Paik, 2012). The coverage of synset alignments varies widely. For instance, the alignment of ALBANET (Albanian) to English WORDNET covers 3.6% of the synsets in the Bansal et al. (2014) dataset, while the FINNWORDNET (Finnish) alignment covers 99.6% of the synsets in the dataset.
Since these wordnets do not include alignments to all of the synsets in the English dataset, we convert the English dataset to each target language using alignments specified in WORDNET as follows. We first exclude all subtrees whose roots are not included in the alignment between the WORDNET of the target language and English WORDNET. For each remaining subtree, we remove any node that is not included in the alignment. Then we remove all remaining nodes that are no longer connected to the root of the corresponding subtrees. We describe the resulting dataset statistics in Table 8 in the Appendix.

Evaluation Metrics
As with previous work (Bansal et al., 2014;Mao et al., 2018), we report the ancestor F 1 score 2P R P +R , where IS_A PREDICTED and IS_A GOLD denote the set of predicted and gold ancestor relations, respectively. We report the mean precision (P ), recall (R), and F 1 score, averaged across the subtrees in the test set.

Models
In our experiments, we use pretrained models from the Huggingface library (Wolf et al., 2019). For the English dataset we experiment with BERT, BERT-Large, and ROBERTA-Large in the parenthood prediction module. We experiment with multilingual BERT and language-specific pretrained models (detailed in Section 9 in the Appendix). We finetuned each model using three learning rates {1e-5, 1e-6, 1e-7}. For each model, we ran three trials using the learning rate that achieved the highest dev F 1 score. In Section 4, we report the average scores over three trials. We include full results in Tables 13 and 15 in the Appendix. The code and datasets are available at https://github.com/cchen23/ctp.

Main Results
Our approach, CTP, outperforms existing stateof-the-art models on the WORDNET subtree construction task. In Table 1 we provide a comparison of our results to previous work. Even without retrieved glosses, CTP with ROBERTA-LARGE in the parenthood prediction module achieves higher F 1 than previously published work. CTP achieves additional improvements when provided with the web-retrieved glosses described in Section 2.4.
We compare different pretrained models for the parenthood prediction module, and provide these comparisons in Section 4.3.

Web-Retrieved Glosses
In Table 2 we show the improvement in taxonomy construction with two types of glosses -glosses retrieved from the web (as described in Section 2.4), and those obtained directly from WORDNET. We consider using the glosses from WORDNET as an oracle setting since these glosses are directly generated from the gold taxonomies. Thus, we focus on the web-retrieved glosses as the main setting. Models produce additional improvements when given WORDNET glosses. These improvements suggest that reducing the noise from web-retrieved glosses could improve automated taxonomy construction.

Comparison of Pretrained Models
For both settings (with and without web-retrieved glosses), CTP attains the highest F 1 score when ROBERTA-Large is used in the parenthood prediction step. As we show in Table 3, the average F 1 score improves with both increased model size and with switching from BERT to ROBERTA.

Aligned Wordnets
We extend our results to the nine non-English alignments to the Bansal et al. (2014) dataset that we created. In Table 4 we compare our best model in each language to a random baseline. We detail the random baseline in Section 9 in the Appendix and provide results from all tested models in Section 17 in the Appendix. CTP's F 1 score non-English languages is substantially worse than its F 1 score on English trees. Lower F 1 scores in non-English languages are likely due to multiple factors. First, English pretrained language models generally perform better than models in other languages because of the additional resources devoted to the development of English models. (See e.g., Bender, 2011;Mielke, 2016;Joshi et al., 2020). Second, OPEN MULTI-LINGUAL WORDNET aligns wordnets to English WORDNET, but the subtrees contained in English WORDNET might not be the natural taxonomy in other languages. However, we note that scores across languages are not directly comparable as dataset size and coverage vary across languages (as we show in Table 8).
These results highlight the importance of evaluating on non-English languages, and the difference in available lexical resources between languages. Furthermore, they provide strong baselines for fu-  ture work in constructing wordnets in different languages.

Analysis
In this section we analyze the models both quantitatively and qualitatively. Unless stated otherwise, we analyze our model on the dev set and use ROBERTA-Large in the parenthood prediction step.

Models Predict Flatter Trees
In many error cases, CTP predicts a tree with edges that connect terms to their non-parent ancestors, skipping the direct parents. We show an example of this error in Figure 3. In this fragment (taken from one of the subtrees in the dev set), the model predicts a tree in which botfly and horsefly are direct children of fly, bypassing the correct parent gadfly. On the dev set, 38.8% of incorrect parenthood edges were cases of this type of error. Missing edges result in predicted trees that are generally flatter than the gold tree. While all the gold trees have a height of 3 (4 nodes in the longest path from the root to any leaf), the predicted dev trees have a mean height of 2.61. Our approach scores the edges independently, without considering the structure of the tree beyond local parenthood edges. One potential way to address the bias towards flat trees is to also model the global structure of the tree (e.g., ancestor and sibling relations).  CTP generally makes more errors in predicting edges involving nodes that are farther from the root of each subtree. In Table 5 we show the recall of ancestor edges, categorized by the number of parent edges d between the subtree root and the descendant of each edge, and the number of parent edges l between the ancestor and descendant of each edge. The model has lower recall for edges involving descendants that are farther from the root (higher d). In permutation tests of the correlation between edge recall and d conditioned on l, 0 out of 100,000 permutations yielded a correlation at least as extreme as the observed correlation.

Subtrees Higher Up in WORDNET are
Harder, and Physical Entities are Easier than Abstractions Subtree performance also corresponds to the depth of the subtree in the entire WORDNET hierarchy. The F 1 score is positively correlated with the depth of the subtree in the full WORDNET hierarchy, with a correlation of 0.27 (significant at p=0.004 using a permutation test with 100,000 permutations). The subtrees included in this task span many different domains, and can be broadly categorized into subtrees representing concrete entities (such as telephone) and those representing abstractions (such as sympathy). WORD-NET provides this categorization using the toplevel synsets physical_entity.n.01 and abstraction.n.06. These categories are direct children of the root of the full WORDNET hierarchy (entity.n.01), and split almost all WORDNET terms into two subsets. The model produces a mean F 1 score of 60.5 on subtrees in the abstraction subsection of WORDNET, and a mean F 1 score of 68.9 on subtrees in the physical_entity subsection. A one-sided Mann-Whitney rank test shows that the model performs systematically worse on abstraction subtrees (compared to physical entity subtrees) (p=0.01). With models pretrained on large web corpora, the distinction between the settings with and without access to the web at test time is less clear, since large pretrained models can be viewed as a com-pressed version of the web. To quantify the extent the evaluation setting measures model capability to generalize to taxonomies consisting of unseen words, we count the number of times each term in the WORDNET dataset occurs in the pretraining corpus. We note that the WORDNET glosses do not directly appear in the pretraining corpus. In Figure  4 we show the distribution of the frequency with which the terms in the Bansal et al. (2014) dataset occur in the BERT pretraining corpus. 2 We find that over 97% of the terms occur at least once in the pretraining corpus. However, the majority of the terms are not very common words, with over 80% of terms occurring less than 50k times. While this shows that the current setting does not measure model ability to generalize to completely unseen terms, we find that the model does not perform substantially worse on edges that contain terms that do not appear in the pretraining corpus. Furthermore, the model is able do well on rare terms. Future work can investigate model ability to construct taxonomies from terms that are not covered in pretraining corpora. Some trees in the gold WORDNET hierarchy contain ambiguous edges. Figure 5 shows one example.

WORDNET Contains Ambiguous Subtrees
In this subtree, the model predicts arteriography as a sibling of arthrography rather than as its child. The definitions of these two terms suggest why the model may have considered these terms as siblings: arteriograms produce images of arteries while arthrograms produce images of the inside of joints. In Figure 6 we show a second example of an ambiguous tree. The model predicts good faith as a child of sincerity rather than as a child of honesty, but the correct hypernymy relation between these terms is unclear to the authors, even after referencing multiple dictionaries.
These examples point to the potential of augmenting or improving the relations listed in WORD-NET using semi-automatic methods.

Web-Retrieved Glosses Are Beneficial When They Contain Lexical Overlap
We compare the predictions of ROBERTA-Large, with and without web glosses, to understand what kind of glosses help. We split the parenthood edges in the gold trees into two groups based on the glosses: (1) lexical overlap (the parent term appears in the child gloss and/or the child term appears in the parent gloss) and (2) no lexical overlap (neither the parent term nor the child term appears in the other term's gloss). We find that for edges in the "lexical overlap" group, glosses increase the recall of the gold edges from 60.9 to 67.7. For edges in the "no lexical overlap" group, retrieval decreases the recall (edge recall changes from 32.1 to 27.3).

Pretraining and Tree Reconciliation Both Contribute to Taxonomy Construction
We performed an ablation study in which we ablated either the pretrained language models for the parenthood prediction step or we ablated the tree reconciliation step. We ablated the pretrained language models in two ways. First, we used a onelayer LSTM on top of GloVe vectors instead of a pretrained language model as the input to the finetuning step, and then performed tree reconciliation as before. Second, we used a randomly initialized ROBERTA-Large model in place of a pretrained network, and then performed tree reconciliation as before. We ablated the tree reconciliation step by substituting the graph-based reconciliation step with a simpler threshold step, where we output a parenthood-relation between all pairs of words with softmax score greater than 0.5. We used the parenthood prediction scores from the fine-tuned ROBERTA-Large model, and substituted tree reconciliation with thresholding.
In Table 6, we show the results of our ablation experiments. These results show that both steps (using pretrained language models for parenthoodprediction and performing tree reconciliation) are  important for taxonomy construction. Moreover, these results show that the incorporation of a new information source (knowledge learned by pretrained language models) produces the majority of the performance gains.

Models Struggle to Generalize to Large Taxonomies
To test generalization to large subtrees, we tested our models on the English environment and science taxonomies from SemEval-2016 Task 13 (Bordea et al., 2016a). Each of these taxonomies consists of a single large taxonomic tree with between 125 and 452 terms. Following Mao et al. (2018) and Shang et al. (2020), we used the medium-sized trees from Bansal et al. (2014) to train our models. During training, we excluded all medium-sized trees from the Bansal et al. (2014) dataset that overlapped with the terms in the SemEval-2016 Task 13 environment and science taxonomies. In Table 7 we show the performance of the ROBERTA-Large CTP model. We show the Edge-F1 score rather than the Ancestor-F1 score in order to compare to previous work. Although the CTP model outperforms previous work in constructing medium-sized taxonomies, this model is limited in its ability to generalize to large taxonomies. Future work can incorporate modeling of the global tree structure into CTP.

Related Work
Taxonomy induction has been studied extensively, with both pattern-based and distributional approaches. Typically, taxonomy induction involves hypernym detection, the task of extracting candidate terms from corpora, and hypernym organization, the task of organizing the terms into a hierarchy.
While we focus on hypernym organization, many systems have studied the related task of hypernym detection. Traditionally, systems have used patternbased features such as Hearst patterns to infer hypernym relations from large corpora (e.g. Hearst, 1992;Snow et al., 2005;Kozareva and Hovy, 2010). For example, Snow et al. (2005) propose a system that extracts pattern-based features from a corpus to predict hypernymy relations between terms. Kozareva and Hovy (2010) propose a system that similarly uses pattern-based features to predict hypernymy relations, in addition to harvesting relevant terms and using a graph-based longest-path approach to construct a legal taxonomic tree.
Later work suggests that, for hypernymy detection tasks, pattern-based approaches outperform those based on distributional models (Roller et al., 2018). Subsequent work pointed out the sparsity that exists in pattern-based features derived from corpora, and showed that combining distributional and pattern-based approaches can improve hypernymy detection by addressing this problem (Yu et al., 2020).
In this work we consider the task of organizing a set of terms into a medium-sized taxonomic tree. Bansal et al. (2014) Table 7: Generalization to large taxonomic trees. Models trained on medium-sized taxonomies generalize poorly to large taxonomies. Future work can improve the usage of global tree structure with CTP. rate siblinghood information. Mao et al. (2018) propose a reinforcement learning based approach that combines the stages of hypernym detection and hypernym organization. In addition to the task of constructing medium-sized WORDNET subtrees, they show that their approach can leverage global structure to construct much larger taxonomies from the SemEval-2016 Task 13 benchmark dataset, which contain hundreds of terms (Bordea et al., 2016b). Shang et al. (2020) apply graph neural networks and show that they improve performance in constructing large taxonomies in the SemEval-2016 Task 13 dataset.
Another relevant line of work involves extracting structured declarative knowledge from pretrained language models. For instance, Bouraoui et al. (2019) showed that a wide range of relations can be extracted from pretrained language models such as BERT. Our work differs in that we consider tree structures and incorporate web glosses. Bosselut et al. (2019) use pretrained models to generate explicit open-text descriptions of commonsense knowledge. Other work has focused on extracting knowledge of relations between entities (Petroni et al., 2019;Jiang et al., 2020). Blevins and Zettlemoyer (2020) use a similar approach to ours for word sense disambiguation, and encode glosses with pretrained models.

Discussion
Our experiments show that pretrained language models can be used to construct taxonomic trees. Importantly, the knowledge encoded in these pretrained language models can be used to construct taxonomies without additional web-based information. This approach produces subtrees with higher mean F 1 scores than previous approaches, which used information from web queries.
When given web-retrieved glosses, pretrained language models can produce improved taxonomic trees. The gain from accessing web glosses shows that incorporating both implicit knowledge of input terms and explicit textual descriptions of knowledge is a promising way to extract relational knowledge from pretrained models. Error analyses suggest specific avenues of future work, such as improving predictions for subtrees corresponding to abstractions, or explicitly modeling the global structure of the subtrees. Experiments on aligned multilingual WORD-NET datasets emphasize that more work is needed in investigating the differences between taxonomic relations in different languages, and in improving pretrained language models in non-English languages. Our results provide strong baselines for future work on constructing taxonomies for different languages.

Ethical Considerations
While taxonomies (e.g., WORDNET) are often used as ground-truth data, they have been shown to contain offensive and discriminatory content (e.g., Broughton, 2019). Automatic systems created by pretrained language models can reflect and exacerbate the biases contained by their training corpora. More work is needed to detect and combat biases that arise when constructing and evaluating taxonomies.
Furthermore, we used previously constructed alignments to extend our results to wordnets in multiple languages. While considering English WORDNET as the basis for the alignments allows for convenient comparisons between languages and is the standard method for aligning wordnets across languages, continued use of these alignments to evaluate taxonomy construction imparts undue bias towards conceptual relations found in English.  Table 9 shows the results for the learning rate trials for the ablation experiment.      Table 11 shows the results for the test trials for the SemEval experiment. These results all use the ROBERTA-Large model in the parenthood prediction step.

Random Baseline for Multilingual WORDNET Datasets
To compute the random baseline in each language, we randomly construct a tree containing the nodes in each test tree and compute the ancestor precision, recall and F 1 score on the randomly constructed trees. We include the F 1 scores for three trials in Table 12.

Model
Run 0