COCO-EX: A Tool for Linking Concepts from Texts to ConceptNet

In this paper we present COCO-EX, a tool for Extracting Concepts from texts and linking them to the ConceptNet knowledge graph. COCO-EX extracts meaningful concepts from natural language texts and maps them to conjunct concept nodes in ConceptNet, utilizing the maximum of relational information stored in the ConceptNet knowledge graph. COCOEX takes into account the challenging characteristics of ConceptNet, namely that – unlike conventional knowledge graphs – nodes are represented as non-canonicalized, free-form text. This means that i) concepts are not normalized; ii) they often consist of several different, nested phrase types; and iii) many of them are uninformative, over-specific, or misspelled. A commonly used shortcut to circumvent these problems is to apply string matching. We compare COCO-EX to this method and show that COCO-EX enables the extraction of meaningful, important rather than overspecific or uninformative concepts, and allows to assess more relational information stored in the knowledge graph.


Introduction
ConceptNet ) is a semantic network which contains general commonsense facts about the world, e.g. Birds can fly or Computers are used for sending e-mails (Liebermann, 2008). It originates from the crowdsourcing project Open Mind Common Sense (Speer et al., 2008) that acquired commonsense knowledge from contributions over the web. The current version also includes expert-created resources such as Word-Net (Fellbaum, 1998) and JMDict (Breen, 2004), other crowdsourced resources such as Wiktionary, knowledge obtained through games with a purpose such as Verbosity, and automatically extracted knowledge (cf. Speer et al. (2008)). Knowledge facts in ConceptNet are represented as triples, e.g.
As opposed to conventional knowledge bases such as NELL (Carlson et al., 2010), Freebase (Bollacker et al., 2008), or YAGO (Nickel et al., 2012), the nodes in ConceptNet are represented as non-canonicalized, free-form text. This means that (I) concept nodes are not normalized: e.g. bake cake, bake cakes, baking cake, and baking cakes are represented as distinct nodes; likewise bin bag, binbag, bin bags, and bin-bag are separate nodes in ConceptNet. (II) concept nodes often consist of multi-word expressions, which can be very long and complex. Often they consist of several nested phrase types, e.g., buying the ingredients of the recipe, or a friend was celebrating a birthday. (III) Since large parts of ConceptNet have been crowdsourced, it contains noise (e.g., typos), uninformative concepts (e.g., there, it's), or very specific concepts (e.g., the second concept in the triple: [compute,HASPROP,more complex than pencil]).
These specific properties lead to a larger amount of nodes and a substantially sparser graph compared to conventional knowledge bases. This in turn is challenging for tasks such as knowledge base completion (cf. Li et al. (2016);Saito et al. (2018); Bosselut et al. (2019); Malaviya et al. (2020)); the semantic representation of nodes and edges (Speer and Lowry-Duda, 2017); or the learning of new relations (dos Santos et al., 2015;Becker et al., 2019;Trisedya et al., 2019).
Moreover, non-canonicalized nodes become challenging when merging knowledge bases, as in Faralli et al. (2020), who introduce a graph database merging multiple hypernymy graphs extracted from ConceptNet, DBpedia, WebIsAGraph, WordNet, and Wikipedia. They find that only 25% of the edges connect nodes from ConceptNet to other databases, which can be traced back to the fact that ConceptNet nodes are non-canonicalized, as opposed to common knowledge bases.
Finally, free-form concept nodes become problematic when we aim to project a ConceptNet subgraph from natural language texts by mapping phrases from natural language text to nodes in Con-ceptNet. In recent approaches, simple string matching has been applied to perform such a mapping (e.g. Lin et al. (2019); Wang et al. (2020)). Given the non-normalized nature of the concepts in Con-ceptNet, this can, however, result in an incomplete and noisy mapping: e.g., if the word "brains" occurs in a text, it can be mapped to the Concept-Net node brains (which is connected by 131 edges within ConceptNet), but not to brain (which is connected by 1799 edges). Therefore, a lot of relational knowledge stored in ConceptNet gets lost when mapping natural language text to concepts in ConceptNet via string matching. Moreover, since ConceptNet contains many nodes that don't represent meaningful concepts (e.g. yes, there, it's, the), simple string matching can lead to the extraction of concepts that will most likely be useless for downstream applications.
Motivated by these observations, we built a Concept Extraction Tool for ConceptNet, CoCo-Ex, which we present in this paper. COCO-EX is a tool written in Python 3.6 that selects meaningful concepts, possibly consisting of multiple tokens from natural language texts; it maps them to a collection of concept nodes in ConceptNet, utilizing the maximum of relational information stored in the knowledge graph. It is thus perfectly suited for identifying and extracting concepts from natural language texts and mapping them to ConceptNet, e.g., to project knowledge subgraphs from texts (Paul and Frank, 2019), or for detecting and classi-fying knowledge relations instantiated within texts (Becker et al., 2019).
We describe our Concept Extraction Tool COCO-EX in Section 2. In Section 3 we evaluate the benefits of COCO-EX in a practical application scenario, comparing it to simple string matching, by evaluating the retrieved concepts and their connectivity both automatically and manually. We conclude with a summary and results in Section 4.

COCO-EX: Extracting Concepts from
Text and Mapping them to ConceptNet COCO-EX is a pipeline implementation comprising several stages as shown in Figure 1. In Step 1, we extract candidate phrases from a given text, which we preprocess in Step 2. In Step 3, we map the preprocessed phrases to ConceptNet concepts, which we preprocess in the same manner: We first create a dictionary based on ConceptNet, where we gather all concepts that are conceptually related (that is, referring to a similar or the same entity or event), but represented as distinct nodes. In this dictionary we then look up the preprocessed candidate phrases and get all ConceptNet nodes which contain them. In order to avoid obtaining conceptually unrelated nodes, in Step 4 we establish a method that allows us to filter out nodes that are not similar enough to the candidate phrase using similarity metrics and vector space representations.
Step 1: Extracting Candidate Phrase Types. We start by extracting candidate phrases from a given text using the Stanford Constituency parser (Mi and Huang, 2015). We extract noun phrases, verb phrases and adjective phrases. 2 We find that some verb phrases are very long and specific and therefore unlikely to find exact matches in Concept-Net (e.g., "be sorted into different wheelie bins"). Yet, ConceptNet concepts often consist of general verb-object phrases, such as walk the dog; cook dinner; bake a cake. To accommodate for this, we create, for every verbal phrase we extract from the text, additional versions (i.e., chunks) that exclude subordinated prepositional phrases and/or noun phrases (e.g., for "be sorted into different wheelie bins" we additionally extract "be sorted into" and "be sorted"). Addressing the fact that nodes in Con-ceptNet are of different lengths and often consist of several nested phrases, we keep all the original complex verbal phrases; the reduced chunks; and the split-off nested, subordinated phrases, which we again split into chunks (here: "different wheelie bins", "wheelie bins", and "bins").
Step 2: Preprocessing Candidate Phrase Types and ConceptNet Nodes. Next, we preprocess the candidate phrases we extracted from the text to prepare the mapping in Step 3. We apply spacy (Honnibal and Montani, 2017) to lemmatize the candidate phrases extracted from the texts, and remove articles, pronouns, adverbs, conjunctions, interjections and punctuation. The very same process we apply in Step 3 to nodes in ConceptNet, which are not normalized, in order to build a dictionary from ConceptNet.
Step 3: Matching Candidate Phrase Types to a Dictionary Based on ConceptNet. We then map the preprocessed phrases to the preprocessed ConceptNet concepts as follows: We create a dictionary based on ConceptNet where we collect all concepts that are conceptually related -in the sense that they involve at least one common content word -but are represented as distinct nodes in Concept-Net. I.e., we aim to subsume, e.g., dog, dogs, nice dog, and my neighbour's dog under one entry in the dictionary (cf. Figure 2). In our dictionary, keys are lemmatized words contained in concept node phrases (e.g. dog for the concept my dog), and the corresponding value assigned to a key is a list of all ConcepNet nodes that contain this lemma (e.g. dog, dogs, my dog, my neigbor's dog), as determined by the lemmatization of the nodes (see Step 2 for the applied process). Therefore, in our dictionary all ConceptNet nodes that contain the same lemma, the lemma of the key, are clustered together in one entry. Note that we lemmatize the ConcepNet nodes only for the purpose of mapping and clustering, while they remain unchanged (in their original form and inflection) as values in the dictionary. I.e., we compare a key (lemma) to the lemmatized version of the concepts, and include all nodes, or concept phrases in their original, inflected form, that contain this lemma.
An example of how we create an entry in the dictionary is given in Figure 2 Figure 3). In case the lemmatized candidate phrase from the text contains further lemmas, we apply the same procedure for each of these, and construct additional entries, if they have not yet been created and stored.
Using this dictionary we are now able to assess the maximum of relational information stored in the ConceptNet knowledge graph for a given candidate phrase from a text, since it allows us to jointly look up the in-and outgoing edges of all values (nodes) assigned to the same key, e.g., [dogs,ISA,domestic animal]; [dog,HASPROPERTY,nice]; ..) ( Figure  3, right-hand side). In case a candidate phrase contains multiple lemmas, we collect the union of ConceptNet nodes defined for the respective lemmas (keys) as their values, and apply a filtering step, which we describe below, to select the concept nodes that best correspond to the complex phrase.
Specifically, when looking up extracted candidate phrases that contain a single lemma (e.g. dog ), we consider the complete list of nodes stored in the dictionary for that lemma (key)that is, all concepts containing (inflected versions of) dog , including also multiword phrases which are linked with other keys. When looking up extracted candidate phrases that contain more than one lemma (e.g. "walk the dog"), we obtain sets of ConceptNet nodes (values) that are defined for each (non-stopword) lemma (key) -here: dog Figure 2: Collecting conceptually related nodes in Con-ceptNet, here: for the phrase "the dog". and walk -and retrieve all ConceptNet nodes from their respective list of values. From these sets, instead of building their union, we construct their intersection, which yields the set of phrases from all keys' values that contain the maximum of lemmas contained in the candidate phrase.
For our example "walk the dog", we would obtain the two lemmas walk and dog , together with their values: walk → walk, walks, walking, walking home, walking a dog, long walk, walk the dog, ... ; and dog → dog, dogs, nice dog, my neigbor's dog, walking a dog, walk the dog, ...; and extract walking a dog and walk the dog that are contained as values in both keys.
During the mapping process that collects values (ConceptNet concepts) for the lemmatized keys of candidate phrases, we are also resolving ambiguities. E.g., the forms fly or flies can be either a noun or a verb. We resolve this ambiguity by comparing the POS tags obtained during preprocessing the extracted candidate phrases to the POS tags that are associated with concepts in ConceptNet. 3 Specifically, we retrieve POS information for the extracted candidate phrases by applying the POS tagger implemented in spacy (Honnibal and Montani, 2017) on the sentence level, while for ConceptNet nodes we assess the POS labels available as metadata. In case we find several concepts with the same surface form but different POS tags in ConceptNet (e.g. fly/noun and fly/verb), we use the POS annotations from the extracted candidate phrases and from ConceptNet tags to restrict the mapping to matching POS, hence we do not include any concepts with conflicting POS information in the list Figure 3: Example of the ConceptNet Dictionary entry for dog . Left: lemmatized ConceptNet nodes (grey) that contain dog (underlined); middle: CN dictionary entry (containing the original CN nodes); right: relational knowledge (in-and outgoing edges for each value (CN node) assigned to the key) which can be retrieved from ConceptNet based on the dictionary entry. of values for the phrase's keys.
To summarize, the dictionary we obtain from Step 3 allows us to look up concepts for any preprocessed candidate phrases, and obtain from it all ConceptNet nodes which contain them or inflected versions of them. In case of multiple lemmas contained in a candidate phrase, we retrieve all nodes that contain all lemmas included in the given phrase, by computing an intersection over the values associated with all keys (lemmas) evoked by the phrase. 4 Since we lemmatize both the Con-ceptNet nodes and the extracted candidate phrases as described above, we maximize the number of matches, and hence, the associated ConceptNet relation tuples, while selecting maximally specific nodes. At the same time, since we construct chunked phrases from the extracted concepts, we also allow for more constrained matches (limited, e.g., to single lemmata) with equally constrained Concept-Net concepts, preventing over-specific phrases and an ensuing loss of recall. Finally, we apply POS filtering, and hence avoid the retrieval of ConceptNet concepts that do not match the POS category of the concepts mentioned in the candidate phrase, relying on the sentential context of the phrase candidate for disambiguation.
Step 4: Constraining the Mapping to Concept-Net Concepts. While in Step 3 we constrain the selected concept nodes by intersection in case the phrase candidate contains multiple lemmata, we still obtain many ConceptNet nodes when mapping short phrases containing a single content word to ConceptNet, since we retrieve all nodes that include the lemma of the candidate phrase. In practice, this yields a huge set of concepts that contain not only this lemma, but many other content words not present in the candidate phrase -possibly conceptually unrelated nodes that we want to omit. For example, if the candidate phrase is "dog", we map it to the ConceptNet nodes dog and dogs, but also conceptually not strictly related nodes such as feeding my dogs, dogs are my favourite animals, it's raining cats and dogs, etc. We therefore establish a method that allows us to filter out nodes that are not similar enough to the candidate phrase, and hence are assumed to be conceptually unrelated, which we describe in the following.
We filter the nodes (values) for each lemma (key) by calculating the similarity between the Concept-Net concepts and the extracted candidate phrase. We calculate similarity in terms of length (by token or char length) and in terms of semantic similarity (using word embeddings and similarity metrics). We experimented with different similarity metrics: we tried Dice Coefficient (Sørensen, 1948), Jaccard Coefficient (Jaccard, 1902), Minimum Edit Distance, Word Mover's Distance (Kusner et al., 2015), and Cosine Distance, with different similarity thresholds. For the metrics that require word representations in vector space (Word Mover's Distance and and Cosine Distance), we tried different embeddings (Numberbatch , Word2Vec trained on GoogleNews (Mikolov et al., 2013), and GloVe (Pennington et al., 2014)), where we compute representations for multiword terms by averaging their embeddings. We also consider differences in phrase lengths: here we compare the length of the ConceptNet nodes' concept phrases to the length of the candidate phrase -by number of tokens and of characters. E.g. when comparing the candidate phrase "my dog" to the nodes (a) dogs and (b) many dogs, we obtain for (a) a difference in the number of tokens by 1 and of characters by 1, and for (b) in the number of tokens by 0 and of characters by 3.
We evaluated the output of several configurations manually in terms of how well the filtered nodes fit the extracted candidate phrase, and found the following configurations to yield the highest coverage and lowest noise: we allow for a maximum token length difference of 1 and/or a maximum character difference of 10, and a minimum Dice coefficient of 0.85. The other configurations are implemented as well (as command line parameters), so users can experiment with different settings easily.

Applications
Recent approaches that map natural language text to nodes in ConceptNet apply simple string matching. Wang et al. (2020) for example use Concept-Net in order to retrieve multi-hop knowledge paths as background information for improving the task of question answering. They map concepts that appear in questions and answers from the two benchmark datasets, CommonsenseQA (Talmor et al., 2019) and OpenBookQA , to ConceptNet using plain string matching. Irrespective of the question answering task, we want to evaluate the two methods of linking concepts from texts to ConceptNet (plain string matching vs. COCO-EX) by comparing the number of concepts that could be retrieved from ConceptNet by both methods, respectively; and by evaluating the quality of the retrieved concepts, with regard to their coverage and informativity, as well as the amount of utilized relational knowledge from the ConceptNet knowledge graph.
We reimplement the string matching method and make it comparable to COCO-EX by retrieving all noun phrases, verb phrases and adjective phrases and their nested phrases (as we do for COCO-EX). Additionally, as in COCO-EX, we filter these phrases by removing articles, pronouns, adverbs, conjunctions, interjections and punctuation, and keep the original phrases and the chunked versions.
The counts of concepts retrieved by simple string matching vs. using COCO-EX are displayed in Table 1. We find that for the CommonsenseQA dataset, more concepts are linked to ConceptNet from the questions when using string matching, while with COCO-EX we can link more concepts from the answers (Table 1). For OpenBookQA, the number of extracted concepts for the questions are similar for both methods, while again we can link more concepts from the answers with COCO-EX.
For evaluating concept quality, we set up a  small annotation experiment where we provided our annotators with 50 questions randomly sampled from CommonsenseQA and OpenBookQA. For each question, our annotators evaluated whether all meaningful concepts were extracted (coverage, in a binary setting (yes/no)); and if/how many informative (and thus, wanted) concepts are among the extracted concepts (which can be interpreted as reverse precision). 5 For each dataset, two annotators with linguistic background performed annotations. We measure annotator agreement in terms of Cohen's Kappa and achieve an agreement of 78%. Remaining conflicts were resolved by an expert annotator (one of the authors). The number of concepts that could be accessed in ConceptNet we evaluate automatically, in terms of the number of in-and outgoing edges connecting the node(s) which have been annotated as informative (wanted), identified by simple string matching vs. all nodes obtained by COCO-EX through keys and values. The results of our manual evaluation experiment are displayed in Table 2. We find that the coverage (if all meaningful concepts were extracted, evaluated in a binary setting: yes/no) is higher for Com-monsenseQA when using COCO-EX and higher for OpenBooksQA when applying string matching.
Next, we evaluate the informativeness of the extracted concepts. We find that the ratio between informative (wanted) and uninformative concepts (unwanted) is much better when using COCO-EX opposed to simple string matching on both datasets (cf. Table 2). Finally, we also evaluate the amount of relational information stored in the ConceptNet knowledge graph which can be retrieved by looking 5 Our annotation manual can be found here: https:// github.com/Heidelberg-NLP/CoCo-Ex/blob/ master/CoCo-Ex_Annotation_Manual.pdf up in-and outgoing nodes from the nodes rated as informative. Here we find that with COCO-EX, much more relational information of ConceptNet can be accessed, indicating again the superiority of this method compared to simple string matching.

Conclusion
In this paper we presented COCO-EX, a tool for Extracting Concepts from texts and linking them to the ConceptNet knowledge graph. As opposed to the common shortcut method of simply matching strings from texts to ConceptNet nodes, COCO-EX extracts meaningful concepts from texts and maps them to collections of concept nodes in Con-ceptNet, which enables us to assess the maximum of relational information stored in the ConceptNet knowledge graph. COCO-EX takes into account that concepts in ConceptNet are represented as noncanonicalized, free-form text and are often complex, noisy, uninformative, and/or over-specific. We evaluated COCO-EX against the method of simple string matching, which confirmed our hypotheses that (i) COCO-EX improves the precision of mapping by enabling the extraction of meaningful, important rather than overspecific or uninformative concepts, and (ii) allows to utilize the maximum of relational information stored in the knowledge graph, a step towards overcoming the well-known sparsity issue of commonsense knowledge graphs such as ConceptNet.