Graphine: A Dataset for Graph-aware Terminology Definition Generation

Precisely defining the terminology is the first step in scientific communication. Developing neural text generation models for definition generation can circumvent the labor-intensity curation, further accelerating scientific discovery. Unfortunately, the lack of large-scale terminology definition dataset hinders the process toward definition generation. In this paper, we present a large-scale terminology definition dataset Graphine covering 2,010,648 terminology definition pairs, spanning 227 biomedical subdisciplines. Terminologies in each subdiscipline further form a directed acyclic graph, opening up new avenues for developing graph-aware text generation models. We then proposed a novel graph-aware definition generation model Graphex that integrates transformer with graph neural network. Our model outperforms existing text generation models by exploiting the graph structure of terminologies. We further demonstrated how Graphine can be used to evaluate pretrained language models, compare graph representation learning methods and predict sentence granularity. We envision Graphine to be a unique resource for definition generation and many other NLP tasks in biomedicine.


Introduction
Obtaining the definition is the first step toward understanding a new terminology. The lack of precise terminology definition poses great challenges in scientific communication and collaboration (Oke, 2006;Cimino et al., 1994), which further hinders new discovery. This problem becomes even more severe in emerging research topics (Baig, 2020;Baines et al., 2020), such as COVID-19, where curated definitions could be imprecise and do not scale to rapidly proposed terminologies. Figure 1: Graphine dataset contains 2,010,648 terminology definition pairs organized in 227 directed acyclic graphs. Each node in the graph is associated with a terminology and its definition. Terminologies are organized from coarse-grained ones to fine-grained ones in each graph.
Neural text generation (Bowman et al., 2016;Vaswani et al., 2017;Sutskever et al., 2014;Song et al., 2020b) could be a plausible solution to this problem by generating definition text based on the terminology text. Encouraging results by neural text generation have been observed on related tasks, such as paraphrase generation (Li et al., 2020), description generation (Cheng et al., 2020), synonym generation (Gupta et al., 2015) and data augmentation (Malandrakis et al., 2019). However, it remains unclear how to generate definition, which comprises concise text in the input space (i.e., terminology) and longer text in the output space (i.e., definition). Moreover, the absence of large-scale terminology definition datasets impedes the progress towards developing definition generation models.
Despite these challenges, scientific terminologies often form a directed acyclic graph (DAG), which could be helpful in definition generation. Each DAG organizes related terminologies from general ones to specific ones with different granu-larity levels ( Figure 1). These DAGs have proved to be useful in assisting disease, cell type and function classification (Wang et al., 2020b;Song et al., 2020a;Wang et al., 2015) by exploiting the principle that nearby terms on the graph are semantically similar (Altshuler et al., 2000). Likewise, terminologies that are closer on this DAG should acquire similar definitions. Moreover, placing a new terminology in an existing DAG requires considerably less expert efforts than curating the definition, further motivating us to generate the definition using the DAG.
In this paper, we collectively advance definition generation in the biomedical domain through introducing a terminology definition dataset Graphine and a novel graph-aware text generation model Graphex. Graphine encompasses 2,010,648 terminology definition pairs encapsulated in 227 DAGs. These DAGs are collected from three major biomedical ontology databases (Smith et al., 2007;Noy et al., 2009;Jupp et al., 2015). All definitions are curated by domain experts. Our graph-aware text generation model Graphex utilizes the graph structure to assist definition generation based on the observation that nearby terminologies exhibit semantically similar definitions.
Our human and automatic evaluations demonstrate the substantial improvement of our method on definition generation in comparison to existing text generation methods that do not consider the graph structure. In addition to definition generation, we illustrate how Graphine opens up new avenues for investigating other tasks, including domain-specific language model pretraining, graph representation learning and a novel task of sentence granularity prediction. Finally, we present case studies of a failed generation by our method, pinpointing directions for future improvement. To the best of our knowledge, Graphine and Graphex build up the first large-scale benchmark for terminology definition generation, and can be broadly applied to a variety of tasks.

Data collection and statistics
We collect 2,010,648 biomedical terminology definition pairs from three major biomedical ontology databases, including Open Biological and Biomedical Ontology Foundry (OBO) (Smith et al., 2007), BioPortal (Noy et al., 2009) and EMBL-EBI Ontology Lookup Service (OLS) (Jupp et al., 2015), spanning diverse biomedical subdisciplines such as cellular biology, molecular biology and drug development. For the definition that span multiple sentences, we only consider the first sentence.
Even though these large-scale terminology definition pairs have already presented a novel resource for definition generation, one unique feature of our dataset is the graphs among terminologies. In particular, we construct a DAG for each biomedical subdiscipline using 'is a' relationship from the original data. As a result, each terminology belongs to one DAG, where the node is associated with a terminology and its definition and the edge links from a general terminology to a specific one. We reduce the number of DAGs from 499 to 227 by merging DAGs that appear in more than one database.
We notice substantial amount of missing definitions in the original collection, confirming the importance of computationally generating definition. In 81 out of 499 DAGs, more than 50% of terminologies does not have any definition. We thus exclude terminologies that do not have a curated definition. We further observed a substantial discrepancy between the number of words in the terminology and the number of words in the definition. The average number of words in the terminology is 4.55, which is much lower than the 15.58 average number of words in the definition ( Figure  2). This discrepancy could pose great challenges to text generation model. We seek to alleviate it using graph neighbor's terminology and definition.

Data analysis
All definitions in our datasets are curated by domain experts, assuring the high-quality. Reassur- ingly, we investigate the consistency between expert curation by comparing the definitions of the same terminology from different DAGs (e.g., material maintenance appears in both obi and chmo). Different DAGs are curated by different domain experts in our dataset. We observed a remarkable cosine similarity of 0.96 between definitions of the same terminology ( Figure 3a). We next examine the definitions of 67,257 terminology synonym pairs that presents in different DAGs. Synonyms are also curated by domain experts in the original databases. We again observed prominent cosine similarity 0.97, assuring the consistency between expert curation.
To examine the quality of the graph structure, we study the consistency between graph-based terminology similarity and text-based terminology similarity. Graph-based terminology similarity is calculated using the shortest distance on the graph. Textbased similarity is calculated using BLEU score (Papineni et al., 2002) between two terminologies. We observed strong agreement between these two similarity scores (Figure 3b). This agreement is even more substantial between graph-based terminology similarity and text-based definition similarity ( Figure 3c). Collectively, these results indicate that nearby nodes exhibit similar terminologies and definitions, suggesting the opportunity to improve definition generation using the graph structure.
3 Graph-aware Definition Generation: Task and Model

Problem Definition
Our goal is to generate the definition text according to the terminology text. Meanwhile, terminologies form a DAG, which could be used to assist definition generation. More precisely, the input is a where t j i ∈ C, d j i ∈ C and C is the vocabulary. In practice, the terminology is often a phrase and the definition is a sentence. Therefore, n d i is much larger than n t i .
We consider a transductive learning setting where V composes of V train and V test . V train is the set of nodes that have both terminologies and definitions. V test is the set of nodes that only have terminologies. The goal of graph-aware definition generation is to generate d i for v i ∈ V test according to both the terminology t i and the graph G. Although each graph G in Graphine is a DAG, our method can be applied to any kind of graphs. The proposed definition generation task is distinct from conditional text generation and machine translation due to the presence of this graph G. G makes it possible to transfer knowledge between terminologies based on our previous observation that nearby nodes on the graph have similar definitions. We thus aim at propagating terminology and definition using the graph structure to enhance definition generation.

Model
We propose a graph-aware definition generation approach Graphex that generates definition based on the global semantic embedding and the local semantic embedding using a two-stage approach (Fig. 4). At the first stage, global semantic embeddings are calculated through propagating terminology and definition on the graph. At the second stage, the lo- cal semantic embedding is obtained by embedding the specific terminology. Finally, Graphex generates the definition d i by using the concatenation of global and local semantic embeddings as the input to a Transformer (Vaswani et al., 2017).

Encoding global semantic via graph propagation
At the first stage, we obtain two global semantic embedding g t i and g d i of each node v i through propagating terminology and definition on the graph, respectively. In particular, we follow a previous work (Kotitsas et al., 2019) to calculate g t i and g d i using a bidirectional GRU-based neural network, which aggregates the embeddings of individual words in t i as the node features of the node v i and then smooths node features based on random walk.
To encode the network structure, we sample m random walk paths of fixed length k starting from each node (Grover and Leskovec, 2016). The r-th random walk starting from the node v i is denoted as P v i ,r = p 1,r , p 2,r , ..., p k,r (r = 1, . . . , m), where p 1,r = v i . We then learn two embeddings w i and u i for each node v i based on the arriving probability calculated from these sampled random walk paths. In particular, the predicted probability of arriving the node v j through the walk P v i ,r is defined as: . (1) Here, w i is the feature embedding and u i is the context embedding for node v i . Instead of training w i and u i solely based on the network structure, we use text feature from t i to regularize them. We define q(c) and h(c) to be the two separate trainable word embeddings for each token c in the vocabulary C. Then q(t k i ) and h(t k i ) are the trainable word embeddings of the k-th token in the terminology t i . We use a shared bidirectional GRU network to encode t i into u i and w i as: The loss function at the first stage is defined as: After minimizing this loss function, g t i is obtained by concatenating w i and u i , which represents the global semantic of node v i using the terminology. Likewise, we can obtain g d i by first encoding d i into the feature embedding w i and the context embedding u i , and then concatenating them. For node that does not have the definition (i.e., v i ∈ V test ), we generate a d i as replacement by using t i as input to a Transformer trained on other terminology definition pairs.

Fusing local and global semantic for definition generation
At the second stage, we generate the definition d i for node v i conditioned on both the local semantic l i and the global semantic g t i and g d i . The local semantic l i is obtained by embedding t i using BioBERT . We also examined other BERT-based models in the experiments. Let P (d i |l i , g t i , g d i ; θ) be the transformer model parameterized by θ. The loss function at the second stage is defined as 4 Experimental Results

Experimental setup
We conduct experiments using DAGs included in the OBO database. To study the effect of graph structures, we only consider graphs that show a high correlation between the graph-based similarity and the text-based definition similarity as measured We compare our method with three conventional conditional text generation models: Seq2Seq (Bahdanau et al., 2014), CVAE (Yan et al., 2016) and Transformer (Vaswani et al., 2017). All of them take the terminology as the input and the definition as the output. Since none of them considers the graph structure, our comparison could reveal the importance of considering graph structures. We further implement two variants of our model to investigate the impact of propagating definition on the graph and propagating terminology on the graph. In particular, Our Model w/o TG is the Graphex framework that does not incorporate the terminology-derived global semantic embedding g t i in eq. 9. Our Model w/o DG is the Graphex framework that does not incorporate the definitionderived global semantic embedding g d i in eq. 9. We used the same pretrained language model for all the competing methods. We chose BioBERT as it achieved the best performance among different pretrained language models. LSTM is used as the encoder and the decoder of Seq2seq and CVAE and the dimensions of the word embedding and the hidden state are set to 768. The dimensions of the word embedding and the hidden state of Transformer are also set to 768. In our model, we used the default hyperparameters in (Kotitsas et al., 2019) in the first stage and use the same structure as Transformer baseline in the second stage. The dimensions of g d i and g t i are 768. All the models were trained using the same data splits.
We used Graphex as a benchmark to compare pretrained language models on Graphine. We We compare the three graph neural network methods with Euclidean embeddings and Poincare embeddings methods, Euclidean and PoincareBall (Nickel and Kiela, 2017), and feature-based methods, HNN (Chami et al., 2019b) and MLP. We follow the default hyperparameter settings in (Chami et al., 2019a) We perform both automatic evaluation and human evaluation. For automatic evaluation, we used six standard metrics including BLEU1-4 (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and NIST (Doddington, 2002). BLEU1-4 measures the n-gram overlap between the generated sentence and the target sentence. METEOR improves BLEU by considering synonyms when comparing unigrams and using F1 instead of precision. NIST reweights words by frequency when matching n-grams to adjust the contribution of common words like "is". For human evaluation, we recruited 3 annotators to score the generated sentences of each method for 50 terminologies. Annotators are requested to grade each generated definition as 0 (bad), 1 (fair) and 2 (good).

Graphex improves definition generation by considering the graph structure
We first evaluated the performance of definition generation by Graphex. We compared Graphex with baselines that do not consider the graph structure ( Table 1). We found that Graphex, which uses both the definition graph and the terminology graph, obtained the best performance on all six metrics. The improvement is most prominent against baselines that do not use the graph structure. For example, Graphex obtained 34.35 BLEU1 score, which is 7.65% and 59.62% higher than Transformer and Seq2seq. Moreover, we observed decreased performance when only the terminology graph (Our Method w/o DG) or the definition graph (Our Method w/o TG) is considered. Despite less superior performance, these two variants are still consistently better than baselines that do not use graphs, confirming the importance of modeling graph structures in definition generation. We showed two examples of how the graph can help Graphine generate better definition ( Table 2).
In both examples, the true definition of the nearby node is included in the training set, and can thus be used to capture the global semantic. We found that Graphex selectively copied tokens in the true an estuarine open water pycnocline which is composed primarily of fresh tidal water Parent definition: a pycnocline which is part of an estuarine water body, spanning from a fiat boundary where the estuary bed below the water column reaches a depth of 4 meters until the end of the estuary most distal from the coast Graphex: an estuarine water which extends from an estuarine pycnocline or mid -depth to the estuary bed and from a fiat boundary where the estuary bed below the water column Transformer: an area of a planet's surface which is primarily covered by UNK herbaceous vegetation and where the underlying soil or Terminology: increased eye tumor incidence True definition: greater than the expected number of tumors originating in the eye in a given population in a given time period Child definition: greater than the expected number of neoplasms in the retina, usually in the form of a distinct mass, in a specific population in a given time period Graphex: greater than the expected number of neoplasms in the gastric tissue usually in the form of a distinct mass , in a specific population in a given time period Transformer: greater than the expected number of UNK in the lung , usually in the form of a distinct mass definition of the parent node, leading to a more accurate generation. For example, in the first case, Graphex successfully generated estuarine water and estuary bed below the water column. In the second case, Graphex propagated in a given population in a given time period from the child node, resulting in the correct generation of in a given population in a given time. In contrast, the Transformer baseline is not able to generate such detailed information in both examples due to the ignorance of graph structures. Since the pretrained language model and the graph representation method are two important model selections in Graphex, we next leverage Graphex to compare different pretrained language models and graph representation methods, shedding light on future directions in definition generation.

Comparing Pretrained Language Models
Domain-specific pretrained language models have achieved impressive performance on tasks such as named entity recognition, information extraction and relation extraction in biomedicine (Beltagy et al., 2019;. One barrier to more thoroughly comparing these pretrained language  models is the lack of domain-specific benchmarks. Graphine could be used as a novel domain-specific benchmark in biomedicine. As a proof-of-concept, we compared five pretrained language models, including three biomedical domain-specific models, by using it to generate the local semantic l i in eq. 8 (Table 3). We found that domain-specific pretrained language models have consistently better performance than general pretrained language models, which agrees with previous findings on the value of domain-specific language models in biomedicine (Gu et al., 2020;Beltagy et al., 2019). Within the three domain-specific pretrained language models, BioBERT and SciB-ERT obtained the most prominent performance. This might be due to the corpus these two models  Table 4: Comparison on the performance of link prediction using different graph representation learning methods.
were trained on, suggesting the possibility to use Graphine to further compare different biomedical corpus (Wang et al., 2020a;.

Comparing graph representation methods
We next sought to compare graph representation methods using a link prediction task based on our dataset. Graphs in our dataset present a hierarchical structure, which poses a unique challenge for graph representation methods. The results are summarized in Table 4 . We found that methods that consider the graph structure have overall superior performance, conforming the importance of the graph structure in definition generation. Among all approaches, GCN obtained the best performance. We didn't observe improved performance by embedding graphs into the hyperbolic space, which is contradictory to prior work showing that hyperbolic embedding can better model hierarchical structures (Nickel and Kiela, 2017;Chami et al., 2019a). We attribute this to the more complicated node features in contrast to previous work. In our dataset, node features are text that has arbitrary length and large vocabulary, introducing new challenges to hyperbolic embedding-based methods.

Sentence granularity prediction
Finally, we exploited Graphine for a novel task of sentence granularity prediction. Measuring sentence semantic similarity is crucial for many NLP tasks. Existing sentence similarity benchmarks only provide binary labels indicating similar or dissimilar (Li et al., 2006;Mueller and Thyagarajan, 2016). In contrast, our dataset is able to characterize the specific granularity of sentences beyond similarity. We define the ground truth granularity of a definition sentence as its depth in the DAG, where a smaller (larger) depth indicates a more coarse-grained (fine-grained) sentence. Based on this granularity benchmark, we define two specific tasks: relative granularity prediction and absolute granularity prediction. Relative granularity prediction aims at predicting which sentence is more fine-grained between two given sentences. Absolute granularity prediction aims at predicting the specific granularity of a given sentence. The incomparable granularity levels from different graphs could introduce systematic bias to comparing sentences from different graphs. To tackle this problem, we first performed a graph alignment among all DAGs using terminologies that appeared in multiple DAGs as anchors. After the alignment, all sentences are associated with a granularity level between 1 and 17, where 1 indicates the most coarsegrained sentence.
To predict the relative granularity, we used the concatenation of the BERT embeddings of two sentences as features to train an multi-layer perceptron (MLP). When comparing sentences within the same DAG, 76% of graphs obtained an accuracy larger than 0.80 ( Figure 5a). We next examined the accuracy of classifying a pair of sentences from two different DAGs and also observed a good accuracy of 0.81. To predict the absolute granularity, we used the BERT embedding of each sentence as features to train an MLP-based multi-class classifier. We again observed desirable accuracy of 0.71 and 0.81 and Spearman correlation 0.60 and 0.69 within each graph and across all graphs ( Figure  5b,c). In addition to predicting sentence granularity, we envision this new benchmark of sentence granularity could provide deeper insight into evaluating existing sentence similarity models through transforming it from a binary classification task to a ranking task.

Future work motivated by an opposite generation
Despite the overall improved performance of Graphex, we found that some definitions generated by Graphex present an opposite meaning to the truth definition. We showed one of such example in Figure 6 . Although the definition generated by Graphex for hyperlasia matches the true definition well, the generated definition has the opposite semantic meaning (e.g., reduction, reduced) to the true definition (e.g., increase, increased). Notably, such failed generations cannot be captured by existing n-gram based metrics, leading to artificial improvement. After a closer examination, we found that this opposite generation is caused by using

Terminology: Hyperplasia
True definition: phenotype that is an increase in size of a tissue or organ due to increased numbers of cells, where the affected tissue or organ maintains its normal form.
Generated definition: phenotype that is a reduction in size of an organ or tissue compared to wild -type due to reduced

Terminology: Hypoplasia
True definition: Phenotype that is a reduction in size of an organ or tissue compared to wild-type due to reduced numbers of cells being produced during its development or growth.
Training Test Figure 6: A failed generation that cannot be captured by existing evaluation metrics. Graphex generated a sentence that has the opposite meaning to the true definition.
the definition from a cousin node hypolasia in the graph. Moreover, existing BERT-based models are not able to effectively associate subword hypo (hyper) in the terminology with reduce (increase) in the definition. We plan to explore the possibility of developing faithful generation models (Wang et al., 2020d) to address this problem and leave it as an important future work.

Relate Work
Existing works related to terminology definition mainly focus on definition extraction (Westerhout, 2009;Anke and Schockaert, 2018;Veyseh et al., 2020;Li et al., 2016) and technology entity recognition (Fahmi and Bouma, 2006;Gao et al., 2018). Definitions are extracted from different sources, such as Wikipedia (Espinosa-Anke and Saggion, 2014;Li et al., 2016) and scholarly articles (Jin et al., 2013;Spala et al., 2019). In contrast to previous work, We study the novel problem of terminology definition generation. Notably, the proposed dataset Graphine can also be used as a new benchmark to evaluate existing approaches on extracting definitions from the free text.
Many scientific literature datasets have been curated for a variety of tasks, such as hypothesis generation (Spangler et al., 2014), scientific claim verification (Wadden et al., 2020), paraphrase identification Dong et al., 2021; and citation recommendation (Saier and Färber, 2019). Paraphrase identification datasets, such as MSCOCO, Quora, MSR, ParaSCI, are most related to our work Dong et al., 2021;. Distinct from these datasets, we focused on a different task (i.e., definition generation) and a different domains (i.e., biomedical domain).
Graph2text and data2text, which aim at generating text from structured data, have attracted increasing attention (Marcheggiani and Perez-Beltrachini, 2018;Cai and Lam, 2020;Yao et al., 2020;Guo et al., 2020;Wang et al., 2019). Among them, AMR-to-text Generation and knowledge graph to text generation also consider graph structures. The Abstract Meaning Representation (AMR) represents the semantic information of each sentence using a rooted directed graph, where each edge is a semantic relations and each node is a concept (Song et al., 2018;Zhu et al., 2019;Mager et al., 2020;Wang et al., 2020c). Knowledge graph to text generation has advanced tasks such as entity description generation and medical image report by generating text from a subgraph in the knowledge graph (Cheng et al., 2020;. Despite all considering graph structures, our method generates one sentence for each node on a large directed acyclic graph, whereas AMR-to-text and knowledge graph to text generate sentences for a subgraph or the entire graph.

Conclusion
We have introduced a novel dataset Graphine for studying definition generation. Graphine includes 2,010,648 terminology definition pairs from three major biomedical databases. Terminologies in Graphine form 227 directed acyclic graphs, which make Graphine a unique resource for exploring graph-aware text generation. We have proposed a graph-aware definition generation method Graphex, which takes the graph structure into consideration. Graphex has obtained substantial improvement against methods that do not consider graph structures. Moreover, we have illustrated how Graphine can be used to evaluate other tasks, including comparing pretrained language models, comparing graph representation learning methods and predicting sentence granularity. Finally, we have analyzed the definition generated by our method and proposed future directions to improve. Collectively, we envision our dataset to be a unique resource for definition generation and could be broadly utilized by other natural language processing applications.