Heterogeneous Graph Neural Networks for Keyphrase Generation

The encoder–decoder framework achieves state-of-the-art results in keyphrase generation (KG) tasks by predicting both present keyphrases that appear in the source document and absent keyphrases that do not. However, relying solely on the source document can result in generating uncontrollable and inaccurate absent keyphrases. To address these problems, we propose a novel graph-based method that can capture explicit knowledge from related references. Our model first retrieves some document-keyphrases pairs similar to the source document from a pre-defined index as references. Then a heterogeneous graph is constructed to capture relations with different levels of granularity of the source document and its retrieved references. To guide the decoding process, a hierarchical attention and copy mechanism is introduced, which directly copies appropriate words from both source document and its references based on their relevance and significance. The experimental results on multiple KG benchmarks show that the proposed model achieves significant improvements against other baseline models, especially with regard to the absent keyphrase prediction.


Introduction
Keyphrase generation (KG), a fundamental task in the field of natural language processing (NLP), refers to the generation of a set of keyphrases that expresses the crucial semantic meaning of a document. These keyphrases can be further categorized into present keyphrases that appear in the document and absent keyphrases that do not. Current KG approaches generally adopt an encoderdecoder framework (Sutskever et al., 2014) with attention mechanism (Bahdanau et al., 2015;Luong et al., 2015) and copy mechanism (Gu et al., 2016; * * Equal contribution.

186
.UDSLYLQ 6HP(YDO .3N 3HUFHQWDJH 3UHVHQW.H\SKUDVH $EVHQW.H\SKUDVH Figure 1: Proportion of present and absent keyphrases among four datasets. Although the previous methods for keyphrase generation have shown promising results on present keyphrase predictions, they are not yet satisfactory on the absent keyphrase predictions, which also occupy a large proportion. See et al., 2017) to simultaneously predict present and absent keyphrases (Meng et al., 2017;Chen et al., 2018;Chan et al., 2019;Chen et al., 2019b,a;Yuan et al., 2020). Although the proposed methods for keyphrase generation have shown promising results on present keyphrase predictions, they often generate uncontrollable and inaccurate predictions on the absent ones. The main reason is that there are numerous candidates of absent keyphrases that have implicit relationships (e.g., technology hypernyms or task hypernyms) with the concepts in the document. For instance, for a document discussing "LSTM", all the technology hypernyms like "Neural Network", "RNN" and "Recurrent Neural Network" can be its absent keyphrases candidates. When dealing with scarce training data or limited model size, it is nontrivial for the model to summarize and memorize all the candidates accurately. Thus, one can expect that the generated absent keyphrases are often suboptimal when the candidate set in model's mind is relatively small or inaccurate. This problem is crucial because absent keyphrases account for a large proportion of all the ground-truth keyphrases. As shown in Figure 1, in some datasets, up to 50% of the keyphrases are absent.
To address this problem, we propose a novel graph-based method to capture explicit knowl-  Figure 2: Graphical illustration of our proposed GATER. We first retrieve references using the source document, where each reference is the concatenation of document and keyphrases pair from the training set. Then we construct a heterogeneous graph and perform iterative updating. Finally, the source document node is extracted to decode the keyphrase sequence with a hierarchical attention and copy mechanism. edge from related references. Each reference is a retrieved document-keyphrases pair from a predefined index (e.g., the training set) that similar to the source document. This is motivated by the fact that the related references often contain candidate or even ground-truth absent keyphrases of the source document. Empirically, we find three retrieved references cover up to 27% of the groundtruth absent keyphrases on average (see Section 4.3 for details).
Our heterogeneous graph is designed to incorporate knowledge from the related references. It contains source document, reference and keyword nodes, and has the following advantages: (a) different reference nodes can interact with the source document regarding the explicit shared keyword information, which can enrich the semantic representation of the source document; (b) a powerful structural prior is introduced as the keywords are highly overlapped with the groundtruth keyphrases. Statistically, we collect the top five keywords from each document on the validation set, and we find that these keywords contain 68% of the tokens in the ground-truth keyphrases. On the decoder side, as a portion of absent keyphrases directly appear in the references, we propose a hierarchical attention and copy mechanism for copying appropriate words from both source document and its references based on their relevance and significance.
The main contributions of this paper can be summarized as follows: (1) we design a heterogeneous graph network for keyphrase generation, which can enrich the source document node through keyword nodes and retrieved reference nodes; (2) we propose a hierarchical attention and copy mechanism to facilitate the decoding process, which can copy appropriate words from both the source document and retrieved references; and (3) our proposed method outperforms other stateof-the-art methods on multiple benchmarks, and especially excels in absent keyphrase prediction. Our codes are publicly available at Github 1 .

Methodology
In this work, we propose a heterogeneous Graph ATtention network basEd on References (GATER) for keyphrase generation, as shown in Figure  2. Given a source document, we first retrieve related document from a predefined index 2 and concatenate each retrieved document with its keyphrases to serve as a reference. Then we construct a heterogeneous graph that contains document nodes 3 and keyword nodes based on the source document and its references. The graph is updated iteratively to enhance the representations of the source document node. Finally, the source document node is extracted to decode the keyphrase sequence. To facilitate the decoding process, we also introduce a hierarchical attention and copy mechanism, with which the model directly attends to and copies from both the source document and its references. The hierarchical ar-1 https://github.com/jiacheng-ye/kg_ gater 2 We use the training set as our reference index in our experiment, which can also be easily extended to open corpus.
3 Note that source document and reference are the two specific contents of the document node. rangement ensures that more semantically relevant words and those in more relevant references will be given larger weights for the current decision.

Reference Retriever
Given a source document x, we first use a reference retriever to output several related references from the training set. To make full use of both the retrieved document and retrieved keyphrases, we denote a reference as the concatenation of the two.
We find that the use of a term frequency-inverse document frequency (TF-IDF)based retriever provides a simple but efficient means to accomplish the retrieval task. Specifically, we first represent the source document and all the reference candidates as TF-IDF weighted uni/bigram vectors. Then, the most similar K references X r = {x r i } i=1,...,K are retrieved by comparing the cosine similarities of the vectors of the source document and all the references.

Graph Construction
Given the source document x and its references X r , we select the top-k unique words as keywords based on their TF-IDF weights from the source document and each reference. The additional keyword nodes can enrich the semantic representation of the source document through message passing, and introduce prior knowledge for generating keyphrase as the highly overlap between keywords and keyphrases. We then build a heterogeneous graph based on the source document, references and keywords.
Formally, our undirected heterogeneous graph can be defined as . . , m}) denotes m unique keyword nodes of the source document and K references, V d = x ∪ X r corresponds to the source document node and K reference nodes, E d2d = {e k } (k ∈ {1, . . . , K}) and e k represents the edge weight between the k-th reference and source document, and E w2d = {e i,j } (i ∈ {1, . . . , m}, j ∈ {1, . . . , K + 1}) and e i,j indicates the edge weight between the i-th keyword and the j-th document.

Graph Initializers
Node Initializers There are two types of nodes in our heterogeneous graph (i.e., document nodes V d and keyword nodes V w ). For each document node, the same as previous works (Meng et al., 2017;Chen et al., 2019a), an embedding lookup table e w is first applied to each word, and then a bidirectional Gated Recurrent Unit (GRU) (Cho et al., 2014) is used to obtain the context-aware representation of each word. The representation for document x and each word is defined as the concatenation of the forward and backward hidden , respectively). For each keyword node, since the same keyword may appear in multiple documents, we simply use the word embedding as its initial node representation w i = e w (w i ).
Edge Initializers There are two types of edges in our heterogeneous graph (i.e., document-todocument edge E d2d and document-to-keyword E d2w ). To include information about the significance of the relationships between keyword and document nodes, we infuse TF-IDF values in the edge weights. Similarly, we also infuse TF-IDF values in the edge weights of E d2d as a prior statistical n-gram similarity between documents. The two types of floating TF-IDF weights are then transformed into integers and mapped to dense vectors using embedding matrices e d2d and e w2d .

Graph Aggregating and Updating
Aggregator Graph attention networks (GAT) (Velickovic et al., 2018) are used to aggregate information for each node. We denote the hidden states of input nodes as h i ∈ R d h , where i ∈ {1, . . . , N }. With the additional edge feature, the aggregator is defined as follows: where e ij is the embedding of edge feature, α ij is the attention weight between h i and h j , and u i is the aggregated feature. For simplicity, we will use GAT (H, H, H, E) to denote the GAT aggregating layer, where H is used for query, key, and value, and E is used as edge features.
Updater To update the node state, similar to the approach used in the Transformer (Vaswani et al., 2017), we introduce a residual connection and position-wise feed-forward (FFN) layer consisting of two linear transformations. Given an undirected heterogeneous graph G with node features H w ∪ H d and edge features E w2d ∪E d2d , we update each types of nodes separately as follows: with word nodes updated first by aggregating document-level information from document nodes, then document nodes updated by the updated word nodes, and finally document nodes updated again by the updated document nodes. The above process is executed iteratively for I steps to realize better document representation.
When the heterogeneous graph encoder finished, ..,Lr i denotes the encoder hidden state of each word of the i-th reference. All the features described above (i.e., d s , D r , M s and M r ) will be used in the reference-aware decoder.

Reference-aware Decoder
After encoding the document into a referenceaware representation d s , we propose a hierarchical attention and copy mechanism to further incorporate the reference information by attending to and copying words from both the source document and the references.
We use d s as the initial hidden state of a GRU decoder, and the decoding process in time step t is described as follows: where c t is the context vector and the hierarchical attention mechanism hier_attn is defined as follows: s the context vectors from source document and references. All the attention distributions described above are computed as in Bahdanau et al. (2015).
To alleviate the out-of-vocabulary (OOV) problem, a copy mechanism (See et al., 2017) is generally adopted. To further guide the decoding process by copying appropriate words from references based on their relevance and significance, we propose a hierarchical copy mechanism. Specifically, a dynamic vocabulary V is constructed by merging the predefined vocabulary V, the words in source document V x and all the words in the references V X r . Thus, the probability of predicting a word y t is computed as follows: ) is the generative probability over predefined vocabulary V, P Vx (y t ) = i:x i =yt a s t,i is the copy probability from the source document, j is the copy probability from all the references, and p = softmax(W p [h t ; h t ; e w (y t−1 )]) ∈ R 3 serves as a soft switcher that determines the preference for selecting the word from the predefined vocabulary, source document or references.

Training
The proposed GATER model is independent of any specific training method, so we can use either the ONE2ONE training paradigm (Meng et al., 2017), where the target keyphrase set Y = {y i } i=1,...,|Y| are split into multiple training targets for a source document x: or the ONE2SEQ training paradigm (Ye and Wang, 2018;Yuan et al., 2020), where all the keyphrases are concatenated into one training target: where y is the concatenation of the keyphrases in Y by a delimiter.

Datasets
We conduct our experiments on four scientific article datasets, including NUS (Nguyen and Kan, 2007), Krapivin (Krapivin et al., 2009), SemEval (Kim et al., 2010 and KP20k (Meng et al., 2017). Each sample from these datasets consists of a title, an abstract, and some keyphrases given by the authors of the papers. Following previous works (Meng et al., 2017;Chen et al., 2019b,a;Yuan et al., 2020), we concatenate the title and abstract as a source document. We use the largest dataset (i.e., KP20k) for model training, and the testing sets of all the four datasets for evaluation. After preprocessing (i.e., lowercasing, replacing all the digits with the symbol digit and removing the duplicated data), the final KP20k dataset contains 509,818 samples for training, 20,000 for validation and 20,000 for testing. The number of test samples in NUS, Krapivin and SemEval is 211, 400 and 100, respectively.

Baselines
For a comprehensive evaluation, we verify our method under both training paradigms (i.e., ONE2ONE and ONE2SEQ) and compare with the following methods 4 : • catSeq (Yuan et al., 2020). The RNN-based seq2seq model with copy mechanism under ONE2SEQ training paradigm. CopyRNN (Meng et al., 2017) is the one with the same model but under ONE2ONE training paradigm. • catSeqD (Yuan et al., 2020). An extension of catSeq with orthogonal regularization (Bousmalis et al., 2016)  We keep all the parameters the same as those reported in Chan et al. (2019), hence, we only report the parameters in the additional graph module. We retrieve 3 references and extract the top 20 keywords from source document and each reference to construct the graph. We set the number of attention heads to 5 and the number of iterations to 2, based on the valid set. During training, we use a dropout rate of 0.3 for the graph layer, the batch size of 12 and 64 for ONE2SEQ and ONE2ONE training paradigm, respectively. During testing, we use greedy search for ONE2SEQ, and beam search with a maximum depth of 6 and a beam size of 200 for ONE2ONE. We repeat the experiments of our model three times using different random seeds and report the averaged results.

Evaluation Metrics
For the model trained under ONE2ONE paradigm, as in previous works (Meng et al., 2017;Chen et al., 2018Chen et al., , 2019b, we use macro-averaged F 1 @5 and F 1 @10 for present keyphrase predictions, and R@10 and R@50 for absent keyphrase predictions. For the model trained under ONE2SEQ paradigm, we follow Chan et al. (2019)    the ground-truth keyphrases, which means it considers the number of predictions. We apply the Porter Stemmer before determining whether two keyphrases are identical and remove all the duplicated keyphrases after stemming. the source document and to predict more accurate present keyphrases.

Ablation Study
To examine the contribution of each component in GATER, we conduct ablation experiments on the largest dataset KP20k, the results of which are presented in Table 3. For the input references, the model's performance is degraded if either the retrieved documents or retrieved keyphrases are removed, which indicates that both are useful for keyphrases prediction. For the heterogeneous graph encoder, the graph becomes a heterogeneous bipartite graph when the d2d edges are removed, and a homogeneous graph when the w2d edges are removed. We can see that both result in degraded performance due to the lack of interaction. Removing both the d2d edges and the w2d edges means that the reference information is only used on the decoder side with the reference-aware decoder, which further degrades the results. For the reference-aware decoder, we find the hierarchical attention and copy mechanism to be essential to the performance of GATER. This indicates the importance of integrating knowledge from references on the decoder side.

Quality and Influence of References
As our graph is based on the retrieved references, we also investigated the quality and influence of the references. We define the quality of the retrieved references as the transforming rate of  Trans.

SPECTER TF-IDF Random
Number of References 0 1 3 5 10 Figure 3: Transforming rate and ∆F 1 @M for absent keyphrases under different types of retrievers on KP20k dataset for catSeq-GATER. We study a random retriever, a sparse retriever based on TF-IDF and a dense retriever based on SPECTER.
absent keyphrase (i.e., the proportion of absent keyphrases that appear in the retrieved references). Intuitively, the references that contain more absent keyphrases provide more explicit knowledge for the model generation. As shown on the left part in Figure 3, the simple sparse retriever based on TF-IDF outperforms the random retriever by a large margin regarding the reference quality. We also use a dense retriever SPECTER 6 (Cohan et al., 2020), which is a BERT-based model pretrained using scientific documents. We find that using a dense retriever further helps in the transforming rate of absent keyphrases. On the right part of Figure 3, we show the influence of the references, and we note that random references degrade the model performance as they contain a lot of noise. Surprisingly, we can obtain a 2.6% performance boost in the prediction of absent keyphrase by considering only the most similar references with a sparse or dense retriever, and the introduction of 6 https://github.com/allenai/specter  more than three references does not further improve the performance. One possible explanation is that although more references lead to a higher transforming rate of the absent keyphrase, they also introduce more irrelevant information, which interferes with the judgment of the model.

Incorporating Baselines with GATER
Our proposed GATER can be considered as an extra plugin for incorporating knowledge from references on both the encoder and decoder sides, which can also be easily applied to other models. We investigate the effects of adding GATER to other baseline models in Table 4. We note that GATER enhances the performance of all the baseline models in both predicting present and absent keyphrases. This further demonstrates the effectiveness and portability of the proposed method.

Case Study
We display a prediction example by baseline models and CopyRNN-GATER in Figure 4. Our model generates more accurate present and absent keyphrases comparing to the baselines. For instance, we observe that CopyRNN-GATER successfully predicts the absent keyphrase "porous medium" as it appears in the retrieved documents, while both CopyRNN and KG-KE-KR-M fail. This demonstrates that using both the retrieved documents and keyphrases as references provides more knowledge (e.g., candidates of the groundtruth absent keyphrases) compared with using keyphrases alone as in KG-KE-KR-M.

Keyphrase Extraction and Generation
Existing approaches for keyphrase prediction can be broadly divided into extraction and generation methods. Early work mostly use a two-step approach for keyphrase extraction. First, they extract a large set of candidate phrases by handcrafted rules (Mihalcea and Tarau, 2004;Medelyan et al., 2009;Liu et al., 2011). Then, these candidates are scored and reranked based on unsupervised methods (Mihalcea and Tarau, 2004;Wan and Xiao, 2008) or supervised methods (Hulth, 2003;Nguyen and Kan, 2007). Other extractive approaches utilize neural-based sequence labeling methods Gollapalli et al., 2017).
Keyphrase generation is an extension of keyphrase extraction which considers the absent keyphrase prediction. Meng et al. (2017) proposed a generative model CopyRNN based on the encoder-decoder framework (Sutskever et al., 2014). They employed an ONE2ONE paradigm that uses a single keyphrase as the target sequence. Since CopyRNN uses beam search to perform independently prediction, it's lack of dependency on the generated keyphrases, which results in many duplicated keyphrases. CorrRNN (Chen et al., 2018) proposed a review mechanism to consider the hidden states of the previously generated keyphrase. Ye and Wang (2018) proposed to use a seperator sep to concatnate all keyphrases as a sequence in training. With this setup, the seq2seq model is capable to generate all possible keyphrases in one sequence as well as capture the contextual information between the keyphrases. However, it still use beam search to generate multiple keyphrases sequences with a fixed beam depth, and then perform keyphrase ranking to select top-k keyphrases as output. Yuan et al. (2020) proposed catSeq with ONE2SEQ paradigm by adding a special token eos at the end to terminate the decoding process. They further introduce catSeqD by maximizing mutual information between all the keyphrases and source text and using orthogonal constraints (Bousmalis et al., 2016) to ensure the coverage and diversity of the generated keyphrase. Many works are conducted based on the ONE2SEQ paradigm (Chen et al., 2019a;Chan et al., 2019;Chen et al., 2020;Meng et al., 2021;Luo et al., 2020). Chen et al. (2019a) proposed to use the keyphrases of retrieved documents as an external input. However, the keyphrase alone lacks semantic information, and the potential knowledge in the retrieved documents are also ignored. In contrast, our method makes full use of both retrieved documents and keyphrases as references. Since catSeq tends to generate shorter sequences, a reinforcement learning approach is introduced by Chan et al.
(2019) to encourage their model to generate the correct number of keyphrases with an adaptive reward (i.e., F 1 and Recall). More recently,  introduced a two-stage reinforcement learning-based fine-tuning approach with a fine-grained reward score, which also considers the semantic similarities between predictions and targets.  proposed a ONE2SET paradigm to predict the keyphrases as a set, which eliminates the bias caused by the predefined order in ONE2SEQ paradigm. Our method can also be integrated into these methods to further improve performance, as shown in section 4.4.

Heterogeneous Graph for NLP
Different from homogeneous graph that only considers a single type of nodes or links, heterogeneous graph can deal with multiple types of nodes or links (Shi et al., 2016). Linmei et al. (2019) constructed a topic-entity heterogeneous neural graph for semi-supervised short text classification. Tu et al. (2019) introduced a heterogeneous graph neural network to encode documents, entities, and candidates together for multi-hop reading comprehension. Wang et al. (2020) presented heterogeneous graph neural network with words, sentences, and documents nodes for extractive summarization. In our paper, we study the keyword-document heterogeneous graph network for keyphrase generation, which has not been explored before.

Conclusions
In this paper, we propose a graph-based method that can capture explicit knowledge from related references. Our model consists of a heterogeneous graph encoder to model different granularity of relations among the source document and its references, and a hierarchical attention and copy mechanism to guide the decoding process. Extensive experiments demonstrate the effectiveness and portability of our method on both the present and absent keyphrase predictions.