Structure-Augmented Keyphrase Generation

This paper studies the keyphrase generation (KG) task for scenarios where structure plays an important role. For example, a scientific publication consists of a short title and a long body, where the title can be used for de-emphasizing unimportant details in the body. Similarly, for short social media posts (, tweets), scarce context can be augmented from titles, though often missing. Our contribution is generating/augmenting structure then injecting these information in the encoding, using existing keyphrases of other documents, complementing missing/incomplete titles. We propose novel structure-augmented document encoding approaches that consist of the following two phases: The first phase, generating structure, extends the given document with related but absent keyphrases, augmenting missing context. The second phase, encoding structure, builds a graph of keyphrases and the given document to obtain the structure-aware representation of the augmented text. Our empirical results validate that our proposed structure augmentation and augmentation-aware encoding/decoding can improve KG for both scenarios, outperforming the state-of-the-art.


Introduction
Keyphrases not only help human readers to gain immediate insights about a given document, but also make documents queryable, e.g., by hashtagging social posts with predicted keyphrases. Among keyphrase tasks, we focus on tasks that allow words that are not present in the given document, which often be called absent words, to be keyphrases, as the input document can be often context-scarce: Many potentially relevant keyphrases are missing in the given document, since the keyphrase can be expressed in terms different from those present in the document (i.e., vocabulary mismatch).
Our goal is to generate keyphrases that are likely to be words absent from the document, especially Table 1: An example of a social Q&A post consisting of a title and the main body question: We present gold keyphrases labeled by the user, and existing keyphrases labeled for other related posts. Bold-face words can indicate the topic of the post. in the following scenarios: (a) scientific publications with diverse vocabularies, or (b) short social posts, where we found that, from public and reallife datasets we used in our evaluation, 37% of keyphrases in scientific publications, 67% of hashtags, on average, in social media posts are absent in the given document. Thus, we formulate our problem as keyphrase generation (KG) (Meng et al., 2017) of predicting keyphrases, including absent words as well as present words, adopting encoderdecoder architecture.
A recent trend is leveraging document structure for KG (Chen et al., 2019b), where metadata in documents, e.g., document title, clarifies the meaning of documents (Kim et al., 2021). For illustration, Table 1 shows a social Q&A post consisting of title (the first row) and main body question (the second row), where we need to predict gold keyphrases "question answering" and "natural language processing" (the last row). The concise title enables readers to focus on important parts of the main question (bold-face words), while ignoring details. However, titles often exclude meaningful keyphrases (e.g., "natural language processing" in Table 1) due to their inherent length limitation, and making matters worse, titles may not exist at all in some social media posts (e.g., tweets).
Our contribution is to construct a structured document X + , even when the structure is not given, by leveraging observed keyphrases from other documents in the training dataset, to improve both encoding and decoding. For encoding, keyphrases can be used to emphasize relevant parts in X , similarly done with the title. For decoding, the keyphrases can augment vocabulary, e.g., "natural language processing", from which the decoder can copy words. We stress that our work is designed to work for both closed and open set scenarios, represented by social and scientific document scenario respectively, with the following distinctions: • Closed set: Hashtags in social media posts are frequently reused (i.e., keyphrases can be copied from observed keyphrases from the training set for decoding).
• Open set: A significant amount of keyphrases of scientific publications (about 20% in our target scenario) have never been observed in the training dataset (i.e., keyphrases candidate set is open-ended). Our proposed solution is two-phased: The first phase, to construct a structured document X + , augments the given document X by retrieving relevant keyphrases R from existing keyphrases in the training dataset. Then, the second phase follows, to encode structure-aware representations on X + . We use graph representation to effectively integrate X and R, where the graph can be flexibly designed depending on closed/open set scenarios.
Our empirical results validate that generating and encoding document structures significantly improve performance, outperforming the state-of-theart, for both social and scientific documents.

Related Work
We briefly explain our target task of KG and introduce our distinction of leveraging structures.
Observing words in the document is a crucial signal for keyphrase tasks, especially for keyphrase extraction (KE) requiring keyphrases to appear in the given document. In contrast, we focus on KG, as our target scenarios require not having such restrictions. KG, to generate absent words, is often modeled as neural encoder-decoder architecture (Sutskever et al., 2014), which generates the keyphrase sequence given the input document (Meng et al., 2017). KG approaches can be further categorized into two settings: one-to-one (O2O) and one-to-seq (O2S) (Yuan et al., 2020;Ye et al., 2021). In O2O, a model is trained to generate a single keyphrase for each document, and then, for evaluation, the model generates multiple keyphrases using beam search decoding with a large beam size (e.g., 200). On the other hand, in O2S, a model is trained to generate multiple keyphrases where multiple keyphrases for each document are concatenated into a single sequence with a predefined delimiter. Our model follows O2O, as O2O is known to be better for predicting absent keyphrases (Meng et al., 2021) and as absent keyphrases are frequently observed in our target scenario with scarce context. The distinction of our proposal for KG is to leverage structures in documents, to improve both encoder and decoder.
Structures in documents are essential sources of prediction signals (Kim et al., 2021). For example, for a scientific publication consisting of a title and the main body, the structured document enjoys the complementary strength of the two fields: While the main body contains many keywords (i.e., high recall), title, though much shorter, concisely describes the main focus of the paper (i.e., high precision). To leverage such structure, TGNet (Chen et al., 2019b) uses the title to guide the encoder to accurately capture core contents. However, the given title is observed to be often insufficient due to length limitation (Li et al., 2010), which we consistently observed in our evaluations.
As a further extreme, short social media posts may not have titles. Furthermore, because of the length limitation (e.g., 140 character tweets), most keywords may not appear in the given post (i.e., low recall). In such scenarios, one can construct structured posts, to augment posts with missing keywords. For example, TAKG (Wang et al., 2019) utilizes topic modeling, where topics shared across other documents enable the encoder to leverage contexts observed in other related posts. However, a small, fixed number of topics (e.g., 15 or 30) are limited in differentiating diverse documents with similar topics.
Our distinction is overcoming incomplete or missing structure, by generating a virtual structure from existing keyphrases (keyphrases in the train set), from which we "retrieve" terms that can serve as titles or topics for encoding. KG-KE-KR (Chen et al., 2019a) similarly leverages keyphrase retrieval, but they do not use this for structure-aware encoding and use only for decoding. In contrast, we jointly contextualize the given document and the retrieved keyphrases, to allow both fields to exchange contexts from each other. We empirically validate that our proposed structure-augmented encoding significantly boosts performance.

Approach
with N X unique words, KG models aim to output a set of target keyphrases, which can be implemented using encoder-decoder architecture. We train a model to predict a single keyphrase, then, for inference, we generate multiple keyphrases using beam search decoding. We denote a single keyphrase with length T by Y = [y 1 , . . . , y T ].
In the following sections, we first describe a baseline encoding targeting plain text X ( §3.1), and then, we explain our distinction of generating structure ( §3.2.1) and encoding structure ( §3.2.2).
In this section, we do not consider pre-existing title T , but our framework straightforwardly extends for such titles, as we later discuss in §4.4.1.
For graph construction, a fully connected graph is used, where nodes are unique words in X . For edges, two adjacency matrices ← − A X ij and − → A X ij , using position-based proximity, are obtained for forward and backward direction respectively: Given the graph, to contextualize node representations, Graph Convolutional Network (GCN) (Kipf and Welling, 2017) is adopted. We denote the number of stacked graph convolution layers by L, the number of node features by D, and contextualized representations of nodes for lth layer by H X l ∈R N X ×D where H X 1 is obtained from a word embedding matrix. As comprehensive notations, we denote learnable parameters by v for vector and W for matrix, with different super/subscripts.
Starting from H X 1 , context vectors C X l ∈R N X ×D are gathered from neighbor nodes using graph convolution, then combined with H X l to produce H X l+1 (Dauphin et al., 2017): where σ denotes sigmoid function, and − → A X are normalized matrices with eigenvalues close to 1 for stable training (Kipf and Welling, 2017). In Eq (3), residual addition is employed with G X l ∈ R N X ×D , which helps gradient back-propagation through deep layers (He et al., 2016;Dauphin et al., 2017). G X l is obtained in the same way to C X l in Eq (2) but with different learnable parameters.
We adopt the same graph construction and GCN contextualization, for encoding X . Built upon the GCN encoder, our distinction is to generate and encode structure for X .

Proposed: Structure-Augmented KG
Beyond plain text, we aim to leverage structures in documents. Though leveraging the title-body structure in scientific publications has been shown effective (Chen et al., 2019b), the given titles are often found to be incomplete because of limited length (Li et al., 2010), or unavailable (e.g., tweets).
Our goal is, given incomplete or missing structure, to generate and encode structures, to replace missing titles or complement incomplete titles.

Generating Structure
To generate structured document X + , we leverage existing keyphrases of other documents, specifically similar documents to X , by adopting an assumption that similar documents tend to have similar keyphrases. Specifically, for each keyphrase r in the training dataset, we first collect supporting documents, having r as one of the groundtruth keyphrases, then concatenate them as a single document, denoted by S r , and use S r to index r. We then use BM25 search (Robertson and Walker, 1994) with X as a query, to retrieve top-K relevant keyphrases R = [r 1 , . . . , r K ] 2 . Finally, we extend X with R, to construct a structured document X + .
A stable 3D energetic Galerkin BEM approach for wave propagation interior problems. … mixed boundary conditions, … boundary integral equations … (a) Social Post (Closed Set )

(b) Scientific Article (Open Set )
Methods to tell if a question can be answered from a paragraph … neural network … "natural language processing", "formal languages",…, 3. Encode graph natural language processing question answered paragraph ℛ formal language neural network

Connected graph, integrated by green edges
Copied keyphrase "natural language processing" from ℛ.

Decode
"wave propagation", "boundary element method"… Novel keyphrase "energetic Galerkin boundary element method" by combining words in , ℛ.  Figure 1: Overall approach: Red-colored words or phrases are texts included in decoded keyphrases. Black edges and blue edges are proximity between nodes, obtained from X and R respectively, where the thickness indicates the degree of relatedness between nodes. To construct an integrated graph for social posts, we connect two graphs (from X and R), using green edges, and for scientific articles, we construct a multigraph with two types of edges (for X and R), by merging nodes having the same words, depicted by green nodes.

Retrieved keyphrases
Having relevant keyphrases R enhances both encoding and decoding phases, as described next.

Encoding Structure
Once we augment X with R, our next step is to integrate X and R into X + , and jointly contextualize contents in X + by exchanging contexts between the two fields.
For effective integration, we represent X + as an integrated graph, that can be flexibly designed, based on the following principle: A pair of two highly related nodes should be connected by an edge or merged into a single node, while unrelated nodes should be separated from each other.
For encoding X , we adopt the GCN contextualization, described in §3.1. On the other hand, for R, graphs are differently encoded for open-and closed-set scenarios, as below.
• Closed set Y: Keyphrases in R are likely to be reused for the target keyphrases Y, where keyphrase in R is assigned as a node. The keyphrase can be copied as Y based on the node representation. • Open set Y: Keyphrases may not be reused, such that words in R should be combined with other words to generate novel keyphrases, for which we assign the node with a keyword in R.
In the following two sections, we introduce our Graph-based Structured Document Encoder, or GSEnc, specifically for closed ( §3.2.3) and open set Y ( §3.2.4) respectively.

GSEnc for closed set Y
Targeting closed set keyphrases, we jointly contextualize word nodes from X and keyphrase nodes from R, by building a graph each, then generate connecting edges between the two for propagating contexts across graphs.
Constructing and encoding graph for X follow §3.1. On the other hand, for R, we set a single phrase node for each keyphrase, instead of multiple word nodes, to enable a decoder to copy the keyphrase as-is based on the node features. For edge construction, instead of position-based proximity in Eq (1) which is not available for keyphrase nodes, we adopt co-occurrence-based proximity, as frequently co-occurred keyphrases tend to have similar topics 3 . Specifically, we compute adjacency matrix A R as conditional probabilities based on co-occurrence between keyphrases, then, contexts are gathered using graph convolution, similar to Eq (2).
where H R l ∈ R K×D is node features for R in l-th layer, p(r k |r j ) is the probability that r k co-occurs given r j , which is computed using the training dataset, andÂ R is normalized matrix from A R , as inÂ X from A X .
We now discuss how to connect the two graphs for X and R (green edges in Figure1). For such connection, we are inspired by connecting query (keyphrase in our case) and document, using relevance feedbacks, such as clicks on matching querydocument pair. As we have no such feedback between each keyphrase r j ∈ R and X , we can adopt zero-shot query log synthesis , by using pseudo-relevance feedback (Rocchio, 1971): We treat supporting documents S r j as feedback documents with pseudo-relevance to r j , to connect r j with X using top-M overlapped words, denoted by S * r j , between S r j and X . For selecting top-M , an unsupervised signal frequently used is tf-idf, to favor words appearing frequently in S r j (i.e., representative words) but infrequently in other documents (i.e., discriminative words) (Xu and Croft, 2017). Alternatively, query generators can be supervised as a separate task, requiring additional training data, so we adopt the former.
Given S * r j , we construct an edge between x i and r j , if and only if x i is included in S * r j , then we enable the encoder to estimate a proper edge weight ← → A ij , using graph attention network (Veličković et al., 2018) 4 , such that only relevant contexts are exchanged between X and R.

← →
where f (·) is LeakyReLU nonlinearity (Maas et al., 2013), and "||" denotes concatenation. In addition, we further leverage pseudo-relevance feedbacks, to augment contexts of r j , by using S r j to initialize node features for r j : H R 1,j = E f (S r j ) ∈ R D , where E is a trainable word embedding matrix, and f weighs each word in S r j . For f , we adopt tf-idf 5 .
Finally, we add all gathered contexts and com- 4 We use single-head attention. In our experiments, performances between single/multi-head attention were comparable. 5 Since the number of documents in Sr j can vary among keyphrases, we normalized tf-idf weights to have unit norm. bine these with H [X /R] l , similar to Eq (3).
where the Gs are obtained in the same way to Cs but with different learnable parameters.

GSEnc for open set Y
Targeting open set keyphrases, we jointly contextualize word nodes from both X and R.
Since nodes for X and R have the same granularity (i.e., word-level nodes), instead of connecting two graphs (for X and R) using edges, we construct a single integrated graph, by merging a node for X and a node for R whenever the two nodes correspond to the same word (green nodes in Figure1). Such integration enables contexts scattered across the two fields to be effectively gathered into the merged node, from which other nodes connected to the merged node also can exchange their contexts with each other through the merged node. Though sharing nodes, we use separate edges for X and R, for structure-aware encoding. That is, the merged graph becomes a multigraph; we connect a pair of nodes with two types of edges (black edges for X and blue edges for R in Figure1). We denote the contextualized features for nodes on the merged graph, including nodes from both X and R, by H X + merged l . As in Eq (2), using graph convolution and the two types of edges, we aggregate neighbor contexts from X and R, into C X merged l and C R merged l respectively, while obtaining G X merged l ,G R merged l similarly. For adjacency matrix for keywords in R, we use position-based proximity between two keywords within each keyphrase in R, as in Eq (1). Then, we combine both contexts to update H Given the contextualized node features at the last GCN layer, such as (H X L , H R L ) or H X + merged L , for closed set Y or open set Y respectively, we feed the features into the decoder, to generate keyphrases.

Decoder
The goal of the decoder is to generate a target keyphrase Y=[y 1 , . . . , y T ] with length T , based on the contextualized features on X + from the encoder. While standard KG models either copy words from X or generate words from the predefined vocabulary V, we have R as an additional source, which provides valid keyphrase candidates already used for similar documents. To leverage R for decoding, depending on the application, we allow the decoder to copy either (a) keyphrases (closed set Y) or (b) words (open set Y), from R.
(a) Closed set Y: copying keyphrases In social media posts, a few keyphrases (e.g., trending hashtags) cover most of the potential target keyphrases. Thus, when Y is included in R, copying a keyphrase (e.g., "natural language processing" in Figure 1) from R makes efficient decoding.
To copy relevant keyphrases, we compute relevance score φ j of each r j ∈R to X , using inner product between H R L,j and the summarized features for X , denoted byh X : Given {φ j } K j=1 , we copy top-ranked keyphrases among R, regarding {φ j } K j=1 . For training, we use mean square error (MSE) 6 as the objective function: (b) Open set Y: copying keywords On the other hand, in scientific publications, copying a word from R (e.g., "element", "method" in Figure 1) enables the decoder to generate novel keyphrases (e.g., "energetic Galerkin boundary element method"), by combining the word with words from the other two sources (e.g., "energetic Galerkin" from X ). We adopt a single layer GRU decoder equipped with copy mechanism (See et al., 2017). For simplicity, we denote H X + merged L by H and the number of nodes in the merged graph by N . For t-th word (i.e., y t ) decoding, copy scores p copy t are computed using attention: p copy t,k = softmax(v 1 W 1 [H k ||o t ]), 6 We also tried cross entropy, but found the MSE to be better empirically.
where o t denotes the decoder hidden state. However, p copy t only covers y t present in X + . To predict arbitrary y t using V, generation scores p gen t are computed by p where E V ∈ R |V|×D is a learnable word embedding matrix for V, andĥ t (= N k=1 p copy t,k H k ) summarizes relevant contexts from X + , using p copy t as relevance scores. The final score p final t is computed as a combination of the two scores using a gate value z t : . Once y t is decoded according to p final t , we update the decoder hidden state o t : o t+1 = GRU(y t , o t ), where y t ∈R D is the feature vector for the decoded token y t .
For y t , previous work uses a word embedding matrix. However, when y t is a rare word, which is often replaced by "[UNK]" symbol, y t from the embedding matrix contains little information on the word. We thus leverage structure-aware representation H, to capture the meaning of y t based on the contexts within X + . Specifically, when y t is copied from X + , similar to p final t , we compute y t as combination of H and E V from X + and V respectively: where H yt∈X + , E yt∈V are corresponding vectors for y t from H and E V respectively.
Following convention (Meng et al., 2017), to predict multiple keyphrases, we use beam search with beam size 200, and then use top-ranked keyphrases as final predictions. For training, we use cross entropy loss on y t as the objective function.  Table 2: Statistics for three social datasets and a scientific publication dataset. "D" and "KP" are short for document and keyphrases respectively. "abs KP" denotes absent keyphrases. SE denotes StackExchange.
We conducted experiments on social media posts and scientific publication, with missing or incomplete structures. We present statistics on the datasets in Table 2.
For social media posts, we used three public datasets including not only microblog posts such as Twitter and Weibo, but also a Q&A platform such as StackExchange (Wang et al., 2019) 7 . In the Twitter and Weibo datasets, user-assigned hashtags are treated as keyphrases of a corresponding post. Hashtags in the middle of a post were treated as present keyphrases, and hashtags either before or after a post were treated as absent keyphrases and are removed from the post. For the StackExchange dataset, a given document is a question, and keyphrases are manually annotated by users. Different from microblogs, for each question, there is a title and a description such that the given document is a concatenation of the title and the description for the question. For training and evaluation, document-keyphrase pairs are partitioned into train, validation, and test splits consisting of 80%, 10%, and 10% of the entire data respectively.
As in (Wang et al., 2019), we adopted macroaveraged F1@k and mean average precision (mAP) as evaluation metrics. To compute F1@k, we count the number of correct keyphrases among the topk predictions (denoted by hit@k), then, precision and recall are computed as hit@k divided by the number of predictions (i.e., k) and the number of gold keyphrases respectively. We report F1@1/3 for Twitter/Weibo and F1@3/5 for StackExchange considering the average number of keyphrases in datasets (1.13, 1.06, and 2.43 respectively). For all datasets, we report mAP over the top-5 predictions.
For scientific publication, we use KP20k dataset (Meng et al., 2017). Since the original dataset includes duplicates between train/test documents, we use a preprocessed dataset without duplicates, released by (Chen et al., 2019a) 8 . The dataset has 510K, 20K, 20K documents for train, validation, and test datasets respectively. As in the baselines, we use F1@5/10 as evaluation metrics.

Baselines
We compared our models with previous state-ofthe-art KG baselines as well as KE baselines.
For KE baselines, we use TextRank (Mihalcea and Tarau, 2004) and TF-IDF which are the most popular unsupervised keyword ranking models, and a neural sequence tagging model  (denoted by Seq-Tag) that predicts keyphrase spans within the given document.
For KG baselines, we use CopyRNN (Meng et al., 2017) and CorrRNN (Chen et al., 2018) in common for both social media posts and scientific publication. CopyRNN adopts standard encoderdecoder architecture with copy mechanism. Cor-rRNN exploits correlation between keyphrases to predict diverse keyphrases. We compare our proposed model with the previous state-of-the-art baselines, including TAKG (Wang et al., 2019) for social media post datasets, that augments contexts of the given posts using topic modeling, and KG-KE-KR-M (Chen et al., 2019a) for KP20k, that uses the retrieved keyphrases R to provide the decoder with additional contexts. Note that, different from ours, KG-KE-KR-M separately encodes X and R, thus do not enjoy structure-aware representations.
To validate the effectiveness of joint contextualization of X and R on the graph, we also conduct an ablation study, where we compare our proposed model to the ablation model that separately encodes the graph for X and the graph for R (denoted by w/o integration). For social media posts, we exclude all edges between X and R (green edges in Figure 1). For scientific publication, we separately encode two graphs (each for X or R) instead of a merged multigraph. The ablation model is similar to KG-KE-KR-M that also separately encodes X and R, while differing in the encoder (graph encoder, instead of sequence encoder).

Implementation Details
For fair comparisons, we use the same hyperparameters and training strategy to the baselines, such as the size of the predefined vocabulary V, batch size of 64, Adam optimizer (Kingma and Ba, 2015) with 0.001 initial learning rate, and gradient clipping (Pascanu et al., 2013) with 1.0 threshold.

Evaluation
In this section, we confirm the effectiveness of structures in documents for KG, and validate the superiority of the proposed structure over other structures. Results are shown in Table 3.
Since large portion of ground-truth keyphrases are absent keyphrases (i.e., context-scarce X ), KE models show significantly worse performances compared to KG models, on both datasets.
Among KG models, structures in document significantly improve performances, for both datasets, where TAKG with latent topics and TGNet with title significantly outperform the other KG models without structures, except KG-KE-KR-M on KP20k. However, both a small number of latent topics and short titles have limited information. In contrast, by retrieving relevant keyphrases from existing keyphrases, we augment the given document with sufficient topical information. As a result, GSEnc outperforms all baselines, except TAKG on Twitter with comparable performance.
For social datasets, in addition to having superior or comparable performance over baselines, our proposed model, by avoiding sequential decoding, requires much lower computational costs than other KG models. Regarding computational efficiency, our model can process each text in about 22 ms (21 ms for retrieving R and less than 1 ms for the rest) while other KG approaches consume about 90 ms.
Meanwhile, we stress that it is important to jointly contextualize X and R, to enjoy mutual benefits between them; we found that performance significantly decrease on all datasets, when we separately encode them (w/o integration in Table 3  When titles are available, we can use R, to complement concise, but incomplete titles (high precision, low recall), by providing missing terms in the title (increasing recall). In this section, we explain how to integrate R and title (denoted by T ), for better structured documents, and evaluate the effectiveness of such integration. We use KP20k and StackExchange datasets where titles of documents are written by the authors of the documents.
Since the title words are already represented as nodes in X , we can simply add another edge type, as similarly done in §3.2.4, to make three types of edges for the same node pair: edges from X , R, and the given title respectively. As in edges for X , we use position-based proximity between title words, for edge weights. Given the title edges, we use contexts gathered from the title, to update contextualized representations, similar to Eq (3,9). The results are shown in Table 4.
We observe that R is more representative in KP20k than StackExchange, where the F1 accuracy of R on gold keywords was 22.3 and 11.5, for KP20k and StackExchange respectively. This can be explained by the document length difference: When R is less relevant, for shorter documents in StackExchange, titles with high precision capture relevant keyphrases in R, such that X +R+T outperforms X +R, while X +R is sufficiently accurate, otherwise. Our proposed approach, by leveraging both fields, work well in both cases, outperforming TGNet that uses only title structure.
Leveraging existing keyphrases in train set is specifically effective when test distribution is similar to the train distribution, i.e., in-distribution test set, as we validated in our experiments. On the other hand, for out-of-distribution test set, where train/test distributions are different from each other, existing keyphrases may not be helpful. We empirically tested different effects of leveraging R, by training models using KP20k training dataset and evaluating the models on KP20k test set (indistribution) and Inspec (Hulth, 2003) test set (out-of-distribution). While, in KP20k datasets, keyphrases are labeled by authors of documents, keyphrases in Inspec are labeled by third-party annotators. In Table 5, we can observe that R does not improve on out-of-distribution test set, while showing significant improvements on indistribution test set.
To overcome, as a future work, our retriever can be extended to use external sources such as open-domain knowledge graphs (Shi et al., 2017) or Wikipedia texts (Yu and Ng, 2018), that generalize well to out-of-distribution data. However, such external sources will be less effective than internal sources (e.g., existing keyphrases), when most of target documents are in-distribution documents. We will explore effective integration strategies between internal/external sources, to enjoy complementary strengths of those, for better performance on both in-distribution and out-of-distribution documents.

Conclusion
We studied the problem of KG for scientific text and social posts, representing context-scarce scenarios with open and closed keyphrase set, respectively. Our work is two-phased, for augmenting and encoding missing/incomplete structure. Empirical evaluation results validate that our proposed model outperforms the state-of-the-art in both problems.