UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction

Keyphrase Prediction (KP) task aims at predicting several keyphrases that can summarize the main idea of the given document. Mainstream KP methods can be categorized into purely generative approaches and integrated models with extraction and generation. However, these methods either ignore the diversity among keyphrases or only weakly capture the relation across tasks implicitly. In this paper, we propose UniKeyphrase, a novel end-to-end learning framework that jointly learns to extract and generate keyphrases. In UniKeyphrase, stacked relation layer and bag-of-words constraint are proposed to fully exploit the latent semantic relation between extraction and generation in the view of model structure and training process, respectively. Experiments on KP benchmarks demonstrate that our joint approach outperforms mainstream methods by a large margin.


Introduction
Keyphrases are several phrases that highlight core topics or information of a document. Given a document, the KP task focuses on automatically obtaining a set of keyphrases. As a basic NLP task, keyphrase prediction is useful for numerous downstream NLP tasks such as summarization (Wang and Cardie, 2013;Pasunuru and Bansal, 2018), document clustering (Hulth and Megyesi, 2006), information retrieval (Kim et al., 2013).
Keyphrases of a document fall into two categories: present keyphrase that appears continuously in the document, and absent keyphrase which does not exist in the document. Figure 1 shows an example of a document and its keyphrases. Traditional KP methods are mainly extractive, which have been extensively researched in past decades (Witten et al., 2005;Nguyen and Kan, 2007; Medelyan * Equal contribution. Document: On selecting an optimal wavelet for detecting singularities in traffic and vehicular data. …… applications of wavelet transform s ( wts ) in traffic engineering have been introduced however , …… , second order difference , oblique cumulative curve , and short time fourier transform ) . it then mathematically describes wts ability to detect singularities in traffic data . …… , it is shown that selecting a suitable wavelet largely depends on the specific research topic , and that the mexican hat wavelet generally gives a satisfactory performance in detecting singularities in traffic and vehicular data . et al., 2009;Lopez and Romary, 2010;Zhang et al., 2016;Alzaidy et al., 2019;Sun et al., 2020). These methods aim to select text spans or phrases directly in the document, which show promising results on present keyphrase prediction. However, extractive methods cannot handle the absent keyphrase, which is also significant and requires a comprehensive understanding of document.
To mitigate this issue, several generative methods (Meng et al., 2017;Chen et al., 2018;Ye and Wang, 2018;Chen et al., 2019b;Zhao and Zhang, 2019;Chen et al., 2020;Yuan et al., 2020) have been proposed. Generative methods mainly adopt the sequenceto-sequence (seq2seq) model with a copy mechanism to predict a target sequence, which is concatenated of present and absent keyphrases. Therefore, the generative approach can predict both kinds of keyphrases. But these methods treat present and absent keyphrases equally, while these two kinds of keyphrase actually have different semantic properties. As illustrated in Figure 1, all the present keyphrases are specific techniques, while the absent keyphrases are tasks or research areas.
Thus several integrated methods (Chen et al., 2019a;Ahmad et al., 2021) try to perform multi-task learning on present keyphrase extraction (PKE) and absent keyphrase generation (AKG). By treating present and absent keyphrase prediction as different tasks, integrated methods clearly distinguish the semantic properties for these two kinds of keyphrases. But integrated models suffer from two limitations. Firstly, these approaches are not trained in an end-to-end fashion, which causes error accumulation in the pipeline. Secondly, integrated methods just adopt a bottom shared encoder to implicitly capture the latent semantic relation between PKE and AKG, while this relation is essential for the KP task. As illustrated in Figure 1, the ground truth of PKE are specific techniques, which are all used for the "singularity detection" task in the "traffic data analysis" area. Such semantic relation between PKE and AKG can bring benefits for KP. Actually, semantic relations like "technique-task-area" between two tasks are common in the KP task. However, these integrated methods are weak at modeling it.
To address these issues, we propose a novel endto-end joint model, UniKeyphrase, which adopts a unified pretrained language model as the backbone and is fine-tuned with both PKE and AKG tasks. What's more, UniKeyphrase explicitly captures the mutual relation between these two tasks, which brings benefits for keyphrase prediction: present keyphrases can provide an overall sense about salient parts of the document for AKG, and absent keyphrases viewed as high-level latent topics of the document can also supply PKE with global semantic information.
Specifically, UniKeyphrase employs two mechanisms to capture the relation from model structure and training process, respectively. Firstly, stacked relation layer is applied to repeatedly fuse PKE and AKG task representations to explicitly model the relation between the two sub-tasks. In detail, we adopt a co-attention based relation network to model the co-influence. Secondly, a bag-of-words constraint is designed for UniKeyphrase, which aims to provide some auxiliary global information of the whole keyphrases set during training.
Experiments conducted on the widely used public datasets show that our method significantly outperforms mainstream generative and integrative models. 1 The contributions of this paper can be summarized as follows: • We introduce a novel end-to-end framework 1 Code available on https://github.com/thinkwee/UniKeyphrase UniKeyphrase for unified PKE and AKG.
• We design stacked relation layer (SRL) to explicitly capture the relation between PKE and AKG.
• We propose bag-of-words constraint (BWC) to explicitly feed global information about present and absent keyphrases to the model.

Keyphrase Extraction
Most existing extraction approaches can be categorized into two-step extraction methods and sequence labeling approaches. Two-step extraction methods first identify a set of candidate phrases from the document by heuristics, such as essential n-grams or noun phrase (Hulth, 2003). Then, the candidate keyphrases are sorted and ranked to get predicted results. The scores can be learned by either supervised algorithms (Nguyen and Kan, 2007;Medelyan et al., 2009;Lopez and Romary, 2010) or unsupervised graph ranking methods (Mihalcea and Tarau, 2004;Wan and Xiao, 2008). For sequence labeling approaches, documents are fed to an encoder then the model learns to predict the likelihood of each word being a keyphrase (Zhang et al., 2016;Alzaidy et al., 2019;Sun et al., 2020).

Keyphrase Generation
Keyphrase generation focuses on predicting both present and absent keyphrases. Meng et al. (2017) first propose CopyRNN which is a seq2seq framework with attention and copy mechanism. Then a semi-supervised method for the exploitation of the unlabeled data is investigated by Ye and Wang (2018

Integrated Methods
To explicitly distinguish the present and absent keyphrases, integrated extraction and generation approach have been applied to the KP task. Chen et al. (2019a) aim at improving the performance of the generative model by using an extractive model. Ahmad et al. (2021) propose SEG-Net, a neural keyphrase generation model that is composed of a selector for selecting the salient sentences in a document, and an extractor-generator that extracts and generates keyphrases from the selected sentences. In contrast to these methods, our joint approach can explicitly capture the relation between extraction and generation in an end-to-end framework.

Approach
In this section, we describe the architecture of UniKeyphrase. Figure 2 gives an overview of UniKeyphrase, which consists of three components: extractor-generator backbone based on UNILM, a stacked relation layer for capturing the relation between PKE and AKG, and bag-of-words constraint for considering the global view of two tasks in training. In the following sections, the details of UniKeyphrase are given.

Extractor-Generator Backbone
Given a document X = {x 1 , ..., x m }, KP aims at obtaining a keyphrase set K = {k 1 , ..., k |K| }. Naturally, K can be divided into present keyphrase set K p = {k p 1 , ..., k p |Kp| } and absent keyphrase set K a = {k a 1 , , ..., k a |Ka| } by judging whether keyphrases appear exactly in the source document. UniKeyphrase decomposes the KP into PKE and AKG, and jointly learns two tasks in an end-to-end framework.
UniKeyphrase treats PKE as a sequence labeling task and AKG as a text generation task. To jointly learn in an end-to-end framework, UniKeyphrase adopts UNILM (Dong et al., 2019) as the backbone network. UNILM is a pre-trained language model, which can perform sequence-to-sequence prediction by employing a shared transformer network and utilizing specific self-attention masks to control what context the prediction conditions on.
As shown in Figure 2, with a pre-trained UNILM layer, the contextualized representation for the source document can attend to each other from both directions, which is convenient for PKE. While the representation of the target token can only attend to the left context, as well as all the tokens in the source document, which can be easily adapted to AKG.
Specifically, for a document X, all absent keyphrases will be concatenated as a sequence. Then we randomly choose tokens in this sequence, and replace them with the special token [MASK]. The masked sequence is defined as K m a . We further concatenate document X and K m a with [CLS] and [SEP] tokens as the input sequence: Afterwards, we feed input sequence into UNILM and obtain output hidden state H: the hidden state H = {h 1 , ...,h T } (T is the number of input tokens in the UNILM) will be used as the input of stacked relation layer for jointly modeling PKE and AKG.

Stacked Relation Layer
Based on the UNILM, we can obtain the output hidden H. Instead of directly using the UNILM hidden for PKE and AKG, we use the SRL to explicitly model the relation between these two tasks. Actually, modeling the cross-impact and interaction between different tasks in joint model is a common problem (Qin et al., 2020a(Qin et al., ,b, 2019. Specifically, SRL takes the initial shared representations P 0 = A 0 = {h 1 , ...,h T } as input and aims to obtain the finally task representations P L and A L (L is the number of stacked layers), which consider the cross-impact between PKE and AKG. Besides, SRL can be stacked to repeatedly fuse PKE and AKG task representations for better capturing mutual relation.
Formally, given the l th layer inputs P l = {p l 1 , ...,p l T } and A l = {a l 1 , ...,a l T }, stacked relation layer first apply two linear transformations with a ReLU activation over the input to make them more taskspecific, which can be written as follow: where LN represent the layer normalization function (Ba et al., 2016). Then the relation between the two tasks will be integrated base on task-specific representations. In this paper, we adopt co-attention relation networks.

Linear Linear
Add&Norm Add&Norm The architecture of our model Co-Attention is an effective approach to model the important information of correlated tasks. We extend the basic co-attention mechanism from token level to task representations level. It can produce the PKE and AKG task representations considering each other. Therefore, we can transfer useful mutual information between two tasks. The process can be formulated as follows: T } are the l th layer updated representations. After stacked relation layer, we can obtain the outputs P L = {p L 1 , ...,p L m } and A L = {a L 1 , ...,a L n }. We then adopt separate decoders to perform PKE and AKG by using the task representations of corresponding position , which can be denoted as follows: where y p i and y a j are the predicted distribution for present keyphrase and absent keyphrase respec-tively; W p and W a are transformation matrices; b p and b a are bias vectors.

Bag-of-Words Constraint
UniKeyphrase divides the KP task into two subtasks, PKE and AKG. These two sub-tasks are optimized separately, which lacks the awareness of global information about the total keyphrase set. Such global information can be the amount of all keyphrases or the common words between present and absent keyphrases. Bag of words (BoW) is a suitable medium for describing this information. In this paper, we feed global information to UniKeyphrase by constructing constraints based on the BoW of keyphrases. The word count in BoW can provide guidance about task relation for PKE and AKG training in a global view.
Specifically, we calculate the gap between the model predicted keyphrase BoW and ground truth keyphrase BoW, then add it into the loss. Hence UniKeyphrase can get a global view of keyphrases allocation and adjust two tasks during training.
We first collect present and absent keyphrase BoW from model. For present keyphrases, since PKE is a sequence labeling task, we collect all words that labeled as keyphrases, and construct present predicted BoW V p . We use the sum of corresponding label probabilities as the count of word w in V p : where y p i denotes all predicted label probabilities at time step i. I w is all position of word w in document. Maximum operation is used for selecting the probability of predicted label. For absent keyphrase, the generation probability of all steps are accumulated as predicted absent BoW V a (w).
After acquiring the predicted present and absent keyphrase BoW, we concatenate these two parts as the total predicted BoW V , then calculate the error compared with ground truth BoWV . To reserve the word count information, we use Mean Square Error (MSE) function: It is worth noting that V is the collection of words that make up the ground truth keyphrases and predicted keyphrases. So the BWC only affects a small subset of the whole vocabulary for each sample. This can help reduce the noise and stabilize the training process. In practice we increase the weight of BWC logarithmically from zero to a defined maximum value w m , the weight of BWC on t step can be denoted as follows: where t total is the total step of training. The reason to adjust the weight is the same as Ma et al. (2018). The BWC should take effect when predicted results are good enough. Therefore we first assign a small weight to BWC at the initial time, and gradually increase it when training.

Training
For the PKE task, objection is formulated as: where M refers to the length of document, C refers to the number of label, w c is the loss weight for the positive label.ŷ p i refers the gold label. For the AKG task, training objection is to maximize the likelihood of masked tokens, which is formulated as: where N refers to the number of masked tokens, V s refers to the size of vocabulary.ŷ a i refers the ground-truth word.
Considering the BWC, the overall loss of UniKeyphrase is formulated as: 4 Experiments

Datasets and Evaluation
We follow the widely used setup of the deep KP task: train, validation and test on the KP20K (Meng et al., 2017) dataset, and give evaluation on three more benchmark datasets: NUS (Nguyen and Kan, 2007), INSPEC (Hulth, 2003) and SEMEVAL (Kim et al., 2010). We follow the preprocess, post-process, and evaluation setting of Specifically, we use the partition of present and absent provided by Meng et al. (2017) and calculate F 1 @5 and F 1 @M (use all predicted keyphrases for F 1 calculation) after stemming and removing duplicates.

Experimental Setup
Setting: We reuse most hyper-parameters from pretrained UNILM 3 . The layer number of SRL is set to 2. We use w m = 1.0 when adjusting the weight of BWC. PKE loss weights w c for the positive label is set to 5. Baselines: We compare two kinds of strong baselines (generative, integrated) to give a comprehensive evaluation on the performance of UniKeyphrase.
• Generative: Generative models can predict both present and absent keyphrases under the seq2seq framework. CatSeq (Yuan et al., 2020) is the classic setting of keyphrase seq2seq model. We report the performance of CatSeq and various improved models on it, including CatSeqTG (Chen et al., 2019b), CatSeq (TRM) (Ahmad et al., 2021) and Cat-SeqD (Yuan et al., 2020). A recently released model is also included for comparing, which is ExHiRD-h (Chen et al., 2020).
• Integrated: Integrated model often combine multiple modules to perform extractive and abstractive tasks. But they are not end-to-end. Two latest integrated models are recorded for comparison. including KG-KE-KR-M (Chen et al., 2019a) and SEG-NET (Ahmad et al., 2021)

Main Results
In this section, we show the experimental results of the baseline methods and our model on present keyphrase extraction and absent keyphrase generation. Besides, we also study the average number of unique predicted keyphrases per document to further show the advantages of our model.

Present and Absent Keyphrase Prediction
The present and absent keyphrase prediction performance of all methods are shown in Table 1 and  Table 2. From the results, we can find that our joint framework outperforms most state-of-the-art generative baseline by a significant margin, especially on absent keyphrase generation, which demonstrates the effectiveness of our UniKeyphrase. We notice that the UniKeyphrase does not perform well on F 1 @M for present keyphrase extraction. One potential reason is that UniKeyphrase predicts more than other baselines, which makes it has the potential to predict more reasonable but not-ground-truth keyphrases.

Number of Predicted Keyphrases
The number of predicted keyphrases indicates the model's understanding of input documents. From the previous work (Chen et al., 2020), we find the average number of unique predicted keyphrases per document is much lower than the gold average keyphrase number in most datasets. The number of unique keyphrases predicted by UniKeyphrase and baselines is compared in Table 3. We can find that UniKeyphrase predicts more (especially in absent keyphrases) than baseline methods, which is closer to ground truth. Meanwhile, we find UniKeyphrase leads to predict more keyphrases than the groundtruth (especially on KP20k). We leave solving the over prediction keyphrases problem as our future work.

Ablation Study
In this section, we check the improvement brought by SRL and BWC. Several ablation experiments are conducted to analyze the effect of different components. The ablation experiment on three datasets are shown in   Table 3: Results of average numbers of predicted unique keyphrases. "#PK" and "#AK" are the number of present and absent keyphrases respectively. Bold denotes the prediction closest to the ground truth.  Effectiveness of bag-of-words constraint: In this setting, we remove our bag-of-words constraint and there is no global constraint for two tasks. The results show a drop in KP performance, indicating that capturing the global constraint of the result by BWC is effective and important for our method.

SRL Analysis
To better understand the SRL module, we analyze the impact of stacked layers and give a visualization of the inner state of SRL.
Analysis of SRL  the impact of the stack number of relation network. The comparison of total keyphrase prediction result, which regardless of the present or absent of keyphrases, are shown in Table 5. We can find that setting deeper layers could generally result in better performance when the number of stacked layers is less than three, which proves the effectiveness of stacked layers. It is worth noting that when the number of stacked layers is larger than two, the KP performance drops. We suppose that when the relation network becomes deeper, the over-interaction will lose the diversity of two task representations. Visualization Analysis for SRL: To better understand what the SRL network has learned, we compare the distance between the PKE representation and AKG representation in different settings. In detail, we randomly sample 2000 pairs of PKE representation vector and AKG representation vector on different positions from test data and compute euclidean metric in each pair. As shown in Figure 3, the blue points mean the Euclidean metric between PKE and AKG representation vector without SRL layer, while the yellow points mean the Euclidean metric with SRL layer.
From the Figure 3, we can find that the blue points are under the yellow points, which means the PKE and AKG representation vector without SRL is more similar. In other words, SRL has learned the task-specific representation. Also, the blue points are denser than the yellow points, which  means the PKE and AKG representation with SRL is more diverse than the one without SRL on different samples.

BWC Analysis
Loss Compare: From Figure 4 we can see that the original total loss (labeling and generation) drops more with the help of BWC compared to the vanilla model. BWC actually is an enhancement on the original supervised signal from a global view. It guides the model to learn how many to predict and how to allocate present and absent keyphrases, while original loss only teaches what to predict in each position.
Bag-of-words Error: We also calculate the bagof-words Error between ground truth and model predicted keyphrases, which is how many tokens are incorrectly predicted. As shown in Figure 5, UniKeyphrase with BWC achieves lower BoW Error compared with the vanilla model. It proves that BWC successfully guides the model to learn a better BoW allocation.

Joint Framework Analysis
In our UniKeyphrase model, we adopt pre-trained model UNILM for KP. So it is necessary to check that the gain on metrics of our proposed joint framework is not just come from the pre-trained model. In this section, we compare UniKeyphrase with directly using the pre-trained UNILM to perform generative KP.
Specifically, we train a sequence to sequence model for KP based on UNILM. Results are shown in Table 5. From the results, we find that all of the joint models with SRL can further outperform the generative method based on UNILM, demonstrating that the improvement of KP mainly come from our joint framework instead of pre-trained UNILM. We notice that the UniKeyphrase without SRL does not outperform the generative method based on UNILM, which show the significance of modeling the relation between the two sub-tasks in our joint framework.

Conclusion and Future Work
This paper focuses on explicitly establishing an endto-end unified model for PKE and AKG. Specifically, we propose UniKeyphrase, which contains stacked relation layer to model the interaction and relation between the two sub-tasks. In addition, we design a novel bag-of-words constraint for jointly training these two tasks. Experiments on benchmarks show the effectiveness of the proposed model, and more extensive analysis further confirms the correlation between two tasks and reveals that modeling the relation explicitly can boost their performance.
Our UniKeyphrase can be formalized as a unified framework of NLU and NLG tasks. It is easy to transfer it to other extraction-generation NLP tasks. In the future, we will explore to adopt our framework to more scenarios. Relevant statistics about the dataset used in this paper is shown in Table 6.

B Experimental Details
The BWC does not bring extra parameters, hence the trainable parameters of UniKeyphrase come from UNILM and SRL. We use the base version of UNILM, which contains about 110M parameters. Follow UNILM, our model is implemented using PyTorch. The learning rate is 1e-5 and the proportion of warmup steps is 0.1. The masking probability of absent keyphrase sequence is 0.7. For the SRL module, dropout is applied to the output of each layer for regularization, the dropout rate is 0.5. In this paper, we try to set the number of layer by 2,3,4 and choose the best based on validation. For all experiments in this paper, we choose the model that performs best on the KP20k validation dataset.

C Preprocess
The input of UniKeyphrase is the same as BERT, which applies wordpiece tokenizer on raw sentences. So we use the "BIXO" labeling method, where B and I stand for Beginning and Inside of a word in keyphrases, and O denotes any token that Outside of any keyphrase. For any sub-word token in keyphrases(which starts with '##' in processed input) we use X to label it. For example, "voip conferencing system" will be tokenized into "v ##oi ##p con ##fer ##encing system" and be labeled as "B X X I X X I". We concatenate all the tokenized absent keyphrases into one sequence using a special delimiter " ; ". An example of absent keyphrase sequence will like "peer to peer ; content delivery ; t ##f ##rc ; ran ##su ##b". Document: fast image recovery using variable splitting and constrained optimization . we propose a new fast algorithm for solving one of the standard formulations of image restoration and reconstruction which consists of an unconstrained optimization problem where the objective includes an data fidelity term and a nonsmooth regularizer . this formulation allows both wavelet based ( with orthogonal or frame based representations ) regularization or total variation regularization . our approach is based on a variable splitting to obtain an equivalent constrained optimization formulation , which is then addressed with an augmented lagrangian method . the proposed algorithm is an instance of the so called alternating direction method of multipliers , for which convergence has been proved . experiments on a set of image restoration and reconstruction benchmark problems show that the proposed algorithm is faster than the current state of the art methods .

D Case Study
We give a case on the KP20k testset in Figure  6. We compare with the original UNILM since our joint models are based on its implementation. Blue and red denote correct present and absent keyphrases, respectively. As shown in Figure 6, UniKeyphrase successfully catches the deep semantic relation similar to the case in the introduction and gives more accurate results(predicts some applications like "image restoration" or "image reconstruction").

E Evaluation Details
We use F 1 @5 and F 1 @M as evaluation metric. Following previous works, we pad the result when number of predicted keyphrases is less than 5 when calculating F 1 @5. For calculating F 1 @5, since there is no explicit rank score for each predicted keyphrase, we calculate the rank score as follows: Present: we calculate the average predicted label probabilities of all tokens in a keyphrase as the score. We tried several other scoring strategies as the score. The results show no significant difference(less than 0.1%).
Absent: following previous works, we pick up the top 5 keyphrases in sequence order. The 5 leftmost keyphrases in the predicted sequence are selected as the result.