One2Set: Generating Diverse Keyphrases as a Set

Recently, the sequence-to-sequence models have made remarkable progress on the task of keyphrase generation (KG) by concatenating multiple keyphrases in a predefined order as a target sequence during training. However, the keyphrases are inherently an unordered set rather than an ordered sequence. Imposing a predefined order will introduce wrong bias during training, which can highly penalize shifts in the order between keyphrases. In this work, we propose a new training paradigm One2Set without predefining an order to concatenate the keyphrases. To fit this paradigm, we propose a novel model that utilizes a fixed set of learned control codes as conditions to generate a set of keyphrases in parallel. To solve the problem that there is no correspondence between each prediction and target during training, we propose a K-step label assignment mechanism via bipartite matching, which greatly increases the diversity and reduces the repetition rate of generated keyphrases. The experimental results on multiple benchmarks demonstrate that our approach significantly outperforms the state-of-the-art methods.


Introduction
Keyphrase generation (KG) aims to generate of a set of keyphrases that expresses the high-level semantic meaning of a document. These keyphrases can be further categorized into present keyphrases that appear in the document and absent keyphrases that do not. Meng et al. (2017) proposed a sequence-to-sequence (Seq2Seq) model with a copy mechanism (Gu et al., 2016) to predict both present and absent keyphrases. However, the model needs beam search during inference to overgenerate multiple keyphrases, which cannot determine the dynamic number of keyphrases. To * * Corresponding authors.  Figure 1: An example of ground-truth keyphrases (upper) and predictions (lower) under ONE2SEQ and ONE2SET training paradigm. For the ONE2SEQ training paradigm, although the predictions are correct in each keyphrase, they will still be considered wrong due to the shift in keyphrase order, and the model will receive a large penalty. address this, Yuan et al. (2020) proposed the ONE2SEQ training paradigm where each source text corresponds to a sequence of keyphrases that are concatenated with a delimiter sep and a terminator eos . As keyphrases must be ordered before being concatenated, Yuan et al. (2020) sorted the present keyphrases by their order of the first occurrence in the source text and appended the absent keyphrases to the end. During inference, the decoding process terminates when generating eos , and the final keyphrase predictions are obtained after splitting the sequence by sep . Thus, a model trained with ONE2SEQ paradigm can generate a sequence of multiple keyphrases with dynamic numbers as well as considering the dependency between keyphrases.
However, as the keyphrases are inherently an unordered set rather than an ordered sequence, imposing a predefined order usually leads to the following intractable problems. First, the predefined order will give wrong bias during training, which can highly penalize shifts in the order between keyphrases. As shown in Figure 1 (a), the model makes correct predictions in each keyphrase but can still receive a large loss during training. Second, this increases the difficulty of model training. For example, the absent keyphrases are appended to the end in an author-defined order in Yuan et al. (2020), however, different authors can have various sorting bases, which makes it difficult for the model to learn a unified pattern. Third, the model is highly sensitive to the predefined order, as shown in Meng et al. (2019), and can suffer from error propagation during inference when previously having generated keyphrases with an incorrect order. Lately, Chan et al. (2019) proposed a reinforcement learningbased fine-tuning method, which fine-tunes the pre-trained models with metric-based rewards (i.e., recall and F 1 ) for generating more sufficient and accurate keyphrases. However, this method can alleviate the impact of the order problems when fine-tuning but needs to be pre-trained under the ONE2SEQ paradigm to initialize the model, which can still introduce wrong biases.
To address this problem, we propose a new training paradigm ONE2SET where the ground-truth target is a set rather than a keyphrase-concatenated sequence. However, the vanilla Seq2Seq model can generate a sequence but not a set. Hence, we introduce a set prediction model that adopts Transformer (Vaswani et al., 2017) as the main architecture together with a fixed set of learned control codes as additional decoder inputs to perform controllable generation. For each code, the model generates a corresponding keyphrase for the source document or a special ∅ token that represents the meaning of "no corresponding keyphrase". During training, the cross-entropy loss cannot be directly used since we do not know the correspondence between each prediction and target. Hence, we introduce a K-step target assignment mechanism, where we first auto-regressively generate K words for each code and then assign targets via bipartite matching based on the predicted words. After that, we can train each code using teacher forcing as before. Compared with the previous models, the proposed method has the following advantages: (a) there is no need to predefine an order to concatenate the keyphrases, thus the model will not be affected by the wrong biases in the whole training stage; and (b) the bipartite matching forces unique predictions for each code, which greatly reduces the duplication ratio and increases the diversity of predictions.
We summarize our main contributions as follows: (1) we propose a new training paradigm ONE2SET without predefining an order to concatenate the keyphrases; (2) we propose a novel set prediction model that can generate a set of diverse keyphrases in parallel and a dynamic target assignment mechanism to solve the intractable training problem under the ONE2SET paradigm; (3) our method consistently outperforms all the state-of-the-art methods and greatly reduces the duplication ratio. Our codes are publicly available at Github 1 .

Keyphrase Generation
Compared to extractive approaches, generative ones have the ability to consider the absent keyphrase prediction. Meng et al. (2017) proposed a generative model CopyRNN, which employs a encoder-decoder framework (Sutskever et al., 2014) with attention (Bahdanau et al., 2015) and copy mechanisms (Gu et al., 2016). Many works are proposed based on the CopyRNN architecture (Chen et al., 2018;Zhao and Zhang, 2019;Chen et al., 2019b,a).
In previous CopyRNN based works, each source text corresponds to a single target keyphrase. Thus, the model needs beam search during inference to overgenerate multiple keyphrases, which cannot determine the dynamic number of keyphrases and consider the inter-relation among keyphrases.  Figure 2: Architecture of SETTRANS model with N learned control codes as input conditions. A K-step target assignment mechanism is used during training, where we first predict K words for each code, and then find an optimal allocation among the predictions and targets. In the figure, N = 8 and K = 2 are used.
To this end, Yuan et al. (2020) (2019) proposed an RL-based fine-tuning method using F 1 and Recall metrics as rewards. Swaminathan et al. (2020) proposed an RL-based fine-tuning method using a discriminator to produce rewards. All the above models need to be trained or pre-trained under the ONE2SEQ paradigm. As keyphrases must be ordered before concatenating and keyphrases are inherently an unordered set, the model can be trained with wrong signal. Our ONE2SET training paradigm aims to solve this problem.

Methodology
This paper proposes a new training paradigm ONE2SET for keyphrase generation. A set prediction model based on Transformer (SETTRANS) is proposed to fit this paradigm, as shown in Figure 2. Given a fixed set of learned control codes as input conditions, the model generates a keyphrase or a special ∅ token for each code in parallel. During training, a K-step target assignment mechanism is proposed to dynamically determine the target corresponding to each code. The main idea is that the model first freely predicts K steps without any supervision to see what keyphrase each code can roughly generate, and then use bipartite matching to find the optimal allocation based on the model's conjecture and target. Given the correspondence of each code and target, a separate set loss is then used to correct the model's conjecture, where half of the codes are trained to predict the present keyphrase set and the others are trained to predict the absent keyphrase set.

The ONE2SET Training Paradigm
We first formally describe the keyphrase generation task as follows. Given a document x, it's aimed to predict a set of keyphrases where |Y| is the number of keyphrases. To solve the KG task, previous works typically adopted an ONE2ONE training paradigm (Meng et al., 2017) or ONE2SEQ training paradigm (Yuan et al., 2020). The difference between the two training paradigms is that the form of training samples is different. Specifically, in the ONE2ONE training paradigm, ..,|Y| to perform training independently. In the ONE2SEQ training paradigm, each original sample pair is processed as is a sequence of keyphrases after the reordering and concatenating operation.
To solve the wrong bias problem caused by the ONE2SEQ training paradigm, we propose the ONE2SET training paradigm, where each original sample pair is kept still as (x, Y). Hence, the sample used in training is consistent with the original sample, which avoids the intractable problem introduced by the additional processing (i.e., dividing or concatenating).

The SETTRANS Model
We adopt the Transformer (Vaswani et al., 2017) as the backbone encoder-decoder framework. How-ever, the vanilla Transformer can only generate a sequence but not a set. To predict a set of keyphrases, we propose SETTRANS model that utilizes a set of learned control codes as additional decoder inputs. By performing generation conditioned on each control code, we can generate a set of keyphrases in parallel. To decide suitable numbers of keyphrases for different given documents, we fix the total length of the control codes to a sufficient number N , and introduce a special ∅ token that represents the meaning of "no corresponding keyphrase". Hence, we can determine the appropriate number of keyphrases for an input document after removing all the ∅ tokens from the N predictions.
Formally, the decoder input at time step t for control code n is defined as follows: where e w y n t−1 is the embedding of word y n t−1 , e p t is the t-th sinusoid positional embedding as in (Vaswani et al., 2017) and c n is the n-th learned control code embedding. The decoder outputs the predictive distribution p n t , which is used to get the next word y n t . As some keyphrases contain words that do not exist in the predefined vocabulary but appear in the input document, we also employ a copy mechanism (See et al., 2017), which is generally adopted for many previous KG works (Meng et al., 2017;Chan et al., 2019;Chen et al., 2020;Yuan et al., 2020).

Training
The main difficulty of training under the ONE2SET paradigm is that the correspondence between each prediction and ground-truth keyphrase is unknown, so that the cross-entropy loss cannot be directly used. Hence, we introduce a K-step target assignment mechanism to assign the ground-truth keyphrase for each prediction, and a separate set loss to train the model in an end-to-end way.

K-step Target Assignment
We first generate K words for each control code and collect the corresponding predictive probability distributions of each step. Formally, we denote P = {P n } n=1,...,N , where P n = {p n t } t=1,...,K and p n t is the predictive distribution at time step t for control code n.
Then, we find a bipartite matching between the ground-truth keyphrases and predictions. Assuming the predefined number of control codes N is larger than the number of ground-truth keyphrases, we consider the ground-truth keyphrases also as a set of size N padded with ∅. Note that the bipartite matching enforces permutation-invariance, and guarantees that each target element has a unique match. Thus, it reduces the duplication ratio of predictions. Specifically, as shown in Figure  2, both the fifth and eighth control code predict the same keyphrase "neural model", but one of them is assigned with ∅. The eighth code can perceive that this keyphrase has been generated by another code. Hence, the control codes can learn their mutual dependency during training and not generate duplicated keyphrases.
Formally, to find a bipartite matching between sets of ground-truth keyphrases and predictions, we search for a permutationπ with the lowest cost: where Π(N ) is the space of all N -length permutations, C match y n , P π(n) is a pair-wise matching cost between the ground truth y n and distributions of a prediction sequence with index π(n). This optimal assignment is computed efficiently with the Hungarian algorithm (Kuhn, 1955). The matching cost takes into account the class predictions, which can be defined as follows: where s = min(|y n |, K) is the minimum shared length between the target and predicted sequence, p π(n) t (y n t ) denotes the probability of word y n t in p π(n) t , and we ignore the score from matching predictions with ∅, which ensures that valid targets (i.e., non-∅ targets) can be allocated to predictions with as higher predictive probability as possible.

Separate Set Loss
Given the correspondence between each code and target, we can train the model to predict a single target set, which is defined as follows: where pπ (n) t is the predictive probability distribution using teacher forcing. However, predicting present and absent keyphrases requires the model to have different capabilities, we propose a separate set loss to flexibly take this bias into account in a unified model. Specifically, we first separate the control codes into two fixed sets with equal size of N/2, which is denoted as C 1 and C 2 , and the target keyphrase set Y into present target keyphrase set Y pre and absent target keyphrase set Y abs . Finally, the bipartite matching is performed on the two sets separately, namely, we find a permutationπ pre using Y pre and predictions from C 1 , andπ abs using Y abs and predictions from C 2 . Thus, we can modify the final loss in Equal 4 as follows: In practice, we down-weight the log-probability term when y n t = ∅ by scale factors λ pre and λ abs for present keyphrase set and absent keyphrase set to account for the class imbalance.

Datasets
We conduct our experiments on five scientific article datasets, including Inspec (Hulth, 2003), NUS (Nguyen and Kan, 2007), Krapivin (Krapivin et al., 2009), SemEval (Kim et al., 2010 and KP20k (Meng et al., 2017). Each sample from these datasets consists of a title, an abstract, and some keyphrases. Following previous works (Meng et al., 2017;Chen et al., 2019b,a;Yuan et al., 2020), we concatenate the title and abstract as a source document. We use the largest dataset (i.e., KP20k) to train all the models. After preprocessing (i.e., lowercasing, replacing all the digits with the symbol digit and removing the duplicated data), the final KP20k dataset contains 509,818 samples for training, 20,000 for validation, and 20,000 for testing. The dataset statistics are shown in Table 1.

Baselines
We focus on the comparisons with the following state-of-the-art methods as our baselines: • catSeq (Yuan et al., 2020). The RNN-based seq2seq model with copy mechanism trained under ONE2SEQ paradigm. • catSeqTG (Chen et al., 2019b). An extension of catSeq with additional title encoding and cross-attention. • catSeqTG-2RF 1 (Chan et al., 2019). An extension of catSeqTG with RL-based finetuning using F 1 and Recall metrics as rewards. • GAN M R (Swaminathan et al., 2020). An extension of catSeq with RL-based fine-tuning using a discriminator to produce rewards. • ExHiRD-h (Chen et al., 2020). An extension of catSeq with a hierarchical decoding method and an exclusion mechanism to avoid generating duplicated keyphrases. In this paper, we propose two Transformer-based models that are denoted as follows: • Transformer.
A Transformer-based model with copy mechanism trained under ONE2SEQ paradigm.
• SETTRANS. An extension of Transformer with additional control codes trained under ONE2SET paradigm.  (Yuan et al., 2020) 0   (Yuan et al., 2020) 0  and λ abs respectively based on the validation set. We conduct the experiments on a GeForce RTX 2080Ti GPU, repeat three times using different random seeds, and report the averaged results.

Evaluation Metrics
We follow previous works (Chan et al., 2019;Chen et al., 2020) and use macro-averaged F 1 @5 and F 1 @M for both present and absent keyphrase predictions. F 1 @M compares all the keyphrases predicted by the model with the ground-truth keyphrases, which means it considers the number of predictions. For F 1 @5, when the prediction number is less than five, we randomly append incorrect keyphrases until it obtains five predictions. If we do not adopt such an appending operation, F 1 @5 will become the same with F 1 @M when the prediction number is less than five as shown in Chan et al. (2019). We apply the Porter Stemmer before determining whether two keyphrases are identical and remove all the duplicated keyphrases after stemming. Table 2 and Table 3 show the performance evaluations of the present and absent keyphrase, respectively. We observe that the proposed SETTRANS model consistently outperforms almost all the previous state-of-the-art models on both F 1 @5 and F 1 @M metrics by a large margin, which demonstrates the effectiveness of our methods. As noted by previous works (Chan et al., 2019;Yuan et al., 2020) that predicting absent keyphrases for a document is an extremely challenging task, thus the performance is much lower than that of present keyphrase prediction. Regarding the comparison of our Transformer model trained under ONE2SEQ paradigm and SETTRANS model trained under ONE2SET paradigm, we find SETTRANS model consistently improves both keyphrase extractive and generative ability by a large margin on almost all the datasets, and maintains the performance of present keyphrase prediction on the Inspec and Krapivin datasets, which demonstrates the advantages of ONE2SET training paradigm.

Diversity of Predicted Keyphrases
To investigate the model's ability to generate diverse keyphrases, we measure the average numbers of unique present and absent keyphrases, and the average duplication ratio of all the predicted keyphrases. The results are reported in Table 4. Based on the results, we observe that our SETTRANS model generates more unique keyphrases than other baselines by a large margin, as well as achieves a significantly lower duplication ratio. Note that ExHiRD-h specifically designed a deduplication mechanism to remove duplication in the inference stage. In contrast, our model achieves  Table 4: Number and duplication ratio of predicted keyphrases on three datasets. "#PK" and "#AK" are the average number of unique present and absent keyphrases respectively. "Dup" refers to the average duplication ratio of predicted keyphrases. "Oracle" refers to the gold average keyphrase number.
a lower duplication ratio without any deduplication mechanism, which proves its effectiveness. However, we also observe that our model tends to overgenerate more present keyphrases than the ground-truth on the Krapivin and KP20k datasets. We analyze that different datasets have different preferences for the number of keyphrases, which we leave as our future work.

Ablation Study
To understand the effects of each component of the SETTRANS model, we conduct an ablation study on it and report the results on the KP20k dataset in Table 5.

Effects of Model Architecture
To verify the effectiveness of the model architecture of SET-TRANS, we remove the control codes and find the model is completely broken. The duplication ratio increases to 0.95, which means all the 20 control codes predict the same keyphrase. This occurs because when the control codes are removed, all the predictions depend on the same condition (i.e., the source document) without any distinction. This demonstrates that the control codes play an extremely important role in the SETTRANS model.

Effects of Target Assignment
The major difficulty for successfully training under ONE2SET paradigm is the target assignment between predictions and targets. An attempt is first made to remove the K-step target assignment mechanism, which means that we employ a fixed sequential matching strategy as in the ONE2SEQ paradigm.
From the results, we observe that both the present and absent keyphrase performances degrade, the number of predicted keyphrases also drops dramatically, and the duplication ratio increased greatly by 18%. We analyze the reasons as follows: (1)  Table 5: Ablation study of SETTRANS on KP20k dataset. "-teacher forcing" refers to directly calculating the loss after target assignment in a student forcing schema. "-separate set loss" refers to using a single set loss.
The dynamic characteristics of the K-step target assignment remove unnecessary position constraint during training, which encourages the model to generate more keyphrases. Specifically, the model can generate a keyphrase in any location rather than only in the given position. Thus, the model does not need to consider the position constraint during the generation and encourages all the control codes to predict keyphrases rather than only the first few codes, which will be verified in Section 5.6. (2) The bipartite characteristics of the Kstep target assignment forces the model to predict unique keyphrases, which reduces the duplication ratio of predictions. When predictions from two codes are similar, only one code may be assigned a target keyphrase, and the other is assigned a ∅ token. Thus, the model can be very careful about each prediction to prevent duplication. We further experiment that replacing the K-step target assignment with a random assignment, and we find that the results are similar to those when removing the control codes. This is because the random assignment misleads the learning of the control codes and causes them to become invalid.

Effects of Set Loss
As discussed in Section 3.3.2, teacher forcing and a separate set loss are used to train the model after assigning a target for each prediction. We investigate their effects in detail. The results show the following.
(1) Teaching forcing can alleviate the cold start problem. After removing teaching forcing, the model faces a cold start problem, in other words, the lack of supervision information leads to a poor prediction, and the target assignment is therefore not ideal, which causes the model to fail at the early stage of training. (2) A separate set loss helps in both present and absent keyphrase predictions but also increases the duplication ratio slightly compared with a single set loss. As producing correct present keyphrases is an easier task, the model tends to generate present keyphrases only when using a single set loss. Our separate set loss can infuse different inductive biases into the two sets of control codes, which makes them more focused on generating one type of keyphrase (i.e., the present one or absent one). Thus, it increases the accuracy of the predictions and encourages more absent keyphrase predictions. However, because bipartite matching is performed separately, the constraint of unique prediction does not exist between the two sets, which leads to a slight increase in the duplication ratio.

Performance over Scale Factors
In this section, we conduct experiments on KP20k dataset to evaluate performance under different loss scale factors λ for ∅ token. The results are shown in Figure 3. The left part of the figure shows that when λ = 0.2, the performances on both present and absent keyphrases are consistently better than the results when λ = 0.1. However, a scale factor larger than 0.1 improves the present keyphrase performance, but also harms the absent keyphrase performance. As we can see from the right part of the figure, the number of predictions decreases consistently for both the present and absent keyphrases when the scale factor becomes larger. This is because a larger scale factor causes the model to predict more ∅ tokens to reduce the loss penalty during training. Moreover, we also find that the precision metric P @M will increases when the number of predictions decreases. While the effect of the decrease in the recall metric R@M is even greater when the number is too small, which leads to a degradation in the overall metric F 1 @M .

Efficiency over Assignment Steps
In this section, we study the influence of target assignment steps K on the prediction performance and efficiency compared with Transformer. As shown in the left part of Figure 4, we note that when K is equal to 1, the improvement of SETTRANS over Transformer is relatively lower than when it is equal to 2 (i.e., the average length of keyphrase). This is mainly because some keyphrases that have the same first word cannot be distinguished during training, which could interfere with the learning of control codes. The right part of Figure 4 shows the training and inference speedup with various K compared with the Transformer. We note SETTRANS could be slower than Transformer at the training stage, and a smaller K could alleviate this problem. For performance and efficiency considerations, we consider 2 to be an appropriate value for steps K. Moreover, as K is only used in the training stage, SETTRANS is 6.44 times invariably faster than Transformer on the inference stage. This is because that with different control codes as input condition, all the keyphrases can be generated in parallel on the GPU. Hence, in addition to better performance than Transformer, SETTRANS also has great advantages in the inference efficiency.

Analysis of Learned Control Codes
Our analysis here is driven by two questions from Section 5.3: (1) Whether the K-step target assignment mechanism encourages all the control codes to predict keyphrases rather than only the first few codes?
(2) Whether the separate set loss makes the control codes more focused on generating one type of keyphrase (i.e., present or absent) compared to the single set loss?
To investigate these two questions, we measure the ratio of present and absent keyphrase predic- The subgraphs from top to bottom are for the "w/o K-step target assignment", "a single set loss", and "a separate set loss" cases, respectively. The summation of the ratios of the present keyphrases, absent keyphrases and ∅ equals to 100% for each code.
tions for all the control codes on the KP20k dataset, which is shown in Figure 5. As shown in the top and middle subfigures, we observe that without the target assignment mechanism, many control codes are invalid (i.e., only predicting ∅), and only the first small part performs valid predictions. Moreover, when there are already very few valid predictions, the model still has a duplication ratio of up to 26%, as shown in Table 5, resulting in an even smaller number of final predictions. After the introduction of the target assignment mechanism, most of the codes can generate valid keyphrases, which increases the number of predictions.
However, as shown in the middle subfigure, most of the control code tends to generate more present keyphrases than absent keyphrases when using a single set loss. When using a separate set loss in the bottom subfigure, the two parts are more inclined to predict only present and absent keyphrases respectively, which also increases the number of absent keyphrase predictions.

Conclusions
In this paper, we propose a new training paradigm ONE2SET without predefining an order to concatenate the keyphrases, and a novel model SETTRANS that predicts a set of keyphrases in parallel. To successfully train under ONE2SET paradigm, we propose a K-step target assignment mechanism and a separate set loss, which greatly increases the number and diversity of the generated keyphrases. Experiments show that our method gains significantly huge performance improvements against existing state-of-the-art models. We also show that SETTRANS has great advantages in the inference efficiency compared with the Transformer under ONE2SEQ paradigm.