SimCKP: Simple Contrastive Learning of Keyphrase Representations

Keyphrase generation (KG) aims to generate a set of summarizing words or phrases given a source document, while keyphrase extraction (KE) aims to identify them from the text. Because the search space is much smaller in KE, it is often combined with KG to predict keyphrases that may or may not exist in the corresponding document. However, current unified approaches adopt sequence labeling and maximization-based generation that primarily operate at a token level, falling short in observing and scoring keyphrases as a whole. In this work, we propose SimCKP, a simple contrastive learning framework that consists of two stages: 1) An extractor-generator that extracts keyphrases by learning context-aware phrase-level representations in a contrastive manner while also generating keyphrases that do not appear in the document; 2) A reranker that adapts scores for each generated phrase by likewise aligning their representations with the corresponding document. Experimental results on multiple benchmark datasets demonstrate the effectiveness of our proposed approach, which outperforms the state-of-the-art models by a significant margin.


Introduction
Keyphrase prediction (KP) is a task of identifying a set of relevant words or phrases that capture the main ideas or topics discussed in a given document.Prior studies have defined keyphrases that appear in the document as present keyphrases and the opposites as absent keyphrases.High-quality keyphrases are beneficial for various applications such as information retrieval (Kim et al., 2013), text summarization (Pasunuru and Bansal, 2018), and translation (Tang et al., 2016).KP methods are generally divided into keyphrase extraction (KE) (Witten Figure 1: An example of keyphrase prediction.Present and absent keyphrases are colored blue and red, respectively.Overlapping keyphrases are in bold.et al., 1999;Hulth, 2003;Nguyen and Kan, 2007;Medelyan et al., 2009;Caragea et al., 2014;Zhang et al., 2016;Alzaidy et al., 2019) and keyphrase generation (KG) models (Meng et al., 2017;Ye and Wang, 2018;Chan et al., 2019;Chen et al., 2020b;Yuan et al., 2020;Ye et al., 2021b;Zhao et al., 2022), where the former only extracts present keyphrases from the text and the latter generates both present and absent keyphrases.
Recently, several methods integrating KE and KG have been proposed (Chen et al., 2019a;Liu et al., 2021;Ahmad et al., 2021;Wu et al., 2021Wu et al., , 2022b)).These models predict present keyphrases using an extractor and absent keyphrases using a generator, thereby effectively exploiting a relatively small search space in extraction.However, current integrated models suffer from two limitations.First, they employ sequence labeling models that predict the probability of each token being a constituent of a present keyphrase, where such token-level predictions may be a problem when the target keyphrase is fairly long or overlapping.As shown in Figure 1, the sequence labeling model makes an incomplete prediction for the term "multi-ply connected problem" because only the tokens for "multiply connected" have yielded a high probability.We also observe that the model is prone to miss the keyphrase "integral equations" every time because it overlaps with another keyphrase "boundary integral equation" in the text.Secondly, integrated or even purely generative models are usually based on maximum likelihood estimation (MLE), which predicts the probability of each token given the past seen tokens.This approach scores the most probable text sequence the highest, but as pointed out by Zhao et al. (2022), keyphrases from the maximumprobability sequence are not necessarily aligned with target keyphrases.In Figure 1, the MLE-based model predicts "magnetostatic energy analysis", which is semantically similar to but not aligned with the target keyphrase "nonlinear magnetostatic analysis".This may be a consequence of greedy search, which can be remedied by finding the target keyphrases across many beams during beam search, but it would also create a large number of noisy keyphrases being generated in the top-k predictions.
Existing KE approaches based on representation learning may address the above limitations (Bennani-Smires et al., 2018;Sun et al., 2020;Liang et al., 2021;Zhang et al., 2022;Sun et al., 2021;Song et al., 2021Song et al., , 2023)).These methods first mine candidates that are likely to be keyphrases in the document and then rank them based on the relevance between the document and keyphrase embeddings, which have shown promising results.Nevertheless, these techniques only tackle present keyphrases from the text, which may mitigate the overlapping keyphrase problem from sequence labeling, but they are not suitable for handling MLE and the generated keyphrases.
In this work, we propose a two-stage contrastive learning framework that leverages context-aware phrase-level representations on both extraction and generation.First, we train an encoder-decoder network that extracts present keyphrases on top of the encoder and generates absent keyphrases through the decoder.The model learns to extract present keyphrases by maximizing the agreement between the document and present keyphrase representations.Specifically, we consider the document and its corresponding present keyphrases as positive pairs and the rest of the candidate phrases as negative pairs.Note that these negative candidate phrases are mined from the document using a heuristic algorithm (see Section 4.1).The model pulls keyphrase embeddings to the document embedding and pushes away the rest of the candidates in a contrastive manner.Then during inference, top-k keyphrases that are semantically close to the document are predicted.After the model has finished training, it generates candidates for absent keyphrases.These candidates are simply constructed by overgenerating with a large beam size for beam search decoding.To reduce the noise introduced by beam search, we train a reranker that allocates new scores for the generated phrases via another round of contrastive learning, where this time the agreement between the document and absent keyphrase representations is maximized.Overall, major contributions of our work can be summarized as follows: • We present a contrastive learning framework that learns to extract and generate keyphrases by building context-aware phrase-level representations.
• We develop a reranker based on the semantic alignment with the document to improve the absent keyphrase prediction performance.
• To the best of our knowledge, we introduce contrastive learning to a unified keyphrase extraction and generation task for the first time and empirically show its effectiveness across multiple KP benchmarks.
2 Related Work

Keyphrase Extraction
Keyphrase extraction focuses on predicting salient phrases that are present in the source document.Existing approaches can be broadly divided into two-step extraction methods and sequence labeling models.Two-step methods first determine a set of candidate phrases from the text using different heuristic rules (Hulth, 2003;Medelyan et al., 2008;Liu et al., 2011;Wang et al., 2016).These candidate phrases are then sorted and ranked by either supervised algorithms (Witten et al., 1999;Hulth, 2003;Nguyen and Kan, 2007;Medelyan et al., 2009) or unsupervised learning (Mihalcea and Tarau, 2004;Wan and Xiao, 2008;Bougouin et al., 2013;Bennani-Smires et al., 2018).Another line of work is sequence labeling, where a model learns to predict the likelihood of each word being a keyphrase word (Zhang et al., 2016;Luan et al., 2017;Gollapalli et al., 2017;Alzaidy et al., 2019).

Extraction and Generation Reranking
Figure 2: A contrastive framework for keyphrase prediction.In the first stage (left), the model learns to maximize the relevance between present keyphrases and their corresponding document while generating absent keyphrases.
After training, the model generates candidates for absent keyphrases and sends them to the second stage (right), where the candidates are reranked after their relevance with the document has been maximized/minimized.

Keyphrase Generation
The task of keyphrase generation is introduced to predict both present and absent keyphrases.(Swaminathan et al., 2020), hierarchical decoding (Chen et al., 2020b), graphs (Ye et al., 2021a), dropout (Ray Chowdhury et al., 2022), and pretraining (Kulkarni et al., 2022;Wu et al., 2022a) to improve keyphrase generation.Furthermore, there have been several attempts to unify KE and KG tasks into a single learning framework.These methods not only focus on generating only the absent keyphrases but also perform presumably an easier task by extracting present keyphrases from the document, instead of having to generate them from a myriad of vocabularies.Current methodologies utilize external source (Chen et al., 2019a), selection guidance (Zhao et al., 2021), salient sentence detection (Ahmad et al., 2021), relation network (Wu et al., 2021), and prompt-based learning (Wu et al., 2022b).

Contrastive Learning
Methods to extract rich feature representations based on contrastive learning (Chopra et al., 2005;Hadsell et al., 2006) have been widely studied in numerous literature.The primary goal of the learn-ing process is to pull semantically similar data to be close while pushing dissimilar data to be far away in the representation space.Contrastive learning has shown great success for various computer vision tasks, especially in self-supervised training (Chen et al., 2020a), whereas Gao et al. (2021) have devised a contrastive framework to learn universal sentence embeddings for natural language processing.Furthermore, Liu and Liu (2021) formulated a seq2seq framework employing contrastive learning for abstractive summarization.Similarly, a contrastive framework for autoregressive language modeling (Su et al., 2022) and open-ended text generation (Krishna et al., 2022) have been presented.
There have been endeavors to incorporate contrastive learning in the context of keyphrase extraction.These methods generally utilized the pairwise ranking loss to rank phrases with respect to the document to extract present keyphrases (Sun et al., 2021;Song et al., 2021Song et al., , 2023)).In this paper, we devise a contrastive learning framework for keyphrase embeddings on both extraction and generation to improve the keyphrase prediction performance.

SIMCKP
In this section, we elaborate on our approach to building a contrastive framework for keyphrase prediction.In Section 4.1, we delineate our heuristic algorithm for constructing a set of candidates for present keyphrase extraction; in Section 4.2, we describe the multi-task learning process for extracting and generating keyphrases; and lastly, we explain our method for reranking the generated keyphrases in Section 4.3.Figure 2 illustrates the overall architecture of our framework.

Hard Negative Phrase Mining
To obtain the candidates for present keyphrases, we employ a similar heuristic approach from existing extractive methods (Hulth, 2003;Mihalcea and Tarau, 2004;Wan and Xiao, 2008;Bennani-Smires et al., 2018).A notable difference between prior work and ours is that we keep not only noun phrases but also verb, adjective, and adverb phrases, as well as phrases containing prepositions and conjunctions.We observe that keyphrases are actually made up of diverse parts of speech, and extracting only the noun phrases could lead to missing a significant number of keyphrases.Following the common practice, we assign partof-speech (POS) tags to each word using the Stanford POSTagger2 and chunk the phrase structure tree into valid phrases using the NLTK RegexpParser3 .
As shown in Algorithm 1, each document is converted to a phrase structure tree where each word w is tagged with a POS tag t.The tagged document is then split into possible phrase chunks based on our predefined regular expression rules, which must include one or more valid tags such as nouns, verbs, adjectives, etc.Nevertheless, such valid tag sequences are sometimes nongrammatical, which cannot be a proper phrase and thus may introduce noise during training.In response, we filter out such nongrammatical phrases by first categorizing tags as independent or dependent.Phrases generally do not start or end with a preposition or conjunction; therefore, preposition and conjunction tags belong to a dependent tag set T dep .On the other hand, noun, verb, adjective, and adverb tags can stand alone by themselves, making them belong to an independent tag set T indep .There are also C pre ← C pre ∪ span i 19: return C pre tag sets T start_dep and T end_dep , which include tags that cannot start but end a phrase and tags that can start but not end a phrase, respectively.Lastly, each candidate phrase is iterated over to acquire all ngrams that make up the phrase.For example, if the phrase is "applications of machine learning", we select n-grams "applications", "machine", "learning", "applications of machine", "machine learning", and "applications of machine learning" as candidates.Note that phrases such as "applications of", "of", "of machine", and "of machine learning" are not chosen as candidates because they are not proper phrases.As noted by Gillick et al. (2019), hard negatives are important for learning a high-quality encoder, and we claim that our mining accomplishes this objective.

Extractor-Generator
In order to jointly train for extraction and generation, we adopt a pretrained encoder-decoder network.Given a document x, it is tokenized and fed as input to the encoder where we take the last hidden states of the encoder to obtain the contextual embeddings of a document: where T is the token sequence length of the document and h 0 is the start token (e.g., <s>) representation used as the corresponding document embedding.For each candidate phrase, we construct its embedding by taking the sum pooling of the token span representations: SumPooling([h s , ..., h e ]), where s and e denote the start and end indices of the span.The document and candidate phrase embeddings are then passed through a linear layer followed by nonlinear activation to obtain the hidden representations:  where θ denotes the model parameters.Then, the MLE objective used to train the model to generate absent keyphrases is defined as (4) Lastly, we combine the contrastive loss with the negative log-likelihood loss to train the model to both extract and generate keyphrases: where λ is a hyperparameter balancing the losses in the objective.

Reranker
As stated by Zhao et al. (2022), MLE-driven models predict candidates with the highest probability, disregarding the possibility that target keyphrases may appear in suboptimal candidates.This problem can be resolved by setting a large beam size for beam search; however, this approach would also result in a substantial increase in the generation of noisy keyphrases among the top-k predictions.
Inspired by Liu and Liu (2021), we aim to reduce this noise by assigning new scores to the generated keyphrases.

Candidate Generation
We employ the finetuned model from Section 4.2 to generate candidate phrases that are highly likely to be absent keyphrases for the corresponding document.We perform beam search decoding using a large beam size on each training document, resulting in the overgeneration of absent keyphrase candidates.The model generates in an ONE2SEQ fashion where the outputs are sequences of phrases, which means that many duplicate phrases are present across the beams.We remove the duplicates and arrange the phrases such that each unique phrase is independently fed to the encoder.We realize that the generator sometimes fails to produce even a single target keyphrase, in which we filter out such documents for the second-stage training.
Dual Encoder We adopt two pretrained encoderonly networks and obtain the contextual embeddings of a document, as well as each candidate phrase c: , where T c is the token sequence length of the candidate phrase and h 0 d and h 0 c are the start token representations used as the document and candidate phrase embedding, respectively.Consequently, their hidden representations are obtained by and Contrastive Learning for Generation To rank relevant keyphrases high given a document, we train the dual-encoder framework via contrastive learning.Following a similar process as before, we train our model to learn absent keyphrase representations by semantically aligning them with the corresponding document.Specifically, we set the correctly generated keyphrases and their corresponding document as positive pairs, whereas the rest of the generated candidates and the document become negative pairs.The training objective for a positive pair (z d a , z + a,i ) (i.e., document and absent keyphrase y i a ) with N a candidate pairs then follows Equation 3, where the cross-entropy objective maximizes the similarity of positive pairs and minimizes the rest.The final loss is computed across all positive pairs for the corresponding document with a summation.

Datasets
We evaluate our framework on five scientific article datasets: Inspec (Hulth, 2003), Krapivin (Krapivin et al., 2009), NUS (Nguyen and Kan, 2007), Se-mEval (Kim et al., 2010), andKP20k (Meng et al., 2017).Following previous work (Meng et al., 2017;Chan et al., 2019;Yuan et al., 2020), we concatenate the title and abstract of each sample as a source document and use the training set of KP20k to train all the models.Data statistics are shown in Table 1.

Baselines
We compare our framework with two kinds of KP models: Generative and Unified.

Evaluation Metrics
Following Chan et al. ( 2019), all models are evaluated on macro-averaged F 1 @5 and F 1 @M .F 1 @M compares all the predicted keyphrases with the ground truth, taking the number of predictions into account.F 1 @5 measures only the top five predictions, but if the model predicts less than five keyphrases, we randomly append incorrect keyphrases until it obtains five.The motivation is to avoid F 1 @5 and F 1 @M reaching similar results when the number of predictions is less than five.We stem all phrases using the Porter Stemmer and remove all duplicates after stemming.

Implementation Details
Our framework is built on PyTorch and Huggingface's Transformers library (Wolf et al., 2020).We use BART (Lewis et al., 2020) for the encoderdecoder model and uncased BERT (Devlin et al., 2019) for the reranking model.We optimize their weights with AdamW (Loshchilov and Hutter, 2019) and tune our hyperparameters to maximize F 1 @M4 on the validation set, incorporating techniques such as early stopping and linear warmup followed by linear decay to 0. We set the maximum n-gram length of candidate phrases to 6 during mining and fix λ to 0.3 for scaling the contrastive loss.
When generating candidates for absent keyphrases, we use beam search with the beam size 50.During inference, we take the candidate phrases as predictions in which the cosine similarity with the corresponding document is higher than the threshold found in the validation set.The threshold is calculated by taking the average of the F 1 @Mmaximizing thresholds for each document.If the number of predictions is less than five, we retrieve the top similar phrases until we obtain five.We conduct our experiments with three different random seeds and report the averaged results.
6 Results and Analyses

Present and Absent Keyphrase Prediction
The present and absent keyphrase prediction results are demonstrated in Table 2 and Table 3, respectively.The performance of our model mostly exceeds that of previous state-of-the-art methods by a large margin, showing that our method is effective in predicting both present and absent keyphrases.
Particularly, there is a notable improvement in the F 1 @5 performance, indicating the effectiveness of our approach in retrieving the top-k predictions.
On the other hand, we observe that F 1 @M values are not much different from F 1 @5, and we believe this is due to the critical limitation of a global threshold.The number of keyphrases varies significantly for each document, and finding op-  timal thresholds seems necessary for improving the F 1 @M performance.Nonetheless, real-world applications are often focused on identifying the top-k keywords, which we believe our model effectively accomplishes.

Ablation Study
We investigate each component of our model to understand their effects on the overall performance and report the effectiveness of each building block in Table 4.Following Xie et al. (2022), we report on two kinds of test sets: 1) KP20k, which we refer to as in-domain, and 2) the combination of Inspec, Krapivin, NUS, and SemEval, which is outof-domain.
Effect of CL We notice a significant drop in both present and absent keyphrase prediction performance after decoupling contrastive learning (CL).
For a fair comparison, we set the beam size to 50, but our model still outperforms the purely generative model, demonstrating the effectiveness of CL.
We also compare our model with two extractive methods: sequence labeling and binary classifica- In-domain Out-of-domain Mining Method  tion.For sequence labeling, we follow previous work (Tokala et al., 2020;Liu et al., 2021) and employ a BiLSTM-CRF, a strong sequence labeling baseline, on top of the encoder to predict a BIO 5 tag for each token, while for binary classification, a model takes each phrase embedding to predict whether each phrase is a keyphrase or not.CL outperforms both approaches, showing that learning phrase representations is more efficacious.

Effect of Reranking
We remove the reranker and observe the degradation of performance in absent keyphrase prediction.Note that the vanilla BART (i.e., w/o CL) is trained to generate both present and absent keyphrases, while the other model (i.e., w/o RERANKER) is trained to generate only the absent keyphrases.The former performs slightly better in out-of-domain scenarios, as it is trained to generate diverse keyphrases, while the latter excels in in-domain since absent keyphrases resemble those encountered during training.Nevertheless, the reranker outperforms the two, indicating that it plays a vital role in the KG part of our method.

Performance over Max N-Gram Length
We conduct experiments on various maximum lengths of n-grams for extraction and compare the 5 We use the BIO format for our sequence labeling baseline.For example, if the phrase "voip conferencing system" is tokenized into "v ##oi ##p con ##fer ##encing system", it is labeled as "B I I I I I I".present keyphrase prediction performance from unigrams to 6-grams, as shown in Figure 3.For all datasets, the performance steadily increases until the length of 3, which then plateaus to the rest of the lengths.This indicates that the testing datasets are mostly composed of unigrams, bigrams, and trigrams.The performance increases slightly with the length of 6 for some datasets, such as Inspec and SemEval, suggesting that there is a non-negligible number of 6-gram keyphrases.Therefore, the length of 6 seems feasible for maximum performance in all experiments.

Impact of Hard Negative Phrase Mining
In order to assess the effectiveness of our hard negative phrase mining method, we compare it with other negative mining methods and report the results in Table 5.First, utilizing in-batch document embeddings as negatives yields the poorest performance.This is likely due to ineffective differentiation between keyphrases and other phrase embeddings.Additionally, we experiment with using random text spans as negatives and observe that although it aids in representation learning to some degree, the performance improvement is limited.The outcomes of these two baselines demonstrate that our approach successfully mines hard negatives, enabling our encoder to acquire high-quality representations of keyphrases.

Visualization of Semantic Space
To verify that our model works as intended, we visualize the representation space of our model with t-SNE (van der Maaten and Hinton, 2008) plots, as depicted in Figure 4. From the visualizations, we find that our model successfully pulls keyphrase embeddings close to their corresponding document in both extractor and generator space.Note that the generator space displays a lesser number of phrases than the beam size 50 because the duplicates after stemming have been removed.

Upper Bound Performance
Following previous work (Meng et al., 2021;Ray Chowdhury et al., 2022), we measure the upper bound performance after overgeneration by calculating the recall score of the generated phrases and report the results in Table 6.The high recall demonstrates the potential for reranking to increase precision, and we observe that there is room for improvement by better reranking, opening up an opportunity for future research.

Conclusion
This paper presents a contrastive framework that aims to improve the keyphrase prediction performance by learning phrase-level representations, rectifying the shortcomings of existing unified models that score and predict keyphrases at a token level.
To effectively identify keyphrases, we divide our framework into two stages: a joint model for extracting and generating keyphrases and a reranking model that scores the generated outputs based on the semantic relation with the corresponding document.We empirically show that our method significantly improves the performance of both present and absent keyphrase prediction against existing state-of-the-art models.

Limitations
Despite the promising prediction performance of the framework proposed in this paper, there is still room for improvement.A fixed global threshold has limited the potential performance of the framework, especially when evaluating F 1 @M .We expect that adaptively selecting a threshold value via an auxiliary module for each data sample might overcome such a challenge.Moreover, the result of the second stage highly depends on the performance of the first stage model, directing the next step of research towards an end-to-end framework.
Given a document x, the task of keyphrase prediction is to identify a set of keyphrases Y = {y i } i=1,...,|Y| , where |Y| is the number of keyphrases.In the ONE2ONE training paradigm, each sample pair (x, Y) is split into multiple pairs {(x, y i )} i=1,...,|Y| to train the model to generate one keyphrase per document.In ONE2SEQ, each sample pair is processed as (x, f (Y)), where f (Y) is a concatenated sequence of keyphrases.In this work, we train for extraction and generation simultaneously; therefore, we decompose Y into a present keyphrase set Y p = {y i p } i=1,...,|Y p | and an absent keyphrase set Y a = {y i a } i=1,...,|Y a | .
where τ is a temperature hyperparameter and sim(u, v)  is the cosine similarity between vectors u and v.The final loss is then computed across all positive pairs for the corresponding document (i.e.,L CL = ∑ |Y p | i=1 L i CL ).Joint Learning Our model generates keyphrases by learning a probability distribution p θ (y a ) over an absent keyphrase text sequence y a = {y a,1 , ..., y a,|y| } (i.e., in an ONE2SEQ fashion),

Figure 3 :
Figure 3: Comparison of present keyphrase prediction performance w.r.t max n-gram length during extraction.

Figure 4 :
Figure 4: Visualization of the semantic space using t-SNE.The left shows the extractor space, while the right depicts the generator space after reranking.

Extractor h 0 maximize agreement h s ... h e Generator Reranker doc keyphrases Keyphrase Embedding Document Embedding h 0 Reranker Keyphrase Embedding Document Embedding maximize agreement h 0
Algorithm 1 Hard Negative Phrase Mining Input: Source document x, maximum n-gram length n, regular expression pattern p, POS tagging function tag(⋅), phrase parsing function parse(⋅), stemming function stem(⋅) Output: Present keyphrase candidate set C pre 1: C pre ← ∅ ij ∈ T indep or t ij ∈ T start_dep then 17: span i += stem(w ij ) 18: Chen et al. (2020a) b d p , b p are learnable parameters.Contrastive Learning for Extraction To extract relevant keyphrases given a document, we train our model to learn representations by pulling keyphrase embeddings to the corresponding document while pushing away the rest of the candidate phrase embeddings in the latent space.Specifically, we follow the contrastive framework inChen et al. (2020a)and take a cross-entropy objective between the document and each candidate phrase embedding.We set keyphrases and their corresponding document as positive pairs, while the rest of the phrases and the document are set as negative pairs.The training objective for a positive pair (z d p , z + p,i ) (i.e., document and present keyphrase y Split Dataset #KP µ #KP σ |KP| µ % Absent # Samples

Table 2 :
Present keyphrase prediction results.The best results are in bold, while the second best are underlined.The subscript denotes the corresponding standard deviation (e.g., 0.427 1 indicates 0.427 ± 0.001).

Table 4 :
Ablation study."w/o CL" is the vanilla BART model using beam search for predictions."w/o reranker" extracts with CL but generates using only beam search.

Table 5 :
Comparison of negative mining methods for present keyphrase prediction.