AttentionRank: Unsupervised Keyphrase Extraction using Self and Cross Attentions

Keyword or keyphrase extraction is to identify words or phrases presenting the main topics of a document. This paper proposes the AttentionRank, a hybrid attention model, to identify keyphrases from a document in an unsupervised manner. AttentionRank calculates self-attention and cross-attention using a pre-trained language model. The self-attention is designed to determine the importance of a candidate within the context of a sentence. The cross-attention is calculated to identify the semantic relevance between a candidate and sentences within a document. We evaluate the AttentionRank on three publicly available datasets against seven baselines. The results show that the AttentionRank is an effective and robust unsupervised keyphrase extraction model on both long and short documents. Source code is available on Github.


Introduction
A vast amount of scientific or non-scientific articles are published online every year. Although some of them have associated keyphrases specified, which made them easy to index and search, a considerable amount of them has no keyphrase defined, making indexing and information retrieval challenging. Given so many articles available, it is not feasible to manually extract keyphrases from each. Automate the keyphrases extraction becomes crucial. Keyphrase extraction has immense value to the downstream text mining tasks, such as text segmentation, text summarization, query expansion, indexing, and so on.
Keyphrase extraction methods can be supervised or unsupervised. Traditional supervision methods use decision tree (Turney, 2000) or naive Bayes (Witten et al., 2005) to identify whether the input word/phrase is a keyphrase. With the advancing of neural networks, various supervised deep 1 https://github.com/hd10-iupui/AttentionRank learning models (Meng et al., 2017;Alzaidy et al., 2019;Sun et al., 2019) are developed to extract keyphrases. The supervised method requires a large labeled training dataset and is often domainspecific, whereas unsupervised methods do not need labeled datasets. The traditional unsupervised methods use statistical and graph-based approaches. The statistical-based methods (Beliga et al., 2016;Rose et al., 2010;Campos et al., 2018) utilize the candidate position, frequency, length, and capitalization to determine the importance of a word. The graph-based approaches (Wan and Xiao, 2008;Gollapalli and Caragea, 2014) construct a graph with the candidates as nodes. The edges indicate the similarity or co-occurrences of the candidates. Graph-based algorithms can be applied to calculate the importance of the nodes (candidates) on the graph. Driven by recent deep language models for natural language processing and text analysis, text embedding-based or mixed statistical and embedding-based unsupervised keyphrase extraction methods have emerged, such as EmbedRank (Bennani-Smires et al., 2018), SIFRank (Sun et al., 2020), and KeyGames (Saxena et al., 2020).
In this research, we propose an attention-based unsupervised model -AttentionRank for keyphrase extraction. AttentionRank is motivated by the self-attention mechanism of the BERT model (Devlin et al., 2019), and hierarchical attention retrieval (HAR) mechanism (Zhu et al., 2019). At-tentionRank model calculates the accumulated selfattention and cross-attention of a candidate to rank the importance. The accumulated self-attention value for each candidate is extracted from a pretrained BERT model. It is calculated as adding the received attention from other words within a sentence then summing up these self-attention values on all sentences within the document. The accumulated self-attention addresses the importance of a candidate within the sentences where it locates. The cross-attention identifies the semantic relevance between a candidate and the document. It calculates the word-level bidirectional attention between the embeddings of a candidate and sentences within a document then generate an enhanced document embedding for the candidate. The final ranking of a candidate is determined by the linear integration of the accumulated self-attention value and the cross-attention relevance value. A postprocessing step based on the document frequency is used to remove the generic terms of a specific corpus.
AttentionRank is the first to use the attention values extracted from a pre-trained BERT model for keyphrase extraction. No additional information or domain knowledge is needed. It is generalizable to documents of any domain. This research focused on investigating whether the pre-trained none domain-specific language model can be used for keyphrase extraction.
AttentionRank is compared against seven stateof-the-art unsupervised keyphrase extraction methods on three benchmark datasets. Two datasets contain short documents, and one contains long documents. The results show that AttentionRank performs better than or as competitive as the baselines.
The main contributions of this paper are summarized as follows: • We proposed a novel attention-based unsupervised keyphrase extraction model -Attention-Rank; • We demonstrated pre-trained language model could be utilized to identify keyphrases via self-attentions and cross-attentions.
• We showed that the AttentionRank model outperforms the compared baselines and is robust to identify keyphrases from both short and long documents of different domains.

Methodology
In this section, we introduce the AttentionRank model in detail. The overview architecture of our model is shown in Figure 1. AttentionRank integrates the accumulated self-attention component with the cross-attention component to calculate the final score of a candidate. The proposed model has four main steps: (1) Generate a candidate set C from a document; (2) Calculate the accumulated self-attention value a c for each candidate c, c ∈ C; (3) Calculate the cross-attention relevancy value (r c ) between a candidate c and the document d; (4) Calculate the final score s c for each candidate through a linear combination of a c and r c .

Candidates Generation
We use the candidate extraction module implemented in EmbedRank (Bennani-Smires et al., 2018). This module first use Part-of-Speech (PoS) to identify words tagged as NN, NNS, NNP, NNPS, JJ, etc. Then, the python package NLTK 2 is used to generate the noun phrases, which are candidates. Given a sentence 'Most parameters of the program are controllable by experimenter-edited text files or simple switches in the program code, thereby minimizing the need for programming to create new experiments', the extracted candidates are 'program code', 'experimenter-edited text files', 'need', 'parameters', 'simple switches', 'program', and 'new experiments'.

Accumulated Self-Attention Calculation
We use the method introduced by Clark et al. (Clark et al., 2019) to extract self-attention weights of the words from the pre-trained BERT. We sum the attentions (a w w ) that a word (w) received from other words (w ) within the same sentence (s) to obtain the attention value (a w ) of the word within a sentence, shown as Equation 1. This attention value (a w ) represents the importance of the word within the context of a sentence.
As shown in Figure 2, all highlighted are noun chunks. Intuitively, the darker the noun chunk is, the higher self-attention it receives. They have a higher probability of being selected as keyphrases.
To calculate the self-attentions of a candidate (c) in sentence i, we add up the attention of the words in c, shown as Equation 2.
The document level self-attention value of candidate c is computed as the sum of all self-attention values of c in each sentence of document d:

Cross-Attention Calculation
The cross-attention model is inspired by the hierarchical attention retrieval (HAR) model (Zhu et al., 2019) and the bi-direction attentional model (Seo et al., 2016). Based on their network architectures, we develop the cross-attention component to measure the correlation between a candidate and the document based on the context.
A pre-trained BERT model can generate candidate c representation as E c = {e c 1 , ..., e c m }, where e i ∈ R H is embedding of w i , and there are m words in the candidate. Similarly, a pre-trained BERT model can also generate a representation E i = {e i 1 , ..., e i n } for a sentence i which contains n words.
Cross-attention calculates a new document embedding to better measure the contextual correlations between candidates and sentences within a document. Given a sentence i represented as E i ∈ R n×H , and a candidate c represented as E c ∈ R m×H , a similarity matrix between i and c S ∈ R n×m can be calculated as Equation 4. Then, the word based sentence to candidate and candidate to sentence similarity can be measured as Equation 5 and 6).
The word-based cross-attention weights from sentence to candidate and from candidate to sentence are calculated as Equation 7 and 8. The new sentence representation V i is built upon these crossattention weights and computed by averaging the sum of the four items, shown as Equation 9. The E i is the original context of the sentence, the A i2c , E i A i2c and E i A c2i measure the context correlation between a sentence and a candidate. is element-wise multiplication.
The new sentence representation V i is still a set of embeddings that comprises the word-based relations between the candidate and sentence. To generate a standardized sentence representation based on V i , a self-attention is performed on V i to highlight the importance of the words after applying the cross attention. Given a new sentence representation V i = {v i 1 , ..., v i n } with n words, the selfattention of sentence i is calculated as Equation 10. Then, the column-wise average is calculated to gain the final representation of a sentence α i ∈ R H .
Once the sentence embeddings are generated, we perform a similar process on sentence embeddings to generate document embedding. Given a document d which include a set of sentences E d = {α 1 , ..., α i }, to calculate the document embedding, we first generate self-attention of the document to emphasize sentences with higher correlation to the candidate (Equation 12), then take the column-wise average to get the final document embedding p d ∈ R H .
Since the candidate is also originally represented as a word embedding set E c = {e c 1 , ..., e c m }, the self-attention calculation (Equation 14) is also applied, and the column-wise average is done afterwards to get the final candidate embedding p c ∈ R H as Equation 15.
Finally, the relevance between a candidate c and a document d is determined by the cosine similarity of p c and p d shown as Equation 16.
A corpus is often domain-specific. It means some words with high document frequency might be generic words to this corpus. In this research, in order to limit the generic word or phrase becoming a keyphrase, we remove the candidates with a high document frequency than a threshold df θ .

Datasets and Evaluation Metrics
To fully evaluate the performance of Attention-Rank, we evaluated it on three benchmark datasets. The percentages of the n-grams, the average number of words (AveWords), and the average number of sentences (AveSentences) per document are provided in Table 1. Over 50% of the keyphrases are either unigram or bigram. Datasets SemEval2017 (Augenstein et al., 2017) and Inspec (Hulth, 2003) contain short documents with an average of 6 to 7 sentences, whereas SemEval2010 (Kim et al., 2010) contains long documents with hundreds of sentences.
The performance of the keyphrase extraction is evaluated using Precision (P), Recall (R), and F1 measure (F1) on the top 5, 10, and 15 ranked keyphrases. The PR curve based on the top-ranked keyphrases is also generated for comparison.

Baselines
We compared our model against seven different unsupervised models: SingleRank (Wan and Xiao, 2008), RAKE (Rose et al., 2010), TopicRank (Bougouin et al., 2013), PositionRank (Florescu and Caragea, 2017b), YAKE! (Campos et al., 2018), EmbedRank (Bennani-Smires et al., 2018), and SIFRank (Sun et al., 2020). These baselines all generate candidates using noun phrases without any additional steps. Since KeyGames (Saxena et al., 2020) includes designed steps, such as stop word removal and threshold on token length, for candidate generation, it is not included. We used PKE 3 to run SingleRank, RAKE, and TopicRank. The published GitHub code of YAKE! 4 , EmbedRank 5 , PositionRank 6 , and SIFRank 7 were used to produce the results on the selected datasets. It is worth noting that the produced results of the baselines are slightly higher or lower than the results presented in the original papers.

HyperParameter Setting and Computational Cost
The BERT-Base (Devlin et al., 2019) is used. For different corpora, the document frequency threshold df θ and integration ratio d are fine-tuned to achieve the best performance. The document frequency threshold df θ for dataset Inspec, Se-mEval2017 and SemEval2010 are set to be 5, 5, and 44, respectively. For all datasets, we set the linear combination ratio d to be 0.8. For the baseline methods, the parameters published on the corresponding GitHub were used. Based on our experiments, the computational cost to extract the attention between words is low since we can utilize a pre-trained BERT. It takes an average of 74 seconds to generate an attention matrix on a long document using our computer with an Intel i7 9700k CPU and 32G memory. Table 2 shows the results of recall, precision, and F1 @5, 10, and 15 using AttentionRank and baseline models on all datasets. A paired statistical significance test (used t-test when both are normaldistributed, otherwise used Kolmogorov-Smirnov test) is used to evaluate whether the F1 values of the AttentionRank model are statistically significantly better (p-value <0.01 or p-value<0.05 ). If p-value >0.05, there is no or . It is worth noting that the difference is not statistically significant for those cases when AttentionRank performs slightly lower than the SIFRank model.

Results
The results show that the embedding-based algorithms, including AttentionRank, perform better than the statistical-based (RAKE and YAKE!) and graph-based algorithms (SingleRank, TopicRank, and PositionRank) on short document sets (Inspec and SemEval 2017). SIFRank performs slightly better than AttentionRank on Inspec. Attention-Rank works better than the other baselines on Inspec when K is set to 5 or 10. Nevertheless, At-tentionRank has a slightly better F1 than SIFRank and other baselines when the top 15 candidates are used for evaluation. It shows that AttentionRank performs competitively on the dataset Inspec. At-tentionRank outperforms all baselines on the other two datasets with statistical significance. Atten-tionRank shows advantage on long document set -SemEval2010. The F1 value is at least 3% better than the highest baseline. Table 3 shows the comparison results with other unsupervised keyphrase extraction methods by referring to the reported in the original papers. These methods reported results on Inspec or Se-mEval2010. The comparison shows that Atten-tionRank works better than KeyGames (Saxena et al., 2020) on SemEval2010 by 0.5%, although KayGames used a designed candidate selection approach to remove noise. AttentionRank also works better than the MultipartiteRank (Boudin, 2018) and TeKET (Rabby et al., 2020) based on the original reported result on SemEval2010. Because the accumulated self-attention model considers the selfattention values accumulation of candidates over the document, AttentionRank works better on the long document set. The performance of Attention-Rank on the Inspec is not as good as KeyGames, but better than the Salience Rank (Teneva and Cheng, 2017).
The PR curve (shown in Figure 3) is drawn using the top 60 ranked candidates generated by each method for overall comparison. The PR curves show consistent results that SIFRank works slightly better than AttentionRank on the Inspec dataset, whereas the AttentionRank outperforms all the baselines on both SemEval datasets. Attention-Rank performs much better than YAKE! on Se-mEval2010, and both outperform the other baselines.

Integration Ratio
The AttentionRank model linearly integrates the accumulated self-attention with the cross-attention relevance to determine the importance of a candidate. We study the influence of accumulated self-attention and cross-attention by adjusting d (in Equation 17) from 0 to 1, with a step size of 0.1. Figure 4 shows that the contribution of accumulated self-attention value is higher than crossattention relevance in general. However, the best ratio is different for each dataset. With dataset Inspec, F1 value is highest when d is round 0.3, 0.5, and 0.9 for K equals 5, 10, and 15, respectively. For Se-mEval2017, the best performance can be achieved when d is set to 0.7, 0.9, and 0.9. For the long document set -SemEval2010, the performance is optimized when d is 1, which means only accumulated self-attention value is needed to find the keyphrases. We think that the accumulated self-attention model considers the repetition of the keyphrases implicitly through the self-attention weights accumulation over the document. However, for short document sets, such as Inspec, the cross-attention relevance has more impact. Since there are only a few sentences in a document, the repetition of the word is low. However, the context-wise relevancy between keyphrases and sentences and the document needs to be emphasized.

Analysis on Document Frequency
In the AttentionRank model, document frequency is used to remove generic terms for a specific corpus. Hence, we also study the impact of the document frequency threshold df θ . For the short document set -SemEval2017 and Inspec, df θ is adjusted from 5 or 10, whereas for the long document set -SemEval2010, df θ is adjusted from 10 to 110. Figure 5 shows that different datasets can have different optimal performances based on the df θ values. For short document datasets, the best df θ is often small. The df θ value for long document datasets is relatively larger. After df θ is larger than a certain value, the performance drops with the increase of the df θ , which means term with larger df θ can be a keyphrase for a document within a specific corpus.

Case Study
AttentionRank performs closely with the SIFRank on short documents. To observe the difference between AttentionRank and SIFRank, we randomly  Figure 6 presents the importance scores of the candidates calculated by the two models. We normalized their original scores to draw the heatmap. The warmer the color is, the higher the score is. The phrases with bold italics and underline in the text are the labeled keyphrases. The candidate scores generated by AttentionRank fluctuate more than those generated by SIFRank. Using accumulated self-attention, AttentionRank generates slightly lower scores than SIFRank on some candidates, such as 'multiple decision points' which shows only once. The labeled keyphrase 'decision support system' is not identified by both models.
Although AttentionRank performs better than the other baselines on long documents, YAKE! also achieves good performance. Figure 7 shows a snippet of a long document with extracted keyphrases using these two models. The document is selected from the dataset SemEval2010.
Based on the observed heatmap, the scores are assigned differently by the models. The accumulated self-attention ranks the bigram candidate 'judgment aggregation' high because of its high frequency. Whereas, YAKE! counts the influence of frequency differently. Nevertheless, YAKE! emphasizes on a candidate using the length of the candidate. It generates a higher score on the trigram candidate 'judgment aggregation rules' higher than 'judgment aggregation'.

Related Work
There are supervised and unsupervised keyphrase extraction approaches. The unsupervised approaches can be grouped into two categories: traditional unsupervised methods and deep learningbased methods. Rose et al. (Rose et al., 2010) identified the keywords based on the word frequency, the number of co-occurring neighbors, and the ratio between the co-occurrence and the frequency. Campos et al. (Campos et al., 2018) calculated the importance of each candidate using frequency, offsets, and co-occurrence. Alrehamy et al. (Alrehamy and Walker, 2018) first clustered the candidates based on the semantic similarity. Candidates with similar semantic relevance with the centroids were selected as keywords. Rabby et al. (Rabby et al., 2020) took a binary tree approach. Its statistical relevance determined the importance of a subtree with its root. Then, they extracted all the paths from the roots to the leaves to find the keyphrases. Mihalcea et al. (Mihalcea and Tarau, 2004) converted the candidates into nodes on a graph, then used the PageRank algorithm to calculate the importance of the candidates Wan and Xiao et al. (Wan and Xiao, 2008) enriched the graph of candidates by collecting information from k-Nearest-Neighbor documents. Bougouin et al. presented the TopicRank (Bougouin et al., 2013) model, which first assigned a score to each topic by candidate keyphrases clustering. The topics were scored using the TextRank ranking model, and keyphrases were extracted using the most representative candidate from the top-ranked topics. Boudin et al. (Boudin, 2018) proposed a Multipartite graph model to encode the topic information within a multipartite graph, which utilized the keyphrases mutual relationship to improve ranking. Florescu et al. (Florescu and Caragea, 2017a) proposed the PositionRank to use the position information of word occurrences into TextRank to improve the TextRank on a document. Sung et al. (yeon Sung and Kim, 2020) considered the candidate hierarchy based on conditional probability (general or specific) on a directed weighted graph.
Wang et al. (Wang et al., 2014) made use of the pre-trained word embedding and the frequency of each word to generate weighted edges between words in a document. A weighted PageRank algorithm was used to compute the final scores of words. Mahata et al. (Mahata et al., 2018) used a similar approach using the phrase embeddings for representing the candidates and ranking the importance of the phrases by calculating the semantic similarity and co-occurrences of the phrases. Papagiannopoulou et al. (Papagiannopoulou and Tsoumakas, 2018) also used word embeddings for unsupervised keyphrase extraction. Different from the previous two methods, the embeddings were trained from a single document called local embedding. Bennani-Smires et al. (Bennani-Smires et al., 2018) used a document embedding model, EmbedRank, to measure the similarity between documents and candidates to select more representative keyphrases. Sun et al. (Sun et al., 2020) proposed SIFRank, integration of statistical model and pre-trained language model, to calculate the relevance between candidates and document topic. Saxena et al. (Saxena et al., 2020) employed the concept of evolutionary game theory and utilized the embeddings, position, and frequency of the candidate to calculate the confidence score to determine whether a candidate is a keyphrase.
Different from previous methods, this study focuses on integrating self-attention weights extracted from the pre-trained deep language model with the calculated cross-attention relevancy value to identify the keyphrases that are important to local sentence context and also have strong relevancy to all sentences within the whole document.

Conclusions
This research investigated the accumulated selfattention mechanism integrated with a crossattention model for unsupervised keyphrase extraction. A pre-trained BERT model is utilized to calculate the self-attention and cross-attention values. A candidate is processed through a self-attention calculation and a cross-attention relevancy calculation to gain a final score towards ranking. We compared the proposed AttentionRank model with seven different baselines on three benchmark datasets, including two short document datasets and one long document dataset. AttentionRank gains a better or competitive F1@5, 10, and 15 on all datasets. The ablation study shows that accumulated selfattention has a higher contribution than the crossattention relevancy score on the long document set. For a short document set, the linear integration of both attention mechanisms shows good performance. The future work includes fine-tuning the BERT on a target domain and comparing against more baselines on domain-specific datasets.