Exploiting Position and Contextual Word Embeddings for Keyphrase Extraction from Scientific Papers

Keyphrases associated with research papers provide an effective way to find useful information in the large and growing scholarly digital collections. In this paper, we present KPRank, an unsupervised graph-based algorithm for keyphrase extraction that exploits both positional information and contextual word embeddings into a biased PageRank. Our experimental results on five benchmark datasets show that KPRank that uses contextual word embeddings with additional position signal outperforms previous approaches and strong baselines for this task.


Introduction
Keyphrase extraction is the task of automatically extracting a small set of descriptive words or phrases that can accurately summarize the topics discussed in a document Ng, 2010, 2014). Keyphrases are useful in many applications such as document indexing and summarization (Abu-Jbara and Radev, 2011;Qazvinian et al., 2010;Turney, 2003), topic tracking (Augenstein et al., 2017), contextual advertising (Yih et al., 2006), and opinion mining (Berend, 2011).
Most of the previous approaches to keyphrase extraction are either supervised or unsupervised. While supervised approaches perform generally better (Kim et al., 2013), the unsupervised ones have the advantage that they do not require large human-annotated corpora for training reliable models. Unsupervised keyphrase extraction methods usually use graph-based ranking algorithms such as PageRank that work on the word graph constructed from the target document (Mihalcea and Tarau, 2004). Various PageRank extensions have been proposed that incorporate different types of information (Wan and Xiao, 2008;Gollapalli and Caragea, 2014). For example, Wan and Xiao (2008) proposed to incorporate a local neighborhood of the target document into the graph construction, with the neighborhood being determined based on the textual similarity between documents. Liu et al. (2010) exploited topical information to select keyphrases from all major topics. More recently, Mahata et al. (2018) proposed a theme-weighted biased PageRank, called Key2Vec, for keyphrase extraction. In Key2Vec, a theme-vector is computed by averaging the embeddings of words and phrases from the title of a scientific document to capture its theme and the PageRank is biased based on the similarity of candidate words or phrases to the computed theme vector. However, this model is oblivious to the position of words in a scientific document, in which more important words appear not only frequently, but also close to the beginning of the document (Florescu and Caragea, 2017).
Inspired by the Transformer models (Vaswani et al., 2017) that infuse positional information into the word embeddings to produce embeddings with time signal, we propose an extension of Key2Vec that incorporates words' positions into a biased PageRank. Moreover, different from Mahata et al. (2018), who used non-contextual FastText embeddings (Mikolov et al., 2018), we propose to integrate SciBERT contextual embeddings (Beltagy et al., 2019) into our biased PageRank extension. Our contributions are as follows: (1) We propose KPRank, an unsupervised graph-based algorithm that exploits both the position of words in a document and the contextual word embeddings for computing a biased PageRank score for ranking candidate phrases; (2) We show empirically that infusing position information into our biased KPRank model yields better performance compared with its counterpart that does not use the position information. In addition, KPRank with contextual SciB-ERT embeddings performs better than FastTextbased KPRank; (3) Finally, we show that KPRank outperforms many previous unsupervised models.

Proposed Approach
In this section, we describe our unsupervised graphbased algorithm called KPRank, that exploits both position information of the words in a document along with contextual word embeddings for computing a biased PageRank score for each candidate word. Our approach consists of three steps: (1) candidate word selection and word graph construction; (2) word scoring by biased PageRank; and (3) candidate phrase formation.

Candidate Word Selection and Graph Construction
For a target doucment D, we first apply a partof-speech filter 1 and select only nouns and adjectives as candidate words, consistent with previous works (Gollapalli and Caragea, 2014;Mihalcea and Tarau, 2004;Wan and Xiao, 2008). We build a word graph G = (N, E) for D using the candidate words as nodes in G. N and E are the sets of nodes and edges, respectively. We consider an edge (n i , n j ) ∈ E between two nodes n i and n j in N if the words corresponding to these nodes appear within a window of k consecutive words in the content of D. We experimented with values of k from 1 to 10 and obtained best results with k = 10, which is consistent with (Wan and Xiao, 2008). The weight of an edge (n i , n j ), denoted as w ij , is computed based on the co-occurrence count of the two words within k consecutive words in D (k = 10). Here, we build undirected graphs because prior work (Mihalcea and Tarau, 2004;Liu et al., 2010) observed that the type of graph (directed or undirected) used to represent the text does not significantly influence the performance of the keyphrase extraction task.

Biased PageRank
Preliminaries. PageRank (Page et al., 1998) is a graph-based ranking algorithm that iteratively calculates the importance of each node in a graph through endorsements from its neighbors. For document D, we construct an undirected graph G as explained above. Initially, the score of each node in G is set to 1 |N | . This score is then iteratively updated using PageRank. That is, the score s for node n i is obtained by applying the equation: s(nj) (1) 1 We used Python's NLTK toolkit for POS tagging.
where O(n j ) = n k ∈Adj(n j ) w jk and Adj(n j ) is the set of all adjacent nodes of node n j ∈ N . p i is defined below. In order to prevent the PageRank from getting stuck in cycles or dead ends, a dumping factor α was added to Eq. (1) to allow the PageRank to randomly jump to any node in the graph (α = 0.85). Let p = [ p 1 , · · · , p i , · · · , p |N | ] be the probability distribution of randomly jumping to any node in the graph. For an unbiased PageRank, this is a uniform distribution, with p i = 1 |N | , for all i from 1 to |N |. For a biased PageRank, this probability distribution is not uniform, but rather the nodes in the graph are visited preferentially, with some nodes being visited more often than others, depending on the p i value for node n i (Haveliwala, 2003). Key2Vec is an example of (topic) biased PageRank for keyphrase extraction that computes p i for node n i using the cosine similarity between the embedding of word/phrase corresponding to node n i and a theme vector for the entire document, which corresponds to the aggregated word/phrase embeddings from the document's title (Mahata et al., 2018). That is, p i is higher for words/phrases that are topically (semantically) more similar to the overall theme vector for the document. Next, we describe our extension KPRank of Key2Vec.
KPRank. In our proposed approach, we calculate p i for node n i using two types of scores: theme (or topic) score and positional score. We multiply both scores to assign a final weight to node n i before running the biased PageRank algorithm. Both scores and their calculation are explained below.
To calculate the theme score (ts i ) for node n i ∈ N , we first calculate a theme vector (T D ) for document D. A theme vector is obtained by averaging SciBERT (Beltagy et al., 2019) word embeddings of adjectives and nouns from D's title. The theme score for node n i is calculated using the cosine similarity of the SciBERT word embedding corresponding to node n i and T D . The idea is to assign a higher score to a word if that word is closer to the theme (topic) of a given document. For obtaining word embeddings, for all the words with similar stemmed version (obtained with Porter stemmer), we averaged the contextualized word embeddings of a word obtained by using SciBERT. We used the title and abstract of a document as input to the SciBERT model. We also experimented with pretrained BERT (Devlin et al., 2018), and found that the performance of BERT-based KPRank and SciBERT-based KPRank are very similar. To calculate the positional score (ps i ) for node n i , we consider the set P i that contains all the positions of occurrence in the text of the word corresponding to node n i . Then, ps i is calculated as ps i = j∈P i 1 j . For example, for a word occurring on positions 1 and 10 in the text, its ps i score is 1 1 + 1 10 , whereas for a word occurring on position 100, its ps i score is 1 100 . The intuition behind this weighting scheme is to give higher weight to words appearing in the beginning of a document since in scientific writing, authors tend to use keyphrases very early in the document (even from the title) (Florescu and Caragea, 2017). Based on these considerations, the first position of a phrase/word and its relative position are also used in many supervised approaches as powerful features (Patel and Caragea, 2019;Hulth, 2003;Wu et al., 2005) To calculate the weight w i for n i , we perform multiplication of both the theme score (ts i ) and the positional score (ps i ). The intuition is that we give preference to words that appear near the beginning of the document and are more frequent as compared with less frequent words appearing later in document even though both words may be equally close to the theme of the document or may have similar theme score. The vector p is last set to the normalized weights for each node as follows: The biased PageRank scores for each node n i are finally calculated by iteratively applying Eq. (1) with p as in Eq. (2). Figure 1 shows the illustration of our approach. As can be seen, even though both n 1 and n 4 have similar theme score, final weights are different based on different positional scores.
In our experiments, the PageRank scores are updated until the difference between two consecutive iterations is ≤ 0.001 or for 100 iterations.

Candidate Phrases Formation
Candidate words appearing continuously in the document are concatenated to generate candidate  phrases. We consider phrases with the regular expression (adjective)*(noun)+, of length up to four words, to generate candidate phrases. We used stemmed version of each word using Porter stemmer. We use POS tagger from Python's NLTK toolkit. The score for each candidate phrase is calculated by summing up the scores of its individual words (Wan and Xiao, 2008). The top-scoring phrases are output as predicted keyphrases for a given document.

Data
For evaluation, we use five datasets, which we describe below. We use the combination of controlled (author assigned) and uncontrolled (reader assigned) keyphrases as gold-standard phrases. We used uncontrolled keyphrases when available. Table 1 shows the summary of the datasets. SemEval (Kim et al., 2010) contains 288 research papers with a train and test split consisting of 188 and 100 papers, respectively.
Krapivin (Krapivin et al., 2009) contains 2,304 ACM research papers with full text and authorassigned keyphrases. Similar to (Meng et al., 2017), since the dataset does not have a train-test split, we sampled 400 papers as the test set.
NUS (Nguyen and Kan, 2007) contains 211 research papers. This dataset does not have a train and test split and it is relatively small. Hence, consistent with (Meng et al., 2017) we used entire dataset as the test set.
ACM (Patel and Caragea, 2019)    contains 30,000 papers published in ACM conferences with a train and test split consisting of 10, 000 and 20, 000 papers, respectively. For each dataset we use its test set for evaluation.

Experimental Setup and Results
Evaluation metrics. To evaluate the performance of different methods, we use micro avg. F1-score. We report the performance for the top 5 and 10 candidate phrases returned by different methods as in (Meng et al., 2017). To create a word graph for a given document, we use its title and abstract. To match the predicted keyphrases with gold-standard keyphrases, we do exact match between the stemmed version of each.
The effect of position, contextual embeddings, and the comparison with previous works. To see the effect of positional information, we compare the performance of KPRank that uses contextual SciBERT (SB) embeddings along with positional information (denoted as KPRank(SB)) with that of its counterpart that does not use positional information (denoted as KPRank(SB−POS)). Moreover, to see the effect of contextual embeddings, we compare the performance of SciBERT-based KPRank (KPRank(SB)) with that of KPRank that uses FastText non-contextual word embeddings (Mikolov et al., 2018) (denoted as KPRank(FastText)). For FastText, we used pretrained 300 dimensional embeddings trained on subword information on Common Crawl. Note that KPRank(SciBERT) and KPRank(FastText) use positional information along with the theme score. Last, we compare the performance of KPRank with Tf-Idf and six PageRank based unsupervised methods as baselines: PositionRank (Florescu and Caragea, 2017), Key2Vec (Mahata et al., 2018), TextRank (Mihalcea and Tarau, 2004), SingleRank (Wan and Xiao, 2008), ExpandRank (Wan and Xiao, 2008), TopicRank (Bougouin et al., 2013).
Tables 2 shows these comparisons on SemEval, Inspec, Krapivin, NUS, and ACM. It can be seen from the table that adding position information shows much higher improvement in the performance of KPRank, i.e. KPRank(SB) substantially outperforms KPRank(SB−POS). Moreover, KPRank(SB) outperforms KPRank(FastText) on all the datasets except for Krapivin. Importantly, KPRank(SB) outperforms most baseline methods, including Key2Vec (by a large margin) e.g., on Se-mEval, KPRank(SB) achieves an F1@5 of 22.51% as compared with 17.54% achieved by Key2Vec. We can also notice from Table 2 that KPRank(SB) achieves comparable performance whenever any baseline method achieves the best performance. Figure 2 shows the confusion matrices of KPRank(SB) using @5 predictions on all five datasets. Each matrix is represented as a heat map, i.e., the darker the blue color the higher the value at that position and the darker the blue on the main diagonal, the more accurate the model is.

Organization design: The continuing influence of information technology
Drawing from an information processing perspective, this paper examines how information technology (IT) has been a catalyst in the development of new forms of organizational structures. [...] to the present environmental instability that now characterizes many industries. Specifically, the authors suggest that advances in IT have enabled managers to adapt existing forms and create new models for organizational design that better fit requirements of an unstable environment. [...]. IT has gone from a support mechanism to a substitute for organizational structures in the form of the shadow structure. [...] Gold-standard keyphrases: Organization design, Information processing perspective, Organizational structures, Environmental instability, Information technology Predicted keyphrases: Organization design, Information technology, Information processing perspective, Organizational structures, Organizational design, Organization, Information processing, Shadow structure, New forms, Bureaucratic structure Comparison with a supervised approach. Usually, the performance of the supervised keyphrase extraction models is better than the unsupervised models (Kim et al., 2013). We compare the performance of KPRank(SB) with the CRF based sequence classification model for the keyphrase extraction (Patel and Caragea, 2019) that uses word embeddings as features along with document specific features. The CRF model outperforms KPRank(SB) on all five datasets, e.g., CRF model achieves an F1 of 45.73% as compared with 25.76% achieved by KPRank(SB) on SemEval.
Anecdotal example. To see the quality of predicted phrases by the KPRank(SB), we randomly selected a paper from the Inspec dataset and evaluated the KPRank(SB) on it. We manually inspected the top-10 predictions by the KPRank(SB) and contrasted them with the gold-standard keyphrases. The title, abstract, gold-standard keyphrases and top-10 predicted keyphrases for this paper are shown in Figure 3. Precisely, in the figure, the cyan italic phrases shown in the text on the top of the figure represent gold-standard keyphrases, whereas the bottom of the figure shows gold-standard keyphrases and the top-10 predicted keyphrases by KPRank(SB) (shown in the order of their prediction). It can be seen from the figure that four out of five gold-standard keyphrases are present in the top-5 predicted keyphrases.
We can also see that KPRank(SB) did not predict gold-standard phrase "environmental instabily." A closer inspection of the document and both types of scores (theme score and positional score) assigned by KPRank(SB) to both constituent words of the gold-standard phrase that was not ranked in top-10 predictions revealed that these constituent words have lower values of theme score and they both appear only once in the document. Hence, the Pagerank algorithm will not boost these words. Inspecting other errors, we found that KPRank can fail to predict phrases that contain words that are less frequent in the document and their word embeddings are far from the theme vector.

Conclusion and Future Work
In this paper, we proposed a novel unsupervised graph-based algorithm, named KPRank, which incorporates both positional appearances of the words along with contextual word embeddings for computing a biased PageRank score for each candidate word. Our experimental results on five datasets show that incorporating position information into our biased KPRank model yields better performance compared with a KPRank that does not use the position information, and SciBERTbased KPRank usually outperforms FastText-based KPRank on this task. Moreover, KPRank outperforms strong baseline methods. In the future, it would be interesting to explore KPRank on other domains, such as Biology, and Social Science.