Back to the Basics: A Quantitative Analysis of Statistical and Graph-Based Term Weighting Schemes for Keyword Extraction

Term weighting schemes are widely used in Natural Language Processing and Information Retrieval. In particular, term weighting is the basis for keyword extraction. However, there are relatively few evaluation studies that shed light about the strengths and shortcomings of each weighting scheme. In fact, in most cases researchers and practitioners resort to the well-known tf-idf as default, despite the existence of other suitable alternatives, including graph-based models. In this paper, we perform an exhaustive and large-scale empirical comparison of both statistical and graph-based term weighting methods in the context of keyword extraction. Our analysis reveals some interesting findings such as the advantages of the less-known lexical specificity with respect to tf-idf, or the qualitative differences between statistical and graph-based methods. Finally, based on our findings we discuss and devise some suggestions for practitioners. Source code to reproduce our experimental results, including a keyword extraction library, are available in the following repository: https://github.com/asahi417/kex


Introduction
Keyword extraction has been an essential task in many scientific fields as a first step to extract relevant terms from text corpora. Despite the simplicity of the task, it still poses practical problems, and often researchers resort to simple but reliable techniques such as tf-idf (Jones, 1972). In turn, term weighting schemes such as tf-idf paved the way for developing large-scale Information Retrieval (IR) systems (Ramos et al., 2003;Wu et al., 2008). Its simple formulation is still widely used nowadays, not only for keyword extraction but also as a an important component in IR (Jabri et al., 2018;Marcos-Pablos and García-Peñalvo, 2020) and Natural Language Processing (NLP) tasks Arroyo-Fernández et al., 2019).
While there exist supervised and neural techniques (Lahiri et al., 2017;Xiong et al., 2019;Sun et al., 2020) as well as ensembles of unsupervised methods (Campos et al., 2020;Tang et al., 2020) that can provide competitive performance, in this paper we go back to the basics and analyze in detail the single components of unsupervised methods for keyword extraction. In fact, it is still common to rely on unsupervised methods for keyword extraction given their versatility and the lack of training sets in specialized domains.
Unsupervised keyword extraction methods can be split into statistical and graph-based (see Section 2). However, despite their wide adoption in NLP and IR (even outside keyword extraction), there has not been a large-scale comparison of existing techniques to date. The advantages and disadvantages of statistical methods in contrast to graphical methods, for example, has remained largely unexplored.
In order to fill this gap, in this paper we perform an extensive analysis of single unsupervised keyword extraction techniques in a wide range of settings and datasets. To the best of our knowledge, this is the first large-scale empirical evaluation performed across base statistical and graphical keyword extraction methods. Our analysis sheds light on some properties of statistical methods largely unknown. For instance, our experiments show that a statistical weighting scheme based on the hypergeometric distribution such as lexical specificity (Lafon, 1980) can perform at least as well as or better tf-idf (Jones, 1972), while having additional advantages with respect to flexibility and efficiency. As for the graph-based methods, they can be more reliable than statistical methods without being considerably slower in practice. In fact, a graph-based method initialized with tf-idf or lexical specificity performs best overall.
The remainder of the paper is organized as follows. First, we provide an overview of standard statistical and graph-based keyword extraction techniques (Section 2). Then, we present our experimental setting and overall results (Section 3). To better understand the results, we provide an analysis on some relevant features (Section 4) and a focused discussion about advantages and disadvantages of each kind of model (Section 5). Finally, in Section 6 we present our concluding remarks and possible avenues for future work.

Keyword Extraction
Given a document with m words [w 1 · · · w m ], keyword extraction is a task to find n noun phrases, which can comprehensively represent the document. As each of such phrases consists of contiguous words in the document, the task can be seen as an ordinary ranking problem over all candidate phrases appeared in the document. A typical keyword extraction pipeline is thus implemented as first to construct a set of candidate phrases P d for a target document d, and second to compute importance scores for all of individual words in d. 1 Finally, the top-n phrases {y j |j = 1 . . . n} ⊂ P d in terms of the aggregated score are selected as the prediction (Mihalcea and Tarau, 2004). Figure 1 shows an overview of the overarching methodology for unsupervised keyword extraction.
To compute word-level scores, we can mainly see two lines of research: statistical and graphbased models. Although there are a few contributions that focus on training supervised models for keyword extraction (Witten et al., 2005), due to the absence of large labeled data and domainspecificity, most efforts are still unsupervised, which is the focus of this paper.

Statistical Models
A statistical model attains an importance score based on word-level statistics or surface features, such as a word frequency or a length of word. 2 A simple keyword extraction method could be to simply use term frequency (tf) as a scoring function for each word, which tend to work reasonably well. However, this simple measure may miss important information such as the relative importance of a given word in a corpus. For instance, prepositions such as in or articles such as the tend to be highly frequent in a text corpus. However, they barely represent a keyword in a given text document. To this 1 In the case of multi-token candidate phrases, this score is averaged among its tokens. 2 We are aware that statistical may not be strictly accurate to refer to tf-idf or purely frequency-based models, but in this case we follow previous conventions by grouping all these methods based on word-level frequency statistics as statistical (Aizawa, 2003). end, different variants have been proposed, which we summarize in two main alternatives: tf-idf (Section 2.1.1) and lexical specificity (Section 2.1.2).

TF-IDF
As an extension of tf, term frequency-inverse document frequency (tf-idf) (Jones, 1972) is one of most popular and effective methods been used for statistical keyword extraction (El-Beltagy and Rafea, 2009), as well as still being an important component in modern information retrieval applications (Marcos-Pablos and García-Peñalvo, 2020;Guu et al., 2020). Given a set of documents D and a word w from a document d ∈ D, tf-idf is defined as the proportion between its word frequency and its inverse document frequency 3 , as where we define | · | as the number of elements in a set, tf (w|d) as a frequency of w in d, and df (w|D) as a document frequency of w over a dataset D. In practice, tf (w|d) is often computed by counting the number of times that w occurs in d, while df (w|D) by the number of documents in D that contain w.
To give a few examples of statistical models based on tf-idf and its derivatives in a keyword extraction context, KP-miner (El-Beltagy and Rafea, 2009) utilizes tf-idf, a word length, and the absolute position of a word in a document to determine the importance score, while RAKE (Rose et al., 2010) uses the term degree, the number of different word it co-occurs with, divided by tf. Recently, YAKE (Campos et al., 2020) established strong baselines on public datasets by combining various statistical features including casing, sentence position, term/sentence-frequency, and term-dispersion. In this paper, however, we focus on the vanilla implementation of term frequency and tf-idf.

Lexical specificity
In this section we describe lexical specificity (Lafon, 1980), which is a statistical metric to extract relevant words from a subcorpus using a larger corpus as reference. In short, lexical specificity extracts a set of most representative words for a given text based on the hypergeometric distribution. The hypergeometric distribution represents the discrete probability of k successes in n draws, without replacement. In the case of lexical specificity, k represents word frequency and n the size of a corpus. While not as widely adoped as tf-idf, lexical specificity has been used in similar term extraction tasks (Drouin, 2003), but also in textual data analysis (Lebart et al., 1998), domain-based disambiguation (Billami et al., 2014) or as a weighting scheme for building vector representations for concepts and entities (Camacho-Collados et al., 2016) or sense embeddings (Scarlini et al., 2020) in NLP.
Formally, the lexical specificity for a word w in a document d is defined as where m d is the total number of words in d and P hg (x=l, m, M, f, F ) represents the probability of a given word to appear l times exactly in d according to the hypergeometric distribution parameterised with m d , M , f , and F , which are defined as below.
Note also that, unlike in tf-idf, for lexical specificity a perfect partition of documents of D (reference corpus) is not required. This also opens up to other possibilities, such as using larger corpora as reference, for example. The implementation of (2) can be also computed in an efficient way as described in (Camacho-Collados et al., 2016), relatively faster than tf-idf as estimated from our empirical speed test in Section 3.5.

Graph-based Methods
The basic idea behind graph-based keyword extraction is to identify the most relevant words from a graph constructed from a text document, where words are nodes and their connections are measured in different ways. For this, PageRank (Page et al., 1999) and its derivatives have been proved to be the most successful graph-based algorithms for extracting relevant keywords (Mihalcea and Tarau, 2004;Wan and Xiao, 2008a;Florescu and Caragea, 2017;Sterckx et al., 2015;Bougouin et al., 2013).
Formally, let G = (V, E) be a graph where V and E are its associated set of vertices and edges. In a typical word graph construction on a document d (Mihalcea and Tarau, 2004), V is defined as the set of all unique words in d and each edge e w i ,w j ∈ E represents a strength of the connection between two words w i , w j ∈ V. Then, a Markov chain from w j to w i on a word graph can be defined as is a prior probabilistic distribution over V, and 0 ≤ λ ≤ 1 is a parameter to control the effect of p b (·). The prior term p b (·), which is originally a uniform distribution, is introduced to enable any transitions even if there are no direct connections among them, and this probabilistic model (3) is called the random surfer model (Page et al., 1999). Once we have a word graph, PageRank is applied on it to estimate a probabilityp(w) for every word w ∈ V, which is used as an importance score for the word.
TextRank (Mihalcea and Tarau, 2004) uses an undirected graph and defines the edge weight as e w i ,w j = 1 if w i and w j co-occurred within l contiguous sequence of words in d, otherwise e w i ,w j = 0. SingleRank (Wan and Xiao, 2008a) extends TextRank by modifying the edge weight as the number of co-occurrence of w i and w j within the l-length sliding window and ExpandRank (Wan and Xiao, 2008b) multiplies the weight by cosine similarity of tf-idf vector within neighbouring doc-uments. To reflect a statistical prior knowledge to the estimation, recent works proposed to use non-uniform distributions for p b (·). In (Florescu and Caragea, 2017), the authors observe that keywords are likely to occur very close to the first few sentences in a document in academic paper and proposed PositionRank in which p b (·) is defined as the inverse of the absolute position of each word in a document. TopicalPageRank (TPR) (Jardine and Teufel, 2014;Sterckx et al., 2015) introduces a topic distribution inferred by Latent Dirichlet Allocation (LDA) as a p b (·), so that the estimation contains more semantic diversity across topics. TopicRank (Bougouin et al., 2013) clusters the candidates before running PageRank to group similar words together, and MultipartiteRank (Boudin, 2018) extends it by employing a multipartite graph for a better candidate selection within a cluster.
Finally, there are a few other works that directly run graph clustering (Liu et al., 2009;Grineva et al., 2009), but they use edges to connect clusters instead of words, with semantic relatedness as a weight. Although these techniques do a good job at capturing high-level semantics, the relatednessbased weights rely on external resources such as Wikipedia (Grineva et al., 2009), and thus add another layer of complexity in terms of generalization. For these reasons, they are excluded from this study.

Experimental setting and Results
In this section, we explain our experimental setting, including datasets considered (Section 3.1), preprocessing operations (Section 3.2), comparison systems (Section 3.3), evaluation metrics (Section 3.4), and present the main results (Section 3.5).
As mentioned throughout the paper, our experiments are focused on keyword extraction. In all experiments, we use Python and a 16 cores Ubuntu machine equipped with 3.8GHz i7 core and 64GiB memory. Upon acceptance, we will release the code and data to reproduce all our experiments.

Datasets
To evaluate keyword extraction methods, we consider 15 different public datasets in English. 4 Table 1 provides high-level statistics of each dataset, including length and number of keyphrases 5 (both average and standard deviation). Most datasets are comprised of full-length academic articles, but there are also datasets of abstracts or paragraphs from academic papers, as well as other formats such as news and technical reports. This is also reflected on their diverse size. While there are datasets with a relatively small average document size with fewer than 40 tokens on average such as Inspec, kdd and www (based on abstracts), and SemEval2017 (paragraph-based), other datasets such as SemEval2020, citeulike180, wiki20 and Krapivin2009 have an average document size over of over 800 tokens.
Each entry in a dataset consists of a source document and a set of gold keyphrases, where the source document is processed through the pipeline described in section 3.2 and the gold keyphrase set is filtered to include only phrases which appear in its candidate set.

Preprocessing
Before running keyword extraction on each dataset, we apply standard text preprocessing operations. The documents first are tokenized into words by segtok 7 , a python library for tokenization and sentence splitting. Then, each word is stemmed to reduce it to its base form for comparison purpose by  Table 1: Dataset statistics, where size refers to the number of documents; diversity refers to a measure of variety of vocabulary computed as the total number of words divided by the number of unique words; number of noun phrases (NPs) refers to candidate phrases extracted by our pipeline; number of tokens is the size of the dataset; vocab size is the number of unique tokens, and number of keyphrase shows the statistics of gold keyphrases for which we report the total number keyphrases, as well as the number of keyphrases composed by more than one token (multi-tokens). In terms of statistics, we show the average (avg) and the standard deviation (std).
Porter Stemmer from NLTK (Bird et al., 2009), a widely used python library for text processing. Partof-speech annotation is carried out using NLTK tagger.
To select a candidate phrase set P d , following the literature (Wan and Xiao, 2008b), we consider contiguous nouns in the document d that form a noun phrase satisfying the regular expression (AD-JECTIVE)*(NOUN)+. 8 We then filter the candidates with a stopword list taken from the official YAKE implementation 9 (Campos et al., 2020). Finally, for the statistical methods and the graphbased methods based on them (i.e., LexRank and TFIDFRank), we compute prior statistics including term frequency (tf), tf-idf, and LDA by Gensim (Řehůřek and Sojka, 2010) within each dataset.

Comparison Models
In the following, we list the keyword extraction models which are considered in our experiments.

Statistical models
As statistical models, we include keyword extraction methods based on tf, tf-idf, and lexical specificity referred as TF, TFIDF, and LexSpec 10 respec-tively. 11 Each model uses its statistics as a score for the individual words and then aggregates them to score the candidate phrases (see Section 2.1). We also add a heuristic baseline which takes the first n phrases as its prediction (FirstN).

Graph-based models
As graph-based models, we compare five distinct methods: TextRank (Mihalcea and Tarau, 2004), SingleRank (Wan and Xiao, 2008a), PositionRank (Florescu and Caragea, 2017), SingleTPR (Sterckx et al., 2015), and TopicRank (Bougouin et al., 2013). Additionally, we propose two extensions of SingleRank, which we call TFIDFRank and LexRank, where a word distribution computed by tf-idf or lexical specificity is used for p b (·). Supposing that we have computed tf-idf (1) for a given dataset D, the prior distribution for TFIDFRank is defined as for w ∈ V. Likewise, LexRank relies on the precomputed lexical specificity prior (2), which defines the prior distribution for d as All remaining specifications follow SingleRank's graph construction procedure. As implementations of graph operations such as PageRank and word graph construction, we use NetworkX (Hagberg et al., 2008), a graph analyzer in Python.

Evaluation Metrics
To evaluate the keyword extraction models, we employ three standard metrics in the literature in keyword extraction and information retrieval: precision at 5 (P@5), precision at 10 (P@10), and mean reciprocal rank (MRR). In general, precision at k is computed as where y d is the set of gold keyphrases provided with a document d in the dataset D andŷ k d is a set of estimated top-k keyphrases from a model for the document. 12 MRR measures the ranking quality given by a model as follows: In this case, MRR takes into account the position of the first correct keyword from the ranked list of predictions.

Results
In this section, we report our main experimental results comparing unsupervised keyword extraction methods. Table 4 shows the results obtained by all comparison systems. As can be observed, the performances highly depend on datasets and the metrics.
The algorithms in each metric that achieve the best accuracy across datasets the most are TFID-FRank for P@5, SingleRank and TopicRank for P@10, and LexSpec and TFIDF for MRR. In the averaged metrics over all datasets, lexical specificity and tf-idf based models (TFIDF, LexSpec, TFID-FRank, and LexRank) are shown to perform quite well on all metrics. In particular, the hybrid models 12 The minimum operation between the number of gold keyphrases and gold labels in the denominator of Eq. 6 is included as to provide a measure between 0 and 1, given the varying number of gold labels. This has also been considered in other retrieval tasks with similar settings (Camacho- Collados et al., 2018).  LexRank and TFIDFRank achieve the best accuracy on all the metrics, with LexSpec and TFIDF being competitive in MRR. Overall, despite their simplicity, both lexical specificity and tf-idf appear to be able to exploit effective features for keyword extraction from a variety of datasets, and perform robustly to domain shifts including document size, format, as well as the source domain. In the following sections we perform a more in-depth analysis on these results and the overall performance of each type model.
Moreover, TF gives a remarkably low accuracy on every metric and the huge gap in between TF and TFIDF can be interpreted as the improvement given by the inverse document frequency. Document frequency relies on a partition of corpus, and hence corpus with an unreliable document partition such as noisy social media texts or web-crawled contents can degrade TFIDF performance as heuristics to build a meaningful partition would be required. On the other hand, lexical specificity only needs a term frequency, so we can expect it to be more robust than TFIDF in such a practical situation.

Analysis
Following the main results presented in the previous section, we perform an analysis on different aspects of the evaluation. In particular, we focus on the agreement among methods (Section 4.1), overall performance (Section 4.2), and the features related to each dataset leading to each method's performace (Section 4.3).

Agreement Analysis
For a visualization purpose, we compute agreement scores over all possible pairs of models as the percentage of predicted keywords the two models have in common in the top-5 prediction, as displayed in Table 3. Interestingly, the most similar models in terms of the agreement score are TFIDFRank and LexRank. Not surprisingly, TFIDF and LexSpec also hold a very high similarity that implies those two statistical measures capture quite close features. However, as we will discuss in Section 5.1, they also have a few marked differences. Moreover, we can see that graph-based models provide fairly high agreement scores, except for TopicRank, which can be due to the difference in the word graph construction procedure. In fact, TopicRank unifies similar word before building a word graph and that results in such a distinct behaviour among graph-based models (see Section 2.2). In the discussion section, we investigate the relation among each model in more detail.

Mean Precision Analysis
The objective of the following statistical analysis is to compare the overall performance of the key extraction methods in terms of their mean per-formance (i.e., P@k, for k = 5, 10 and MRR). For this analysis all instances from all datasets are considered together, without splitting the performance by datasets like in our main experiments. Therefore, this analysis considers observations that correspond to a pair in the Cartesian product (document × method ), that is, the application of a keyword extraction method to a document. Overall, 96,280 observations are included. Table 5 illustrates the mean P@k and MMR for each key extraction method. The best P@5, P@10, and MMR are obtained by TFIDFRank, TopicRank, and TFIDFRank, respectively. A method dominates another in terms of performance if it is nonworse in all the metrics and strictly better in at least one metric. Following this rule, it is possible to rank the methods according to their dominance order, i.e., the Pareto ranking. According to this ranking, the top methods are those that are nondominated, followed by those that are dominated only by methods of the first group, et cetera. The resulting ranking is presented in the following: 1. TFIDFRank and TopicRank. 2. LexRank, LexSpec, and PositionRank. 3. SingleRank and TFIDF. 4. FirstN and TextRank. 5. SingleTPR. 6. TF. As can be observed in this ranking and in the results of Table 5, LexSpec slightly but consistently outperforms TFIDF, which is an interesting result on its own given the predominance in the use of TFIDF in the literature and in practical applications. 13

Regression Analysis on Methods and Datasets
The objective of this analysis is to understand what are a dataset's characteristics that make one method better than another at extracting keywords. For this purpose, a regression model is built for every performance metric (P@k, for k = 5, 10, and MRR) and pair of key extraction methods (m1 and m2).
In the regression models, each observation is a pair in the Cartesian product (dataset × method ).
The following independent variables are considered: avg_word and sd_word (i.e., average and standard deviation of the number of tokens in the dataset, representing the length of the documents), avg_vocab and sd_vocab (i.e., average  Table 4: Mean precision at top 5 (P@5) and top 10 (P@10), and mean reciprocal rank (MRR). The best score in each dataset is highlighted using a bold font. and standard deviation of the number of unique tokens in the dataset, representing the lexical richness of the documents), avg_phrase and sd_phrase (i.e., average and standard deviation of the number of noun phrases in the dataset, representing the number of candidate keywords in the documents), avg_keyword and sd_keyword (i.e., average and standard deviation of the number of gold keyphrases associated to the dataset). The regres-sion models estimate the dependent variable as ∆avg_score = avg_score m1 − avg_score m2 where avg_score m1 and avg_score m2 are the average performance metrics obtained by the methods m1 and m2 on the dataset's documents, respectively. Feature selection is carried out by forward stepwise-selection using BIC penalization to re-  Table 5: Key extraction methods' mean precision at k, for k = 5, 10, and MRR. Each column is independently colour-coded according to a gradient that goes from green (best/highest value) to red (worst/lowest value). move non-significant variables. Each model considers 15 observations and, overall, 165 regression models are fitted. The adjusted coefficient of determination (adjR 2 ) is used as a measure of goodness of fit; its distribution is illustrated in Figure 2, which shows overall good explanatory capabilities for the regression models. More in detail, The 0%, 25%, 50%, 75%, and 100% quantiles are -0.0481, 0.5779, 0.7092, 0.8263, and 0.9679, respectively (note that adjR 2 allows for negative values). Thus, more than 75% of the models have an adjR 2 > 0.58, and more than 50% have adjR 2 > 0.71, suggesting that, in general, the considered dataset's characteristics explain satisfactorily the differences in the results obtained by the key extraction methods. Therefore, the variables can be used to determine what method is more performant for a given dataset.
In addition to this overall interpretation, the coef-  Table 6: Goodness of fit of selected regression models. The columns present the keyword extraction methods compared (m1 and m2), the metric chosen (metric), and the corresponding adjusted coefficient of determination (adjR 2 ).
ficients of the regression models can be used to understand under what circumstances each model is preferable. In fact, a positive coefficient identifies a variable that positively correlates with a greater precision for m1. On the other hand, a negative coefficient corresponds to a variable that positively correlates with a greater precision for m2. Only the models having an adjR 2 > 0.50 and their statistically significant variables (i.e., p-value<0.05) should be considered for interpretation. Table 6 illustrates a summary of the goodness of fit for a selection of models that are analysed in the following section. Most models, expect for LexSpec VS SingleRank using MRR, achieve a adjR 2 > 0.50, which indicates that the considered variables explain most of the differences in performance between the models. Finally, Table 7 illustrates the significant variables in the regression models. These are used in the following section to draw insights on the methods' preferences in terms of datasets' characteristics.  Table 7: Significant variables in the regression models comparing key-extraction methods' performance. Only models having adjR 2 > 0.5 are represented. Columns m1 and m2 report the compared methods; columns'metric' shows the performance metric considered; the central columns illustrate the statistically significant variables that positively affect the performance of each model. The significance of the variables is indicated between parenthesis, according to the following scale: 0 '***' 0.001 '**' 0.01 '*' 0.05.

LexSpec vs. TFIDF
According to Table 5, LexSpec and TFIDF have similar average performance, although LexSpec obtains slightly better scores in all the precision levels.
Both measures attain also a high agreement score as explained in Section 4.1. However, in the Pareto ranking, LexSpec ranks second, while TFIDF ranks third. Therefore, in general, the former should be preferred over the latter. Still, TFIDF might perform well in certain datasets. According to Table 7, the choice of the key-extraction method strongly depends on the metric used. For P@5, TFIDF performs better in datasets having a higher variability in the number of words (sd_word), while LexSpec prefers datasets with longer documents (avg_word), more variability in terms of lexical richness (sd_vocab) and more gold keyphrases (avg_keyword). For P@10, LexSpec exhibits a very different behaviour. In fact, it now performs significantly better in datasets with high average and high variability in the number of noun phrases (avg_phrase and sd_phrase), as well as variable number of gold keywords (sd_keyword). On the other hand, TFIDF prefers datasets with high variability in the documents' length (sd_word) and high lexical richness (avg_vocab). Finally, for MRR, the only characteristics that concerns LexSpec is the number of candidate keyphrases (avg_phrase), while TFIDF prefers datasets with longer and lexically richer documents (avg_word and avg_vocab).
Broadly speaking, LexSpec and TFIDF also have qualitative difference. LexSpec, being based on the hypergeometric distribution, has a statistical nature, and probablities can be directly inferred from it. On this respect, TFIDF is heuristics-based, although it also can be integrated within a probabilistic framework (Joachims, 1996) or interpreted from an information-theoretic perspective (Aizawa, 2003). In practical terms, LexSpec has the advantage of not requiring a partition into documents unlike the traditional formulation of TFIDF. On the flip side, TFIDF has been generally found relatively simple to tune for specific settings (Cui et al., 2014).

SingleRank vs. TopicRank
In this analysis we compare two qualitatively different graph-based methods, namely Single Rank (a representative of vanilla graph-based methods) and TopicRank, which leverages topic models. Topi-cRank outperforms SingleRank in P@5 and P@10, while SingleRank achieves a better average MRR score (see Table 5). However, in terms of ranking, TopicRank ranks first (due to obtaining the best average P@10), while Single Rank ranks third. Therefore, in general, TopicRank should be preferred, unless MRR is the metric of choice.
The insights drawn from the regression models are summarised in the following. Table 7 shows that the performance of SingleRank depends on the metric used. On the other hand, TopicRank has a more stable set of preferences. However, it is still possible to identify a pattern. In fact, Topi-cRank is positively influenced by the number of words and the lexical richness of the documents in a dataset (avg_word, sd_word, avg_vocab, and sd_vocab), while SingleRank is affected by the number of noun phrases and keyphrases associated to the documents (avg_phrase, sd_phrase, and avg_keyword).

Statistical vs. Graph-based
When comparing SingleRank versus TFIDF and LexSpec in terms of average performance (see Table 5), it can be seen that SingleRank completely dominates TFIDF. On the other hand, SingleRank and LexSpec both rank second, as the latter achieves a better MRR. Therefore, it is recommendable to use SingleRank when the metric of choice is precision at k, and LexSpec when MRR is used instead. This result seems to suggest that while statistical methods can be reliably used to extract relevant terms when precision is required (reminder that MRR rewards systems extracting the first correct candidate in top ranks), graphical methods can extract a more coherent set of keywords overall thanks to its graph-connectivity measures. This idea should be investigated more in detail in future research.
In terms of dataset features, Table 7 shows that the behaviour of SingleRank is very stable. In fact, across all metrics, SingleRank performs better for datasets with a high average of noun phrases and keyphrases (avg_phrase and avg_keyword). On the other hand, the statistical methods (i.e. TFIDF and LexSpec) achieve better performances on datasets with a high standard deviation for the number of words and keyphrases, and a high average number of unique tokens (sd_word, sd_keyword, and avg_vocab). In conclusion, SingleRank performs better in datasets having a high number of candidate and gold keyphrases, while its performance is hindered in datasets having more lexical richness.
As far as efficiency is concerned, statistical methods are faster overall in terms of computation time (see Table 2). However, all methods are overall efficient, and this factor should not be of especial relevant unless computations need to be done on the fly or on a very large scale. As an advantage of graphical models, these do not require a prior computation over the whole dataset. Therefore, graph-based models could potentially reduce the gap in overall execution time in online learning settings, where new documents are added after the initial computations.

Conclusion
In this paper, we have presented a large-scale empirical comparison of unsupervised keyword extraction measures. Our study was focused on two differ-ent types of keyword extraction methods, namely statistical methods relying on frequency-based features, and graph-based methods exploiting the interconnectivity of words in a corpus. Our analysis on fifteen diverse keyword extraction datasets reveals various insights with respect to each type of method and provide practical suggestions.
In addition to well-known term weighting schemes such as tf-idf, our comparison includes statistical methods such as lexical specificity, which shows better performance than tf-idf while being significantly less used in the literature. We have also explored various types of graph-based methods based on PageRank and on topic models, with varying conclusions with respect to performance and execution time. Finally, all the results and analyses can be used for future research as reference to understand in detail the advantages and disadvantages of each approach in different settings, both qualitatively and quantitatively.
As future work, we plan to extend this analysis to fathom the extent and characteristics of the interactions of different methods and their complementarity. Moreover, we are planning to extend this empirical comparison to other settings where the methods are used as weighting schemes for NLP and IR applications, and for languages other than English.

A.1 Regression models
In the following tables, the regression models for the comparisons considered in Section 5 are presented. For each variable, the tables show the estimated coefficient value, the standard error, the t-value, and the p-value. The last column identifies the significance of the coefficient, according to the following scale: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1. The adjusted coefficient of determination (adjR 2 ) is provided in the caption. Note that only the models having adjR 2 > 0.5 are shown.