Key Concept Identiﬁcation: A Comprehensive Analysis of Frequency and Topical Graph-Based Approaches

.


Introduction
The key concepts in an ontology of a specific domain represent a set of important entities' classes or objects [1,2].Extracting these key concepts automatically is a fundamental and challenging step in Ontology Learning.In this regard, many existing approaches for extracting key concepts have focused on keyphrases extraction from text documents [1,[3][4][5][6][7].Keyphrases refers to terms or group of terms (phrases) within a document, that describe the document and convey its key information [1,5,8] .Because of the relatedness of both the terms keyphrase and key concept, we use them interchangeably without distinguishing between them.
Keyphrase or key concept extraction plays a basic role in many application areas.It is not limited to Ontology Learning [9] only, but also it is considered to be the core step in text and documents summarization, indexing, clustering [10], categorization [11], and currently, in improving search results [8].
While the key concepts can provide excellent means to describe a document or represent knowledge of a specific domain, the job of extracting key concepts is definitely non-trivial, as have been suggested in the recent studies [12].Several approaches have been devised by researchers to address this problem.Broadly, these approaches can be categorized into Supervised and Unsupervised methods.Supervised approaches for concept identification recast this task typically, as a binary classification step [13].In these methods, a classifier is trained on annotated training documents, which classify a given phrase as key concept or non-key concept [10,[14][15][16][17]).However, the effectiveness of these methods strongly relies on a large set of training documents, thus making it biased towards a specific domain and undermining their capability of generalization to other domains.A viable alternative could be an unsupervised approach.
Among the above categories, statistical frequency and topical graph-based unsupervised methods are the two kinds of potentially powerful and leading approaches in this area.in order to utilize the potential of these approaches for improving key concept identification, we need to thoroughly analyze the performance of the methods based on these approaches, on datasets from different domains, and investigate the underlying reasons and error sources in case of poor results.To gain better understanding of the approaches by identifying their shortcomings, and to provide future research directions, we examine three state-of-the-art methods and evaluate their performance on three different datasets.We will describe these datasets later in the analysis section.
For our experiments, the first statistical frequency-based method we choose is TF-IDF, because it is a baseline for this approach as used in SemEval-2010 task 5 [18,19].The second method we select is KP-Miner [8], as it is a representative method for statistical frequency-based approaches, and has outperformed all the unsupervised methods in SemEval-2010 task 5. Finally, we choose the method TopicRank [21], because it is a popular and representative method for topical clustering-based approach that has beaten the previous methods.Another reason for selecting these methods is that they are data-driven which are independent of auxiliary sources.We use the re-implementation of these methods that is publicly available [29].
This study provides a firm basis for future research work and contributes by: • Providing a brief survey of various kinds of keyphrase extraction methods along with the necessary details and limitations of different approaches.

•
Identifying the factors that can contribute to precision and recall errors in frequency and topical graph-based keyphrase extraction approaches, through performance analysis.

•
Identifying the three major sources of errors in the selected approaches by conducting quantitative error source analysis.
The rest of the paper is organized as follow: Section 2 presents brief survey of various supervised and un-supervise methods used for keyphrase extraction.Working of unsupervised methods is briefly explained in Section 3 with description of selected algorithms for comparative analysis.Detailed comparative analysis of selected algorithms is provided in Section 4 with error analysis.Section 5 concludes the paper with future work recommendation.

Related Work
As mentioned earlier in the introduction section, the approaches of key concept or keyphrase extraction can be broadly categorized into supervised and unsupervised methods.Various keyphrase extraction algorithms are proposed in the literature under each class.Typical classification of supervised and unsupervised methods for keyphrase extraction is given in Figure 1.
To overcome the limitations of supervised methods, numerous unsupervised methods have been proposed that do not rely on training sets.However, the task to develop key concept extraction methods that are language independent and portable across different domains is quite challenging.The methods can be classified into several categories based on the approach and techniques used.
The methods that are based on statistical information and structural information, for example tf-idf (term frequency-inverse document frequency), phrase position, and topic proportion, are language independent [8,27,[35][36][37][38].However, weighting more to single terms than multiword terms and overlooking the semantics, are their main drawbacks.Despite the limitations of statistical frequency-based approach, still it is preferred approach and many algorithms are based on frequency-based model tf-idf, and the reason is that it is a data driven approach which is independent of auxiliary sources.
To address the issues pertaining to statistical information-based algorithms, an alternate approach is devised that exploits linguistic information and auxiliary structures.Such methods use techniques like part-of-speech tags, linguistic patterns, glossaries, WordNet, Wikipedia, or manually created semantically hierarchical databases.Auxiliary structures and linguistics based information contribute to the comprehensiveness and efficiency of keyphrase extraction, however, linguistics based techniques are language dependent and may require domain knowledge and expertise in language, while using glossaries or auxiliary structures require extensive human efforts in updating, definition of terms and terminology standardization [39,40].
CFinder [1] adopts a hybrid approach that combines techniques based on statistical, structural, linguistic-based information of candidate phrases, and domain knowledge.However, still there is a need for an optimal solution for keyconcepts identification as it is not completely domain independent.
Graph-based approach is popular among the unsupervised methods.Graph-based approaches try to overcome the limitations of aforementioned approaches by constructing a graph in which the nodes represent the candidate phrases and the edges show their relatedness.A ranking algorithm, e.g., PageRank then is used to rank the keyphrases according to their weights.Several popular graph-based systems have been proposed by researchers for example, TextRank [22], SingleRank [23], ExpandRank [24], SGRank [41].Some other graph-based methods are recently introduced [42][43][44][45].Most of the graph-based keyphrase extraction methods prefer single words as nodes that may result in missing multiword phrases [1], which is one of the drawbacks of graph-based methods.Another drawback is that they does not guarantee to cover all the topics of the document [13].
Several popular topical graph-based methods exploit topic models and clustering techniques for keyphrase identification [46], Topic-biased PageRank [47], TopicRank [21] This approach has recently attracted the attention of many researchers and considered potentially powerful approach.
A summary of the related work is given in Table A1 in Appendix A. It categorizes the different keyphrase extraction methods, presenting various approaches, techniques used by each method, and their limitations.Hopefully, this categorization will help in identifying future research directions.

Common Extraction Steps
Before giving description of selected unsupervised methods for experimental analysis, we first briefly explain the working of unsupervised methods.These algorithms commonly follow three steps for a generic unsupervised keyconcept extraction.
Candidate Phrase Selection: In this preprocessing step, the input text is passed out through a filtration process that removes unnecessary words and produces a list of potential candidate phrases.The process is carried out using some commonly used heuristics, that include (1) filter out non-keywords using stop-word list [27], (2) considering words only with certain part-of-speech tags, i.e., nouns, adjectives [21][22][23]37].another approach is using n-grams as candidate words as reported in [8,18,19].
Candidate Weighting: The second step is to rank the candidate phrases.To accomplish this task, various approaches as discussed earlier, have been proposed to represent the input text, the relatedness between the candidate words, and ranking them.
Keyconcept Formation: The last step is to form keyconcepts from the ranked list of candidate phrases.A phrase, that is typically a sequence of nouns, verbs and adjectives is considered as keyconcept if one or more of its constituents are top ranked candidate terms [22,27], or their sum result in a top score of the phrase [23].

TF-IDF
The TF-IDF method [18,19] uses n-gram approach for candidate selection.It selects 1,2,3-grams as candidate phrases and filter out stop-words, those words consisting only punctuation marks and the words shorter than three characters.It assigns weight to each word w in a document d using the word's frequency in the document d referred to as tf -term frequency, and the idf-inverse document frequency.It can be defined as follow: where w ij represents weight and t f ij is the frequency of word t j , in document D i .The inverse document frequency idf is equal to log 2 N/n, where N represents the total number of documents in the corpus and n is document frequency.The tf weighting is based on Hans Peter Luhn assumption [48]: "The weight of term that occurs in a document is simply proportional to the term frequency" whereas, the idf weighting is a statistical interpretation of specificity of a term that is described as [49]: "The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs".

KP-Miner
KP-Miner [8] is a non-learning key concepts extraction method, meaning that it does not require any training.This system is also based on frequency-based statistical measure, i.e., tf-idf.KP-Miner emphasizes on both candidate words selection and their weighting process.Along with TF and IDF, the two other attributes used in calculating candidate's score are boosting factor and first occurrence position.KP-Miner uses n-gram approach for candidate selection.It selects 1-5-grams as candidate phrases and filter out stop-words, those words consisting only punctuation marks and the words shorter than three characters.KP-Miner then uses two parameters to further filter out the candidate list.One is Lasf (least allowable seen frequency) that represent the minimum frequency for a candidate to be considered as key concept.Second, CuttOff constant that represents the number of words in a long document after which a candidate appears for the first time is rarely keyphrase.The values are set to 3 and 400 respectively in the original method.KP-Miner assumes that compound keyphrases are less frequent as compared to single keywords.Based on this argument, it assigns high scores to multiword keyphrases in two ways: (1) by setting document frequency to 1 for compound keyphrases, which result in maximum IDF value for such phrases, and (2) multiplying the score with a boosting factor ("related to a ratio of single to compound terms") [8].To calculate the weights of single or multiword candidate key concepts, the following equation is devised: where w ij represents weight and t f ij is the frequency of word t j , in document D i .The inverse document frequency idf is equal to log 2 N/n, where N represents the total number of documents in the corpus and n is document frequency.In case of multiword candidate phrase, n is set to 1. P f is the factor that is associated with term position.The term position P f is set to 1, if position rules are not applied.B i denotes the boosting factor, introduced in KP-Miner, associated with document D i , and can be defined by the following equation: where |N i | represents the number of all candidate words in document i, |P i | is the number of all words whose length exceeds one in document i. α and σ are weight adjustment constants.The constant α controls the value of the boosting factor, without this the boosting factor would be too large, that may produce results biased towards compound words.

TopicRank
The TopicRank [21] is graph-based approach that improves SingleRank [23].The intuition behind TopicRank is [21] "ranking topics instead of words is a more straightforward way to identify the set of keyphrases that cover the main topics of a document".Therefore, TopicRank groups lexically similar noun phrase candidates into clusters that represent topics.Then a complete graph is constructed in which the topics are represented by vertices, and the semantic relatedness between them is denoted by edges.The weight of the edges is related to the strength of semantic relatedness between the corresponding vertices.The weight w of each edge in the graph is defined as follows: where, where t represents the topic at the particular vertex of the graph G = (V, E), and dist(c i , c j ) is the reciprocal distances between the offset positions of the candidate concepts c i and c j in the given document and pos(c) refers to all the offset positions of the candidate key concept c.The TextRank's ranking algorithm [22] is then applied to rank the topics that ranks based on the weights of their edges.At the end, the first occurring phrase from each of the top ranked topics is extracted to form the key concepts.

Comparative Analysis
In this section, we first describe the experimental setup, then discuss the performance of each individual method in detail, with the aim to highlight the major weaknesses of tf-idf and topical clustering-based data-driven approaches.In addition, finally, present error source analysis that provides evidence and support our arguments in performance analysis.

Experimental Setup
Data Sets: We choose the following evaluation corpora from two different domains.
(3) 500N-KPCrowd [52] dataset that is composed of news stories.The reason for selecting the datasets is that SemEval-2010 has been created in a systematic way to provide a common base for evaluation of current and future key phrase extraction systems, while the Quran translation is selected because its contents are different from that of commonly used datasets which are composed of either scientific documents or news articles.Quran translation is from religious domain, while SemEval-2010 is from scientific domain.
For the SemEval-2010 and 500N-KPCrowd data sets the ground truth or gold standard is provided for each document within the datasets.Unfortunately, like the datasets for other domains, the gold standard and benchmarking dataset for Quranic domain is not available because to the best of my knowledge, this is the first attempt to use it for the analysis of the key concepts identification algorithms.Also, creating a proper dataset in this domain was out of the scope of the study.Therefore, the easiest way found is to take advantage of the domain experts as they have the knowledge of the field.furthermore, for Quranic dataset we are dependent on domain experts for validation against the ground truth, so, we selected 5 chapters as allowed by the domain experts.To evaluate the results on Quranic dataset a simple procedure is followed.For computing precision and recall, from the domain experts we required to verify the output of the algorithms and identify the true positives, false positives and true negatives.Therefore, they were instructed just to mark the true positives as 1 and false positives as 0 and provide the list all gold standard against each selected of the selected document so that to find the number of true negatives.We did not follow the strict passing criteria for a phrase to be considered as key concept, i.e., the intersection of all lists, rather we set 40% passing criteria meaning that if at least two out of the five experts agree on a key concept then it is considered as key concept.The statistics of selected datasets are given in Table 1.Pre-processing: Each of the selected algorithm has a pre-processing step to convert the data into a processable form for key concept extraction.In the pre-processing step various tasks are performed.For example, the given document is split into sentences and then into words.Part-of-speech tagging and stemming techniques are applied to obtain part of speech tags and stemmed forms of the words.The filtering of the data is carried out to remove unnecessary words e.g., stop-words and punctuation marks, etc.All these steps are common among the selected algorithms except the Part-of-speech tagging, which is only part of TopicRank.
Parameters setting: Table 2 shows the best parameter values for each of the systems.N is the number of extracted key concepts by each system.In KP-Miner [8], Lasf is the least allowable seen frequency, cutoff constant is the total words after that candidates are filtered out, sigma (σ) and alpha (α) are used to compute boosting factor for candidate phrase.In TopicRank [21] similarity threshold is used to compute similarity between candidate concepts for clustering.Also among the normally used linkage methods for clustering, we select the Average linkage.To mention one point that as the cutoff constant depends on the length of documents in the dataset, so we found it best at a higher value than its original value 400.
Execution details: The selected algorithms are first fine-tuned for optimal parameters settings on each dataset, and then with the best settings, results for each of the three selected methods are obtained.The values for sigma (σ) and alpha (α) are set the same as reported by the authors of KPMiner, because they experimentally found the same values best for all datasets they had used.Similarly, in TopicRank we set the same values for the parameters, i.e., similarity threshold and clustering linkage, as reported by the authors.However, because the parameter Lasf, Cutoff constant and N depends on the length of the documents of a given dataset, so we experimentally determined the best values of the parameters for different combination from their range values described in Table 3 on all the selected datasets.The matric f-measure (described in Section 4.2.1) is used to determine the values, because f-measure is the hormonic mean value of precision and recall that will be high when both the precision and recall are reasonably high.We used the following measures to analyze the performance of the selected methods.

1.
Precision measures the probability that if a phrase is selected as key concept by an algorithm then it is actually a key concept.It is the proportion of correctly identified key concepts among all retrieved phrases.In keyphrase extraction, usually one would be interested in retrieving top K concepts, so we use Precision at K (P@K).

2.
Recall measures the probability that if a phrase is key concept then the algorithm will correctly retrieve it.It is the proportion of correctly identified key concepts among all the standard key concepts.

3.
F-measure There is a tradeoff between precision and recall, if you are interested in extracting all key concepts then recall might be 100% but precision (P@K) will tend to 0%.In converse, if you want to optimize such that each extracted phrase should be really a key concept, then P@K might be 100% but the chances to extract all keyphrases will be close to 0%.Therefore, another measure called F-measure is widely used in information extraction that yields maximum value when there is balance between precision and recall.A high value of F-measure implies at reasonably high value of both precision and recall [53][54][55].F-measure is the harmonic mean of precision and recall:

4.
Average Precision (AP) Precision, Recall and F-measure are single-value metrics that are computed over the whole list of concepts retrieved.However, as keyphrase extraction algorithms retrieve a ranked list of key concepts, so it is desirable to consider the ranking order in which the key concepts are extracted.Therefore, we use in our analysis the measure Average Precision which is a preferred measure for evaluating key concepts extraction algorithms that aims at ranking.Average Precision (AP) is defined as the area under a precision-recall curve.AP is a single-figure quality measure across the recall scores.To be more specific, it is the average of precision computed after each retrieved key concept in the ranked list that is matched in the gold standard.In our case, the following equation is used to calculate AP of the methods [1,56] where R represents the total relevant key concepts extracted by the method, r i is set to 1 if ith extracted key concept is relevant, otherwise, set to 0. In the ranked list the key concepts at the top contribute more to the AP than the lower ranked concepts.

5.
Average Multiword Phrases as mentioned by Nakagawa and Mori [1,57], 85% of keyphrases are normally comprised of multi words.Therefore, we are interested to analyze the performance in terms of multiword phrases extracted by each system.To the best of our knowledge this is the first attempt to compare keyphrase algorithms on this metric.To compute Average Multiword phrases, we count the average number of multi word key concepts that match the gold standard.
After introducing the various performance measures, firstly, we individually analyze the performance of the selected methods on each of the three datasets.We explain the performance in terms of precision-recall curves (see Figure 2).Also, we plot multiword graphs for each system on each dataset, showing the performance in terms of the average number of multiword key concepts extracted by each system during the series of experiments (see Figure 3).The curves are generated by varying, K (1 to 20), the number of key concepts extracted by each system and plotting the best values obtained.In addition, we also explain the effect of numerous factors, included in the ranking formula of each method, on weights assigning to candidate concepts by changing the formula parameters.For instance, in TF-IDF method and KP-Miner we vary the number of documents in the corpus, that in turn changes the IDF (Inverse document frequency) factor.Next, we discuss the overall performance in terms of Average Precision and F-measure.

Individual Performance
In Table 4 the detailed results of the selected algorithms is shown.The significant performance in terms of F-measure at various cut-off points is shown in bold face.The performance at each cut-off point N is computed using the following equations.
where TP N is total true positives, FP N is the total false positives and TN N is the number of true negatives at cut-off point N, and n is the number of documents.At the cut-off point N = 2 the total true positives TP N are zero on SemEval-2010 dataset which result in 0s values for the corresponding metrics, while on Quranic the total false positives FP N are zero at that point for TF-IDF and KP-Miner which result in maximum values for the corresponding metrics.In the comming paras we individually discuss and analyze the performance of each algorithm.1.

TF-IDF
The common observation for most of the key concept extraction methods is that by increasing, K, the number of key concepts predicted by each system, the recall increases while precision decreases.The precision-recall curves (Figure 2) show, that TF-IDF is consistent with this intuition.The overall performance of TF-IDF on SemEval-2010 benchmark dataset is low compared to KP-Miner but matching to TopicRank with slightly high value as shown in Figure 2a.Also, the curve in Figure 3a indicates that the average multiword concepts extracted by the system remains stable at a low of 1.25.In contrast, on Quranic and 500N-KPCrowd datasets the precision-recall curve of TF-IDF shows somewhat overlapping progression with KP-Miner (Figure 2b,c).However, the average multiword key concepts extracted are still not more than 1.25.
The reason of low performance could be the fact that tf-idf model can potentially result in missing multiword concepts.This make sense because the factor tf (term frequency) is dominating than idf (Inverse document frequency).The tf measures how frequent a word is in a document and nothing can affect this value, whereas idf measures how rare a word is across the documents in the corpus and it is dependent on the number of documents, N, in the corpus and the value of document frequency.Thus, idf is affective only if there are more documents in the corpus and document frequency of a word or phrase is low.Based on this argument we can say that, despite the fact that 85% of keyphrases are normally comprised of multi words, single terms will gain more weights than multiword phrases because it has been found that single word terms occur more frequently as compared to multiword phrases [8].Therefore, this weakness of tf-idf based data-driven approach may result in missing important multiword key concepts, and in turn affect their performance.2.

KP-Miner
The KP-Miner precision-recall curves (Figure 2) show a similar progression to that of TF-IDF, precision falls when recall raises.The overall performance of KP-Miner on SemEval-2010 dataset is better than both TF-IDF and TopicRank.For all the variations of top K key concepts, the highest scores are achieved by KP-Miner (Figure 2a).We may attribute this to the fact that KP-Miner weighs more to multiword concepts as can be seen in Figure 3a.KP-Miner is based on tf-idf model and as discussed earlier that idf, which measure the rareness of a phrase, is affective only if there are more documents in the corpus and document frequency of a word or phrase is low.Therefore, because multiword phrases are less frequent and rare across the document corpus, therefore, on SemEval-2010 dataset, where the number of documents in the corpus is higher than Quranic dataset, the multiword concepts may get some effective score.By investigating the other factors that contribute to higher number of multiword keyphrases extracted by KP-Miner, it is found that the author of KP-Miner assumes that compound keyphrases do not occur more frequently compared to single words with in a document set.Based on this assumption the document frequency for multiword key concepts is set to 1, which will result in maximum IDF value, thus giving maximum score to multiword key concepts.We speculate that here KP-Miner is biased towards multiword key concepts.The performance of KP-Miner on Quranic and 500N-KPCrowd dataset supports our argument because in that case the idf values of both methods are close to each other, for both single and multiword concepts that result in somewhat overlapping patterns with TF-IDF (Figure 2b,c).

3.
TopicRank This method exhibits different patterns.While, on SemEval-2010 dataset the performance of TopicRank in terms of precision-recall is close to TF-IDF and lower than KP-Miner, on Quranic dataset its results show unstable behavior (Figure 2b,c).First the precision does not fall as recall rises, then suddenly it falls and recall remains stable at 5.88.After that a gradual increase in precision can be seen.By dipping in depth to determine, why TopicRank performing low and behaves differently in an unstable way on SemEval-2010 and Quranic dataset, we found that the main responsibility lies in the way of generating topics and their weighting.In the first step of identifying candidate concepts, it relies on noun phrases.However, the noun phrases may contain too common and general terms or noise ones [1].Also, it is not necessary that all concepts must be noun phrases.Verb phrases may also contain important key concepts.For example, in the keyphrase "extracting concepts" "extracting" is verb of type VBG (verb gerund) not NN (noun), but potentially it is similar to "concept extraction".Similarly, when the key concept "distributed computing" is analyzed the word "distributed" is tagged as verb of type VBN (verb).Therefore, relying only on noun phrases is not enough for key concept extraction.This may result in missing many valuable key concepts.In the next step of making clusters from candidate phrase, it is found that the similarity between candidates is not computed semantically, rather checked lexically with a minimum overlapping threshold value of 25%.This may result in generating topics that group candidates which are lexically similar but semantically opposite.For instance, "supervised machine learning" and "unsupervised machine learning" have lexical similarity but semantically both are opposite concepts.The effect of this will be obvious in the next steps of building graph from the topics and their ranking.Semantically similar key concepts may go to wrong topics, and their co-occurrence weight will be assigned to wrong edges in the graph, thus it may co-relate wrong topics, and ultimately wrong topics may gain higher weights.Therefore, comparing TopicRank with TF-IDF and KP-Miner, we conclude that the co-occurrence based relatedness weighting scheme of TopicRank is uncertain compared to frequency-based weighting scheme of TF-IDF and KP-Miner.Therefore, the same uncertainty can be seen in the unstable results of TopicRank.However, on 500N-KPCrowd dataset it outperforms than its competitors, in terms of precision recall curve (Figure 2c).The reason could be that in 500N-KPCrowd dataset the average number of words per document is very low as compared to the other datasets, in which case the lexical-based similarity may be fruitful that would result in improved precision.Similarly, a gradual increase in the performance can be seen across all the three datasets, in terms of Average Multiword Phrases (Figure 3).this can be attributed to the fact that it does not depend on the frequency-based model tf-idf which is hard to be optimized for multiword phrases.The pattern of precision-recall curves depends on the distribution of key concepts in the dataset.The distribution of key concepts may vary across datasets from different domains.Therefore, a variation in precision-recall curves across different datasets can be seen.however, the common intuition is that precision decreases as recall increases which can be observed in all the precision-recall curves.

Overall Performance
We now present the overall performance of the above methods in terms Average Precision (AP), which measures that how early in the ranking list a ranking algorithm fills the position.Table 5 shows that KP-Miner outperforms in terms of AP on SemEval-2010 and Quranic datasets, whereas TopicRank achieve high score on 500N-KPCrowd dataset.Although we have discussed earlier in details the performance of KP-Miner and TopicRank with possible reasons, the fact remains the same that "a good method ranks actual relevant key concepts near the top of the ranking list, while a poor method takes a higher score for precision to reach a higher score for recall" [1].We now evaluate the overall performance of the selected methods in terms of F-Measure on the selected datasets.Figure 4 indicates the F-measure curves for the methods on each of the three datasets, connecting the F-measure scores at various positions.A common observation on both datasets is that as the number of top N concepts increases the F-measure score also increases, reaching a maximum value.On SemEval-2010 dataset the KP-Miner score becomes larger with the increase in ranking position, as compared to TF-IDF and TopicRank (Figure 4a).However, on Quranic and 500N-KPCrowd datasets KP-Miner shows overlapping pattern to that of TF-IDF (Figure 4b,c), and the reason is the same as discussed earlier that KP-Miner is also based on tf-idf model.Therefore, when the idf score for both methods is close to each other, then slight difference remains between them.
In the precision-recall curves, the precision at N recall values (P@N) is computed.Technically, the cut-off value for N should be selected such a point that after that point the f-measure value drops significantly for all the comparing algorithms on the dataset.Because one dataset is different from other, therefore, the value for cut-off N may be different from one dataset to another (see Figure 4).We observed this difference during our experiments, based on which we set the cut-off values.
The maximum F-measure score obtained for each method is shown in Table 6.In terms of F-measure KP-Miner outperforms on SemEval-2010 when the number of extracted concepts is 12, while on Quranic and 500N-KPCrowd datasets TopicRank achieves maximum score when number of extracted concepts is 18 and 20 respectively.A good method reaches a higher F-measure score near the top in the ranking list, in converse a poor method reaches a higher score near the end of the list.

Error Source Analysis
Based on the above performance analysis, we now present error analysis with objective to quantitatively describe the major error sources that contribute to precision or recall errors of the algorithms, that will also provide future work directions.For this purpose, we manually analyzed the systems output on 30 randomly selected documents from SemEval-2010 dataset and 5 chapters from Quranic dataset and 40 documents from 500N-KPCrowd dataset.The heterogeneity in the number of selected documents is for two reasons.First, the number of documents are selected based on the size of the documents in the dataset, i.e., for larger size documents, we selected fewer documents.Second, for SemEval-2010 and 500N-KPCrowed the ground truth or gold standard is provided for each document within the dataset and for Quranic dataset, as we were dependent on domain experts for validation against the ground truth, therefore, we selected 5 chapters as allowed by the domain experts.We determined the proportion of the total number of errors of particular type to the accumulated number of false positives for each algorithm on the selected data.Hassan and NG [13] described four kind of errors commonly made by keyphrase extraction systems namely, overgeneration errors, infrequency errors, redundancy errors, and evaluation errors.However, adding to them, we identified three more major categories i.e., syntactical errors, frequency errors and semantical errors that are made by the selected methods.In Table 7 we summarize the results of the error source analysis.Inline to the objective of this preliminary study, the table does not aim at comparing the algorithms rather than to provide the proportions of the different kind of errors found in the total false positives.Therefore, the best statistical method that suits to our case is to give confidence level to the proportions instead of performing significance test.The results are presented with 95% confidence interval.In future study, this result will help us to develop a robust solution for key concepts identification and to overcome the different kind of errors.In the subsequent paras, we describe each of the three categories of errors.
Syntactical Errors: These are precision errors that occurs when a system extract keyphrases that are syntactically incorrect.In statistical n-gram-based methods this kind of errors ranges from 18 to 25% as shown in Table 7, because they can select grammatically, wrong combination of words.For example, the keyphrase "querying multiple", extracted from SamEval-2010 is syntactically incorrect.however, the correct one is "querying multiple registries".
Frequency Errors: These are major precision errors that occurs when the extracting system results in more single word terms than multiword keyphrases due to the fact that single term concepts more frequently occur than multiword concepts.we analyzed in the previous section that in statistical frequency-based ranking algorithms the single word keyphrases achieve higher scores than multi word phrases, although, in some algorithms multi word concepts are biasedly given higher weights.Our error analysis supports this argument, as we found through error analysis by manually analyzing the output files that on average about 85% of the extracted terms are single words, out of which about 60% are non-key concepts.This high false positive rate of single words contributes to 40 to 45% of overall errors.
Semantic Errors: This major kind of recall errors are found in results of TopicRank, contributing to almost 28% of overall errors.This error occurs when the system fails to retrieve concepts that are lexically similar to extracted concepts but semantically opposite.For example, the key concepts "UDDI registries" and "proxy registries' are lexically similar but semantically different.However, TopicRank cluster them under same topic based on lexical similarity.
The errors identified in this study could be addressed at different levels of key concepts identification.For example, the syntactical errors can be best handled in candidate selection step, using parsing techniques and extracting meaningful structures.The semantic errors can be overcome at topic identification level, using semantic-based clustering techniques or topic models e.g., Latent Dirichlet Allocation (LDA) or N-gram topical model (TNG) [58].Similarly, the frequency errors are related to syntactical errors that can be reduced if the algorithm can produce a comprehensive and meaningful list of candidate phrases.
Another, important aspect to discuss is that how the error sources identified in this study are related to the previously identified error sources [13].The semantic errors occur when lexically related candidate phrases are clustered under same topic and both of them are key concepts, but the system retrieves one, while the redundancy errors occur when two semantically related candidate phrases are grouped under the same topic and one of them is key concept, but the system retrieves both.The frequency and infrequency errors are closely related, having slight difference.The infrequency errors are a general category of the recall errors that occur when the system fails to retrieve a key concept due to the fact that it is infrequent, in converse the frequency errors are precision errors that occur when the system retrieves more single words due to the fact that they are frequent, but they are not key concepts.The third kind of error source identified in this study i.e., syntactical errors are precision errors are that occurs when the extracted candidate phrases are syntactically incorrect.

Conclusions
In this study initially, we have conducted a brief survey of keyphrase extraction algorithms and categorized them describing the necessary details and limitation of different approaches.After that, we conducted an empirical analysis of three state-of-the-art unsupervised data driven key concept extraction methods on three datasets from different domains.We draw several conclusions from our analysis.(1) By using statistical frequency-based approach for key concepts ranking, the single word concepts achieve higher scores than multi word concepts that result in the major precision errors called Frequency errors ranging from 40 ± 2.88 to 45 ± 2.85% of overall errors as shown in Table 7.There could be three factors that contribute in higher scores of single terms.First, single term concepts more frequently occur than multiword concepts.Second, the term frequency factor tf in frequency-based measure (tf-idf ) is dominant than idf.Third, multiword key concepts are highly dependent on idf factor which is sensitively affected by total number of documents in corpus.(2) the statistical n-gram-based approaches for candidate selection may select grammatically, wrong combination of words that may result in precision errors called Syntactical errors, this kind of errors ranges from 18 ± 2.26 to 25 ± 2.48%.
(3) Using lexical similarity for clustering candidates under different topics may result in recall errors called Semantic errors that contributes to 28 ± 2.62% of overall errors.Finally, as we discussed earlier, that the way of key concept candidate's selection and their ranking may have a strong impact on overall key concepts extraction process, therefore, in future investigating alternative methods that give appropriate weight to multiword key concepts and consider semantic similarity for grouping words under different topics may be worthwhile.
To overcome the shortcomings of existing systems, in future an integrated solution is needed.Parsing techniques can be used in the pre-processing step of the solution to produce a comprehensive list of candidate phrases from the input text documents, that may reduce syntactical and frequency errors.Various topic models or clustering techniques can be used to find topics based on semantic relatedness, that may address semantical errors.

Figure 2 .
Figure 2. Precision and Recall curves for selected datasets.

Table 1 .
Statistics of the datasets used.

Table 2 .
Best Parameter values setting on the selected datasets.

Table 3 .
Various parameters and their range values.

Table 4 .
Performance of the selected algorithms at chosen cut-off values.

Table 5 .
Comparison in terms of Average Precision (AP).

Table 6 .
Comparison in terms of F-Measure.

Table 7 .
Summary of the Error Source Analysis.