BibRank: Automatic Keyphrase Extraction Platform Using Metadata

: Automatic Keyphrase Extraction involves identifying essential phrases in a document. These keyphrases are crucial in various tasks, such as document classiﬁcation, clustering, recommendation, indexing, searching, summarization, and text simpliﬁcation. This paper introduces a platform that integrates keyphrase datasets and facilitates the evaluation of keyphrase extraction algorithms. The platform includes BibRank, an automatic keyphrase extraction algorithm that leverages a rich dataset obtained by parsing bibliographic data in BibTeX format. BibRank combines innovative weighting techniques with positional, statistical, and word co-occurrence information to extract keyphrases from documents. The platform proves valuable for researchers and developers seeking to enhance their keyphrase extraction algorithms and advance the ﬁeld of natural language processing.


Introduction
The internet hosts an extensive collection of scientific documents, numbering in the tens of millions.Google Scholar, a web-based search engine dedicated to academic research, strives to provide comprehensive access to scholarly literature across various disciplines.A study Gusenbauer [2019] reported that by the end of 2018, Google Scholar had indexed approximately 400 million articles.Keyphrases considered concise summaries of documents, aid information retrieval, indexing, and collection browsing.Automatic keyphrase extraction is the process of automatically identifying essential phrases within a document.Keyphrases find application in document clustering, classification, summarization, recommendation systems, and question answering.Automatic keyphrase extraction methods have been developed in domains such as social media, medicine, law, and agriculture, where they support specialized systems for organizing and retrieving information Merrouni et al. [2016] Merrouni et al. [2019].
Automatic keyphrase extraction methods can be categorized into unsupervised, supervised, and semi-supervised.Unsupervised techniques, which are domain-dependent, do not require labeled training data.On the other hand, supervised methods rely on manually annotated data, while semi-supervised ones strike a balance by requiring less annotated data compared to supervised methods.This paper introduces a downloadable platform that integrates keyphrase datasets in BibTeX format and facilitates the evaluation of keyphrase extraction algorithms.The platform currently encompasses 19 algorithms for automatic keyphrase extraction and methods for evaluating their performance against a diverse gold standard dataset.Among the 19 algorithms is a keyphrase extraction method called BibRank.BibRank exploits an information-rich dataset created by parsing bibliographic data in BibTeX format.It combines a new weighting technique applied to the bibliographic data with positional, statistical, and word co-occurrence information.
The main contributions of this paper are as follows: arXiv:2310.09151v1[cs.CL] 13 Oct 2023 1. BibRank dataset: Construction of an information-rich dataset by parsing publicly available bibliographic data, which includes manually assigned keywords.2. BibRank algorithm: Introduction of the BibRank algorithm, a novel method for keyphrase extraction that utilizes the bibliographic information within the BibRank dataset and statistical information.3. BibRank platform: Provision of a downloadable platform that integrates the BibRank dataset, BibRank algorithm, and other state-of-the-art keyphrase extraction algorithms.The platform includes evaluation metrics and allows for the integration of keyphrase extraction algorithms and datasets.4. Manual evaluation of keyphrases: Keyphrase extraction algorithms are evaluated using gold standard datasets as a benchmark.In our evaluation process, we rely on expert human evaluators to assess the quality and effectiveness of these gold-standard algorithms.
The remaining sections of the paper closely align with the contributions presented earlier.The next section briefly overviews notable keyphrase extraction algorithms and datasets.Section 3 introduces the heterogeneous BibRank dataset and presents the BibRank algorithm.Section 4 concentrates on the automatic evaluation of the BibRank algorithm and other state-of-the-art algorithms.Moreover, this section includes assessing the gold standard algorithms' quality, guided by expert human evaluators.The paper concludes by summarizing our findings.

Related work
This section provides an overview of the essential stages in the automatic keyword extraction algorithms pipeline, highlighting the algorithms that influenced BibRank.Li et al. [2010] to quantify the association strengths between keyphrases.An example of a supervised keyphrase extraction algorithm that utilizes external features is CeKE Caragea et al. [2014].CeKE employs citation-based features created from the references used in a publication.
The assignment of weights to each candidate phrase is based on the calculated features in the keyphrase ranking and selection step.Subsequently, the candidate phrases are sorted, and the most relevant ones are selected using an experimental threshold.
In the context of unsupervised methods, graph-based ranking algorithms like TextRank Mihalcea and Tarau [2004] deserve to be mentioned.These algorithms draw inspiration from the Google PageRank algorithm Page et al. [1999] and have demonstrated success in text summarization and keyword extraction.The text document is represented as a graph, where candidate phrases are nodes, and their relationships are edges.These relationships can be co-occurrence relations Beliga et al. [2015], syntactic dependencies Mihalcea and Tarau [2004], or semantic relations Li et al. [2010].
In We utilize BibTeX entries from the web to construct a new and information-rich keyphrase extraction dataset.Unlike existing datasets that often include only the abstract, full article text, title, and keywords of a document, our dataset incorporates additional metadata such as the publication year, journal title, and author name.An example of a BibTeX record for a publication is illustrated in Figure 1, where the entry type (e.g., "Article") is indicated after the "@," followed by various attributes (e.g., author, title, journal, and paper keywords) and their respective values.
Publicly available BibTeX records can be found in online archives like the TUG bibliography archive.TUG's archive contains a vast collection of over 1.6 million categorized BibTeX records from various journals.The archive supports search capabilities using SQL commands Beebe [2009].
To create the BibRank dataset, we processed more than 30,000 BibTeX records extracted from the TUG bibliography archive.Currently, the dataset consists of 18,193 unique records with 22 attributes.These attributes represent the distinct values in all the bib records, including publication year, journal of publication, and bib archive.The dataset includes publications from 1974 to 2019.Table 1 provides statistics on authors, journals, topics, and bib files covered by the dataset.
The bib files, referring to the archives or databases from which the papers were imported, were categorized into one of the following 12 topics: science history journals, computer science journals and topics, ACM Transactions, cryptography, fonts and typography, IEEE journals, computational/quantum chemistry/physics, numerical analysis, probability and statistics, SIAM journals, mathematics, and mathematical and computational biology.Expanding the dataset by processing additional bibliography files in BibTeX format is possible.
The file for the dataset and the essential tools for altering and producing new datasets are available in the BibRank project's GitHub repository.This repository grants users access to the original data and equips them with the requisite resources for customizing the data to their particular requirements or generating entirely new datasets

BibRank algorithm
The BibRank algorithm, comprising five steps, presents an innovative method for weighting candidate phrases, emphasizing the abstracts of scientific publications and based on the concept of a context for a group of BibTeX records.
1. Candidate Selection.The candidate phrases in the document are noun chunks.To identify the noun chunks, we apply rules based on sequences of POS tags.In our workflow, we use the Stanford CoreNLP Natural Language Processing Toolkit Manning et al. [2014], but other noun chunkers can be easily integrated into the platform.
2. PositionRank Weight Calculation.The PositionRank algorithm Florescu and Caragea [2017] assigns position weights to candidate phrases.Higher weights are given to the words appearing earlier in the document.For example, if a phrase consists of positions 3, 6, and 8, its weight is calculated as follows: 1 3 + 1 5 + 1 8 = 5 8 = 0.625.
@ A r t i c l e {Wang : 2 0 0 9 : EKF , a u t h o r = " Z i d o n g Wang and X i a o h u i L i u and Yurong L i u and J i n l i n g L i a n g and V e r o n i c a V i n c i o t t i " , t i t l e = " The final weight of each candidate phrase is determined by summing and normalizing the position weights of each word in the phrase.Additionally, the scores of each word are recursively computed using the PageRank algorithm, as described by Equation 1 Florescu and Caragea [2017], Mihalcea and Tarau [2004].
In Equation 1, S(v i ) represents the weight of each word i in a candidate phrase p, represented by the vertex v i .The damping factor d reflects the probability of jumping to a random vertex in the graph, and p is the position weight of the word i.The set In(v i ) contains the adjacent vertices pointing to vertex i, and w ji is the edge weight between v i and v j .Finally, Out(v j ) is the set of adjacent vertices pointed to by vertex i, and is computed as V k ∈Out(Vj ) w jk .
3. Context Formulation.The computation of the context for a publication involves selecting a set of BibTeX records according to specific criteria.For instance, if we consider a computer science article published in 2012, the context could be formed by including all computer science papers published within the same year.With the original BibRank dataset containing 22 attributes, each attribute can potentially define a distinct context.
4. Bib Weight Calculation.The bib weights aim to capture the occurrence frequency of candidate phrases within the context.Each record includes a list of keyphrases, allowing for the calculation of weights for candidate phrases based on Equation 2.
λp is the bib weight, α is a factor used for normalization, D is the set of all records that belong to the chosen context, d is a record, and c is the occurrence of a candidate phrase in the record's keyphrases list.α was calculated as the maximum bib weight across all keyphrases in the context documents.
5. Candidate Phrase Ranking and Selection.The ranking of candidate phrases is determined by combining their bib weights and position scores.The scores of individual words within each candidate phrase are added to the phrase's bib weight, resulting in a sum that determines the final ranking of the candidate phrases, as illustrated in Equation 3. The document's keyphrases are then determined by selecting the top N candidate phrases.
V p is the set of words that belongs to candidate phrase p and λ p is the calculated bib weight for the candidate phrase p.
In the illustrated Figure 2, The BibRank algorithm begins by processing the input text, extracting nouns and noun phrases like 'Keyword' and 'automatic identification,' which are considered as selected candidates.It then infers keyphrases, including 'Keyword extraction' and 'automatic identification,' assigning them scores of 0.38 and 0.30, respectively.These scores denote their relevance and significance to the document's main topic, calculated based on position weight and Bib weights.
• Input Text: Keyword extraction is tasked with the automatic identification of terms that best describe the subject of a document.

BibRank platform
BibRank is a versatile online platform developed in Python that simplifies the integration of keyphrase extraction algorithms, encompassing three modules: Datasets, Algorithms, and Evaluation.
One of the standout attributes of the platform is its comprehensive support for keyphrase extraction datasets.It seamlessly incorporates user datasets and features multiple pre-integrated datasets, such as the BibRank dataset (see 3.1) and five others extensively detailed in table 2. This table provides crucial information about the papers linked to each dataset, the number of documents contained, and the document types, distinguishing between abstracts and full papers.
Moreover, BibRank facilitates users in crafting personalized datasets with ease.The platform offers user-friendly routines tailored to process BibTeX files, simplifying the generation of new datasets that align with the user's specific needs and requirements.Schutz [2008] 2,304 Full papers NUS Nguyen and Kan [2007] 211 Full papers Inspec Hulth [2003] 2,000 Abstracts WWW Caragea et al. [2014] 1,330 Abstracts KDD Caragea et al. [2014] 755 Abstracts BibRank Dataset 18,193 Abstracts and Metadata The platform offers a comprehensive range of keyphrase extraction algorithms, including the BibRank algorithm (refer to 3.2) and ten additional ones, all clearly specified in table 3. It provides a user-friendly interface for effortlessly integrating the user's own keyphrase extraction algorithms.For smooth integration, the user's algorithm must extend a superclass that encompasses the blueprint for the crucial extraction operations, where the algorithm's name is designated as a class attribute.Additionally, the algorithm must incorporate a function that efficiently returns the extracted keyphrases and their corresponding weights.The platform incorporates PKE, an open-source toolkit for keyphrase.
Boudin [2016].Campos et al. [2020] 2020 Statistical TextRank Mihalcea and Tarau [2004] 2004 Graph based CollabRank Wan and Xiao [2008] 2008 Graph based TopicRank Bougouin et al. [2013] 2013 Graph based PositionRank Florescu and Caragea [2017] 2017 Graph based SGRank Danesh et al. [2015] 2015 Hybrid Statistical-graphical sCAKE Duari and Bhatnagar [2019] 2018 Hybrid Statistical-graphical KeyBERT Grootendorst [2021] 2021 Sentence Embeddings To assess the accuracy of a keyphrase extraction algorithm on a given dataset, the platform provides an evaluation module in the form of a Python script.Users can select the algorithm to be evaluated and specify the metadata for the dataset, such as the year of publication or journal.The evaluation script computes the recall (R), precision (P), and F1 scores, widely recognized as standard measures of algorithm performance.

Evaluation methodology
The widely accepted assumption that the gold standard serves as the reference truth for evaluating algorithms is acknowledged.However, a comprehensive twofold evaluation process was conducted to examine this assumption critically.The first evaluation aimed to assess the algorithms against the gold standard, while the second evaluation focused on evaluating the gold standard itself.
Datasets with manually assigned keywords were used as benchmarks to assess the algorithms' performance.The evaluations were carried out using the BibRank platform, where the algorithms were tested on the BibRank dataset with parameter adjustments.The default setting for the first parameter, determining the number of keywords to extract, was 10 for all algorithms.The second parameter, the tokenizer, utilized the Stanford CoreNLP toolkit, as explained in the BibRank algorithm section.The damping factor α was set to 0.85, and the window size was set to 2 based on experiments by Florescu and Caragea [2017].Extracted keyphrases were compared to the manually assigned keywords in the gold standard dataset to measure the algorithms' performance, considering exact matches as successful hits.
Standard evaluation metrics such as recall, precision, and F1 score were computed.
with expertise were sought through a reputable freelancing platform to evaluate the gold standard.These evaluators were carefully selected based on specific criteria, including fluency in English and a proven track record in similar tasks.Two experts were assigned to evaluate 100 annotated documents containing keywords using seven algorithms and the gold standard.The evaluators were kept unaware of the algorithm names or the gold standard during the evaluation process to prevent potential bias.The evaluators meticulously annotated the different data sets using a five-point scale: 1. Very bad: The keywords are considered inadequate and do not meaningfully represent the text.
2. Bad: The keywords are a mix of poor and good choices, lacking consistency and not fully capturing the essence of the text.
3. Acceptable: The keywords are generally satisfactory and represent the text to a reasonable extent.
4. Good: The keywords are of good quality, although they may not fully encompass all the text's main ideas.

Very good:
The provided keywords accurately summarize the text and effectively capture the main ideas.
Overall, our twofold evaluation approach provides a comprehensive analysis of both the algorithm and the gold standard, allowing us to understand the strengths and weaknesses of each.

Results
The evaluation of the algorithms involved three experiments, each utilizing a different section of the BibRank dataset.
The experiments focused on specific domains, namely "Computer science (compsci)," "ACM," and "history, philosophy, and science," consisting of 335, 127, and 410 papers, respectively.In choosing the dataset years, we aimed for diverse temporal coverage and ran tests on various combinations to ensure validity.For Computer science (compsci), bib scores were generated using publications from the years 1980 to 1987, and the test data was sourced from publications in 1988; ACM bib scores were derived from 1990 to 1996 and tested against 1997 to 2020 publications; for "history, philosophy, and science," scores were based on 2009 to 2011, testing with 2012 to 2014 publications.For a comprehensive overview of these experiments, including the categories used, please refer to Table 4.The table displays the categories the articles belong to and seven selected algorithms for evaluation.We selected these algorithms to exemplify various keyphrase extraction approaches discussed in the Related Works section, showcasing the implementation of distinct methodologies for keyword extraction.
Upon closer inspection, the BibRank algorithm demonstrates consistent enhancements across different datasets, as can be seen in the tables 5, 6, and 7. When compared to TextRank and PositionRank, which use comparable techniques, the integration of Bib Weights in the BibRank algorithm leads to a noticeable enhancement in performance.
1. YAKE (Yet Another Keyword Extractor) is a statistical keyphrase extraction algorithm that utilizes a "maximal marginal relevance" approach to promote diversity in the selected keywords.This ensures that the extracted keyphrases cover a wide range of topics and concepts.
2. The SGRank and sCake methods are algorithms used to extract keyphrases from a document.They employ statistical analysis and graph-based techniques, blending both advantages to identify important keywords.Notably, sCake stands out for integrating domain-specific knowledge into its process when analyzing documents.
3. KeyBERT represents a user-friendly and lightweight algorithm for keyword extraction.It harnesses the power of BERT transformers' embeddings to identify important keywords in a given text.Using an unsupervised technique, KeyBERT calculates the cosine similarity between each phrase and document to determine the most relevant keyphrases.
The preceding sections contain in-depth discussions about graph-based techniques, including TextRank, PositionRank, and BibRank.These algorithms use graph-based approaches to analyze word relationships and extract essential keywords from a text.Our objective in incorporating these algorithms is to comprehensively evaluate various keyphrase extraction techniques.
In addition to using standard gold keyphrases, the chosen experts manually evaluated seven keyphrase extraction approaches.To gauge the performance of each method, the experts assigned scores from 1 to 5 to the generated keywords for 100 randomly selected documents.Table 8 summarizes the average performance of each evaluated approach.These evaluations offer valuable insights into the effectiveness of the diverse keyphrase extraction methods.
The figure denoted by 3 provides a clear and organized visual display of the results for the keyphrase extraction algorithms.These algorithms were evaluated based on the domains depicted on the x-axis, while the F1 score is plotted on the y-axis.

Discussion
The Yake algorithm and the gold standard sets of keyphrases received the lowest scores from the experts in our evaluation.This result was expected for Yake, as it is the only statistical approach among the evaluated techniques.
Prior research Hasan and Ng [2014] has also indicated that models relying on statistical features exhibit lower average performance in keyphrase extraction tasks.However, the surprising finding was the performance of the gold standard keyphrases.
We conducted interviews with the experts who participated in the evaluation to gain deeper insights.One expert mentioned that the gold standard keyphrases are overly general and limited in scope.They are designed to capture the central ideas or keyphrases of the document, which may result in the omission of some important keywords.In contrast, algorithms such as BibRank, PositionRank, TextRank, and KeyBERT better understood the document's meaning, enabling them to extract more relevant and specific keyphrases.
Figure 4 presents an abstract that the experts evaluated, and the corresponding scores provided by the experts are listed in table 9.The gold standard keywords received low scores despite including important keyphrases like "Chinese dependency parsing" and "unlabeled data."However, there were cases where essential keyphrases were missing, while some keywords not explicitly mentioned in the abstract were included in the gold standard set.For instance, the term "semi-supervised learning" was incorporated in the gold standard keyword list but did not appear in the original abstract.
Yake achieved a low score, indicating that the algorithm lacks the contextual understanding exhibited by the other keyword extraction methods.
SGRank outperformed the gold standard, effectively highlighting essential keywords such as "long-distance word," "unlabeled attachment score," and "supervised learning method." SCake also demonstrated strong performance, successfully extracting detailed keywords related to different types of dependency parsers and incorporating "short dependency information." KeyBERT showcased robust performance, extracting comprehensive keywords such as "improves parsing performance" and "parsing approach incorporating," which enhanced the understanding of the paper's content.
TextRank consistently performed well, generating similar keywords to SCake and SGRank, indicating its consistency in identifying key concepts.
PositionRank, with a score of 5, provided additional context by introducing terms such as "short dependencies." BibRank consistently scored 5 in both evaluations, effectively extracting keywords related to various parser types, "short dependency information," and specific performance metrics like "high performance."It also included additional contextual keywords, such as "machine translation," providing a comprehensive overview of the abstract's content.
Overall, these evaluations shed light on the strengths and weaknesses of different keyphrase extraction methods and help us understand their performance characteristics in the context of academic literature.
The detailed results of our evaluations, substantiating the findings discussed in this paper, are recorded and made available for public scrutiny and exploration.These results can be found in our GitHub repository's "evaluation_results" folder.

Conclusions
This paper introduces the BibRank platform, a versatile online platform developed in Python, which simplifies the integration of keyphrase extraction algorithms.A new keyphrase extraction dataset, the BibRank dataset, is presented to benchmark keyphrase extraction algorithms.The paper also introduces a state-of-the-art keyphrase extraction algorithm, BibRank, which utilizes the notion of context to compute keyphrases.
The main keyphrase extraction algorithms are comprehensively evaluated in the study using a two-fold approach: evaluating the algorithms against the gold standard and evaluating the gold standard itself.The evaluations are conducted on the BibRank dataset using standard evaluation metrics.Expert evaluators assess the gold standard using a five-point

Data Availability
The BibRank keyphrase extraction framework is readily available on GitHub to facilitate reproducibility.The repository includes: • The implementation of BibRank and 18 other keyphrase extraction methods.
• A detailed installation guide.
• Examples of evaluations.
• The Bib dataset used for evaluation.
• Comprehensive instructions for running experiments with the BibRank model.

Funding
Eduard Barbu has been supported by the EKTB55 project "Teksti lihtsustamine eesti keeles" Figure 1: BibTeX record example

Table 4 :
Evaluation Results of selected keyphrase extraction algorithms, including BibRank

Table 9 :
The expert evaluation for the abstract presented in figure4The results demonstrate that some algorithms, such as BibRank and PositionRank, outperform the gold standard in extracting relevant and specific keyphrases, while others, like Yake, achieve lower scores due to their statistical nature.This evaluation provides valuable insights into the strengths and weaknesses of different keyphrase extraction methods in the context of academic literature.The BibRank algorithm demonstrates state-of-the-art performance when evaluated against the gold standard.The authors encourage researchers to use the BibRank platform for evaluating their own keyphrase extraction algorithms.To ensure reproducibility, the BibRank platform, BibRank algorithm, and the BibRank dataset are publicly available (see the Data Availability Statement) for use by the research community.Platforms such as BibRank and other keyphrase extraction tools have the potential to operate alongside VosViewer.If the research community starts using BibRank, we'll think about adding a plugin for integration with VosViewer.