2. Related Work
This section provides an overview of the essential stages in the automatic keyword extraction algorithms pipeline, highlighting the algorithms that influenced BibRank.
The keyword extraction pipeline comprises linguistic preprocessing, candidate phrase selection, keyphrase feature selection, and keyphrase ranking and selection. The text is segmented into sentences and tokenized into words during linguistic preprocessing. Several language processing techniques are applied, including lemmatization, stemming, POS tagging, stop word removal, and Named Entity Recognition (NER) [
2]. Sometimes, POS tagging is followed by syntactic parsing, and NER is particularly valuable in languages with reliable NER systems. Candidate phrases are selected from the processed text using n-gram sequencing and noun-phrase chunking (NP chunking) [
4]. Rules based on acceptable sequences of POS tags, such as selecting sequences starting with adjectives and ending with a noun in English, are employed [
5] to reduce the number of candidate phrases.
The subsequent step in the pipeline is feature selection for candidate phrases. Two types of features are calculated: in-document features and external features [
2]. In-document features can be statistical [
6], positional [
3], linguistic [
7], or context-based [
8]. Statistical features such as the TF-IDF score are commonly used, while positional features indicate the candidate phrase’s location in the title, abstract, or main text. Context features, such as sentence embeddings computed by deep neural networks, are also utilized. External features require resources such as Wikipedia [
9] to quantify the association strengths between keyphrases. An example of a supervised keyphrase extraction algorithm that utilizes external features is CeKE [
8]. CeKE employs citation-based features created from the references used in a publication.
The assignment of weights to each candidate phrase is based on the calculated features in the keyphrase ranking and selection step. Subsequently, the candidate phrases are sorted, and the most relevant ones are selected using an experimental threshold.
In the context of unsupervised methods, graph-based ranking algorithms such as TextRank [
4] deserve to be mentioned. These algorithms draw inspiration from the Google PageRank algorithm [
10] and have demonstrated success in text summarization and keyword extraction. The text document is represented as a graph, where candidate phrases are nodes, and their relationships are edges. These relationships can be co-occurrence relations [
11], syntactic dependencies [
4], or semantic relations [
9].
In the keyphrase ranking step, an adapted PageRank algorithm is employed, which iterates until convergence on the graph representation of the text, ultimately selecting the top-ranked candidate phrases. Another algorithm in this family is PositionRank [
12]. Building upon the principles of TextRank, PositionRank introduces a bias towards frequently occurring candidate phrases that appear early in the document. It operates at the word level, transforming the text into a graph, applying a position-based PageRank algorithm, and extracting candidate phrases.
Other initiatives that share a connection with our work encompass the creation and visualization of bibliometric networks. VosViewer stands out as a notable tool in these endeavors [
13]. While VosViewer is not specifically a tool for keyphrase extraction, it is a relevant software used for creating and visualizing bibliometric networks. These networks can encompass journals, researchers, or single publications, helping to analyze and visualize trends and patterns in scientific literature. VosViewer provides multiple avenues to build, visualize, and investigate bibliometric networks, simplifying the process for users to gain insights from bibliometric data.
Author Contributions
Conceptualization, E.B. and A.E.; methodology, A.E. and E.B.; software, A.E. All authors have read and agreed to the published version of the manuscript.
Funding
Eduard Barbu has been supported by the EKTB55 project “Teksti lihtsustamine eesti keeles”.
Data Availability Statement
The BibRank keyphrase extraction framework is readily available on GitHub to facilitate reproducibility. The repository includes: The implementation of BibRank and 18 other keyphrase extraction methods; A detailed installation guide; Examples of evaluations; The Bib dataset used for evaluation; Comprehensive instructions for running experiments with the BibRank model; Reviewers’ full evaluation results. GitHub repository:
https://github.com/dallal9/Bibrank, (accessed on 3 October 2023).
Conflicts of Interest
The authors declare no conflict of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.
References
- Gusenbauer, M. Google Scholar to Overshadow Them All? Comparing the Sizes of 12 Academic Search Engines and Bibliographic Databases. Scientometrics 2019, 118, 177–214. [Google Scholar] [CrossRef]
- Merrouni, Z.A.; Frikh, B.; Ouhbi, B. Automatic keyphrase extraction: An overview of the state of the art. In Proceedings of the 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt), Tangier, Morocco, 24–26 October 2016; pp. 306–313. [Google Scholar] [CrossRef]
- Merrouni, Z.A.; Frikh, B.; Ouhbi, B. Automatic keyphrase extraction: A survey and trends. J. Intell. Inf. Syst. 2019, 54, 391–424. [Google Scholar] [CrossRef]
- Mihalcea, R.; Tarau, P. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 404–411. [Google Scholar]
- Hasan, K.S.; Ng, V. Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 23–25 June 2014; pp. 1262–1273. [Google Scholar]
- Danesh, S.; Sumner, T.; Martin, J.H. Sgrank: Combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, Denver, CO, USA, 4–5 June 2015; pp. 117–126. [Google Scholar]
- Papagiannopoulou, E.; Tsoumakas, G. A review of keyphrase extraction. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1339. [Google Scholar] [CrossRef]
- Caragea, C.; Bulgarov, F.; Godea, A.; Gollapalli, S.D. Citation-enhanced keyphrase extraction from research papers: A supervised approach. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1435–1446. [Google Scholar]
- Li, D.; Li, S.; Li, W.; Wang, W.; Qu, W. A semi-supervised key phrase extraction approach: Learning from title phrases through a document semantic network. In Proceedings of the ACL 2010 Conference Short Papers, Uppsala, Sweden, 11–16 July 2010; pp. 296–300. [Google Scholar]
- Page, L.; Brin, S.; Motwani, R.; Winograd, T. The PageRank Citation Ranking: Bringing Order to the Web. Proc. ASIS 1998, 98, 161–172. [Google Scholar]
- Beliga, S.; Meštrović, A.; Martinčić-Ipšić, S. An overview of graph-based keyword extraction methods and approaches. J. Inf. Organ. Sci. 2015, 39, 1–20. [Google Scholar]
- Florescu, C.; Caragea, C. Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1105–1115. [Google Scholar]
- van Eck, N.J.; Waltman, L.; Dekker, R.; van den Berg, J. A comparison of two techniques for bibliometric mapping: Multidimensional scaling and VOS. J. Am. Soc. Inf. Sci. Technol. 2010, 61, 2405–2416. [Google Scholar] [CrossRef]
- Beebe, N.H. BIBTEX meets relational databases. J. TUGboat 2009, 30, 252–271. [Google Scholar]
- Manning, C.D.; Surdeanu, M.; Bauer, J.; Finkel, J.R.; Bethard, S.; McClosky, D. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nD Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, 5–10 July 2020; pp. 55–60. [Google Scholar]
- Schutz, A.T. Keyphrase Extraction from Single Documents in the Open Domain Exploiting Linguistic and Statistical Methods. Master’s Thesis, National University of Ireland, Galway, Ireland, 2008. [Google Scholar]
- Nguyen, T.D.; Kan, M.Y. Keyphrase extraction in scientific publications. In Proceedings of the International Conference on Asian Digital Libraries, Hanoi, Vietnam, 10–13 December 2007; pp. 317–326. [Google Scholar]
- Hulth, A. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan, 11–12 July 2003; pp. 216–223. [Google Scholar]
- Boudin, F. pke: An open source python-based keyphrase extraction toolkit. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, 11–16 December 2016; pp. 69–73. [Google Scholar]
- Frank, E. Domain-specific keyphrase extraction. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 31 July–6 August 1999; pp. 668–673. [Google Scholar]
- El-Beltagy, S.R.; Rafea, A. Kp-miner: Participation in semeval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, 15–16 July 2010; pp. 190–193. [Google Scholar]
- Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.; Nunes, C.; Jatowt, A. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 2020, 509, 257–289. [Google Scholar] [CrossRef]
- Wan, X.; Xiao, J. CollabRank: Towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, 18–22 August 2008; pp. 969–976. [Google Scholar]
- Bougouin, A.; Boudin, F.; Daille, B. Topicrank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Nagoya, Japan, 14–18 October 2013; pp. 543–551. [Google Scholar]
- Duari, S.; Bhatnagar, V. sCAKE: Semantic connectivity aware keyword extraction. Inf. Sci. 2019, 477, 100–117. [Google Scholar] [CrossRef]
- Grootendorst, M. MaartenGr/KeyBERT. 2021. Available online: https://zenodo.org/record/4461265 (accessed on 28 September 2023).
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).