3.1. Data Sources
Web of Science (owned by Clarivate) and Scopus (produced by the Elsevier Company) are two of the most well-known and scientifically recognized indexing databases, along with Google Scholar, Microsoft Academic, and PubMed. In many countries, the scientific development of academia is evaluated based on indexing criteria (H-index, number of citations per paper, number of ISI indexed papers, etc.). Because PubMed has narrower scope and coverage [
49], being specific to biomedical literature (which is not our interest); Google Scholar does not have a reliable mechanism for author attribution, being an unpaid service; and Microsoft Academic was closed in 2022, we chose to export data from Scopus and WoS only.
Most reviews and articles extract the data from one of these databases because of issues regarding the exporting formats and in order to avoid duplicates. Previous studies focused on the Scopus database explored the environmental, social, and governance research progress [
52], while others used VOSviewer and Bibliometrix to analyze the impact of investment for sustainable development [
53]. Another approach identified in the recent studies includes mapping the landscape of misinformation detection [
54] by using WoS as source. The parallel analysis of sustainable development and Industry 4.0 was also a main focus in the research field [
55], with the alternative of analyzing digital technologies for triple bottom line sustainability [
56] from a bibliometric point of view.
pyBibX is a Python library developed and maintained by the Python community and shared by Professor Valdecy Pereira from Universidade Federal Fluminense in Brazil on GitHub. pyBibX reads data from .bib files (Scopus or WoS) and .txt files (PubMed).
Bibliometrix accepts input formats such as .bib (from WoS or Scopus databases), .txt (from WoS, Scopus, PubMed, and Cochrane Library databases), .ciw (from the WoS database), .csv (from Dimensions or The Lens databases), and .xlsx (from the Dimensions database).
In WoS, data can be exported in different formats, such as text, RIS, BibTex, Excel, Tab delimited file, etc., but no more than 500 records at a time if the full record and cited references are exported.
In Scopus, data can be exported as CSV, text, RIS, and BibTex files, including citation information, bibliographical information, abstract and keywords, funding details, conference information, or references. Up to 20,000 documents can be downloaded at a time, regardless of the exporting format.
For this research, we exported information about 1885 documents in Scopus and 1426 documents in WoS (in 3 steps of 500, 500, and 426 rows, put together in a final document). We performed a parallel analysis in Bibliometrix and a combined analysis in Python, using the pyBibX package. This approach allowed us to better identify the differences between the two indexing databases, thus viewing digitalization and sustainable economy from multiple angles. Also, the combined analysis performed in Python worked with filtered data (the duplicates were eliminated) and brought new analyses, like N-Gram, Treemap, or AI Analysis.
3.2. Data Collection
Similar searches were carried out in the two databases that were used for indexing. In WoS, the title and the abstract and the author keywords matched the search (“digitalization” OR “digital*” OR (“artificial” AND “intelligence”) OR “AI” OR “blockchain” OR “cloud”) AND (“sustainability” OR “sustainable” OR (“triple” AND “bottom” AND “line”)). Additional filters were put on the document’s type (Article) and the languages (English). On 21 August 2024, when the query was performed, 1461 documents that were published between 2004 and 2024 were found. Because 97.60% of the documents were published between 2017 and 2024, this interval was chosen as the reference one; therefore, 1426 documents were selected to be analyzed. The year 2017 brings an increase of 1500% in terms of the number of records compared to 2004.
In Scopus, exactly the same search, with the same filters, produced 1939 results on August 21, 2024. The papers were published since 1996 but became relevant from 2017. In total, 97.22% of the documents were published between 2017 and 2024, the same chosen interval as for WoS; therefore, 1885 documents were selected to be analyzed. The year 2017 brings an increase of 1300% in terms of the number of records compared to 1996, for this Scopus query.
The search string that was used to retrieve relevant articles from WoS and Scopus includes also equivalent terms for
digital, because in the literature, they can be considered synonyms or part of digitalization (
AI, blockchain, cloud). The same thing is defined for
sustainability or
triple bottom line, as explained in [
56].
When formulating the queries on both databases, we considered the two main topics of interest: digitalization and sustainability. For each term, synonyms were also included, as well as the family of words, in the case of digital*—with a wildcard, which could replace the following words: digital, digitally, digitalisation, digitalization, digitalized, digitalize, digitalism, digitization, digitize, digitized. Thus, we can exclude language ambiguity (digitalisation in British English or digitalization in American English) or technical slangs (triple bottom line for sustainability, as [
56] uses it).
Data cleaning and pre-processing were performed in Python for the combined dataset. The first dataset (the Scopus document) was read from a .bib file. From the total 1885 records, 1867 were properly read (they did not have incomplete data). Data coming from the second dataset (the WoS document) were read and merged with the first one, resulting in a complete document of 2021 records. Overall, 154 new documents were added from the WoS database, where everything could be read correctly and no records were lost during the importing process.
3.3. Tools and Software
The applications available for bibliometric analyses in the field of research are limited in number; consequently, in this paper, the parallel analysis was carried out with the help of the “Bibliometrix” package from the R programming language. With the help of the Shiny application “Biblioshiny”, the multitude of commands that were already included in the “Bibliometrix” package have been integrated into an interactive web interface that facilitates the generation of visualization and the access to specific indicators [
57]. This interface offers the user eight categories of possible analyses, divided according to the object of the research (sources, authors, documents) and the topic of the study (conceptual, intellectual, and social structure): Overview; Sources; Authors; Documents; Clustering; Conceptual Structure; Intellectual Structure; and Social Structure.
Each of these categories has one to twelve possible views, which can be easily adjusted based on the outcome that is intended. For most visualizations, configurations such as the number of evaluated records, the clustering technique, the time interval, and the graphic parameters can be changed to display the most important information in the most visually engaging way.
For the combined dataset, a bibliometric analysis was conducted using the pyBibX Python library, developed last year by a team of researchers from Universidade Federal Fluminense, a Brazilian university. The team consists of Valdecy Pereira, Marcio Pereira Basilio, and Carlos Henrique Tarjano Santos, who all presented a paper in 2023, in which they explain the capabilities of pyBibX and the benefits it brings to bibliometric and scientometric analysis [
58]. They published the library on GitHub, for the first time, in April 2024 (release 3.3.1.). In June, Valdecy published the release 3.3.2, and on July 11, the newest release appeared (3.3.4).
pyBibX offers four main capabilities, which are all used in our analysis of the complete dataset coming from the two sources:
correction and manipulation through operations like filtering (by the year, sources, countries, languages, and/or abstracts), by merging fields (authors, institutions, countries, languages, and/or sources that have multiple entries), or by merging different or the same database files one at a time;
general analysis (like EDA (Exploratory Data Analysis) report; Word Cloud from the abstracts, titles, authors keywords or keywords plus; N-Gram bar plot from the abstracts, titles, authors keywords or keywords plus; Sankey Diagram with any combination of the following keys: authors, countries, institutions, journals, authors keywords, keywords plus, and/or languages; Treemap from the authors, countries, institutions, journals, authors keywords, or keywords plus, etc.);
network analysis (Citation analysis, Collaboration analysis, Similarity analysis, World Map collaboration analysis);
Artificial Intelligence analysis (like Topic Modeling using BERTopic to cluster documents by topic, visualize topics distribution, visualize topics by the most representative words, visualize documents projection and clusterization by topic, visualize topics heatmap, etc.).
As the authors mention in “pyBibX--A Python Library for Bibliometric and Scientometric Analysis Powered with Artificial Intelligence Tools” [
58], the library comes with new capabilities for bibliometric analysis, which cannot be found in other tools like Bibliometrix, VOSviewer, SciMat, Scientopy, CiteSpace, Tethne, or many other specific packages. These unique analyses are N-Gram, Treemap, and the AI feature. In the opinion of the authors, the next best tool is Bibliometrix, by offering 11 out of 17 of the most important bibliometric key features.
3.4. Bibliometric Indicators
In Python, after we have obtained the complete and cleaned dataset, merged from Scopus and WoS, we performed the EDA report using the command report = bibfile.eda_bib().
word_cloud_plot method was used to derive a Word Cloud of 500 most important words from the abstract: bibfile.word_cloud_plot(entry = ‘abs’, size_x = 15, size_y = 10, wordsn = 300). The size of each word would reflect its frequency, thus being useful to quickly identify the most representative words, but which should be put into context. A check table was also generated in order to indicate the word and its importance in terms of frequency value.
The N-Grams plot is used for dividing a text into sequences of a predefined number of words in order to determine patterns. We have used the command bibfile.get_top_ngrams(view = ‘notebook’, entry = ‘title’, ngrams = 3, stop_words = [], rmv_custom_words = [], wordsn = 15), which extracts 15 combinations of base words, which contain 3 words that best identify the title.
The last plot that we used for the descriptive analysis on the complete dataset is the Tree Map. It can be performed on the keywords, authors, journals, countries, institutions, or keywords. With the command bibfile.tree_map(entry = ‘jou’, topn = 20, size_x = 30, size_y = 10, txt_font_size = 12), we generated a Tree Map of the 20 journals with the highest frequency.
In pyBibX, the network analysis focuses on interactions between citations, journals, countries, or authors. The adjacency analysis studies the interaction between such entities, and we used citation analysis to identify the citations between papers. In the network that is represented in the plot, the blue nodes represent documents, and the red nodes represent the citations. We used the command bibfile.find_nodes_dir(view = ‘notebook’, article_ids = [1880], ref_ids = []) to identify the paper with the id 1880’s citations and the command bibfile.find_nodes_dir(view = ‘notebook’, article_ids = [], ref_ids = [‘r_2629’]) to identify by which papers the document r_2629 is cited.
Collaboration analysis is represented in pyBibX as an interactive plot between authors, countries, and institutions. By using the code bibfile.network_adj(view = ‘notebook’, adj_type = ‘aut’, min_count = 8, node_labels = True, label_type = ‘name’, centrality = None), we obtained a collaboration analysis for the authors that interacted at least 8 times.
Similarity analysis (also represented as an interactive plot) can be performed using coupling or co-citation methods. The command that we used will perform a co-citation analysis to find documents that share common references. It finds documents that have at least ten references in common, helping to identify highly related documents based on both the topics they address and the sources they reference:
bibfile.network_sim(view = ‘notebook’, sim_type = ‘cocit’, node_size = 10, node_labels = True, cut_coup = 0.3, cut_cocit = 10).
World map collaboration analysis is used to visualize the patterns of collaboration between authors or institutions representing different countries. In the example, we show the countries with which the American researchers collaborated:
bibfile.network_adj_map(view = ‘notebook’, connections = False, country_lst = [‘United States of America’]).
The AI capabilities of pyBibX are especially offered by Natural Language Processing (NLP) algorithms. Thus, using the create_embeddings method, we created a list of 30 hot topics, based on the abstracts.
In the Biblioshiny interface, the graphs used in the analysis were generated with the help of the interactive menu that displays the eight analysis categories (Overview; Sources; Authors; Documents; Clustering; Conceptual Structure; Intellectual Structure; and Social Structure). Among these categories, six have at least one relevant graph that was selected for the research in order to complete the analysis’ overall perspective.
The “Overview” category graphs, which showcase the database’s general details, have been converted into tables to enhance the effectiveness of information extraction. This section’s indicators concentrate on the main information about the data, the authors, the annual production of papers, and the annual average of citations.
In relation to the database sources, the analyses were conducted using tables pertaining to the most relevant sources based on the number of publications and plots that take into account the sources’ production and their impact as seen through the perspective of the H-index.
The most relevant authors are presented in the section with the same name, according to the number of articles they published, their scientific production during the explored period, and the countries that have the most relevant corresponding authors.
The category related to documents encapsulated two tables presenting the 10 most cited articles from each database, thus providing important information on the research directions approached by the authors so far in this subject.
Using the keywords of the article as a framework, the “Thematic map” plot from the section “Conceptual Structure” has been selected to express the status and relevance of the themes for research. The configurations of this plot remain the default ones, with the “Keywords Plus” analysis field and the “Walktrap” clustering algorithm.
For the “Social Structure” section, both available plots were included in the analysis. Collaborative networking between countries and the collaborative world map present the same information in different visual forms. These two graphs show the way in which the authors of different countries collaborated with each other during the selected period. For “Collaborative networking between countries”, we kept the default settings, except the clustering algorithm parameter, which, in this case, becomes “SpinGlass”.
These graphs and tables generated by Biblioshiny, along with the ones obtained with pyBibX, address the essential details required to answer the research questions.