IllustryFlow: A Modular Framework for Automated Bibliometric Analysis Using n8n and BERT-Enhanced Topic Classification

Niţu-Antonie, Vladimir; Niţu-Antonie, Renata Dana; Munteanu, Valentin Partenie

doi:10.3390/electronics15091943

Open AccessArticle

IllustryFlow: A Modular Framework for Automated Bibliometric Analysis Using n8n and BERT-Enhanced Topic Classification

by

Vladimir Niţu-Antonie

^1,*,

Renata Dana Niţu-Antonie

^2,* and

Valentin Partenie Munteanu

³

¹

Doctoral School of Economics and Business Administration, West University of Timisoara, 300115 Timisoara, Romania

²

Department of Marketing, International Business and Economics, Faculty of Economics and Business Administration, West University of Timisoara, 300115 Timisoara, Romania

³

Department of Management and Entrepreneurship, Faculty of Economics and Business Administration, West University of Timisoara, 300115 Timisoara, Romania

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(9), 1943; https://doi.org/10.3390/electronics15091943

Submission received: 25 March 2026 / Revised: 20 April 2026 / Accepted: 27 April 2026 / Published: 3 May 2026

Download

Browse Figures

Versions Notes

Abstract

The accelerating growth of scientific publications has intensified the need for scalable and interoperable tools capable of supporting bibliometric analysis and research evaluation. In response to this challenge, this paper introduces IllustryFlow, a modular framework that combines n8n, an open-source workflow automation engine, with Illustry, a dynamic visualization platform, to extract, classify, and interpret scholarly data retrieved from OpenAlex. At the core of the framework is a multilingual BERT-based classification model implemented within the OpenAlex infrastructure, trained on the CWTS (Centre for Science and Technology Studies from Leiden University) classification schema and enriched with metadata features such as journal-level embeddings and citation graph information. IllustryFlow enables automated topic classification, clustering, and semantic visualization of citation networks, co-authorship structures, and thematic distributions. In this framework, Illustry and the custom n8n nodes represent components developed by the author, while OpenAlex and the OpenAlex-enhanced BERT model are integrated as external resources. The principal contribution of this study therefore consists of the architectural design and operational integration of these components into a unified, modular, automated, and reproducible bibliometric workflow. The proposed framework integrates an explicit and reproducible strategy for querying, semantic filtering, and selection of the bibliographic corpus. The framework was evaluated on a dataset of 1756 bibliographic records, and the entire workflow, including dashboard generation, was completed in approximately 90 s under the experimental conditions considered. The obtained results support the feasibility of the framework for scalable bibliometric workflows and indicate its practical potential for the analysis of heterogeneous bibliographic corpora while maintaining reproducibility under the analyzed conditions.

Keywords:

bibliometric analysis; workflow automation; BERT; OpenAlex; n8n; visual analytics; scientific trends; citation networks

1. Introduction

Bibliometric analysis is commonly applied to understand trends and structures in scientific research. It supports the analysis of knowledge organization, research impact, and shifts in activity across different disciplines. Using publication metadata, citation information, and authorship relationships, bibliometric methods help describe the organization and development of scholarly communication [1]. The exponential growth of scientific research renders manual analysis impractical, necessitating automated tools to process vast datasets efficiently [2]. Such tools support evidence-based decision-making in research policy, funding allocation, and institutional evaluation by quantifying productivity and influence [3]. Furthermore, bibliometric analysis fosters innovation and collaboration by uncovering research gaps, interdisciplinary connections, and seminal works [4]. The integration of advanced computational techniques, including machine learning and data visualization, significantly enhances the precision and scalability of bibliometric studies, cementing their role in modern research evaluation [5].

Many of the difficulties encountered in bibliometric analysis are similar to those discussed in the context of big data and are often described using the five “V” dimensions: volume, velocity, variety, veracity, and value [6]. The volume of scholarly information is substantial, with a large number of publications and citations produced each year, which places pressure on conventional analysis methods [2]. In addition, new research outputs appear continuously, and this high rate of growth increases the need for timely data processing [7]. Bibliometric data also come from a wide range of sources, including structured elements such as author information and identifiers, as well as less structured text such as abstracts, which makes data integration more complex [5]. Finally, data quality remains a concern, as missing or inconsistent metadata and author name ambiguities can affect the reliability of the analysis [8]. Value underscores the imperative to derive actionable insights from complex datasets to inform research strategies and scientific discovery [3]. These challenges demand robust computational frameworks to manage and analyze large-scale bibliometric data effectively.

The research problem addressed in this paper is the lack of an integrated, automated, and reproducible framework for bibliometric analysis that combines data ingestion, semantic classification at the article level, and interactive visualization into a single workflow. Although there are useful tools and data sources for bibliometric analysis, they often treat data collection, semantic filtering, topic classification, and visual exploration separately, which increases manual intervention and reduces the consistency of the analysis. This paper aims to describe a technical implementation and investigates whether such a modular framework can improve bibliometric analysis in terms of automation, the granularity of thematic classification, and integration between data processing and visual exploration. The contribution of this paper is evaluated both at the software architecture level and from the perspective of its analytical utility for bibliometric studies. Beyond methodological interest, such a framework is also relevant from a practical perspective, as it can support the assessment of institutional profiles, the identification of emerging areas and the exploration of collaborative structures in contexts of academic management and research policies. Three current directions in the specialized literature are referenced in this paper: studies on bibliometric analysis and the challenges associated with the volume, variety and quality of metadata; recent developments on semantic classification in OpenAlex [9], including the use of multilingual BERT-type models and information from citation graphs; and research on the visualization of networks and thematic structure in bibliometrics, represented by tools such as VOSviewer v1.6.20 and modern dashboarding platforms. Existing approaches, however, also have important limitations. Classic bibliometric visualization tools are useful for exploring co-authorship, citation, or co-occurrence networks, but they often require manual preprocessing and offer limited semantic classification. At the same time, modern data sources and thematic classification models offer better granularity, but are not always integrated into a complete, reproducible workflow that unites ingestion, filtering, classification, and visualization in a single operational architecture.

The novelty of this work lies in the integration, in a single modular and automated framework, of three components that are usually treated separately in bibliometric analysis: the ingestion and normalization of bibliographic data, the fine-grained semantic classification at the article level, and the generation of interactive visualizations for exploring the results. The proposed framework leverages three complementary components: n8n for automated data flow orchestration, the OpenAlex-enhanced BERT model for article-level semantic classification, and Illustry for generating interactive visualizations. The relevance of this combination does not derive from the simple use of existing or newly created tools, but from their integration into a single, reproducible flow that unites data ingestion, semantic filtering, thematic classification, and visual exploration. In this regard, the work aligns with recent directions in bibliometric analysis that emphasize the need for automation, fine-grained semantic classification, and visual representations capable of supporting the interpretation of heterogeneous corpora [5,10,11,12,13].

To avoid any ambiguity regarding the nature of the contribution, it is important to distinguish between the adopted components and the original contributions of this paper. The adopted components include the n8n infrastructure, used as a basis for orchestrating workflows, and the OpenAlex-enhanced BERT semantic model [9], used for thematic classification at the article level. The original contributions of this paper consist of the development of the Illustry platform for publishing interactive visualizations, the design and implementation of the n8n nodes dedicated to citation processing and integration with Illustry, and the definition of the unified interoperability mechanism between these components. Furthermore, this paper argues that integrating these components results in a more unified analytical framework that is easier to use in practice compared to treating them separately.

2. Materials and Methods

2.1. Illustry Architecture

An overview of the bibliometric analysis and visualization platform’s design is shown in Figure 1 and Figure 2. The system is organized by separating the visual exploration component (frontend) from the data processing component (backend).

On the backend, bibliographic records are ingested and processed through parsing, cleaning, metadata extraction, author name disambiguation, and publication normalization before being stored for subsequent analytical tasks.

On the frontend, the platform provides interactive dashboards for the exploration of authorship patterns, citation structures, publication networks, and thematic groupings. These interfaces support filtering, query-based exploration, and visual analysis of bibliometric relationships, allowing users to examine the structure of scholarly output from multiple analytical perspectives. By separating data processing from visual exploration, the platform combines automated handling of bibliometric records with the interactive analytical functions expected in contemporary bibliometric visualization environments [13,14,15].

2.2. System Architecture and Modular Design

To address the challenges posed by large-scale bibliometric analysis, we developed a fully automated and modular framework built atop the n8n orchestration engine. The proposed framework integrates three custom-developed nodes—Wos, OpenAlexFetcher, and ArticleToIllustry—each purpose-built to perform advanced data ingestion, filtering, classification, and visualization for scientific publications [13,16,17].

To avoid terminological ambiguities, in this paper, several key terms are used with clearly defined meanings. The term framework refers to the proposed framework as a whole, i.e., the conceptual and operational integration of the components used for data ingestion, semantic classification and visual exploration. The term platform is reserved for the existing software systems used in the proposed framework, in particular n8n and Illustry. The term workflow refers to the orchestrated sequence of automated steps implemented in n8n, while pipeline refers to the technical flow of data processing and transformation between the main stages of the analysis. The term architecture is used to describe the internal structure of the system or a component, and node refers to a modular execution unit in n8n, such as Wos, OpenAlexFetcher or ArticleToIllustry. This convention is maintained throughout the manuscript to avoid conceptual overlaps and to make the contribution of this paper clearer.

From a methodological perspective, the contribution of the proposed framework lies both in the development of specific components and in their integration into a modular architecture with clearly delimited analytical functions. More precisely, the originality of this section lies in the design of an orchestration logic through which data ingestion, semantic enrichment, thematic classification and visual publishing are connected in a reproducible flow. This contribution is relevant because, in bibliometric practice, these stages are often carried out separately, using non-integrated tools and requiring considerable manual intervention.

Operationally, the proposed workflow can be understood as a sequence of six connected stages. First, the user provides a Web of Science export in TSV format, together with search and filter parameters, where applicable. Second, the Wos node parses the file and extracts the basic bibliographic fields needed for the next steps. Third, each record is checked in OpenAlex by a title-oriented search and additional author-based validation; if no valid match is obtained, the workflow switches to a fallback branch, where the record is semantically classified by the OpenAlex-enhanced BERT service. Fourth, the OpenAlexFetcher node retrieves, enriches, and semantically filters candidate records from OpenAlex based on explicit inclusion and exclusion rules. Fifth, the ArticleToIllustry node transforms the filtered bibliographic objects into analytical structures, such as thematic clusters, co-authorship networks, institutional graphs, temporal visualizations, and conceptual summaries. Finally, these results are published via the Illustry API as interactive dashboards for bibliometric exploration. This sequential description is intended to make explicit the data path, decision points, and input/output logic corresponding to the flow illustrated in Figure 3.

Figure 3 synthesizes the operational logic of the proposed workflow, highlighting the main processing steps, the decision point associated with the semantic fallback branch, and the transition from bibliographic ingestion to semantic enrichment and visual publishing. A detailed description of the six steps is presented in the text.

Because the workflow is parameterized, the thematic values of query, relevantTerms, excludeTerms, and maxArticles may vary across analytical scenarios. However, the retrieval, filtering, linking, and fallback logic remained fixed across runs and are reported explicitly here to ensure procedural reproducibility.

2.2.1. Data Ingestion via Web of Science (Wos Node)

The Wos node serves as the entry point of the workflow and is responsible for transforming the Web of Science bibliographic export into a structure that can be processed within the proposed pipeline. At the input level, the node receives a TSV file and extracts the essential bibliographic fields used in the following steps, including title (TI), author (AU), abstract (AB), source/journal (SO), year of publication (PY), and number of citations (TC), as illustrated in Figure 4. These fields are then normalized and prepared for the semantic linking and classification step.

Operationally, the node implements an end-to-end strategy for identifying records in OpenAlex, directly using the Works entity query via the title_and_abstract.search field, with descending ordering by relevance_score. For each eligible Wos record, a title-oriented OpenAlex search is generated, and the final selection is based not only on the relevance score, but also on strict bibliographic validation rules.

Before the matching step, the node applies an internal deduplication logic. Records are compared to a set of already processed titles, normalized to lowercase, to avoid introducing duplicates into the current execution. In addition, if essential fields, such as title or authors, are missing, the record is not considered eligible for direct bibliographic linking and is redirected to the semantic fallback branch, preserving the available metadata.

Wos-to-OpenAlex linking is performed in two steps. First, the node retrieves a list of candidate records from OpenAlex, ordered by relevance. Second, match validation is based on a combined title + author criterion. More precisely, a candidate is accepted as a valid match only if the title in OpenAlex matches the title in Wos through a case-insensitive comparison and if there is at least a plausible match between the surnames of the authors in Wos and those extracted from authorships.author.display_name in OpenAlex. In this way, the connection stage aims to reduce false matches and keep only bibliographically consistent records.

When a record cannot be validly identified in OpenAlex, the workflow activates a semantic fallback branch. In this case, the node sends an HTTP POST request to the BERT-based classification service via the configurable endpoint <classifierServerUrl>/invocations. The payload is built from the available abstract, represented as abstract_inverted_index, to which, when available, the log and other useful metadata are added. The result returned by the classifier is then converted into an object compatible with the OpenAlex schema, so that the record can continue through the same analytical flow as those retrieved directly from OpenAlex [9,13,17].

The output of the node therefore consists of a collection of WorkResponse objects, obtained either by bibliographic matching validated with OpenAlex or by fallback semantic classification. From a methodological point of view, this component plays an essential role because it combines deterministic ingestion of bibliographic metadata with an explicit connection logic and a fallback mechanism for cases where the bibliographic signal is incomplete. Consequently, the Wos node does not only parse the initial input, but also acts as a controlled gateway into the broader workflow, ensuring the consistency of the data structure forwarded to the subsequent stages.

For reproducibility, the OpenAlex retrieval step should be understood as a parameterized query process over the Works entity. In the current implementation, candidate records are requested through title_and_abstract.search, ordered by relevance_score in descending order, with cursor pagination and batches of 100 records per page. The key parameters are query, relevantTerms, excludeTerms, and maxArticles. Inclusion is based on the presence of the query and relevant thematic terms, whereas exclusion is triggered by the presence of predefined excludeTerms during semantic filtering. Wos-to-OpenAlex linking is performed by exact case-insensitive title equality plus at least one plausible author surname correspondence. No fuzzy matching based on edit distance or similarity thresholds is currently used.

2.2.2. OpenAlex Enrichment and Filtering (OpenAlexFetcher Node)

As illustrated in Figure 5, the OpenAlexFetcher node is responsible for the semantic retrieval and thematic filtering stage of candidate publications in OpenAlex. At the input level, the node receives a user-defined search expression together with semantic filtering parameters, including relevant-term lists, exclusion-term lists, and the maximum number of articles to be retrieved. Operationally, the node uses the openalex-ts client to query the Works entity in OpenAlex by means of the title_and_abstract.search field, using descending order by relevance_score, the is_oa=true filter, cursor-based pagination, and batches of 100 results per page.

A representative example of an OpenAlex query, equivalent to the logic used by the node, is as follows:

GET/works?filter=title_and_abstract.search:bibliometric analysis scientific visualization topic classification,is_oa:true&sort=relevance_score:desc&per-page=100&cursor=*

In practice, the exact value of the OpenAlex search expression is controlled by the query parameter, which allows the workflow to be adapted to different thematic scenarios. After retrieval, each publication is subjected to an explicit semantic filtering step based on the normalized content of its title, abstract, topics, and associated keywords. Normalization consists of lowercasing and the removal of formatting artifacts, so that the matching process is not affected by superficial orthographic variation.

The inclusion criteria are defined through the relevantTerms parameter as an explicit JSON object of thematic categories and associated terms. In the configuration used in this study, three categories were applied. The management category included the terms “management”, “business”, “organization”, “strategy”, “leadership”, “administration”, and “enterprise”. The decision_making category included “decision”, “decisions”, “decision-making”, “choice”, “judgment”, and “planning”. The visualization category included “visualization”, “visualizations”, “data visualization”, “visual”, “graph”, “chart”, “dashboard”, and “analytics”.

The exclusion criteria were defined through the excludeTerms parameter as the following list of terms: “medicine”, “medical”, “surgery”, “anesthesia”, “fetal”, “cancer”, “disease”, “biology”, “clinical”, and “healthcare”. Publications containing these exclusion terms in the analyzed textual fields were removed from the candidate set.

A publication was considered relevant only if it matched at least two distinct relevant thematic categories from the relevantTerms configuration. In the present implementation, this means that a work has to satisfy terms from at least two of the following categories: management, decision_making, and visualization. This rule was introduced to reduce false positives and retain only publications that were semantically consistent with the intended thematic scope.

The key parameters of this stage include the search expression (query), relevant thematic categories (relevantTerms), exclusion terms (excludeTerms), maximum result limit (maxArticles), ordering by relevance_score, cursor pagination, and a batch size of 100 results per page. In terms of operational robustness, the node includes mechanisms for error handling and rate limiting, through controlled retries, backoff, and explicit handling of transient response codes, especially 429, 500, and 503.

The output of the node consists of a filtered and enriched set of bibliographic objects compatible with the rest of the flow, ready for final classification, analytical aggregation, and visual publishing. In this sense, OpenAlexFetcher is the component that transforms the raw retrieval from OpenAlex into an explicit, transparent, and reproducible thematic corpus [13,17].

2.2.3. Topic Extraction and Visualization (ArticleToIllustry Node)

Figure 6 illustrates the ArticleToIllustry node that finalizes the analytical pipeline.

This component accepts a list of JSON objects and generates the visual and analytical outputs used for downstream exploration. Its functions include primary topic cluster extraction based on primary_topic frequency, entity graph construction for co-authorship and co-institution networks, country and time-series analysis using calendar heatmaps, semantic content summarization via word clouds for concepts and keywords, and citation trajectory modeling represented as multi-year bar charts.

All results are posted via a REST API to a self-hosted Illustry application, where they are rendered in Apache ECharts dashboards optimized for large-scale bibliometric data [12,14]. This design supports interactive exploration, configurable visual summaries, and the extraction of institutionally relevant insights from the processed publication corpus.

2.3. Use of the OpenAlex-Enhanced BERT Model in the Proposed Framework

To ensure scalable and domain-sensitive topic classification in large-scale bibliometric workflows, this study adopts the OpenAlex classification framework, a multilingual, transformer-based architecture that integrates contextual, relational, and source-level embeddings to produce fine-grained semantic labels at the article level. This model has been shown to offer significant improvements over traditional journal-based heuristics in both accuracy and granularity, particularly for cross-disciplinary and multilingual corpora [13,16]. The classification engine is centered on a fine-tuned multilingual BERT (mBERT) encoder, trained using more than 70 million labeled records derived from the OpenAlex graph and labeled by the CWTS Leiden Ranking taxonomy. Each training instance is annotated with topic labels drawn from a hierarchical topic graph of over 4000 nodes, structured by domain → field → subfield → topic, following a refined version of the Scopus ASJC taxonomy [18].

It is important to note that, in this research, the OpenAlex-enhanced BERT model was not developed or retrained by the authors, but is used as a pre-existing semantic model, adopted from the OpenAlex infrastructure. The purpose of this description is to explain the architectural logic and the types of information integrated by this existing model. From a reproducibility perspective, this paper describes the essential components of the model used, as reported in the OpenAlex sources and related literature, including the multilingual BERT encoder, the integration of textual, citation, and journal embeddings, and the use of the CWTS/OpenAlex hierarchical taxonomy. The complete training hyperparameters, the exact optimization procedure and the dataset splitting belong to the original OpenAlex model.

Consequently, the replicable dimension of this work mainly focuses on how this semantic classification is incorporated into an automated workflow of ingestion, filtering, classification and visualization, without considering the full retraining of the existing base model.

2.3.1. Multimodal Embedding Integration

To enhance classification fidelity across varying data sparsity levels, the model fuses three embedding modalities:

For the text-based embeddings, titles and abstracts were combined and encoded using a multilingual BERT model. This was performed to capture differences in meaning across disciplines and languages, especially in cases where metadata coverage is uneven [16].
For citation embeddings, the model integrates two graph-derived features that encode the position of a publication within the citation network relative to previously topic-labeled reference works. Citation 1 captures direct citation links between the focal article and gold-labeled topic exemplars, that is, publications already associated with well-defined topics in the OpenAlex/CWTS taxonomy. Citation 2 captures second-order citation proximity by considering links to works that cite those examples, thereby extending the relational signal beyond direct citation ties. Together, these two features provide a structured indication of the thematic neighborhood within the citation graph and are especially useful when textual metadata are sparse or semantically ambiguous [9]. For journal embeddings, instead of relying on static journal categories, journal identity is modeled via dynamic transformer-based embeddings (e.g., MiniLM). These vectors are trained jointly with the main model to encode topical biases of journals, supporting robust inference even when other features are missing [17].

Operationally, the multimodal fusion mechanism involves aggregating the three sources of semantic signal, textual, citation, and journal, into a single composite representation at the article level. Specifically, the textual embedding derived from the title and abstract, the embeddings based on citation relationships, and the journal-associated embedding are concatenated to form a unified input vector, which is then passed to a feedforward neural classifier. It projects the multimodal representation into the candidate topic space and produces membership scores for possible thematic labels. In this configuration, classification does not depend exclusively on the textual content of the article, but on combining textual, relational, and editorial information into a unified representation. In addition, based on the description available in the original sources, the model includes robustness mechanisms such as stochastic masking of some features and exploitation of alternative signals when metadata are incomplete.

2.3.2. Hierarchical Taxonomy and Clustering Capacity

The output layer maps each publication to one or more topic nodes, each representing a cohesive cluster discovered through community detection over the OpenAlex citation graph using the Leiden algorithm [19].

Each community is subsequently labeled using large language models to align with Scopus-style topic labels, and each label is linked to ASJC codes to support field-normalized evaluations [18].

Although the model’s primary role is supervised classification, the resulting topic representations can also support clustering applications, including the identification of latent research themes, inter-topic relationships and author-based topical proximity. In the present study, article-level clustering is derived from shared primary topics, enabling the construction of co-authorship networks, institutional clusters, and temporal topic flows [13].

2.3.3. Performance Characteristics and Limitations

Empirical evaluations on held-out sets demonstrate strong predictive performance, with Top-1 accuracy of 53%, Top-3 accuracy of 64%, and Top-5 accuracy of 67%, increasing to 72% Top-1 accuracy when full metadata are available [17]. The model shows strong resilience to metadata incompleteness due to its multimodal architecture but exhibits lower precision on rare or emergent topics with limited training samples. Additionally, non-Latin alphabets and very short texts reduce classification confidence, although these challenges are mitigated in production through journal inference and iterative retraining.

This theoretical backbone enables the ArticleToIllustry node to leverage primary topic labels for clustering records into thematic groups, subsequently visualized through high-level semantic maps and citation dynamics dashboards.

The confidence scores associated with the topic classification should be understood as the output scores of the classifier for the topic assigned to each article. In operational terms, they express the relative level of certainty with which the model associates a publication with the topic label selected from among the candidate topics. These values do not represent an external measure of accuracy validation, but an internal estimate of the strength of the classification produced by the model. For this reason, they are used in the present analysis as proxy indicators of the stability and practical consistency of the classification, without representing a substitute for an assessment on a manually labeled set. Consequently, Table 1 does not constitute a direct validation of the classifier’s performance in the sense of standard evaluation metrics, but rather a description of the distribution of internal certainty associated with the classifications generated in the two analyzed streams.

From this perspective, Table 1 summarizes the distribution of confidence scores for records enriched through OpenAlexFetcher and for records originating from Web of Science and subsequently classified through the Wos + BERT branch, providing a comparative picture of the classification behavior in the two data streams.

To complement these aggregate indicators, the evaluation was extended with a comparison across the most frequent thematic categories.

The results in Table 1 show that the OpenAlexFetcher-enriched stream produces higher confidence scores than the Wos + BERT branch. For the OpenAlex stream, the median confidence score is 0.9708 for 1756 articles, while for the Wos + BERT branch the median is 0.8228 for 42 articles. This difference is compatible with the distinct role of the two streams in the proposed architecture: the OpenAlex stream represents the main semantic enrichment and extended aggregation pathway, while the Wos + BERT branch reflects situations where classification needs to be performed under conditions of reduced or incomplete metadata. In this regard, the comparison does not aim at strict equivalence of the samples, but at illustrating how classification works in two different contexts of information availability.

The difference in size between the two streams, however, requires a cautious interpretation. The Wos + BERT branch includes a small number of articles and cannot support general inferences regarding the robustness of the classification or the thematic coverage of the model in broader contexts. Nevertheless, the results remain relevant because they show that topic assignment remains operationally possible even under conditions of limited bibliographic information, even if the confidence values are higher when the classification benefits from more complete metadata. From this perspective, the comparison supports the practical utility of the framework for exploring heterogeneous bibliographic collections, while also indicating that metadata completeness influences classification confidence.

At the thematic granularity level, the OpenAlexFetcher stream assigned 462 unique topics for 1756 articles, while the Wos + BERT set covered 23 unique topics for 42 articles. These values suggest that the main semantic enrichment stream allows for a fine classification of the corpus and provides an adequate basis for the construction of thematic clusters and visual structures used later in the analysis. At the same time, the results from the WoS branch indicate that the model can provide usable classifications when the information signal is weaker, but in a more restricted empirical framework. Overall, the results support the feasibility and practical utility of the proposed framework for classifying and exploring heterogeneous bibliographic corpora. They suggest that integrating semantic classification into the bibliometric flow allows for maintaining an operational classification under different conditions of metadata completeness and provides a sufficiently stable basis for subsequent clustering and visualization steps. In the current form of the study, these conclusions should be viewed as empirical indications of the practical behavior of the framework, without being extended to an exhaustive experimental validation of the classifier’s performance.

An important methodological limitation of the present evaluation is the absence of an ablation study that isolates the contribution of each semantic component integrated into the classification used. In its current form, the comparison between the OpenAlex-enriched stream and the Wos + BERT branch does not allow for a rigorous separation of the effect of OpenAlex enrichment from that of citation embeddings and journal embeddings. Therefore, an important direction for future research is to conduct an ablation study that explicitly compares classification performance in configurations with and without OpenAlex enrichment, with and without citation embeddings, and with and without journal embeddings.

2.3.4. Execution Time Across Dataset Sizes

To evaluate the suggested framework’s operational behavior in relation to the amount of data, we analyzed the total execution time of the same workflow on datasets of different sizes. This comparison is relevant because the practical utility of an automated bibliometric workflow depends not only on the correctness of the classification and the quality of the visualizations, but also on the ability of the system to maintain reasonable processing times as the corpus size increases. Therefore, the operational scalability was examined by comparing the runtime for progressive subsets and for the full dataset, as observed in Table 2.

The results in Table 2 show that the total execution time increases with the size of the dataset, but at a moderate rate, from 46 s for 250 records to 90 s for 1756 records. The processing time (per 100 records) decreases from 18.4 s to 5.1 s, indicating better operational efficiency (measured by processing speed) as the data volume increases. This behavior suggests that the flow does not scale in a restrictive linear way, but combines a relatively stable fixed cost component with a slower increase in volume-dependent steps. In particular, the Wos step remains approximately constant, while OpenAlexFetcher explains most of the increase in the total execution time. Overall, the results in Table 2 support the practical scalability of the proposed architecture under the analyzed experimental conditions, while mentioning the need to validate it on larger bibliographic collections.

2.3.5. Consistency Analysis and Comparison Across Thematic Categories

To complement the aggregate assessment presented above, the analysis was extended in two directions: examining the consistency of the thematic classification and comparing the behavior of the model across the main thematic categories identified in the corpus. To this end, the top 10 topics were selected in order of frequency, so that the analysis would focus on the categories with the greatest empirical relevance in the dataset. This extension is necessary because the global descriptive indicators only provide a synthetic picture of the classification and do not sufficiently capture the internal variations between the dominant thematic areas. The corresponding results are summarized in Table 3.

The results in Table 3 indicate that the level of confidence of the classification varies between the thematic categories analyzed. For example, Service-Oriented Architecture and Web Services and Sustainable Supply Chain Management record high values for the mean and median confidence scores, while categories such as Economic and Business Development Strategies or Economic and Technological Systems Analysis present lower mean values. At the same time, the differences between the mean and median suggest that the distribution of scores is not uniform across all topics, which indicates internal variations in the stability of the classification. Overall, the results presented in Table 3 support the idea that the robustness of the framework should be assessed not only at a global level, but also according to the behavior of the classification in different thematic areas. From this perspective, the comparison by categories provides a more informative picture of the practical performance of the framework than simply reporting aggregated indicators at the level of the entire corpus.

2.4. Visualizations with Illustry

Visualization is an essential part of bibliometric analysis when corpora become large and heterogeneous enough that thematic, relational, and temporal structures can no longer be efficiently tracked in tabular form. In the proposed framework, this function is achieved by integrating the Illustry platform into the n8n stream, so that semantically classified data can be automatically transformed into interactive dashboards usable for analytical exploration. This orientation is in line with the literature on knowledge mapping and visual representation of scientific structures which emphasizes the role of visualization not only as a descriptive tool, but also as a support for the interpretation of thematic relationships and network dynamics [7,15,20].

2.4.1. ArticleToIllustry Node: Architecture and Visualization Logic

To operationalize the topic-classified bibliometric data retrieved via OpenAlex, this study employs a custom n8n workflow component named ArticleToIllustry. This node automates the transformation of enriched article metadata into structured, semantically layered visualizations suitable for scholarly analysis. Its processing logic includes topic grouping, co-occurrence modeling, temporal mapping, and dashboard deployment via the Illustry API [12].

2.4.2. Technical Structure and Functionality

At the core, the ArticleToIllustry node accepts WorkResponse objects from the OpenAlex API and identifies the top primary topic clusters based on frequency of assignment. Each work is evaluated according to its primary topic ID, and the top six clusters are extracted. Subsequently, a ClusterMap object is built, which aggregates article-level metadata, including authors’ countries, affiliated institutions, publication year, and cited-by counts.

The node then generates multiple types of visualizations, each following a standard schema for interoperation with the Illustry platform:

Temporal Publication Calendar—plots articles by publication date and cluster label, enabling chronological trend mapping.
Pie Chart (Countries)—displays the geographic distribution of research by authorship affiliation.
Co-Authorship Networks (Edge Bundling + Force Graph)—maps author collaboration intensity across and within topics [21].
Institutional Graphs—reveals co-affiliation patterns and regional research hubs through force-directed layouts.
Word Clouds—summarizes the most prominent concepts and keywords [15,20].
Bar Chart (Citations by Year)—highlights citation trajectories of top articles across a multi-year window.

The system’s design ensures robust data handling via optional feature fallbacks; for example, articles lacking citation metadata can still be clustered via journal embeddings or textual features. Once visualizations are compiled, the node programmatically deploys them to a user-specific project in Illustry, where they are rendered into an interactive dashboard.

2.4.3. Interactivity and Analytical Use Cases in Illustry

Once deployed, the visualizations support rich user interaction through the Illustry platform. Users can inspect semantic details through tooltip metadata attached to nodes, bars, and word-cloud items, including full author names, article titles, and institution labels, thereby enabling detail-on-hover exploration without overloading the visual abstraction [22]. They can also apply dynamic filters by country, concept, or year to isolate thematic subsets and temporal windows, using Illustry’s metadata indexing and dashboardState objects to manage active layers and constraints [23]. In addition, visual elements such as authors and individual articles are linked to the corresponding OpenAlex records, allowing users to review citation contexts and related metadata directly from the dashboard interface [24]. These functions are intended primarily for exploratory analysis rather than presentation alone, supporting the investigation of collaboration structures, topic development, and geographic patterns in scientific activity in line with established approaches in bibliometric mapping [15,25].

In its current form, the evaluation of the visualization component remains mainly functional and qualitative. Illustry supports exploratory and practically oriented bibliometric analysis by combining interactive filtering, navigation across aggregation levels, and complementary graphical views that facilitate the examination of thematic, institutional, and temporal relationships in the corpus. These features may support applications such as institutional profiling, the identification of emerging areas, and the exploration of collaboration structures. However, this study does not include user studies or standardized usability measurements, and the current evidence should therefore be interpreted as support for exploratory utility rather than as an exhaustive validation of usability.

3. Results and Discussion

The system supports multiple downstream use cases, including the following:

Identifying influential institutions based on centrality in institutional networks.
Mapping interdisciplinary clusters via keyword co-occurrence and topic overlaps.
Tracking the emergence of new subfields through citation bar charts and temporal mappings.
Performing comparative field analyses via country-level distribution pie charts.

These scenarios can be translated into more concrete practical applications. At the institutional level, the framework can be used to map an organization’s research profile by identifying dominant thematic areas, collaborative networks, and high-visibility publications. It can also aid in the assessment of development opportunities, for example by detecting emerging subfields or areas where international collaboration is low. At the research policy level, the results can contribute to supporting decisions on prioritizing funding allocations, supporting certain areas or monitoring the evolution of strategic themes over time.

An illustrative example of use is the analysis of an institutional or thematic corpus in order to identify collaboration structures and the distribution of dominant themes. In such a scenario, the dashboards generated by IllustryFlow allow for a quick transition from an aggregate view of thematic clusters to the identification of the authors, institutions or publications that support these structures. This functionality is particularly relevant for internal assessments, strategic positioning exercises or preliminary exploratory analyses necessary to define research priorities.

Figure 7 presents the fine-grained topic classification generated from OpenAlex data and visualized with Illustry; this system provides a replicable and extensible framework for exploratory bibliometrics, scalable across domains, languages, and dataset sizes.

3.1. Comparative Analysis: IllustryFlow vs. VOSviewer

Bibliometric visualization tools continue to evolve, and different systems approach bibliometric visualization in different ways. VOSviewer v1.6.20 and the recently introduced IllustryFlow represent two approaches with distinct strengths and limitations. This section compares the two systems from the perspective of visualization logic, workflow integration, and thematic organization.

3.1.1. Overview of Tools

VOSviewer v1.6.20, developed by van Eck and Waltman at CWTS Leiden University, is a widely adopted tool for constructing and visualizing bibliometric networks. It supports co-authorship, citation, keyword co-occurrence, and bibliographic coupling visualizations based on Scopus or Web of Science data exports [8]. Its strength lies in its ease of use, interactive network layouts (VOS clustering), and deep integration with CSV-based datasets.

IllustryFlow, on the other hand, is a programmatically integrated dashboard generation engine built on top of the n8n orchestration platform. It accepts enriched JSON datasets via custom nodes (e.g., ArticleToIllustry) and pushes data to a local Illustry backend, rendering complex dashboards including time series, hierarchical graphs, word clouds, and citation charts—based on enhanced OpenAlex topic classification.

VOSviewer v1.6.20 offers highly interactive and customizable network maps based on co-occurrence matrices. Its VOS layout algorithm effectively preserves cluster proximity and size relationships, making it ideal for exploratory network analyses [26]. However, the tool is primarily centered on established network-based visualizations and usually requires manual preprocessing and filtering of datasets. In contrast, IllustryFlow supports a broader dashboard-based environment for bibliometric exploration.

The fundamental difference between the two tools is not in the typology and design of the visualizations, but in the processing architecture and the level of semantic integration. VOSviewer v1.6.20 is designed for classical exploration of bibliometric networks based on predefined bibliometric relations and predominantly manual workflows [26], whereas IllustryFlow is oriented towards automated scenarios in which data ingestion, semantic classification, and dashboard generation are integrated into a unified flow. In this context, IllustryFlow incorporates article-level topic assignment through the OpenAlex-enhanced BERT model [17], allowing thematic organization to be informed by textual, citation-based, and source-level signals [13,14,17].

3.1.2. Automation and Workflow Integration

From the perspective of the operational integration and automation of the bibliometric workflow, the differences between the two approaches become clearer. Compared to existing approaches, the proposed framework can offer a number of functional advantages. First, it reduces manual intervention by automating the steps of ingestion, filtering and publishing results. Second, it overcomes the exclusive reliance on co-occurrence relationships by integrating a finer semantic classification, based on the OpenAlex-enhanced BERT model adopted from the OpenAlex infrastructure and connected here to the automated analysis workflow. A critical distinction lies in automation. VOSviewer v1.6.20 is a standalone application with limited batch scripting support, requiring manual intervention for preprocessing, clustering, and visualization exports. It lacks direct integration with APIs or CI/CD pipelines.

In contrast, IllustryFlow is natively modular, with nodes built in TypeScript inside the n8n workflow engine. The ArticleToIllustry node consumes WorkResponse objects based on the OpenAlex schema, processes them through a topic classification layer based on the previously developed OpenAlex-enhanced BERT model, and automatically generates dashboards through RESTful API calls.

Thus, IllustryFlow is programmatically composable, enabling the following:

Scalable containerized deployments.
Batch automation of projects.
Integration with semantic filtering logic (e.g., OpenAlexFetcher node).

VOSviewer v1.6.20 relies heavily on co-occurrence matrices and clustering based on frequency and association strength. While effective for certain exploratory tasks, this approach lacks semantic grounding and may conflate unrelated terms with high co-mentions.

In contrast, IllustryFlow incorporates a topic classification layer based on OpenAlex’s previously developed Enhanced BERT model [17], trained on over 70 million labeled records using citation embeddings, multilingual abstracts, and journal-based transformer vectors.

This semantic model allows for the following:

Fine-grained topic classification.
Hierarchical clustering by domain → field → subfield.
Improved disambiguation in multilingual or sparse-text scenarios [18,27].

This design makes IllustryFlow better suited for automated, scalable, field-normalized classification in diverse corpora.

Compared with traditional bibliometric approaches based primarily on co-occurrence relations, IllustryFlow supports a more fine-grained and interpretable analysis by combining multimodal semantic classification with interactive visual exploration. At the article level, topic assignment is not derived solely from lexical proximity, but from semantic, citation-based, and source-level signals, which enables the construction of thematic clusters that are more coherent and more analytically meaningful. In contrast to clusterings based purely on co-occurrence or lexical similarity, this approach captures both semantic relationships and citation structures between publications, leading to groupings that are more suitable for identifying research areas, interdisciplinary overlaps, and thematic evolution over time. In addition, the dashboard-based visualization environment facilitates the exploration of relationships among authors, institutions, concepts, and publication years in a more integrated and readable manner. These differences should be understood as functional and architectural advantages of the proposed framework rather than as definitive experimental evidence of superiority over other tools.

3.2. Discussion

The results indicate that, under the conditions examined in this study, the proposed framework can support the operational exploration of heterogeneous bibliographic corpora. In particular, the flow allows for the maintenance of a functional thematic classification and the generation of useful visual structures for analyzing the relationships between authors, institutions, concepts and citation dynamics. However, these results must be interpreted in relation to the exploratory design of the study and the nature of the indicators used. Accordingly, the contribution of the present work should be understood primarily in terms of operational feasibility and exploratory value, rather than as evidence of general superiority over other tools.

The interpretation of the results also depends on the type of indicators used. Confidence scores reflect the internal consistency of the classification and do not constitute an external validation of the performance in the absence of labeled sets and metrics such as accuracy or F1-score. Consequently, Table 1 should be interpreted descriptively, as a distribution of operational confidence, and not as a strict comparative assessment of the classifier’s performance. The small size of the WoS branch requires cautious interpretation of the differences observed between streams.

The additional empirical analyses provide a more nuanced picture of the framework’s behavior. The runtime comparison across dataset sizes suggests practical operational scalability under the analyzed experimental conditions, while the comparison across the most frequent thematic categories indicates that classification confidence varies across topics. Together, these findings suggest that the framework’s behavior is influenced both by corpus size and by thematic composition, which should be taken into account when interpreting its analytical utility.

An important limitation concerns the visualization component, the evaluation of which remains mainly functional. Although the interactive capabilities of the platform, including filtering, supporting the exploratory utility of the Illustry environment, this study does not include user studies or experimental evaluations of performance in concrete analytical tasks. Consequently, the results should be understood as evidence of feasibility and exploratory utility rather than as an exhaustive validation of usability.

An additional relevant aspect concerns the possible disciplinary and temporal biases of the analyzed corpus. Fields with more complete metadata and higher citation density may benefit from more stable classification, whereas emerging or interdisciplinary areas may show greater variability. Similarly, temporal distribution influences network structure and citation dynamics because recent works had less time to accumulate citations. These aspects do not invalidate the results, but they limit the scope of generalization and should be taken into account when interpreting the proposed framework.

As presented in Table 4, both VOSviewer v1.6.20 and IllustryFlow fulfill critical needs in bibliometric analysis but differ fundamentally in design philosophy. VOSviewer v1.6.20 excels in static network exploration and citation clustering with minimal setup, whereas IllustryFlow provides a modern, API-driven approach suited for real-time, large-scale, and semantically enriched bibliometric dashboards.

For use cases involving programmatic scalability, deep semantic classification, and dynamic dashboards, IllustryFlow offers clear advantages over traditional tools. However, VOSviewer v1.6.20 remains a valid choice for exploratory analyses, especially when integrated into manual research workflows or teaching environments.

4. Conclusions

This study investigated the feasibility of an integrated framework for bibliometric analysis, which combines automated data ingestion, semantic classification and interactive visualization in a unified workflow. The results suggest that the proposed approach can support the analysis of heterogeneous bibliometric corpora by connecting the stages of data collection, semantic enrichment, thematic organization and visual exploration in a coherent and reproducible process. In this way, the framework reduces the fragmentation specific to traditional bibliometric workflows and provides a more consistent basis for exploratory analysis.

The current evaluation suggests that the system may support operational thematic classification under varying conditions of metadata completeness, while providing an interactive visual environment for exploring collaboration networks, thematic distributions, and citation dynamics. From an application perspective, such a flow may be relevant for institutional profiling, identifying collaboration structures, and monitoring emerging research areas, especially in contexts where bibliometric analysis is used for evaluation and strategic planning.

At the same time, the results must be interpreted in light of the limitations of the current evaluation. In particular, the absence of large-scale benchmark datasets, ablation studies, and external reference labels limits the strength of conclusions regarding classification performance. In addition, variations in execution time associated with the size of the datasets, as well as possible disciplinary and temporal biases of the analyzed corpus, may influence cluster structure and citation dynamics.

Overall, the results obtained should be interpreted as evidence of the feasibility and exploratory utility of the proposed framework under the analyzed conditions, and not as an exhaustive validation of classifier performance or as a demonstration of superiority over other tools. The main contribution of this study lies in the automated and reproducible integration of data ingestion, semantic classification, and interactive visualization into a unified workflow, while more extensive empirical validations remain necessary to support stronger claims regarding performance and generality.

Future research directions should include extended empirical validations on larger and more diverse bibliographic collections, ablation analyses for semantic classification components, and systematic evaluations of the visualization layer through user studies and task-oriented comparisons with existing tools. These extensions would allow for a more precise assessment of the robustness, analytical value, and generalizability of the proposed framework.

Author Contributions

Conceptualization, R.D.N.-A. and V.P.M.; methodology, V.N.-A., R.D.N.-A. and V.P.M.; software, V.N.-A.; validation, V.N.-A., R.D.N.-A. and V.P.M.; formal analysis, V.N.-A.; investigation, R.D.N.-A. and V.P.M.; resources, R.D.N.-A. and V.P.M.; data curation, V.N.-A.; writing—original draft preparation, V.N.-A., R.D.N.-A. and V.P.M.; writing—review and editing, R.D.N.-A. and V.P.M.; visualization, V.N.-A.; supervision, R.D.N.-A. and V.P.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the West University of Timișoara, Romania, through the grant “Research Career Guidance and Counseling Center – Western Region”, funded by the Romanian Ministry of Research, Innovation and Digitalization via Romania’s National Recovery and Resilience Plan, Call No. PNRR-III-C9-2022-I10.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors are grateful to the editors and the anonymous reviewers for their guidance and valuable recommendations that helped to improve this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alan, P. Statistical Bibliography or Bibliometrics? J. Doc. 1969, 25, 348–349. [Google Scholar]
Bornmann, L.; Mutz, R. Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references: Growth Rates of Modern Science: A Bibliometric Analysis Based on the Number of Publications and Cited References. J. Assoc. Inf. Sci. Technol. 2014, 66, 2215–2222. [Google Scholar] [CrossRef]
Hicks, D.; Wouters, P.; Waltman, L.; de Rijcke, S.; Rafols, I. The Leiden Manifesto for research metrics. Nature 2015, 520, 429–431. [Google Scholar] [CrossRef] [PubMed]
van Raan, A.F.J. Chapter 1 MEASURING SCIENCE CAPITA SELECTA OF CURRENT MAIN ISSUES. In Handbook of Quantitative Science and Technology Research; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2004. [Google Scholar]
Massimo, A. bibliometrix: An R-tool for comprehensive science mapping analysis. J. Informetr. 2017, 11, 959–975. [Google Scholar] [CrossRef]
Gandomi, A.; Haider, M. Beyond the hype: Big data concepts, methods, and analytics. Int. J. Inf. Manag. 2015, 35, 137–144. [Google Scholar] [CrossRef]
Chaomei, C. Science Mapping: A Systematic Review of the Literature. J. Data Inf. Sci. 2017, 2, 1–40. [Google Scholar] [CrossRef]
Waltman, L.; van Eck, N.J. A new methodology for constructing a publication-level classification system of science. J. Am. Soc. Inf. Sci. Technol. 2012, 63, 2378–2392. [Google Scholar] [CrossRef]
Available online: https://docs.openalex.org/ (accessed on 5 January 2026).
Available online: https://github.com/n8n-io/n8n (accessed on 5 January 2026).
Haupka, N.; Culbert, J.H.; Schniedermann, A.; Jahn, N.; Mayr, P. Analysis of the Publication and Document Types in OpenAlex, Web of Science, Scopus, PubMed and Semantic Scholar. Quant. Sci. Stud. 2025, 7, 179–194. [Google Scholar] [CrossRef]
Available online: https://impulsivelabs.github.io/Illustry-monorepo/de/ (accessed on 5 January 2026).
Priem, J.; Piwowar, H.; Orr, R. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv 2022, arXiv:2205.01833. [Google Scholar] [CrossRef]
Li, D.; Mei, H.; Shen, Y.; Su, S.; Zhang, W.; Wang, J.; Zu, M.; Chen, W. ECharts: A declarative framework for rapid construction of web-based visualization. Vis. Inform. 2018, 2, 136–146. [Google Scholar] [CrossRef]
Traag, V.A.; Waltman, L.; Van Eck, N.J. From Louvain to Leiden: Guaranteeing well-connected communities. Sci. Rep. 2019, 9, 5233. [Google Scholar] [CrossRef] [PubMed]
Gusenbauer, M.; Endermann, J.; Huber, H.; Strasser, S.; Granitzer, A.N.; Ströhle, T. Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections. Scientometrics 2025, 131, 2401–2438. [Google Scholar] [CrossRef]
Wolff, B.; Seidlmayer, E.; Förstner, K.U. Enriched BERT Embeddings for Scholarly Publication Classification. In International Workshop on Natural Scientific Language Processing and Research Knowledge Graphs; Springer Nature: Cham, Switzerland, 2024; pp. 234–243. [Google Scholar] [CrossRef]
Perianes-Rodriguez, A.; Waltman, L.; van Eck, N.J. Constructing bibliometric networks: A comparison between full and fractional counting. J. Informetr. 2016, 10, 1178–1195. [Google Scholar] [CrossRef]
Cheng, W.; Zheng, D. Integrating semantic clustering and citation analysis to construct a topic citation network for characterizing scholars’ contributions: A case study of price prize laureates. Inf. Dev. 2025. [Google Scholar] [CrossRef]
Börner, K.; Chen, C.; Boyack, K.W. Visualizing Knowledge Domains. Annu. Rev. Inf. Sci. Technol. 2005, 37, 179–255. [Google Scholar] [CrossRef]
Chaomei, C. CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. J. Am. Soc. Inf. Sci. Technol. 2006, 57, 359–377. [Google Scholar] [CrossRef]
Gusenbauer, M.; Haddaway, N.R. Which Academic Search Systems are Suitable for Systematic Reviews or Meta-Analyses? Evaluating Retrieval Qualities of Google Scholar, PubMed and 26 other Resources. Res. Synth. Methods 2020, 11, 181–217. [Google Scholar] [CrossRef] [PubMed]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Henry, S. Co-Citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents. J. Am. Soc. Inf. Sci. 1973, 24, 265–269. [Google Scholar] [CrossRef]
Mark, N. The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. USA 2001, 98, 404–409. [Google Scholar] [CrossRef]
van Eck, N.; Waltman, L. Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics 2010, 84, 523–538. [Google Scholar] [CrossRef]
Uğuz, S.; Tülü, Ç.N. Topic Modeling Analysis in the Field of Large Language Models with BERTopic (2020–2024). In Proceedings of the 2024 Innovations in Intelligent Systems and Applications Conference (ASYU), Ankara, Turkiye, 16–18 October 2024. [Google Scholar] [CrossRef]

Figure 1. Illustry backend architecture for bibliometric data processing, including data ingestion, preprocessing, metadata normalization, storage, and support services required for subsequent analytical and visualization tasks. The figure highlights the main backend components involved in transforming raw bibliographic records into structured data prepared for semantic analysis and dashboard generation.

Figure 2. Illustry frontend for interactive bibliometric exploration, showing the web-based environment used to examine authorship patterns, citation structures, publication networks, and thematic groupings through dashboard-based visual analysis. The figure illustrates how users can explore multiple bibliometric dimensions through interactive visual components and filtering operations.

Figure 3. The modular n8n workflow integrating bibliographic ingestion from Web of Science, record matching and semantic enrichment through OpenAlex, fallback topic classification when no valid OpenAlex match is found, and visual publishing to Illustry within a reproducible analytical pipeline. The figure highlights the main processing stages and the semantic fallback decision point.

Figure 4. Web of Science ingestion via Wos node.

Figure 5. Semantic filtering configuration in OpenAlexFetcher.

Figure 6. Visual analytics upload via ArticleToIllustry node.

Figure 7. Example of an Illustry dashboard generated from OpenAlex-enriched bibliometric data, illustrating the interactive exploration of thematic clusters, collaboration structures, and citation-related patterns within the analyzed corpus. The dashboard supports transitions between aggregate patterns and article-, author-, or institution-level exploration.

Table 1. Comparative evaluation of topic classification confidence across two data streams.

Metric	OpenAlexFetcher	Wos + BERT Classifier
Total Articles	1756	42
Articles with Primary Topics	1756	42
Unique Primary Topics	462	23
Average Topic Confidence	0.8728	0.7938
Median Topic Confidence	0.9708	0.8228
Min Topic Confidence	0.0416	0.3846
Max Topic Confidence	1.0000	0.9664

Table 2. Comparative evaluation of pipeline execution time across different dataset sizes.

Dataset Size	Pipeline	Total Execution Time(s)	Average Time Per 100 Records(s)
250	Wos → OpenAlexFetcher → ArticleToIllustry	46	18.4
500	Wos → OpenAlexFetcher → ArticleToIllustry	50	10
1000	Wos → OpenAlexFetcher → ArticleToIllustry	69	6.9
1756	Wos → OpenAlexFetcher → ArticleToIllustry	90	5.1

Table 3. Comparison of topic classification confidence across ten most frequent thematic categories.

Primary Topic/Thematic Category	Number of Articles	Mean Confidence	Median Confidence
Big Data and Business Intelligence	343	0.781	0.944
Business Process Modeling and Analysis	112	0.89	0.993
Business and Economic Development	87	0.776	0.939
Digital Transformation in Industry	77	0.824	0.962
Data Visualization and Analytics	65	0.822	0.975
Economic and Technological Systems Analysis	59	0.759	0.926
Economic and Business Development Strategies	56	0.733	0.924
Service-Oriented Architecture and Web Services	55	0.935	0.99
Information Technology Governance and Strategy	48	0.787	0.951
Sustainable Supply Chain Management	43	0.891	0.989

Table 4. Feature comparison between VOSviewer and IllustryFlow.

Feature	VOSviewer	IllustryFlow
Interactivity	High (static app)	Very High (web dashboards)
Topic Classification	Co-occurrence only	Multilingual BERT-based
Automation	Minimal	Full CI/CD support
Scalability	Limited to desktop processing	Containerized & distributed
Data Support	CSV/Wo/Scopus	JSON (OpenAlex schema), TSV
Filtering	Manual/visual	Semantic + visual via Illustry GUI
Output	PNG/PDF/GraphML	ECharts dashboards (interactive)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Niţu-Antonie, V.; Niţu-Antonie, R.D.; Munteanu, V.P. IllustryFlow: A Modular Framework for Automated Bibliometric Analysis Using n8n and BERT-Enhanced Topic Classification. Electronics 2026, 15, 1943. https://doi.org/10.3390/electronics15091943

AMA Style

Niţu-Antonie V, Niţu-Antonie RD, Munteanu VP. IllustryFlow: A Modular Framework for Automated Bibliometric Analysis Using n8n and BERT-Enhanced Topic Classification. Electronics. 2026; 15(9):1943. https://doi.org/10.3390/electronics15091943

Chicago/Turabian Style

Niţu-Antonie, Vladimir, Renata Dana Niţu-Antonie, and Valentin Partenie Munteanu. 2026. "IllustryFlow: A Modular Framework for Automated Bibliometric Analysis Using n8n and BERT-Enhanced Topic Classification" Electronics 15, no. 9: 1943. https://doi.org/10.3390/electronics15091943

APA Style

Niţu-Antonie, V., Niţu-Antonie, R. D., & Munteanu, V. P. (2026). IllustryFlow: A Modular Framework for Automated Bibliometric Analysis Using n8n and BERT-Enhanced Topic Classification. Electronics, 15(9), 1943. https://doi.org/10.3390/electronics15091943

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IllustryFlow: A Modular Framework for Automated Bibliometric Analysis Using n8n and BERT-Enhanced Topic Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Illustry Architecture

2.2. System Architecture and Modular Design

2.2.1. Data Ingestion via Web of Science (Wos Node)

2.2.2. OpenAlex Enrichment and Filtering (OpenAlexFetcher Node)

2.2.3. Topic Extraction and Visualization (ArticleToIllustry Node)

2.3. Use of the OpenAlex-Enhanced BERT Model in the Proposed Framework

2.3.1. Multimodal Embedding Integration

2.3.2. Hierarchical Taxonomy and Clustering Capacity

2.3.3. Performance Characteristics and Limitations

2.3.4. Execution Time Across Dataset Sizes

2.3.5. Consistency Analysis and Comparison Across Thematic Categories

2.4. Visualizations with Illustry

2.4.1. ArticleToIllustry Node: Architecture and Visualization Logic

2.4.2. Technical Structure and Functionality

2.4.3. Interactivity and Analytical Use Cases in Illustry

3. Results and Discussion

3.1. Comparative Analysis: IllustryFlow vs. VOSviewer

3.1.1. Overview of Tools

3.1.2. Automation and Workflow Integration

3.2. Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI