2.2. System Architecture and Modular Design
To address the challenges posed by large-scale bibliometric analysis, we developed a fully automated and modular framework built atop the n8n orchestration engine. The proposed framework integrates three custom-developed nodes—Wos, OpenAlexFetcher, and ArticleToIllustry—each purpose-built to perform advanced data ingestion, filtering, classification, and visualization for scientific publications [
13,
16,
17].
To avoid terminological ambiguities, in this paper, several key terms are used with clearly defined meanings. The term framework refers to the proposed framework as a whole, i.e., the conceptual and operational integration of the components used for data ingestion, semantic classification and visual exploration. The term platform is reserved for the existing software systems used in the proposed framework, in particular n8n and Illustry. The term workflow refers to the orchestrated sequence of automated steps implemented in n8n, while pipeline refers to the technical flow of data processing and transformation between the main stages of the analysis. The term architecture is used to describe the internal structure of the system or a component, and node refers to a modular execution unit in n8n, such as Wos, OpenAlexFetcher or ArticleToIllustry. This convention is maintained throughout the manuscript to avoid conceptual overlaps and to make the contribution of this paper clearer.
From a methodological perspective, the contribution of the proposed framework lies both in the development of specific components and in their integration into a modular architecture with clearly delimited analytical functions. More precisely, the originality of this section lies in the design of an orchestration logic through which data ingestion, semantic enrichment, thematic classification and visual publishing are connected in a reproducible flow. This contribution is relevant because, in bibliometric practice, these stages are often carried out separately, using non-integrated tools and requiring considerable manual intervention.
Operationally, the proposed workflow can be understood as a sequence of six connected stages. First, the user provides a Web of Science export in TSV format, together with search and filter parameters, where applicable. Second, the Wos node parses the file and extracts the basic bibliographic fields needed for the next steps. Third, each record is checked in OpenAlex by a title-oriented search and additional author-based validation; if no valid match is obtained, the workflow switches to a fallback branch, where the record is semantically classified by the OpenAlex-enhanced BERT service. Fourth, the OpenAlexFetcher node retrieves, enriches, and semantically filters candidate records from OpenAlex based on explicit inclusion and exclusion rules. Fifth, the ArticleToIllustry node transforms the filtered bibliographic objects into analytical structures, such as thematic clusters, co-authorship networks, institutional graphs, temporal visualizations, and conceptual summaries. Finally, these results are published via the Illustry API as interactive dashboards for bibliometric exploration. This sequential description is intended to make explicit the data path, decision points, and input/output logic corresponding to the flow illustrated in
Figure 3.
Figure 3 synthesizes the operational logic of the proposed workflow, highlighting the main processing steps, the decision point associated with the semantic fallback branch, and the transition from bibliographic ingestion to semantic enrichment and visual publishing. A detailed description of the six steps is presented in the text.
Because the workflow is parameterized, the thematic values of query, relevantTerms, excludeTerms, and maxArticles may vary across analytical scenarios. However, the retrieval, filtering, linking, and fallback logic remained fixed across runs and are reported explicitly here to ensure procedural reproducibility.
2.2.1. Data Ingestion via Web of Science (Wos Node)
The Wos node serves as the entry point of the workflow and is responsible for transforming the Web of Science bibliographic export into a structure that can be processed within the proposed pipeline. At the input level, the node receives a TSV file and extracts the essential bibliographic fields used in the following steps, including title (TI), author (AU), abstract (AB), source/journal (SO), year of publication (PY), and number of citations (TC), as illustrated in
Figure 4. These fields are then normalized and prepared for the semantic linking and classification step.
Operationally, the node implements an end-to-end strategy for identifying records in OpenAlex, directly using the Works entity query via the title_and_abstract.search field, with descending ordering by relevance_score. For each eligible Wos record, a title-oriented OpenAlex search is generated, and the final selection is based not only on the relevance score, but also on strict bibliographic validation rules.
Before the matching step, the node applies an internal deduplication logic. Records are compared to a set of already processed titles, normalized to lowercase, to avoid introducing duplicates into the current execution. In addition, if essential fields, such as title or authors, are missing, the record is not considered eligible for direct bibliographic linking and is redirected to the semantic fallback branch, preserving the available metadata.
Wos-to-OpenAlex linking is performed in two steps. First, the node retrieves a list of candidate records from OpenAlex, ordered by relevance. Second, match validation is based on a combined title + author criterion. More precisely, a candidate is accepted as a valid match only if the title in OpenAlex matches the title in Wos through a case-insensitive comparison and if there is at least a plausible match between the surnames of the authors in Wos and those extracted from authorships.author.display_name in OpenAlex. In this way, the connection stage aims to reduce false matches and keep only bibliographically consistent records.
When a record cannot be validly identified in OpenAlex, the workflow activates a semantic fallback branch. In this case, the node sends an HTTP POST request to the BERT-based classification service via the configurable endpoint <classifierServerUrl>/invocations. The payload is built from the available abstract, represented as abstract_inverted_index, to which, when available, the log and other useful metadata are added. The result returned by the classifier is then converted into an object compatible with the OpenAlex schema, so that the record can continue through the same analytical flow as those retrieved directly from OpenAlex [
9,
13,
17].
The output of the node therefore consists of a collection of WorkResponse objects, obtained either by bibliographic matching validated with OpenAlex or by fallback semantic classification. From a methodological point of view, this component plays an essential role because it combines deterministic ingestion of bibliographic metadata with an explicit connection logic and a fallback mechanism for cases where the bibliographic signal is incomplete. Consequently, the Wos node does not only parse the initial input, but also acts as a controlled gateway into the broader workflow, ensuring the consistency of the data structure forwarded to the subsequent stages.
For reproducibility, the OpenAlex retrieval step should be understood as a parameterized query process over the Works entity. In the current implementation, candidate records are requested through title_and_abstract.search, ordered by relevance_score in descending order, with cursor pagination and batches of 100 records per page. The key parameters are query, relevantTerms, excludeTerms, and maxArticles. Inclusion is based on the presence of the query and relevant thematic terms, whereas exclusion is triggered by the presence of predefined excludeTerms during semantic filtering. Wos-to-OpenAlex linking is performed by exact case-insensitive title equality plus at least one plausible author surname correspondence. No fuzzy matching based on edit distance or similarity thresholds is currently used.
2.2.2. OpenAlex Enrichment and Filtering (OpenAlexFetcher Node)
As illustrated in
Figure 5, the OpenAlexFetcher node is responsible for the semantic retrieval and thematic filtering stage of candidate publications in OpenAlex. At the input level, the node receives a user-defined search expression together with semantic filtering parameters, including relevant-term lists, exclusion-term lists, and the maximum number of articles to be retrieved. Operationally, the node uses the openalex-ts client to query the Works entity in OpenAlex by means of the title_and_abstract.search field, using descending order by relevance_score, the is_oa=true filter, cursor-based pagination, and batches of 100 results per page.
A representative example of an OpenAlex query, equivalent to the logic used by the node, is as follows:
GET/works?filter=title_and_abstract.search:bibliometric analysis scientific visualization topic classification,is_oa:true&sort=relevance_score:desc&per-page=100&cursor=*
In practice, the exact value of the OpenAlex search expression is controlled by the query parameter, which allows the workflow to be adapted to different thematic scenarios. After retrieval, each publication is subjected to an explicit semantic filtering step based on the normalized content of its title, abstract, topics, and associated keywords. Normalization consists of lowercasing and the removal of formatting artifacts, so that the matching process is not affected by superficial orthographic variation.
The inclusion criteria are defined through the relevantTerms parameter as an explicit JSON object of thematic categories and associated terms. In the configuration used in this study, three categories were applied. The management category included the terms “management”, “business”, “organization”, “strategy”, “leadership”, “administration”, and “enterprise”. The decision_making category included “decision”, “decisions”, “decision-making”, “choice”, “judgment”, and “planning”. The visualization category included “visualization”, “visualizations”, “data visualization”, “visual”, “graph”, “chart”, “dashboard”, and “analytics”.
The exclusion criteria were defined through the excludeTerms parameter as the following list of terms: “medicine”, “medical”, “surgery”, “anesthesia”, “fetal”, “cancer”, “disease”, “biology”, “clinical”, and “healthcare”. Publications containing these exclusion terms in the analyzed textual fields were removed from the candidate set.
A publication was considered relevant only if it matched at least two distinct relevant thematic categories from the relevantTerms configuration. In the present implementation, this means that a work has to satisfy terms from at least two of the following categories: management, decision_making, and visualization. This rule was introduced to reduce false positives and retain only publications that were semantically consistent with the intended thematic scope.
The key parameters of this stage include the search expression (query), relevant thematic categories (relevantTerms), exclusion terms (excludeTerms), maximum result limit (maxArticles), ordering by relevance_score, cursor pagination, and a batch size of 100 results per page. In terms of operational robustness, the node includes mechanisms for error handling and rate limiting, through controlled retries, backoff, and explicit handling of transient response codes, especially 429, 500, and 503.
The output of the node consists of a filtered and enriched set of bibliographic objects compatible with the rest of the flow, ready for final classification, analytical aggregation, and visual publishing. In this sense, OpenAlexFetcher is the component that transforms the raw retrieval from OpenAlex into an explicit, transparent, and reproducible thematic corpus [
13,
17].
2.2.3. Topic Extraction and Visualization (ArticleToIllustry Node)
Figure 6 illustrates the ArticleToIllustry node that finalizes the analytical pipeline.
This component accepts a list of JSON objects and generates the visual and analytical outputs used for downstream exploration. Its functions include primary topic cluster extraction based on primary_topic frequency, entity graph construction for co-authorship and co-institution networks, country and time-series analysis using calendar heatmaps, semantic content summarization via word clouds for concepts and keywords, and citation trajectory modeling represented as multi-year bar charts.
All results are posted via a REST API to a self-hosted Illustry application, where they are rendered in Apache ECharts dashboards optimized for large-scale bibliometric data [
12,
14]. This design supports interactive exploration, configurable visual summaries, and the extraction of institutionally relevant insights from the processed publication corpus.
2.3. Use of the OpenAlex-Enhanced BERT Model in the Proposed Framework
To ensure scalable and domain-sensitive topic classification in large-scale bibliometric workflows, this study adopts the OpenAlex classification framework, a multilingual, transformer-based architecture that integrates contextual, relational, and source-level embeddings to produce fine-grained semantic labels at the article level. This model has been shown to offer significant improvements over traditional journal-based heuristics in both accuracy and granularity, particularly for cross-disciplinary and multilingual corpora [
13,
16]. The classification engine is centered on a fine-tuned multilingual BERT (mBERT) encoder, trained using more than 70 million labeled records derived from the OpenAlex graph and labeled by the CWTS Leiden Ranking taxonomy. Each training instance is annotated with topic labels drawn from a hierarchical topic graph of over 4000 nodes, structured by domain → field → subfield → topic, following a refined version of the Scopus ASJC taxonomy [
18].
It is important to note that, in this research, the OpenAlex-enhanced BERT model was not developed or retrained by the authors, but is used as a pre-existing semantic model, adopted from the OpenAlex infrastructure. The purpose of this description is to explain the architectural logic and the types of information integrated by this existing model. From a reproducibility perspective, this paper describes the essential components of the model used, as reported in the OpenAlex sources and related literature, including the multilingual BERT encoder, the integration of textual, citation, and journal embeddings, and the use of the CWTS/OpenAlex hierarchical taxonomy. The complete training hyperparameters, the exact optimization procedure and the dataset splitting belong to the original OpenAlex model.
Consequently, the replicable dimension of this work mainly focuses on how this semantic classification is incorporated into an automated workflow of ingestion, filtering, classification and visualization, without considering the full retraining of the existing base model.
2.3.1. Multimodal Embedding Integration
To enhance classification fidelity across varying data sparsity levels, the model fuses three embedding modalities:
For the text-based embeddings, titles and abstracts were combined and encoded using a multilingual BERT model. This was performed to capture differences in meaning across disciplines and languages, especially in cases where metadata coverage is uneven [
16].
For citation embeddings, the model integrates two graph-derived features that encode the position of a publication within the citation network relative to previously topic-labeled reference works. Citation 1 captures direct citation links between the focal article and gold-labeled topic exemplars, that is, publications already associated with well-defined topics in the OpenAlex/CWTS taxonomy. Citation 2 captures second-order citation proximity by considering links to works that cite those examples, thereby extending the relational signal beyond direct citation ties. Together, these two features provide a structured indication of the thematic neighborhood within the citation graph and are especially useful when textual metadata are sparse or semantically ambiguous [
9]. For journal embeddings, instead of relying on static journal categories, journal identity is modeled via dynamic transformer-based embeddings (e.g., MiniLM). These vectors are trained jointly with the main model to encode topical biases of journals, supporting robust inference even when other features are missing [
17].
Operationally, the multimodal fusion mechanism involves aggregating the three sources of semantic signal, textual, citation, and journal, into a single composite representation at the article level. Specifically, the textual embedding derived from the title and abstract, the embeddings based on citation relationships, and the journal-associated embedding are concatenated to form a unified input vector, which is then passed to a feedforward neural classifier. It projects the multimodal representation into the candidate topic space and produces membership scores for possible thematic labels. In this configuration, classification does not depend exclusively on the textual content of the article, but on combining textual, relational, and editorial information into a unified representation. In addition, based on the description available in the original sources, the model includes robustness mechanisms such as stochastic masking of some features and exploitation of alternative signals when metadata are incomplete.
2.3.2. Hierarchical Taxonomy and Clustering Capacity
The output layer maps each publication to one or more topic nodes, each representing a cohesive cluster discovered through community detection over the OpenAlex citation graph using the Leiden algorithm [
19].
Each community is subsequently labeled using large language models to align with Scopus-style topic labels, and each label is linked to ASJC codes to support field-normalized evaluations [
18].
Although the model’s primary role is supervised classification, the resulting topic representations can also support clustering applications, including the identification of latent research themes, inter-topic relationships and author-based topical proximity. In the present study, article-level clustering is derived from shared primary topics, enabling the construction of co-authorship networks, institutional clusters, and temporal topic flows [
13].
2.3.3. Performance Characteristics and Limitations
Empirical evaluations on held-out sets demonstrate strong predictive performance, with Top-1 accuracy of 53%, Top-3 accuracy of 64%, and Top-5 accuracy of 67%, increasing to 72% Top-1 accuracy when full metadata are available [
17]. The model shows strong resilience to metadata incompleteness due to its multimodal architecture but exhibits lower precision on rare or emergent topics with limited training samples. Additionally, non-Latin alphabets and very short texts reduce classification confidence, although these challenges are mitigated in production through journal inference and iterative retraining.
This theoretical backbone enables the ArticleToIllustry node to leverage primary topic labels for clustering records into thematic groups, subsequently visualized through high-level semantic maps and citation dynamics dashboards.
The confidence scores associated with the topic classification should be understood as the output scores of the classifier for the topic assigned to each article. In operational terms, they express the relative level of certainty with which the model associates a publication with the topic label selected from among the candidate topics. These values do not represent an external measure of accuracy validation, but an internal estimate of the strength of the classification produced by the model. For this reason, they are used in the present analysis as proxy indicators of the stability and practical consistency of the classification, without representing a substitute for an assessment on a manually labeled set. Consequently,
Table 1 does not constitute a direct validation of the classifier’s performance in the sense of standard evaluation metrics, but rather a description of the distribution of internal certainty associated with the classifications generated in the two analyzed streams.
From this perspective,
Table 1 summarizes the distribution of confidence scores for records enriched through OpenAlexFetcher and for records originating from Web of Science and subsequently classified through the Wos + BERT branch, providing a comparative picture of the classification behavior in the two data streams.
To complement these aggregate indicators, the evaluation was extended with a comparison across the most frequent thematic categories.
The results in
Table 1 show that the OpenAlexFetcher-enriched stream produces higher confidence scores than the Wos + BERT branch. For the OpenAlex stream, the median confidence score is 0.9708 for 1756 articles, while for the Wos + BERT branch the median is 0.8228 for 42 articles. This difference is compatible with the distinct role of the two streams in the proposed architecture: the OpenAlex stream represents the main semantic enrichment and extended aggregation pathway, while the Wos + BERT branch reflects situations where classification needs to be performed under conditions of reduced or incomplete metadata. In this regard, the comparison does not aim at strict equivalence of the samples, but at illustrating how classification works in two different contexts of information availability.
The difference in size between the two streams, however, requires a cautious interpretation. The Wos + BERT branch includes a small number of articles and cannot support general inferences regarding the robustness of the classification or the thematic coverage of the model in broader contexts. Nevertheless, the results remain relevant because they show that topic assignment remains operationally possible even under conditions of limited bibliographic information, even if the confidence values are higher when the classification benefits from more complete metadata. From this perspective, the comparison supports the practical utility of the framework for exploring heterogeneous bibliographic collections, while also indicating that metadata completeness influences classification confidence.
At the thematic granularity level, the OpenAlexFetcher stream assigned 462 unique topics for 1756 articles, while the Wos + BERT set covered 23 unique topics for 42 articles. These values suggest that the main semantic enrichment stream allows for a fine classification of the corpus and provides an adequate basis for the construction of thematic clusters and visual structures used later in the analysis. At the same time, the results from the WoS branch indicate that the model can provide usable classifications when the information signal is weaker, but in a more restricted empirical framework. Overall, the results support the feasibility and practical utility of the proposed framework for classifying and exploring heterogeneous bibliographic corpora. They suggest that integrating semantic classification into the bibliometric flow allows for maintaining an operational classification under different conditions of metadata completeness and provides a sufficiently stable basis for subsequent clustering and visualization steps. In the current form of the study, these conclusions should be viewed as empirical indications of the practical behavior of the framework, without being extended to an exhaustive experimental validation of the classifier’s performance.
An important methodological limitation of the present evaluation is the absence of an ablation study that isolates the contribution of each semantic component integrated into the classification used. In its current form, the comparison between the OpenAlex-enriched stream and the Wos + BERT branch does not allow for a rigorous separation of the effect of OpenAlex enrichment from that of citation embeddings and journal embeddings. Therefore, an important direction for future research is to conduct an ablation study that explicitly compares classification performance in configurations with and without OpenAlex enrichment, with and without citation embeddings, and with and without journal embeddings.
2.3.4. Execution Time Across Dataset Sizes
To evaluate the suggested framework’s operational behavior in relation to the amount of data, we analyzed the total execution time of the same workflow on datasets of different sizes. This comparison is relevant because the practical utility of an automated bibliometric workflow depends not only on the correctness of the classification and the quality of the visualizations, but also on the ability of the system to maintain reasonable processing times as the corpus size increases. Therefore, the operational scalability was examined by comparing the runtime for progressive subsets and for the full dataset, as observed in
Table 2.
The results in
Table 2 show that the total execution time increases with the size of the dataset, but at a moderate rate, from 46 s for 250 records to 90 s for 1756 records. The processing time (per 100 records) decreases from 18.4 s to 5.1 s, indicating better operational efficiency (measured by processing speed) as the data volume increases. This behavior suggests that the flow does not scale in a restrictive linear way, but combines a relatively stable fixed cost component with a slower increase in volume-dependent steps. In particular, the Wos step remains approximately constant, while OpenAlexFetcher explains most of the increase in the total execution time. Overall, the results in
Table 2 support the practical scalability of the proposed architecture under the analyzed experimental conditions, while mentioning the need to validate it on larger bibliographic collections.
2.3.5. Consistency Analysis and Comparison Across Thematic Categories
To complement the aggregate assessment presented above, the analysis was extended in two directions: examining the consistency of the thematic classification and comparing the behavior of the model across the main thematic categories identified in the corpus. To this end, the top 10 topics were selected in order of frequency, so that the analysis would focus on the categories with the greatest empirical relevance in the dataset. This extension is necessary because the global descriptive indicators only provide a synthetic picture of the classification and do not sufficiently capture the internal variations between the dominant thematic areas. The corresponding results are summarized in
Table 3.
The results in
Table 3 indicate that the level of confidence of the classification varies between the thematic categories analyzed. For example, Service-Oriented Architecture and Web Services and Sustainable Supply Chain Management record high values for the mean and median confidence scores, while categories such as Economic and Business Development Strategies or Economic and Technological Systems Analysis present lower mean values. At the same time, the differences between the mean and median suggest that the distribution of scores is not uniform across all topics, which indicates internal variations in the stability of the classification. Overall, the results presented in
Table 3 support the idea that the robustness of the framework should be assessed not only at a global level, but also according to the behavior of the classification in different thematic areas. From this perspective, the comparison by categories provides a more informative picture of the practical performance of the framework than simply reporting aggregated indicators at the level of the entire corpus.