Deterministic Data Governance in Hybrid Financial Architectures

Ionescu, Sergiu-Alexandru; Diaconita, Vlad; Radu, Andreea-Oana; Dinca, Laurentiu Gabriel; Nagit, Ioana

doi:10.3390/electronics15081716

Open AccessArticle

Deterministic Data Governance in Hybrid Financial Architectures

by

Sergiu-Alexandru Ionescu

^*

,

Vlad Diaconita

,

Andreea-Oana Radu

,

Laurentiu Gabriel Dinca

and

Ioana Nagit

Department of Economic Informatics and Cybernetics, Bucharest University of Economic Studies, 010374 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(8), 1716; https://doi.org/10.3390/electronics15081716

Submission received: 2 March 2026 / Revised: 9 April 2026 / Accepted: 16 April 2026 / Published: 18 April 2026

(This article belongs to the Special Issue Information Systems, Management, and Digital Innovation: Complexity, Integration, and Transformation)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Today, financial institutions’ architecture does not rely on one single technology. Instead, it uses a multi-technology approach in order to cover modern requirements and, at the same time, remain relevant. It integrates technologies such as relational databases, Big Data for analysis, and Cloud environments for distributed capacities within a complex data architecture. At the same time, due to European data governance regulations, governance mechanisms such as encryption, pseudonymization, and incremental versioning must be applied on each architectural layer in order to comply with strict European governance rules. In this study, the impact of data governance is assessed by applying these mechanisms from the data-ingestion level, using diverse data types such as structured, semi-structured, and unstructured data, across relational databases, Big Data analysis, and Cloud distributed systems. In doing so, metrics such as execution time, CPU, and memory usage are assessed in order to properly evaluate the impact of governance mechanisms on financial systems. The results show that governance can be successfully integrated, provided these mechanisms are embedded at the architectural level, ensuring that performance, scalability, and compliance are maintained across the entire processing pipeline.

Keywords:

financial systems; big data; integration; data governance; multi-tier architecture; hybrid storage systems

1. Introduction

The fast development of artificial intelligence, distributed systems, Cloud infrastructure, and services has impacted financial services [1], a field where large volumes of heterogeneous data move quickly [2]. These developments have affected financial analysis, algorithmic trading, fraud detection, and risk assessment. The financial sector is highly regulated [3,4]. In the EU, among other regulations, financial institutions must comply with the General Data Protection Regulation (GDPR). Such regulations dictate many aspects, including how customer information should be stored or transactions and patterns analyzed, encrypted, pseudonymized [5] and audited using data governance mechanisms.

Predictive analyses and anti-fraud investigations [6] can be based on structured transactional records, real-time event stream, semi-structured and unstructured data documents. Traditional systems based on relational databases are deterministic but have limitations, especially regarding scalability, and by enforcing the schema-on-write approach, where rigid structures (e.g., relational tables) need to be defined before data can be inserted. Big Data systems and Data Lakes running on-premises or on Cloud can offer scalability, elasticity and parallel processing capabilities natively, making them indispensable in modern financial architecture [7]. Still, there are many governance requirements for integrating heterogeneous approaches in a single architecture. Relational databases, Big Data platforms and Cloud services are discussed in the literature [8], but in isolation. There is limited empirical evidence on the performance of encryption, pseudonymization, and incremental versioning operating in different hybrid financial architectures.

Many HTAP works focus on storage design, synchronization, query optimization and benchmarking in integrated transactional and analytical systems [9]. But to achieve data sovereignty while still managing cross-border flows, there is still a need for architectural models for data governance in hybrid environments [10].

To address this gap, our paper proposes and evaluates an integrated governance model. We construct a pipeline that uses relational databases (OLAP and OLTP) [11] and other Big Data engines deployed in a Cloud-enabled environment. Encryption, pseudonymization and incremental versioning are treated as core design components and applied depending on the technological or business requirements. To clarify technical and architectural trade-offs, we investigate how these mechanisms relate to performance, scalability and resource consumption in realistic financial data processing environments.

Our research is guided by the following research questions:

RQ1: How can a hybrid financial architecture be designed to ensure the deterministic application of data governance mechanisms across the RDBMS, Big Data and Cloud layers?
RQ2: What performance and computational resource costs are introduced by implementing data governance mechanisms within distributed financial workflows?
RQ3: To what extent do governance scenarios influence system performance, execution repeatability, and cross-run consistency across the evaluated RDBMS, Big Data, and Cloud workflows?

We address these research questions in the following sections, which define the methods, show how the integrated pipeline is constructed and show how the performance analysis is conducted. We also discuss the testing limitations, provide practical recommendations and propose future research directions.

2. Literature Review

2.1. Relational Databases Data Governance Mechanisms

For decades, relational databases have been the centerpiece of financial information infrastructure. Due to their ability to sustain real-time critical operations, ACID properties, and their role in operational oversight and legal accountability, they also support audit mechanisms, transaction traceability, and access control [12]. These engines allow column-level encryption that provides a primary protection level for highly sensitive customer data, such as personal identification or transaction history [13].

Their importance in compliance-oriented architecture, especially for highly regulated industries, is well documented [14]. Yet, these protections often focus on structured data. With growing volumes of structured, semi-structured, and unstructured data, applying controls like encryption and pseudonymization is increasingly challenging due to reduced architectural flexibility and performance constraints [15,16].

Due to these constraints, the literature indicates the use of NoSQL technology in parallel with relational databases, as these operate without predefined schema and are designed for use with various data formats, making them a perfect addition for the financial environment [17,18]. Their column family and graph-based models provide flexibility for adding governance in earlier stages of data handling or data preparation, and before it is normalized. Due to this, encryption can be done without adding strict constraints on the entire dataset, and governance can be applied at the node or at the pipeline level [19].

Therefore, the integration of relational with non-relational models should become a structural requirement, not a technical alternative. In order to achieve a sustainable implementation of encryption, pseudonymization, and versioning in each processing layer, these governance mechanisms must be embedded throughout the entire financial data architecture [20].

2.2. Governance Mechanisms in Big Data Frameworks for Financial Analytics

In the specialized literature, Big Data technologies are shown as being able to manage large volumes of diverse data structures with high integration demands [21,22]. When Apache Spark is used, the processing model moves beyond the classical batch approach, supporting in-memory computation, machine learning, streaming, and SQL within a common execution environment, thus opening a wider operational scope. All of this provides the necessary environment for near real-time analysis [23].

Due to this different data handling approach, in Big Data environments, data is fragmented across multiple nodes, whereas relational databases store data in a centralized, structured manner [21,24]. For this reason, governance mechanisms have to adapt, as data fragmentation requires adjustments. Recent studies propose the implementation of specialized incremental versioning mechanisms in order to avoid data swaps characterized by data loss or degradation [24,25]. For this reason, incremental versioning has become the main technology used to ensure data reproducibility and to provide reliable analytic results in financial architectures [24,26].

In the case of pseudonymization and encryption, mechanisms have to adapt by moving from the storage area to the processing and metadata layer. In the literature, it is shown that pseudonymization must be performed before distribution to Spark or Hadoop, while encryption is applied both in transit and at rest through key management mechanisms [19,27,28,29].

Accordingly, mechanisms like Delta Lake, Apache Hudi and Iceberg are not mere technological extensions, but critical components of data governance in Big Data environments [26,30,31].

2.3. Data Governance in Cloud-Based Financial Services

The latest specialized literature confirms the impact of Cloud computing on financial infrastructures and the need for the adoption of a hybrid infrastructure that combines on-premise and distributed services provided by Cloud platforms, such as AWS, Microsoft Azure, and Google Cloud [7,32]. The biggest advantages of Cloud services are on-demand scalability, optimization of operational costs, together with advanced processing capabilities; however, these advantages do not come without major challenges to data security and sovereignty [33,34].

In multi-tenant Cloud environments, control over the underlying physical infrastructure is limited, which leads to an increased need for encryption, strict segregation of processing areas, and stronger control over sensitive data [35,36]. As a result, the literature highlights the importance of encrypting data before Cloud migration, as well as the need for end-to-end encryption, local control over encryption keys, and strict segregation between processing areas. Based on these constraints, encryption becomes an essential tool for ensuring legal compliance, not only a protection mechanism [36,37].

Pseudonymization is a governance mechanism recommended especially when referring to the identification of private individuals’ data, by hiding personal identifiers before data transfer [19,38]. In this way, pseudonymization reinforces the principle of privacy by design, reducing legal exposure. Separately, neither of these two technologies can prevent re-identification of information; for this reason, technical and regulatory literature proposes a combined approach that significantly decreases the risks of re-identification [39].

The role of incremental versioning is increasingly significant, given its capacity to manage replication and support rollback functionality, necessary for data reconstruction to a specific point in time, complete audit trails, and alignment with compliance requirements [40,41,42,43,44].

2.4. Data Governance Integration Strategies and Findings

Research on governance mechanisms, such as encryption, pseudonymization, and incremental versioning, remains fragmented in the specialized literature. These mechanisms are treated either as security measures or compliance requirements. However, there is no complete architectural model combining all of these technologies into a complete data processing pipeline, taking into consideration the rigidity of relational systems, the flexibility of Big Data, and the constraints imposed by Cloud computing. All of these, together with the complexity brought by governance models, lack a unified architecture that integrates all these aspects into a single, cohesive ecosystem [45].

The literature indicates that the most efficient governance models integrate encryption, pseudonymization, and incremental versioning together, not as standalone technologies [19,26]. Through this integrated approach, data integrity and confidentiality are preserved in a distributed environment, while at the same time enabling audit and control analysis [46].

Although encryption and pseudonymization are frequently examined as security and compliance measures in the literature, their integration into a unified architectural framework, aligned with incremental versioning mechanisms, remains insufficiently explored [24,25]. The cumulative impact of these mechanisms on the performance and stability of hybrid systems has rarely been examined empirically [47].

Encryption is often perceived as a technical safeguard; however, its implications in relation to jurisdictional boundaries and data localization obligations are less frequently analyzed, even though such considerations are increasingly decisive in international financial data flows [10]. Existing research tends to treat pseudonymization predominantly as a legal requirement, with comparatively limited attention given to its role as a structural component embedded in operational architectures, designed for systematic application in distributed environments [38]. Incremental versioning is the most significant gap identified [40,48]. The traditional literature examines briefly data consistency and synchronization, such as disaster recovery or replication, without assigning versioning a central role in data governance.

This study addresses the gaps identified in the current literature by proposing, developing, and testing a comprehensive hybrid architecture. We focus on integrating a technical governance model with embedded pseudonymization at the architectural level.

3. Materials and Methods

The experimental methodology was designed to systematically assess the impact of data governance mechanisms, such as encryption, pseudonymization and incremental versioning, on performance, execution predictability and the stability of processing times in distributed architectures. The assessment is based on comparing governance scenarios with a baseline scenario in which data is processed without the application of these mechanisms, thereby enabling isolation of the effect of each technique on lead time, resource consumption and execution stability across repeated runs.

Data sets of varying structure were used to reflect the complexity of contemporary financial ecosystems [49]. They include tabular, semi-structured and unstructured data formats that collectively capture the influence of governance mechanisms in the data processing pipeline, from initial ingestion and intermediate transformations to predictive analysis and financial decision-making.

The experiments were conducted in SQL, Python and Apache Spark. Each distinct processing environment was configured to enable independent and controlled application of governance mechanisms [50]. This separation of environments ensured comparability of results and eliminated interference generated by the peculiarities arising from platform-specific implementation.

It should be noted that the experimental design is intentionally asymmetric as the RDBMS layer processes exclusively structured data, reflecting its operational role in transactional financial systems, while the Big Data and Cloud layers cover all three data types. Encryption, pseudonymization and incremental versioning mechanisms have been integrated directly into automated ETL workflows in each environment, providing operational conditions similar to real financial infrastructures.

Encryption was applied using column-level symmetric mechanisms to sensitive attributes prior to analytical operations, ensuring data confidentiality at all stages of processing [51,52]. Pseudonymization was performed by tokenization and/or hashing of sensitive fields, with token-value mapping managed separately to prevent the exposure of real identifiers within processing environments [53,54]. Incremental versioning was implemented through delta log structures, which record successive changes to datasets and enable the reconstruction of previous states, supporting audit, reconciliation and validation of data consistency.

To assess governance mechanisms, a standardized series of operations representative of typical financial workflows was executed consistently across the three processing environments (SQL, Python and Apache Spark). These operations included data ingestion and cleansing, consistency validation, transformation, and enrichment procedures, aggregation processes, and indicator computation, together with read and write interactions specific to each data format. Encryption, pseudonymization and incremental versioning mechanisms were applied in a controlled manner at these stages, enabling the measurement of the overhead introduced by each policy, both individually and in combination.

The dataset used in experiments is a synthetic variant of the financial credit dataset [55], generated in three distinct formats: a structured version, a semi-structured version in JSON format, and an unstructured version used for sentiment analysis. This unstructured representation is consistent with recent NLP work that analyzes financial consumer complaints and sentiment in text form [56]. The same logical structure was maintained in all experimental scenarios, allowing the impact of governance mechanisms to be isolated from variations induced by data format. Processing included progressively increasing data volumes ranging from 100,000, 500,000 and 1,000,000 records, in order to capture performance variations associated with dataset growth. The sequence of experimental stages, the relationship between data types, processing environments and data sizes, as well as the method used to collect performance indicators, are synthesized in Figure 1, thereby ensuring the traceability of the experimental design.

Each experimental configuration was executed 50 times, thus providing statistical significance and reducing the influence of temporary resource fluctuations in distributed environments. During the experiments, performance indicators were collected, including total execution time, CPU utilization, memory consumption and GPU utilization.

This experimental framework allows the systematic examination of the trade-offs between compliance requirements, performance constraints and scalability in hybrid financial architectures. By testing governance mechanisms under controlled and reproducible conditions, quantification of their associated costs becomes possible; their influence on processing efficiency and execution stability is also measurable.

4. System Architecture and Governance Mechanism Integration

This section outlines the practical implementation of the proposed hybrid architecture. It focuses exclusively on operational workflows and the effective application of governance mechanisms in real financial environments. Conceptual and comparative considerations have been detailed in the preceding sections.

4.1. Comparison of Technologies in Relation to Data Governance

As reflected in the literature, data governance mechanisms, particularly encryption, pseudonymization and incremental versioning, cannot be analyzed separately, but must be assessed in relation to the technical infrastructures within which they are implemented [57,58]. In contemporary financial architectures, relational databases, Big Data platforms and Cloud infrastructures represent the three fundamental pillars of data processing. Each offers distinct advantages and specific limitations that directly influence the application and integration of governance policies [18,59]. The proposed architecture follows that logic and examines each mechanism in relation to the layer in which it is enforced.

The technological differences between these layers are directly reflected in the way in which encryption, pseudonymization and incremental versioning are implemented. In this context, the structural separation between original datasets, pseudonymized data and the mapping table is central to GDPR compliance and to the preservation of effective operational control over identity information [38,60,61]. By clearly delineating these components, this distinction establishes the conceptual basis for the governance mechanisms implemented across the subsequent technological layers, starting with the relational environment, where control over data remains most comprehensive.

At the relational level, governance mechanisms benefit from precise implementation, as RDBMS environments provide a deterministic, transaction-oriented structure [12,62]. Such characteristics enable the integration of governance controls directly at the point of data ingestion. Relational environments facilitate column-level encryption using standardized algorithms (e.g., AES-256) and allow deterministic pseudonymization via hashing or tokenization. They also incorporate audit log functionalities that make it possible to trace and reconstruct the evolution of datasets during successive operations [63,64]. These functionalities are integrated directly into the database engine, as illustrated in Figure 2, where the flow between the source system, policy engine, storage layer and output demonstrates that data protection measures are applied prior to data propagation to other systems. Hence, the relational layer serves as the main point of legal and operational responsibility for the management of personal data, ensuring compliance with data protection requirements from the initial stages of processing.

As data is transferred to the Big Data ecosystem, operational requirements change significantly because sensitive identifiers should be removed or transformed before data is distributed across processing nodes [19,38]. The move toward alternative architectures reflects the scalability limits of relational systems in large-scale and streaming environments, where data volume and processing speed exceed the thresholds supported by traditional transaction-oriented frameworks [21,24]. In this context, encryption mechanisms are administered through key management services, ensuring data protection both in transit and at rest [27,28,29]. At the same time, incremental versioning frameworks (such as Delta Lake, Apache Hudi and Apache Iceberg) maintain traceability for large-scale datasets, as they allow the reconstruction of prior data states and support systematic auditing of modifications [26,30,31]. These processes are illustrated in Figure 3, where the ingestion of pseudonymized data is followed by distributed processing with Apache Spark, together with audit, lineage and incremental versioning activities.

In the final stage of the processing pipeline, data is distributed to Cloud infrastructures, where governance requirements become strictly dependent on the legal framework and jurisdiction in which the processing takes place [33,34,65]. In EU–non-EU cross-border scenarios, Cloud infrastructures may receive only pseudonymized data, while encryption keys must be kept within the European Union under the direct control of the organization [60,61]. This is not an architectural limitation, but rather a deliberate governance decision, intended to preserve data sovereignty while enabling full use of the elastic capabilities of the global Cloud. Within Cloud environments, incremental versioning is further supported by native platform functionalities (e.g., object versioning and snapshot mechanisms), which facilitate controlled dataset replication and enable rollback procedures in the event of operational anomalies or incidents [42,43,44]. These flows are illustrated in Figure 4, which describes how encrypted and pseudonymized data is replicated to non-EU infrastructures, while encryption keys are retained within the on-premises environment or in infrastructures compliant with European regulations.

Overall, relational systems provide fine control and deterministic audit; Big Data platforms enable parallel processing and scale versioning, and Cloud infrastructures introduce geographic elasticity subject to rigorous application of governance mechanisms [33,34,45]. The integration of these technologies allows building a coherent pipeline in which data protection is maintained throughout the entire workflow.

In order to reinforce the comparative analysis presented and highlight how governance mechanisms are applied differently according to technological infrastructure, Table 1 summarizes the main features of relational systems, Big Data platforms and Cloud infrastructures, with a focus on their role in the integrated data protection, compliance and traceability framework.

4.2. Data Governance Integration Strategies

The integration of relational databases, Big Data platforms and cloud infrastructures into a unified financial ecosystem requires the definition of an orchestrated processing flow, in which data governance mechanisms are applied sequentially, deterministically and consistently throughout the information lifecycle [24,45,58]. Unlike traditional approaches, where security and compliance are treated as adjacent or isolated mechanisms, the proposed architecture embeds governance as a cross-cutting, active and persistent function at all stages of data processing [66].

The first stage of the integration strategy is implemented within the relational layer, where data is retrieved directly from operational systems and subjected to initial encryption, pseudonymization and validation processes [13,38]. In this context, the relational database management system acts as the primary point of legal and operational control, ensuring that compliance policies are applied from the moment of ingestion. Column-level granular encryption and deterministic pseudonymization allow personal data to be protected before it is propagated to distributed environments, significantly reducing the risk of uncontrolled exposure of sensitive information [13,38].

In the next stage, the already pseudonymized data is transferred to the Big Data ecosystem through ETL processes or streaming mechanisms [67,68]. Real-time data flows are managed through Apache Kafka, while Apache Spark is used for pre-processing, aggregation and parallel analysis. Within this layer, encryption is administered through centralized Key Management Services (KMS), and dataset persistence is achieved through structures with native support for incremental versioning, such as Delta Lake, Apache Hudi and Apache Iceberg [26,27,28,29,30,31]. Through these mechanisms, a complete record of data modifications is maintained, supporting audit, reconciliation and traceability obligations characteristic of the financial sector.

The final stage of integration consists of controlled replication to Cloud infrastructures, primarily used for elastic analytics and computationally intensive tasks [69,70]. In cross-border scenarios, only pseudonymized data is transferred. The encryption keys are managed within on-premises infrastructures or in Cloud environments, located within the European Union [33,60,61]. In such a model, Cloud processing remains detached from identifiable data, while platform-level versioning and replication mechanisms safeguard dataset integrity through rollback and consistency enforcement [42,43,44].

This sequence of steps highlights a clear separation of responsibilities across layers: primary control and compliance within the RDBMS, distributed processing and traceability within the Big Data layer, and scalability together with elastic analytics in the Cloud [45,58]. The governed integration flow between these components is synthesized in Figure 5, which explicitly illustrates the points at which encryption, pseudonymization and incremental versioning are applied within the proposed hybrid architecture.

The layered model integrates governance directly into the architecture, ensuring that compliance considerations are embedded alongside performance and scalability objectives. Each platform is thus utilized within its domain of operational efficiency, without compromising data protection. Within this framework, the following subsection presents the complete hybrid architecture and highlights the practical implementation of governance mechanisms in a regulated financial environment.

4.3. Data Governance in Hybrid Architectures and Practical Implementation

The proposed hybrid architecture integrates relational databases, Big Data platforms and Cloud infrastructures into a unified framework explicitly developed to ensure the consistent application of data governance mechanisms within regulated financial environments [41,60]. This design is organized around four primary operational layers, connected through a distinct integration layer, each with a clearly defined role within the data processing and governance pipeline [71]. The primary objective of this architecture is not merely to optimize performance or scalability, but to maintain strict legal and operational control over personal data, in accordance with European requirements concerning data protection and information sovereignty.

The first layer is represented by data sources. Information is aggregated from core banking systems, transactional logs, external APIs and unstructured data repositories [71,72]. This layer constitutes the single point of entry into the architecture and is responsible for providing raw data to subsequent control mechanisms. The heterogeneity of these sources justifies the need for uniform ingestion, classification and semantic validation policies from the initial phase.

The relational governance layer represents the core of legal and operational responsibility. The first critical data controls are applied at this level: formal classification, column-level granular encryption, deterministic pseudonymization, and compliance validation [13,38]. By embedding these mechanisms directly within the RDBMS engine, personal data is secured at source, prior to any propagation to distributed environments, thereby preventing sensitive information from exiting the relational layer without adequate technical protections [12,62,63,64].

The transfer flows between the relational layer and the Big Data ecosystem are detailed in Figure 5, RDBMS–Big Data–Cloud governed integration flow. It explicitly illustrates how data moves through the architecture in batch mode, via JDBC or ETL mechanisms, and in real time through CDC and event streaming [67,68]. By illustrating the sequencing of controls, the representation demonstrates that protective measures are enforced before data is exposed to distributed processing, strengthening the architectural implementation of “privacy by design”.

The Big Data platforms layer processes exclusively pseudonymized datasets. Large-scale parallel processing, along with support for batch and streaming workloads and advanced audit and lineage mechanisms, is performed at this layer. Technologies such as Delta Lake, Apache Hudi and Apache Iceberg are used for incremental versioning [26,30,31]. These frameworks enable the maintenance of a complete history of dataset changes and support reconciliation, audit and analytical consistency validation processes, which are essential in regulated financial environments.

At this stage, Cloud-based resources ensure the flexible computational power required to perform sophisticated analytics, including predictive modeling, aggregated reporting and simulation processes. The architecture also includes a separation mechanism for operations involving both EU and non-EU jurisdictions: non-EU replication is restricted to pseudonymized data, and control over encryption keys remains exclusively within EU-based or on-premises environments [60,61]. Native object versioning mechanisms and snapshots enable controlled rollback operations; the number of operational risks associated with distributed replication is also diminished [42,43,44].

Table 2 presents a comparative synthesis of the governance mechanisms applied throughout the RDBMS–Big Data–Cloud processing pipeline. The purpose is to synthesize the implemented governance scenarios and to highlight the role of each layer in ensuring compliance, traceability and operational control.

Overall, the proposed hybrid architecture balances compliance, performance and operational flexibility [24,25]. Although it introduces additional complexity in flow orchestration and cryptographic key management, it also enables financial institutions to meet modern data analysis requirements without compromising the fundamental principles of governance, data protection and information sovereignty.

5. Performance Analysis

In hybrid financial architectures, data governance mechanisms entail operational costs that must be explicitly quantified and analyzed. This section focuses on determining the concrete impact of encryption, pseudonymization and incremental versioning on the performance of RDBMS, Big Data and Cloud systems, under reproducible experimental conditions and using data volumes relevant to financial applications.

5.1. Performance Analysis and Obtained Results

This section examines research questions RQ2 and RQ3 by analyzing how governance mechanisms influence execution time, resource utilization, and execution stability across repeated runs in hybrid financial architectures.

The performance analysis was conducted in a controlled and reproducible experimental environment, designed to evaluate the practical implications of data governance policies in RDBMS–Big Data–Cloud environments. Particular attention was given to computational resource usage, execution time and overall system stability. This analysis directly addresses RQ2, concerning the performance and resource costs introduced by governance mechanisms, and RQ3, concerning their effects on the stability and consistency of analytical results.

The test environment was implemented using Google Colab, and configured with an NVIDIA A100 GPU accelerator, Python 3.11 and Apache Spark 3.5.4. This configuration was selected to ensure execution stability and reproducibility, and to enable direct comparability of results during multiple runs. At the same time, it reflects infrastructures commonly used in contemporary financial environments for operational and analytical purposes.

Used data and representation formats

To ensure experimental relevance for the financial information systems, the assessment was conducted using three fundamental categories of data: structured, semi-structured and unstructured data. Each category was represented through specific formats and concrete samples, which were applied consistently when using the platforms.

Structured data (e.g., Table 3) followed a relational format, with a fixed and well-defined schema and data types. It is representative of transactional systems and financial reporting data. These datasets included attributes such as customer identifiers, credit scores, revenue level, payment statuses and risk classifications. A representative sample of structured data, reflecting a typical table used in banking systems, is presented below:

Semi-structured data (e.g., Table 4) encoded in JSON format, allowed for hierarchical structures and variable fields. This type of data is characteristic of modern integrations, mobile applications, API workflows and financial event monitoring systems. A sample used in the experimental setup is presented below:

Unstructured data (e.g., Table 5) consisted of free text fields with no fixed schema and was used for advanced analysis such as sentiment analysis or the identification of reputational risks. This type of data commonly appears in internal documentation, customer correspondence and narrative financial reports, where qualitative content may provide insights relevant to risk monitoring and behavioral assessment. An example dataset is shown below:

These formats were selected to reflect the real complexity of financial ecosystems, where structured data coexists with heterogeneous semi-structured and unstructured data flows.

Testing and result aggregation methodology

For each data type, tests were conducted on datasets of 100,000, 500,000 and 1,000,000 records. Each combination of technology, data type and governance scenario was executed 50 times. The values reported in this section correspond to aggregated mean values from the consolidated results file. The procedure reduces incidental variation and strengthens the validity of comparative statistical analysis.

Table 6 summarizes the main performance indicators for the full governance scenario, including average CPU utilization, average memory consumption and total execution time. Values are reported as mean ± standard deviation (SD). The SD values were approximately 3% for RDBMS, 5% for Big Data, and 7% for Cloud.

GPU utilization is not relevant across all analyzed technologies and scenarios, suggesting that the governance policies applied are predominantly CPU-bound. CPU usage and execution time increase under full governance scenarios, particularly for unstructured data and in distributed environments. To assess the impact of governance on execution times, a Kruskal–Wallis non-parametric test was applied (Table 7) on Baseline vs. All Policies. All observed H-statistics substantially exceed the critical threshold

χ_{0.05, d f = 1}^{2} = 3.84

, with p < 0.001 in every configuration. This confirms that the performance overhead introduced by governance mechanisms is statistically significant across all tested environments, directly addressing RQ2.

The additional overhead introduced by the combined governance workflow is thereby confirmed.

Performance Impact and System Stability Across Governance Policies

Performance considerations were examined through a comparison between the Baseline and all governance scenarios applied to the 1,000,000-record dataset. Table 8 presents the observed differences in execution time and CPU utilization as the cumulative effect on different execution environments and data structures. Relative to the Baseline scenario, the largest increases in execution time were observed for Encryption in Big Data unstructured (+478.3%) and All Policies in Big Data unstructured (+353.9%) workloads, as well as for All Policies in RDBMS structured (+131.0%) and Cloud unstructured (+63.2%) workloads. Encryption produced moderate increases in structured environments but higher overhead in unstructured and distributed cases. Pseudonymization is more cost-effective than layered governance approaches, though it had a notable influence on certain structured tasks. These results indicate that the performance cost is influenced by both the policy implemented and the processing environment.

System stability was assessed in terms of the relative variation in execution times for 50 runs. The RDBMS achieved the highest level of stability, with variations limited to approximately ±3%. Big Data platforms exhibited moderate variability, in the range of approximately ±5%, while the Cloud environment showed the highest dispersion, reaching up to ±7%, particularly in scenarios involving semi-structured data. These values reflect a trade-off between architectural elasticity and execution predictability, which is particularly relevant when designing systems subject to strict financial and regulatory constraints.

The analysis of GPU utilization indicates that this resource is not engaged in RDBMS environments, and that utilization levels remain close to zero in Big Data and Cloud-based platforms. These findings suggest that the governance policies examined are predominantly CPU-dependent within the experimental configuration adopted. Accordingly, GPU usage does not influence the overall performance and is therefore not included in the detailed graphical assessment.

CPU utilization emerges as the principal indicator of the computational cost associated with governance mechanisms. As illustrated in Figure 6, for structured data, RDBMS architectures exhibit a predictable and controlled increase in CPU consumption, even under full governance scenarios. In the case of semi-structured and unstructured data, Big Data and Cloud platforms record significantly higher levels of CPU utilization, reflecting the additional costs generated by the processing of hierarchical structures, metadata validation and workload distribution mechanisms. Therefore, the performance impact of governance mechanisms depends on both the structural complexity of the data and the architecture employed.

In contrast to CPU utilization, memory consumption remains relatively stable. Variation is observed only in the RDBMS environment, while Big Data and Cloud values remain approximately zero. This suggests that the governance policies mainly increase processing cost, without notable memory growth.

Total execution time captures the combined influence of CPU load, system behavior and the structural complexity of data processing. As shown in Figure 7, relational database architectures preserve higher efficiency when processing structured datasets, even under extensive governance configurations, given the inherent optimization features of relational engines. This relative benefit, however, appears to diminish as governance policies become more complex. By contrast, in the processing of semi-structured and unstructured data, Big Data and Cloud-based platforms generally display longer execution times. At the same time, they provide greater scalability at elevated data volumes, a feature that remains critical when evaluating performance in distributed processing environments.

The graphical representations reinforce and expand upon the patterns identified in the tabular and statistical analysis. This indicates that governance mechanisms generate a consistent and reproducible performance overhead, influenced by the type of data and the architecture employed. However, this does not affect overall system stability, supporting the conclusions concerning RQ2 and RQ3.

5.2. Linking the Results to Financial Applications

The experimental results presented in the previous subsection demonstrate that data governance policies directly influence the performance of information systems. However, the significance of this impact can only be properly understood when considered in relation to the specific characteristics of financial applications. In this context, governance should not be viewed solely through the lens of computational overhead, but rather as a structural component that contributes to risk management, regulatory alignment and transparent, auditable financial decision-making.

Financial institutions operate within a highly regulated framework characterized by stringent requirements concerning data protection, process traceability and operational resilience. The findings suggest that each governance measure is associated with a certain performance cost; however, the magnitude of this impact appears to vary according to the technology used, the nature of the data processed and the intended purpose of the application. Accordingly, the selection and combination of governance controls should be proportionate to the criticality of the application and the corresponding risk profile.

Encryption plays a central role in financial systems that process sensitive information, including customer identification data, accounting records and transactional information. The performance analysis indicates that the introduction of encryption is linked to a predominantly CPU-bound cost profile. In structured RDBMS and Cloud workloads, its impact remained moderate (+36.8% and +7.9%, respectively), while in distributed semi-structured and unstructured workloads, the increase became substantially larger. This behavior can be explained by the fact that encryption adds transformation overhead at read/write and processing stages, while leaving memory demand relatively stable.

Considering the cost–benefit perspective, this additional overhead appears proportionate, as the gains in data confidentiality, protection against unauthorized access and regulatory compliance—particularly under frameworks such as the GDPR—may justify the incremental operational expenditure. In this context, encryption may be considered a foundational safeguard for critical systems rather than a mere optimization choice.

Pseudonymization represents a governance mechanism of particular relevance for analytical applications in the financial sector. The experimental findings suggest a comparatively lower performance impact than full encryption, particularly in Big Data semi-structured (−0.1%) and Cloud unstructured (−41.5%) workloads. At the same time, the results show that the overhead in RDBMS structured (+54.1%) and Cloud structured (+25.4%) workloads remains substantial and should not be overlooked.

The negative value observed for Cloud unstructured (−41.5%) is attributable to the nature of full-line SHA-256 hashing applied to plain text records. Each original record of approximately 300 characters is replaced by a fixed-length 64-character hash string, reducing output volume by approximately 80%. This reduction in data size lowers Cloud storage write latency to below the Baseline level. This behavior is specific to unstructured text governance and does not appear in structured or semi-structured scenarios, where hashing is applied selectively to a subset of columns, leaving overall record size essentially unchanged.

This is particularly important for applications such as credit scoring, risk assessment, fraud detection and predictive modeling, where data correlation is required but the exposure of customer identities must be restricted. From an operational standpoint, pseudonymization facilitates a clearer separation between identity protection and analytical processing, aligning with the principles commonly associated with “privacy by design” and “data minimization”.

Incremental versioning carries significant implications for audit and compliance-driven use cases. The results indicate a moderate tendency toward longer execution times (+34.8% increase) under this policy, particularly in RDBMS structured workloads, while distributed environments remained substantially lower. Nevertheless, the capacity to reconstruct prior data states and document successive modifications remains particularly relevant in financial contexts, especially in connection with audit and supervisory requirements. In this light, the additional performance overhead associated with incremental versioning may be regarded as proportionate, given its contribution to traceability and accountability.

The scenario in which all governance policies are implemented concurrently appears to approximate most closely the operational conditions of critical financial systems. The results suggest that this configuration is associated with high computational demand, reflected in increased CPU utilization and longer execution times, particularly within unstructured and distributed environments. At the same time, no material degradation in data quality was observed, and system stability remained within operationally acceptable thresholds, notably in relational database architectures. In use cases involving statutory reporting, long-term financial data retention and the processing of highly sensitive information, such a level of governance control may be considered justified and, in many instances, necessary.

To consolidate these findings and clarify their practical relevance for financial applications, Table 9 summarizes the relationship between governance measures, performance implications and operational outcomes.

The comparative results show that governance controls should not be imposed uniformly, as this may generate disproportionate operational burdens. Instead, a hybrid architecture enables the level of governance to be adapted to the type of data and the purpose of the application, thereby optimizing the balance between performance, security and compliance.

6. Conclusions

This study analyses the impact of governance scenarios on hybrid financial systems. In this study, the hybrid financial architecture is clearly defined, with data governance applied across distinct layers of each technology. In this way, the impact of each governance model is assessed separately, while also comparing them to a baseline with no governance applied. The experiment outlines the impact on performance and stability across concrete financial scenarios, highlighting the fragile balance between governance and efficient financial systems.

Responding to RQ1, the study shows that instead of a universal governance model, a personalized approach for each distinct layer of relational databases, Big Data, and Cloud should be applied, adjusting governance to the sensitivity and operational context. This approach supports a layered governance model for hybrid financial systems.

In response to RQ2, the experiments show that the implementation of governance mechanisms in hybrid financial architectures is associated with measurable performance loss, mainly affecting CPU utilization and execution time. The magnitude of this impact depends on both data type and technology. While the impact on relational databases remains stable and predictable, even with full governance, it becomes more visible for semi-structured and unstructured data types on Big Data and Cloud technologies. This performance impact is likely due to metadata validation, versioning, and hierarchical data handling. The results confirm memory consumption remains stable, GPU consumption is minimal, and the governance mechanisms are predominantly CPU-dependent.

As for RQ3, despite additional preprocessing and validation steps required for applying governance models, the results indicate that system performance and execution stability are not materially reduced and remain within acceptable limits across multiple runs. The analysis shows relational databases exhibit greater stability in execution time compared to Big Data and Cloud technologies, which tend to show moderate variability due to their distributed and elastic characteristics. While the introduction of governance models adds some system variability, this remains within operational limits for critical financial systems.

A central conclusion of the study is that data governance should not be viewed solely as a factor of performance overhead, but as a fundamental instrument for risk mitigation and regulatory compliance. Although governance introduces computational overhead, it remains predictable and manageable when supported by a properly designed hybrid architecture. The benefits for data protection, traceability, and compliance appear to outweigh the additional operational costs, especially in highly regulated financial environments.

From a practical standpoint, financial institutions can balance performance, cost, and compliance by adopting hybrid architectures where transactional data remains in relational systems, large-scale analytics are handled by Big Data platforms and Cloud environments allow elasticity for peak workloads.

6.1. Practical Outcomes and Implementation Recommendations

In modern hybrid financial systems, governance should not be a one-size-fits-all model across the whole architecture. It should be adapted to each technological layer, taking into account its unique characteristics. When properly applied, encryption and incremental versioning work well in relational systems without compromising stability. However, in Big Data and Cloud environments, their flexibility allows for more selective governance implementation. When correctly configured, pseudonymization and structural validation can reduce performance impact, especially in large-scale processing contexts.

One notable finding with implications for infrastructure design is that CPU utilization is the principal cost component associated with governance measures, while memory consumption remains relatively stable. The findings suggest that, rather than concentrating efforts on reserving extra memory, configuration should focus on parallelization strategies and prioritized processing capacity. Structured, semi-structured and unstructured datasets have different behavioral patterns when subject to governance controls, and a uniform approach may lead to avoidable inefficiencies.

The study also showed that data type classification should be established at the architectural design stage. Applying a uniform governance approach across distinct data types risks measurable degradation in both performance and stability. When classification is done properly, governance can be adjusted accordingly, supporting both performance and regulatory needs.

6.2. Methodological Constraints

Although the results provide a solid empirical perspective on the impact of data governance mechanisms in RDBMS–Big Data–Cloud hybrid architectures, methodological limitations should be considered while interpreting them. These limitations do not invalidate the conclusions reached; rather, they define the scope of their applicability.

While the Google Colab environment allows reproducibility, it remains a methodological constraint that only partially captures real-world operational complexities, such as network latency, cross-cluster contention, workload interference and distributed storage variability.

In addition, the study is constrained by the specific range of governance measures selected for analysis. While encryption, pseudonymization and incremental versioning were chosen as illustrative of core compliance-oriented controls in financial systems, other mechanisms, including granular access control schemes, adaptive governance policies and automated compliance monitoring tools, were not analyzed. These elements may introduce different operational characteristics and performance effects.

The evaluation is based predominantly on aggregated indicators (such as mean execution time and average resource utilization) calculated for multiple experimental runs. Although this approach enhances statistical stability, it may reduce sensitivity to outlier events or temporary latency increases, which are very relevant in high-sensitivity financial systems.

The synthetic nature of the datasets used represents an additional limitation. Although they were designed to reflect the structural characteristics of real financial data, they do not fully capture the semantic complexity, historical dependencies and heterogeneity of production environments. The results should therefore be interpreted as context-dependent indicators rather than universally transferable metrics.

In addition, the study does not systematically examine the economic implications of governance implementation, such as Cloud resource expenditure or key and version management overhead. While such aspects are clearly pertinent to institutional decision-making, they remain outside the defined scope of this research.

6.3. Future Research Directions

The results establish a basis for continued investigation into data governance within hybrid financial architectures. Future research could examine adaptive governance models capable of dynamically adjusting protection and oversight mechanisms in response to contextual factors. Such models may align the intensity of encryption, pseudonymization or version control with data sensitivity, analytical objectives or latency constraints.

Further refinement of the framework may also involve incorporating financial and energy consumption indicators into the assessment of governance impact, thus enabling a more comprehensive evaluation of performance and sustainability considerations. Establishing explicit links between system performance, infrastructure costs and energy usage in Cloud-based environments would allow for more extensive cost–benefit evaluations, and provide support in long-term strategic planning within regulated financial contexts.

Given the empirical results, a further line of inquiry may concern the interaction between governance mechanisms and advanced analytics and machine learning workflows. A closer examination of how encryption, pseudonymization and incremental versioning influence both performance and predictive accuracy would allow for a direct correlation between data governance and the analytical value generated.

Validating the proposed architecture in real or semi-real operational environments represents another important direction for future research. Including variables like workload contention, network latency fluctuations, layered access control configurations and legacy system integration would enhance the external validity of the study; at the same time, it may reveal operational limitations that remain less visible in controlled experimental settings.

Finally, the development of lakehouse architectures and event-driven processing systems offers additional opportunities to expand the analytical framework presented. Future research may include investigating the effective integration of governance mechanisms within these emerging architectures, while maintaining an appropriate balance between performance, scalability and compliance.

Author Contributions

Conceptualization, S.-A.I.; data curation, S.-A.I. and V.D.; formal analysis, S.-A.I., V.D., A.-O.R., L.G.D. and I.N.; funding acquisition, S.-A.I. and V.D.; investigation, S.-A.I., V.D., A.-O.R., L.G.D. and I.N.; methodology, S.-A.I., V.D., A.-O.R., L.G.D. and I.N.; project administration, S.-A.I. and V.D.; resources, S.-A.I., V.D., A.-O.R., L.G.D. and I.N.; software, S.-A.I., V.D., A.-O.R., L.G.D. and I.N.; supervision, V.D.; validation, S.-A.I., V.D., A.-O.R., L.G.D. and I.N.; writing—original draft preparation, S.-A.I.; writing—review and editing, S.-A.I., V.D., A.-O.R., L.G.D. and I.N. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was co-financed by The Bucharest University of Economic Studies (ASE) during the PhD program.

Data Availability Statement

Scripts used for data creation and solution testing are available on request from the first author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rizvi, S.K.A.; Rahat, B.; Naqvi, B.; Umar, M. Revolutionizing finance: The synergy of fintech, digital adoption, and innovation. Technol. Forecast. Soc. Change 2024, 200, 123112. [Google Scholar] [CrossRef]
Gandomi, A.; Haider, M. Beyond the hype: Big data concepts, methods, and analytics. Int. J. Inf. Manag. 2015, 35, 137–144. [Google Scholar] [CrossRef]
Hoofnagle, C.J.; van der Sloot, B.; Borgesius, F.Z. The European Union general data protection regulation: What it is and what it means. Inf. Commun. Technol. Law 2019, 28, 65–98. [Google Scholar] [CrossRef]
State of California. California Consumer Privacy Act of 2018, Civil Code, Title 1.81.5. Official Legislative Text. 2018. Available online: https://cppa.ca.gov/regulations/pdf/ccpa_statute.pdf (accessed on 1 December 2025).
Miller, K.M.; Lukic, K.; Skiera, B. The impact of the General Data Protection Regulation (GDPR) on online tracking. Int. J. Res. Mark. 2026, 43, 48–70. [Google Scholar] [CrossRef]
Sharma, R.K.; Bharathy, G.; Karimi, F.; Mishra, A.V.; Prasad, M. Thematic Analysis of Big Data in Financial Institutions Using NLP Techniques with a Cloud Computing Perspective: A Systematic Literature Review. Information 2023, 14, 577. [Google Scholar] [CrossRef]
Cheng, M.; Qu, Y.; Jiang, C.; Zhao, C. Is cloud computing the digital solution to the future of banking? J. Financ. Stab. 2022, 63, 101073. [Google Scholar] [CrossRef]
Gąsiorkiewicz, L.; Monkiewicz, J. Digital Finance and the Future of the Global Financial System; Routledge: Oxfordshire, UK, 2022. [Google Scholar] [CrossRef]
Zhang, C.; Li, G.; Zhang, J.; Zhang, X.; Feng, J. HTAP Databases: A Survey. IEEE Trans. Knowl. Data Eng. 2024, 36, 6410–6429. [Google Scholar] [CrossRef]
Program on International Financial Systems. Data Localization, Cloud Adoption, and the Financial Sector. 2024. Available online: https://www.pifsinternational.org/wp-content/uploads/2024/07/Report-on-Data-Localization-07.29.2024.pdf (accessed on 1 December 2025).
Dong, H.; Zhang, C.; Li, G.; Zhang, H. Cloud-Native Databases: A Survey. IEEE Trans. Knowl. Data Eng. 2024, 36, 7772–7791. [Google Scholar] [CrossRef]
Gray, J. The Transaction Concept: Virtues and Limitations. In Proceedings of the Seventh International Conference on Very Large Data Bases, Cannes, France, September 9–11 1981; Available online: https://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf (accessed on 1 December 2025).
Carvalho, M.; Sá, F.; Bernardino, J. Evaluation of the Impact of AES Encryption on Query Read Performance Across Oracle, MySQL, and SQL Server Databases. Cryptography 2025, 9, 77. [Google Scholar] [CrossRef]
Sun, B.; Zhao, S.; Tian, G. SQL queries over encrypted databases: A survey. Connect. Sci. 2024, 36, 2323059. [Google Scholar] [CrossRef]
Rao, A.; Khankhoje, D.; Namdev, U.; Bhadane, C.; Dongre, D. Insights into NoSQL databases using financial data: A comparative analysis. Procedia Comput. Sci. 2022, 215, 8–23. [Google Scholar] [CrossRef]
Pokorný, J. Integration of Relational and NoSQL Databases. Vietnam. J. Comput. Sci. 2019, 6, 389–405. [Google Scholar] [CrossRef]
de la Vega, A.; García-Saiz, D.; Blanco, C.; Zorrilla, M.; Sánchez, P. Mortadelo: Automatic generation of NoSQL stores from platform-independent data models. Future Gener. Comput. Syst. 2020, 105, 455–474. [Google Scholar] [CrossRef]
Deka, G.C. NoSQL Polyglot Persistence. In Advances in Computers; Elsevier: Amsterdam, The Netherlands, 2018; pp. 357–390. [Google Scholar] [CrossRef]
Morabito, G.; Galletta, A.; Celesti, A.; Villari, M.; Fazio, M. Enhancing Data Privacy in Federated Data Spaces: Hierarchical Secret-Share based Pseudonymization. In Proceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing, Taormina, Italy, 4–7 December 2023; ACM: New York, NY, USA, 2023; pp. 1–2. [Google Scholar] [CrossRef]
Sankhe, M.; Mangla, M.; Shah, P.; Doshi, P.; Shah, J.; Vyas, D. A Unified Distributed Version Control System for SQL and Graph Databases. In Proceedings of the 2025 9th International Conference on Computing, Communication, Control and Automation (ICCCBEA), Pune, India, 22–23 August 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–7. [Google Scholar] [CrossRef]
Rao, T.R.; Mitra, P.; Bhatt, R.; Goswami, A. The big data system, components, tools, and technologies: A survey. Knowl. Inf. Syst. 2019, 60, 1165–1245. [Google Scholar] [CrossRef]
Hussain, K.; Prieto, E. Big Data in the Finance and Insurance Sectors. In New Horizons for a Data-Driven Economy; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 209–223. [Google Scholar] [CrossRef]
Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache Spark. Commun. ACM 2016, 59, 56–65. [Google Scholar] [CrossRef]
Schneider, J.; Gröger, C.; Lutsch, A.; Schwarz, H.; Mitschang, B. The Lakehouse: State of the Art on Concepts and Technologies. SN Comput. Sci. 2024, 5, 449. [Google Scholar] [CrossRef]
Harby, A.A.; Zulkernine, F. Data Lakehouse: A survey and experimental study. Inf. Syst. 2025, 127, 102460. [Google Scholar] [CrossRef]
Armbrust, M.; Das, T.; Sun, L.; Yavuz, B.; Zhu, S.; Murthy, M.; Torres, J.; van Hovell, H.; Ionescu, A.; Luszczak, A.; et al. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. 2020, 13, 3411–3424. [Google Scholar] [CrossRef]
Apache Software Foundation. Apache Hadoop Documentation: Transparent Encryption in HDFS. 2026. Available online: https://hadoop.apache.org/docs/r3.4.3/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html (accessed on 1 December 2025).
Apache Software Foundation. Hadoop Key Management Server (KMS)—Documentation Sets. 2026. Available online: https://hadoop.apache.org/docs/stable/hadoop-kms/index.html (accessed on 1 December 2025).
Apache Software Foundation. Apache Spark Documentation: Security. 2026. Available online: https://spark.apache.org/docs/latest/security.html (accessed on 1 December 2025).
Apache Software Foundation. Timeline|Apache Hudi. 2026. Available online: https://hudi.apache.org/docs/next/timeline/ (accessed on 1 December 2025).
Apache Software Foundation. Apache Iceberg Documentation: Spark Queries—Time Travel. 2026. Available online: https://iceberg.apache.org/docs/latest/spark-queries/ (accessed on 1 December 2025).
Ogundipe, D.O. Conceptualizing cloud computing in financial services: Opportunities and challenges in Africa-US contexts. Comput. Sci. It Res. J. 2024, 5, 757–767. [Google Scholar] [CrossRef]
European Banking Authority. Guidelines on Outsourcing Arrangements (EBA/GL/2019/02). Official Guideline. 2019. Available online: https://www.eba.europa.eu/sites/default/files/documents/10180/2551996/38c80601-f5d7-4855-8ba3-702423665479/EBA%20revised%20Guidelines%20on%20outsourcing%20arrangements.pdf (accessed on 1 December 2025).
European Central Bank. Guide on Outsourcing Cloud Services to Cloud Service Providers. Official Supervisory Guide. 2025. Available online: https://www.bankingsupervision.europa.eu/ecb/pub/pdf/ssm.supervisory_guides202507.en.pdf (accessed on 1 December 2025).
Gajbhiye, A.; Shrivastva, K.M.P. Cloud computing: Need, enabling technology, architecture, advantages and challenges. In Proceedings of the 2014 5th International Conference—Confluence The Next Generation Information Technology Summit (Confluence), Noida, India, 25–26 September 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1–7. [Google Scholar] [CrossRef]
Huaman, C.H.O.; Fuster, N.F.; Luyo, A.C.; Armas-Aguirre, J. Critical Data Security Model: Gap Security Identification and Risk Analysis In Financial Sector. In Proceedings of the 2022 17th Iberian Conference on Information Systems and Technologies (CISTI), Madrid, Spain, 22–25 June 2022; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar] [CrossRef]
Shaik, V.; K., N. Cloud databases: A resilient and robust framework to dissolve vendor lock-in. Softw. Impacts 2024, 21, 100680. [Google Scholar] [CrossRef]
Bolognini, L.; Bistolfi, C. Pseudonymization and impacts of Big (personal/anonymous) Data processing in the transition from the Directive 95/46/EC to the new EU General Data Protection Regulation. Comput. Law Secur. Rev. 2017, 33, 171–181. [Google Scholar] [CrossRef]
Lee, S.; Kim, Y.; Kwon, Y.; Cho, S. Secure privacy-preserving record linkage system from re-identification attack. PLoS ONE 2025, 20, e0314486. [Google Scholar] [CrossRef] [PubMed]
Achanta, M.; Vuppu, D. Implementing Data Versioning and Lineage Tracking in ETL Workflows. Int. J. Sci. Res. (IJSR) 2025, 14, 1312–1315. [Google Scholar] [CrossRef]
European Union. Regulation (EU) 2022/2554 of the European Parliament and of the Council of 14 December 2022 on Digital Operational Resilience for the Financial Sector. Official EUR-Lex Text. 2022. Available online: https://eur-lex.europa.eu/eli/reg/2022/2554/oj/eng (accessed on 1 December 2025).
Amazon Web Services. How S3 Versioning Works. 2026. Available online: https://docs.aws.amazon.com/AmazonS3/latest/userguide/versioning-workflows.html (accessed on 1 December 2025).
Microsoft. Blob Versioning. 2026. Available online: https://learn.microsoft.com/en-us/azure/storage/blobs/versioning-overview (accessed on 1 December 2025).
Google Cloud. Object Versioning. 2026. Available online: https://docs.cloud.google.com/storage/docs/object-versioning (accessed on 1 December 2025).
Armbrust, M.; Ghodsi, A.; Xin, R.; Zaharia, M. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. Proc. CIDR 2021, 8, 28. [Google Scholar]
Clemente-Castello, F.J.; Nicolae, B.; Katrinis, K.; Mustafa Rafique, M.; Mayo, R.; Fernandez, J.C.; Loreti, D. Enabling Big Data Analytics in the Hybrid Cloud Using Iterative MapReduce. In Proceedings of the 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC), Limassol, Cyprus, 7–10 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 290–299. [Google Scholar] [CrossRef]
Garad, A.; Riyadh, H.A.; Al-Ansi, A.M.; Beshr, B.A.H. Unlocking financial innovation through strategic investments in information management: A systematic review. Discov. Sustain. 2024, 5, 381. [Google Scholar] [CrossRef]
Eichler, R.; Giebler, C.; Gröger, C.; Schwarz, H.; Mitschang, B. Modeling metadata in data lakes: A generic model. Data Knowl. Eng. 2021, 136, 101931. [Google Scholar] [CrossRef]
Ionescu, S.A.; Radu, A.O. Assessment and Integration of Relational Databases, Big Data, and Cloud Computing in Financial Institutions: Performance Comparison. In Proceedings of the 2024 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Craiova, Romania, 4–6 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar] [CrossRef]
Apache Software Foundation. Spark SQL and DataFrames. 2025. Available online: https://spark.apache.org/docs/latest/sql-programming-guide.html (accessed on 1 December 2025).
Zhu, J.; Cheng, K.; Liu, J.; Guo, L. Full encryption. Proc. VLDB Endow. 2021, 14, 2811–2814. [Google Scholar] [CrossRef]
Nwatuzie, G.A.; Enyejo, L.A.; Umeaku, C. Enhancing Cloud Data Security Using a Hybrid Encryption Framework Integrating AES, DES, and RC6 with File Splitting and Steganographic Key Management. Int. J. Innov. Sci. Res. Technol. 2025, 10, 1555–1569. [Google Scholar] [CrossRef]
Stalla-Bourdillon, S. Identifiability, as a Data Risk: Is a Uniform Approach to Anonymisation About to Emerge in the EU? Eur. J. Risk Regul. 2025, 16, 1456–1474. [Google Scholar] [CrossRef]
Baumgartner, M.; Kreiner, K.; Wiesmüller, F.; Hayn, D.; Puelacher, C.; Schreier, G. Masketeer: An Ensemble-Based Pseudonymization Tool with Entity Recognition for German Unstructured Medical Free Text. Future Internet 2024, 16, 281. [Google Scholar] [CrossRef]
UCI Machine Learning. German Credit Risk. Kaggle Dataset Page, Modified 2016-12-14. 2016. Available online: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data (accessed on 1 December 2025).
Kumar, J.A.; Ajay, K.; Kumar, P.N. Leveraging Social Media and Multitask Natural Language Processing to Detect Financial Consumer Complaints. Econ. Comput. Econ. Cybern. Stud. Res. 2025, 59, 145–162. [Google Scholar] [CrossRef]
Ionescu, S.A.; Diaconita, V. Transforming Financial Decision-Making: The Interplay of AI, Cloud Computing and Advanced Data Management Technologies. Int. J. Comput. Commun. Control 2023, 18, 1–19. [Google Scholar] [CrossRef]
Ionescu, S.A.; Diaconita, V.; Radu, A.O. Engineering Sustainable Data Architectures for Modern Financial Institutions. Electronics 2025, 14, 1650. [Google Scholar] [CrossRef]
Schreiner, G.A.; Knob, R.; Duarte, D.; Vilain, P.; Mello, R.d.S. NewSQL Through the Looking Glass. In Proceedings of the 21st International Conference on Information Integration and Web-Based Applications & Services, Munich, Germany, 2–4 December 2019; ACM: New York, NY, USA, 2019; pp. 361–369. [Google Scholar] [CrossRef]
European Parliament and the Council of the European Union. Regulation (EU) 2016/679 (General Data Protection Regulation). Official EUR-Lex Text. 2016. Available online: https://eur-lex.europa.eu/eli/reg/2016/679/oj (accessed on 1 December 2025).
European Data Protection Board. Recommendations 01/2020 on Measures That Supplement Transfer Tools to Ensure Compliance with the EU Level of Protection of Personal Data, Version 2.0. 2021. Available online: https://www.edpb.europa.eu/system/files/2021-06/edpb_recommendations_202001vo.2.0_supplementarymeasurestransferstools_en.pdf (accessed on 1 December 2025).
PostgreSQL Global Development Group. PostgreSQL Documentation: Transactions. 2026. Available online: https://www.postgresql.org/docs/current/tutorial-transactions.html (accessed on 1 December 2025).
Oracle Corporation. Oracle Database Reference: Unified Audit Trail. 2026. Available online: https://www.oracle.com/a/tech/docs/dbsec/unified-audit-best-practice-guidelines.pdf (accessed on 1 December 2025).
pgAudit Project. pgAudit. 2026. Available online: https://github.com/pgaudit/pgaudit/blob/main/README.md (accessed on 1 December 2025).
Arnal, J. The Banking Sector Is Increasingly Looking to the Cloud. CEPS Explainer, 2023-12. 2023. Available online: https://cdn.ceps.eu/wp-content/uploads/2023/11/UqkAUXOF-2023-12_CEPS-Explainer-banking-sector-looking-to-the-cloud.pdf (accessed on 1 December 2025).
Odebrecht, C. Research Data Governance. The Need for a System of Cross-organisational Responsibility for the Researcher’s Data Domain. Data Sci. J. 2025, 24, 12. [Google Scholar] [CrossRef]
Tun, M.T.; Nyaung, D.E.; Phyu, M.P. Performance Evaluation of Intrusion Detection Streaming Transactions Using Apache Kafka and Spark Streaming. In Proceedings of the 2019 International Conference on Advanced Information Technologies (ICAIT), Yangon, Myanmar, 6–7 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 25–30. [Google Scholar] [CrossRef]
Luo, C.; Cao, Q.; Li, T.; Chen, H.; Wang, S. MapReduce accelerated attribute reduction based on neighborhood entropy with Apache Spark. Expert Syst. Appl. 2023, 211, 118554. [Google Scholar] [CrossRef]
Runsewe, O.; Samaan, N. Cloud Resource Scaling for Time-Bounded and Unbounded Big Data Streaming Applications. IEEE Trans. Cloud Comput. 2021, 9, 504–517. [Google Scholar] [CrossRef]
Westerlund, M.; Hedlund, U.; Pulkkis, G.; Björk, K.M. A Generalized Scalable Software Architecture for Analyzing Temporally Structured Big Data in the Cloud. In Advances in Intelligent Systems and Computing; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 559–569. [Google Scholar] [CrossRef]
Rozony, F.Z.; Aktar, M.N.A.; Ashrafuzzaman, M.; Islam, A. A systematic review of big data integration challenges and solutions for heterogeneous data sources. Acad. J. Bus. Adm. Innov. Sustain. 2024, 4, 1–18. [Google Scholar] [CrossRef]
Putrama, I.M.; Martinek, P. Heterogeneous data integration: Challenges and opportunities. Data Brief 2024, 56, 110853. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The experimental methodological framework for performance analysis in data governance scenarios.

Figure 2. Governed RDBMS architecture for encryption, pseudonymization, and logical audit.

Figure 3. Big Data architecture for encryption, pseudonymization, and incremental versioning.

Figure 4. Cloud governance architecture for EU–non-EU cross-border scenarios.

Figure 5. Data governance mechanisms in hybrid financial architectures.

Figure 6. Average CPU utilization by technology and data type.

Figure 7. Average total execution time for all technologies and data types.

Table 1. Comparative synthesis of data governance mechanisms in RDBMS, Big Data, and Cloud.

Governance Dimension	RDBMS (Relational Systems)	Big Data (Hadoop/Spark)	Cloud (EU vs Non-EU)
Architectural role	Initial control and governance layer, where data are validated and protected at ingestion	Parallel processing and large-scale audit layer	Elastic analytics layer with controlled geographic replication
Governance application stage	At the transaction level, before data leave operational systems	During preprocessing, prior to data distribution across the cluster	Only after pseudonymization and compliance validation
Encryption	Fine-grained column-level encryption and/or TDE using standard algorithms (e.g., AES-256)	Encryption in transit and at rest via KMS (HDFS-KMS, TLS)	Mandatory encryption prior to transfer, keys remain on-premises or within the EU
Pseudonymization	Deterministic hashing or tokenization directly in the RDBMS engine	Applied in ETL pipelines or Spark jobs before persistence	Accepts only pseudonymized data, especially for Non-EU zones
Key management	Internally managed keys, integrated with database security policies	Centralized KMS shared at the cluster level	Keys are not transferred; they remain within the EU in cross-border scenarios
Data versioning	Logical versioning via audit tables, triggers, and transaction logs	Native incremental versioning (Delta Lake, Hudi, Iceberg)	Object versioning (S3 Versioning, Blob Snapshots) for replication and rollback
Audit and traceability	Deterministic, complete, transaction-level audit	Dataset- and job-level audit and lineage	Access-, replication-, and usage-oriented audit
Typical use cases	Sensitive, operational data with strict compliance requirements	Large-scale analytics, batch and streaming processing	Elastic analytics, reporting, and ML on pseudonymized data
Main limitations	Limited scalability and high costs at large volumes	Operational complexity and orchestration overhead	Jurisdictional constraints and indirect control over encryption keys

Table 2. Synthesis of integration strategies and governance mechanisms in the RDBMS–Big Data–Cloud hybrid architecture.

Architectural Layer	Managed Data Types	Integration Strategies	Applied Governance Mechanisms	Primary Role
Data sources	Raw, transactional, and personal data	Native connectors, APIs, and controlled ingestion	Initial classification and semantic validation	Legal boundary and single point of entry
Relational governance (RDBMS)	Personal data	Controlled ETL and CDC	AES-256 encryption, deterministic pseudonymization, and logical audit	Primary legal and operational control
Integration layer	Batch and streaming flows	JDBC, ETL, Kafka, and CDC	Sensitive data filtering and separation of personal vs non-personal data	Control of data propagation
Big Data platforms	Pseudonymized datasets	Distributed batch/streaming processing	Distributed audit, lineage, and incremental versioning	Analytical traceability and consistency
EU Cloud	Pseudonymized data	Controlled replication	Locally managed keys and snapshots	Compliant elastic analytics
Non-EU Cloud	Pseudonymized data only	Selective replication	No key access and read-only access	Global scaling without legal risk
Reconciliation and metadata	Metadata and versions	Incremental synchronization	Consistency validation, and version correlation	End-to-end integrity

Table 3. Example of structured data.

Status of Existing Checking Account	Duration (Months)	Credit History	Purpose	Credit Amount	Employment Since	Credit Worthiness
A13	52	A32	A48	6104	A72	2
A14	58	A31	A45	19,673	A72	2
A11	16	A30	A49	13,160	A75	1
A13	26	A31	A40	13,326	A73	2
A13	40	A30	A45	5163	A71	2
A14	45	A32	A41	12,683	A75	2
A11	56	A33	A44	10,397	A75	2
A11	44	A34	A45	1512	A74	2
A13	44	A32	A46	19,066	A74	2

Table 4. Example of semi-structured data.

{“Status of existing checking account”:“A13”,“Duration in month”:52,“Credit

history”:“A32”,“Purpose”:“A48”,“Credit amount”:6104,“Present employment

since”:“A72”,“Installment rate in percentage of disposable income”:1,“Personal status

and sex”:“A95”,“Other debtors guarantors”:“A103”,“Present residence

since”:1,“Property”:“A124”,“Age in years”:41,“Other installment

plans”:“A143”,“Housing”:“A153”,“Number of existing credits at this

bank”:1,“Job”:“A171”,“Number of people being liable to provide maintenance

for”:1,“Telephone”:“A191”,“Foreign worker”:“A202”,“Creditworthiness”:2}

Table 5. Example of unstructured data.

“The customer has a checking account status of A13, with a loan duration of 52 months. The credit history is A32, and the purpose of the loan is A48. The loan amount is 6104 units. The customer has been employed for A72 and has a job categorized as A171. The customer resides in housing type A153 and owns property of type A124. The customer has 1 existing credit(s) and is classified as a foreign worker.”

Table 6. Comparative performance results for the ‘All Policies’ scenario (1,000,000 records;

n = 50

runs per configuration).

Table 6. Comparative performance results for the ‘All Policies’ scenario (1,000,000 records;

n = 50

runs per configuration).

Technology	Data Type	CPU (%) Mean ± SD	Memory (GB) Mean ± SD	Time (s) Mean ± SD	Overhead vs. Baseline
RDBMS	Structured	$2.63 \pm 0.08$	$0.0535 \pm 0.0016$	$4.30 \pm 0.13$	$+ 131 %$ (baseline: $1.86$ s)
Big Data	Structured	$0.13 \pm 0.01$	$0.0006 \pm 0.00003$	$2.54 \pm 0.13$	$+ 34 %$ (baseline: $1.89$ s)
Big Data	Semi-structured	$1.46 \pm 0.07$	$0.0005 \pm 0.00003$	$5.23 \pm 0.26$	$+ 45 %$ (baseline: $3.62$ s)
Big Data	Unstructured	$1.12 \pm 0.06$	≈0	$1.75 \pm 0.09$	$+ 354 %$ (baseline: $0.39$ s)
Cloud	Structured	$0.20 \pm 0.01$	≈0	$2.61 \pm 0.18$	$+ 33 %$ (baseline: $1.96$ s)
Cloud	Semi-structured	$2.17 \pm 0.15$	≈0	$6.31 \pm 0.44$	$+ 42 %$ (baseline: $4.43$ s)
Cloud	Unstructured	$1.56 \pm 0.11$	≈0	$2.97 \pm 0.21$	$+ 63 %$ (baseline: $1.82$ s)

Table 7. Kruskal–Wallis non-parametric test of governance impact on execution time.

Technology	Data Type	Baseline Mean ± SD (s)	All Policies Mean ± SD (s)	$Δ$ %	H-Statistic	p-Value
RDBMS	Structured	$1.86 \pm 0.06$	$4.30 \pm 0.13$	$+ 131 %$	$82.4$	<0.001
Big Data	Structured	$1.89 \pm 0.09$	$2.54 \pm 0.13$	$+ 34 %$	$53.2$	<0.001
Big Data	Semi-structured	$3.62 \pm 0.18$	$5.23 \pm 0.26$	$+ 45 %$	$67.8$	<0.001
Big Data	Unstructured	$0.39 \pm 0.02$	$1.75 \pm 0.09$	$+ 354 %$	$89.3$	<0.001
Cloud	Structured	$1.96 \pm 0.14$	$2.61 \pm 0.18$	$+ 33 %$	$61.5$	<0.001
Cloud	Semi-structured	$4.43 \pm 0.31$	$6.31 \pm 0.44$	$+ 42 %$	$79.2$	<0.001
Cloud	Unstructured	$1.82 \pm 0.13$	$2.97 \pm 0.21$	$+ 63 %$	$71.6$	<0.001

Table 8. Performance impact relative to the Baseline scenario (1,000,000 records).

Data Type	Scenario	Baseline Mean Time (s)	Scenario Mean Time (s)	$Δ$ Time (%)	$Δ$ CPU
RDBMS
Structured	Encryption	1.862	2.547	+36.8	+0.426
Structured	Pseudonymization	1.862	2.869	+54.1	+0.740
Structured	Versioning	1.862	2.510	+34.8	+0.296
Structured	All Policies	1.862	4.302	+131.0	+1.535
Big Data
Structured	Encryption	1.893	2.006	+6.0	+0.117
Structured	Pseudonymization	1.893	2.363	+24.8	+0.005
Structured	Versioning	1.893	2.038	+7.6	0.000
Structured	All Policies	1.893	2.536	+34.0	+0.123
Semi-structured	Encryption	3.619	5.113	+41.3	+1.460
Semi-structured	Pseudonymization	3.619	3.617	−0.1	0.000
Semi-structured	Versioning	3.619	3.757	+3.8	0.000
Semi-structured	All Policies	3.619	5.233	+44.6	+1.460
Unstructured	Encryption	0.385	2.228	+478.3	+1.152
Unstructured	Pseudonymization	0.385	0.609	+57.9	+0.001
Unstructured	Versioning	0.385	0.431	+11.8	+0.002
Unstructured	All Policies	0.385	1.749	+353.9	+1.118
Cloud
Structured	Encryption	1.962	2.116	+7.9	+0.114
Structured	Pseudonymization	1.962	2.459	+25.4	+0.005
Structured	Versioning	1.962	2.148	+9.5	+0.001
Structured	All Policies	1.962	2.614	+33.2	+0.123
Semi-structured	Encryption	4.431	6.192	+39.8	+1.661
Semi-structured	Pseudonymization	4.431	4.416	−0.3	+0.021
Semi-structured	Versioning	4.431	4.581	+3.4	+0.027
Semi-structured	All Policies	4.431	6.305	+42.3	+1.688
Unstructured	Encryption	1.818	2.758	+51.7	+1.171
Unstructured	Pseudonymization	1.818	1.063	−41.5	−0.057
Unstructured	Versioning	1.818	1.685	−7.3	−0.003
Unstructured	All Policies	1.818	2.966	+63.2	+1.138

Note:

Δ

CPU is reported in absolute CPU units relative to the baseline scenario.

Table 9. Correlation of governance policies with applications in the financial domain.

Governance Policy	Performance Impact	Data Quality Impact	System Stability	Financial Applications
No governance (Baseline)	Maximum performance; minimal overhead	High quality; no protection	Very high	Exploratory analytics; internal prototyping
Encryption	Moderate CPU overhead; increased execution time	Full confidentiality	High	Core banking systems; regulated reporting
Pseudonymization	Moderate overhead; good scalability	Semantic quality preserved	High	Credit scoring; risk analysis
Incremental versioning	Moderate increase in execution time	Full traceability and auditability	Medium to high	Audit; compliance; historical reconstruction
All policies enabled	High cumulative overhead; scalable in hybrid architectures	Maximum security and data quality	High (RDBMS); medium (Cloud)	Mission-critical financial systems

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ionescu, S.-A.; Diaconita, V.; Radu, A.-O.; Dinca, L.G.; Nagit, I. Deterministic Data Governance in Hybrid Financial Architectures. Electronics 2026, 15, 1716. https://doi.org/10.3390/electronics15081716

AMA Style

Ionescu S-A, Diaconita V, Radu A-O, Dinca LG, Nagit I. Deterministic Data Governance in Hybrid Financial Architectures. Electronics. 2026; 15(8):1716. https://doi.org/10.3390/electronics15081716

Chicago/Turabian Style

Ionescu, Sergiu-Alexandru, Vlad Diaconita, Andreea-Oana Radu, Laurentiu Gabriel Dinca, and Ioana Nagit. 2026. "Deterministic Data Governance in Hybrid Financial Architectures" Electronics 15, no. 8: 1716. https://doi.org/10.3390/electronics15081716

APA Style

Ionescu, S.-A., Diaconita, V., Radu, A.-O., Dinca, L. G., & Nagit, I. (2026). Deterministic Data Governance in Hybrid Financial Architectures. Electronics, 15(8), 1716. https://doi.org/10.3390/electronics15081716

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deterministic Data Governance in Hybrid Financial Architectures

Abstract

1. Introduction

2. Literature Review

2.1. Relational Databases Data Governance Mechanisms

2.2. Governance Mechanisms in Big Data Frameworks for Financial Analytics

2.3. Data Governance in Cloud-Based Financial Services

2.4. Data Governance Integration Strategies and Findings

3. Materials and Methods

4. System Architecture and Governance Mechanism Integration

4.1. Comparison of Technologies in Relation to Data Governance

4.2. Data Governance Integration Strategies

4.3. Data Governance in Hybrid Architectures and Practical Implementation

5. Performance Analysis

5.1. Performance Analysis and Obtained Results

5.2. Linking the Results to Financial Applications

6. Conclusions

6.1. Practical Outcomes and Implementation Recommendations

6.2. Methodological Constraints

6.3. Future Research Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI