Engineering Sustainable Data Architectures for Modern Financial Institutions

Ionescu, Sergiu-Alexandru; Diaconita, Vlad; Radu, Andreea-Oana

doi:10.3390/electronics14081650

Open AccessArticle

Engineering Sustainable Data Architectures for Modern Financial Institutions^†

by

Sergiu-Alexandru Ionescu

^‡

,

Vlad Diaconita

^*,‡

and

Andreea-Oana Radu

^*,‡

Department of Economic Informatics and Cybernetics, Bucharest University of Economic Studies, 010374 Bucharest, Romania

^*

Authors to whom correspondence should be addressed.

^†

This article is a revised and expanded version of a paper entitled “Assessment and Integration of Relational Databases, Big Data, and Cloud Computing in Financial Institutions: Performance Comparison”, which was presented at International Conference on Innovations in Intelligent Systems and Applications (INISTA) in Craiova, Romania, in September 2024.

^‡

Current address: Academia de Studii Economice Bucuresti, Calea Dorobantilor nr. 15-17, Sector 1, 010552 Bucharest, Romania.

Electronics 2025, 14(8), 1650; https://doi.org/10.3390/electronics14081650

Submission received: 13 March 2025 / Revised: 10 April 2025 / Accepted: 13 April 2025 / Published: 19 April 2025

(This article belongs to the Topic Recent Applications of Artificial Intelligence in Economy and Society)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Modern financial institutions now manage increasingly advanced data-related activities and place a growing emphasis on environmental and energy impacts. In financial modeling, relational databases, big data systems, and the cloud are integrated, taking into consideration resource optimization and sustainable computing. We suggest a four-layer architecture to address financial data processing issues. The layers of our design are for data sources, data integration, processing, and storage. Data ingestion processes market feeds, transaction records, and customer data. Real-time data are captured by Kafka and transformed by Extract-Transform-Load (ETL) pipelines. The processing layer is composed of Apache Spark for real-time data analysis, Hadoop for batch processing, and an Machine Learning (ML) infrastructure that supports predictive modeling. In order to optimize access patterns, the storage layer includes various data layer components. The test results indicate that the processing of market data in real-time, compliance reporting, risk evaluations, and customer analyses can be conducted in fulfillment of environmental sustainability goals. The metrics from the test deployment support the implementation strategies and technical specifications of the architectural components. We also looked at integration models and data flow improvements, with applications in finance. This study aims to enhance enterprise data architecture in the financial context and includes guidance on modernizing data infrastructure.

Keywords:

financial systems; big data; integration; multi-tier architecture; hybrid storage systems

1. Introduction

Advancements in artificial intelligence, cloud computing, mobile technologies, and the blockchain have transformed the financial services industry [1]. These innovations, part of the FinTech movement, have become quite popular in recent years and are an important part of the regulatory and policy-making process. In today’s business environment, financial institutions cannot afford to ignore digital capabilities focused on data analytics, automation, and customer experience if they wish to improve their competitive positioning and foster innovation [2].

Indeed, in contemporary financial contexts, institutions are faced with ever-expanding data volumes and data types (logs, documents, multimedia for fraud forensics, etc.) and increasingly stringent regulatory demands, rendering traditional relational databases insufficient for today’s heterogeneous data environments [3]. Although such databases offer robust schema enforcement and efficient handling of structured records, they often falter when required to process high-velocity or unstructured content. In contrast, big data frameworks and cloud-based services have emerged as formidable solutions to accommodate scale and complexity [4]. However, integrating these newer tools into established relational infrastructures is far from straightforward, particularly in large organizations that focus on compliance [5].

In response to these challenges, modern financial institutions are those organizations that operate in the financial sector with a technology-driven approach that promotes innovation. In addition, balancing performance, scalability, and compliance requirements is something traditional financial systems struggle to achieve in isolation. These players offer a wide range of financial services that have been transformed through technology and data-driven innovations. They provide a range of digital banking services that process transactions in real time: automated loans that provide instant approvals; data-driven investment services that include robo-advisory and algorithmic trading; personalized insurance with automated claims processing; sophisticated capital market services; smart risk management solutions; digital personal finance tools and a wide range of corporate but also consumer-oriented financial services; and some oriented to customers previously not catered by financial institutions [6]. Such firms use advanced databases together with big data and operate over a hybrid infrastructure with an on-premise and cloud environment, along with real-time capabilities and a superior customer experience. They integrate regulatory compliance into their structure, have strong defense against cyber attacks, and use agile operational behavior to respond quickly.

Despite these advancements, there is evidence for the standalone value of relational systems, big data technologies, and cloud computing, but comprehensive studies of their combined deployment in finance remain underwhelming [7]. This shortfall becomes especially problematic under stringent data governance regimes, as demonstrated by GDPR and evolving cross-border transfer protocols, which impose additional complexities on performance, scalability, and data sovereignty. Indeed, reconciling real-time analytics and strict regulatory mandates is an ongoing challenge for cross-border financial operations.

Although there is substantial literature on each of the three pillars, relational databases, big data frameworks, and cloud platform, as well as research that addresses the combinations of these pillars [8], recent literature has presented the issue of the fragmentation of data architecture in financial institutions and a lack of a unified approach. Several banks have separate systems. For example, companies can have a traditional data warehouse for reporting and business intelligence and a separate data lake for raw and minimally processed data. Some may even have a new cloud stack for digital channels or new innovation projects. It is common to find “fragmented data warehouses and data lakes” in banking, where old and new platforms are managed simultaneously without full integration. In such situations, companies may face duplicate costs for storage, processing resources, software licensing, and maintenance across multiple environments. Separate environments can lead to duplicate data. Even more concerning is the likelihood of security and compliance problems, since when data sit in silos, governance becomes harder. For example, a European bank has customer transactions in an on-premise Oracle database. It exports subsets to a Hadoop-based risk analytics system and then copies the data a third time to a cloud data lake for AI-related projects. Clearly, tracking the lineage and ensuring all copies comply with GDPR erasure or audit trail is quite difficult.

Researchers have advocated a unified, integrated framework while noting significant gaps—particularly in performance benchmarking and scalability testing under mixed workloads. Traditionally, this has resulted in the study of one environment at a time (such as, for instance, benchmarking the database or benchmarking the Hadoop cluster). Compliance discussions also usually separate the cloud outsourcing or big data governance topics. Few studies have studied architectures that combine all three, that is, an end-to-end financial data pipeline that entails transaction capture in an RDBMS, big data processing for analytics, and being deployed on-premises and in the cloud.

There are studies that have shown how current database benchmarks do not accurately reflect the reality of financial workloads, such as those spanning different types of data and have strict security requirements. This limitation highlights the necessity for more comprehensive research methodologies in the field. They recommend new benchmarking methodologies that take into account complex business logic, diverse data, and strong security—in other words, they note that financial use cases merge transactional and analytic system characteristics. A survey reported on bank data management, stating that balancing and integrating traditional and new tools is imperative to meet dynamic business needs.

To address this research gap, this research considers how relational databases, big data architectures, and cloud infrastructures can be orchestrated to meet both operational demands and legal requirements. In doing so, it compares performance metrics, such as processing speed, scalability, and resource utilization, and explores the practical feasibility of an integrated multilayer solution. Although relational systems have historically served structured reporting needs, they offer limited scope for growing volumes of semistructured and unstructured content [5]. While big data and machine learning approaches provide deeper insight, they can escalate processing costs as datasets expand, prompting greater reliance on the scalability and flexibility of cloud platforms [7]. By proposing a multi-layer architecture and benchmarking potential integration strategies, this study offers new directions for modernizing legacy systems without compromising governance standards or incurring unsustainable costs.

Building on the results presented in [9], this extended research significantly expands the scope and depth of the analyses on the technical and regulatory requirements of financial institutions, showing how technologies, such as real-time streaming (Kafka) and distributed processing (Spark and Hadoop), can be orchestrated with standard relational systems in a multilayer hybrid architecture. Specifically, this research was guided by the following research questions:

RQ1: What are the main trends, challenges, and strategies in integrating relational databases, big data, and cloud computing in financial institutions based on a systematic review of the recent literature?
RQ2: How can financial institutions implement a hybrid cloud architecture that optimizes operational efficiency while ensuring compliance with EU data protection requirements in the context of EU-US data transfers?
RQ3: How do these technologies impact financial data management and analysis in terms of scalability, processing speed, and cost-effectiveness?

The following sections analyze the current literature on system integration, investigate practical implementation problems (especially with security and regulatory compliance), and offer an empirical evaluation of various platforms across diverse workloads.

2. Literature Review

2.1. Relational and NoSQL Databases in Financial Services

For many years, relational database management systems (RDBMS) have been in use in most financial IT systems. This is because they feature strong consistency (with ACID properties), reliability, and well-defined schema. Studies show that RDBMS are still the predominant choice for operational finance data storage—around 80% of the operational databases of enterprises are still relational. RDBMS remain in use 70% of the time in some form, even for new financial applications as new solutions emerge [10].

The widespread use of RDBMS is largely due to the characteristics of financial data. The transactional nature of financial data, such as payments, trades, account balances, etc., fits well in structured tables. Furthermore, regulatory requirements require accurate and auditable records that are easily compliant with RDBMS. For example, core banking systems and payment processing platforms traditionally run on a commercial RDBMS (Oracle, SQL Server, DB2, etc.) or a modern open-source SQL database to ensure that each transfer or trade is processed transactionally.

Relational databases have long been used for managing financial data as they are capable of handling both Online Transaction Processing (OLTP); systems specialized in the efficient, real-time execution of transactional tasks, such as order entries, payments, and account updates; and Online Analytical Processing (OLAP) systems, which are designed to aggregate and analyze larger volumes of historical data to generate added value for business intelligence. In particular, OLAP systems can be used during risk assessment and strategic decision making [11,12,13]. Furthermore, state-of-the-art cloud native OLTP and OLAP databases offer storage layer consistency, compute layer consistency, multilayer recovery, and HTAP optimization [14,15].

However, despite these strengths, as data volumes and velocity increase, performance and scalability problems arise. Except for some products (such as Oracle RAC, which uses clustering technology to distribute workloads across nodes while maintaining data consistency and integrity [16]), traditional RDBMS are not built for the “dynamic and distributed” environment of modern IT architecture and for horizontal scaling.

In particular, high-frequency trading (HFT) systems create huge streams of tick data and demand microsecond latency. In these cases, systems such as in-memory databases or specialized hardware solutions (e.g., FPGA implementations or kdb+) are preferred, since conventional relational engines often introduce excessive latency.

Additionally, banks are dealing with ever more diverse data (logs, documents, multimedia for fraud forensics, etc.) that do not fit in RDBMS tables. Consequently, the new literature insists that RDBMS need not be replaced but must be complemented.

Moreover, relational databases and data warehouse rely on pre-defined schema. This presents challenges in adapting to unstructured or semi-structured data formats, which limits their scalability and flexibility. Traditional relational databases operate on a schema-on-write system, which, while efficient for structured data, proves to be rigid and less adaptable to evolving data requirements in big data environments. This rigidity makes them less suitable for current financial contexts, where real-time analytics and hybrid data environments are critical [17,18,19].

In response to these limitations, vendors of RDBMS systems have advanced their technologies to address the performance requirements of current workloads. Nevertheless, RDBMS, big data, and their integration appear ever more crucial to handle complex workloads.

Furthermore, despite these advancements, HTAP technology is still emerging and can be complicated to implement (ensuring isolation, synchronization across row/column stores, etc.).

The analytics companion of RDBMS, namely data warehouses, needs to be upgraded and integrated with newer technologies like cloud and NoSQL data models for better flexibility. Many banks have started to extend their relational databases by horizontal partitioning, in-memory acceleration, or sharding. However, these options have limits without a complete re-architecture.

As an alternative approach, NoSQL databases based on models, such as document-based, key value-based, columnar-based, or graph-based, can ingest data (JSON documents, logs, graphs, etc.) without a fixed schema at speed [20]. This is valuable in finance, where data come from diverse sources (mobile apps, social feeds, IoT, etc.) and change frequently. For example, a document database can store a customer profile with varying attributes and update it on the fly, unlike a rigid relational schema. As NoSQL does not require predefined schema, financial teams can adopt agile development for new data-centric applications. Data models can evolve as quickly as business needs change (e.g., due to new regulatory fields and new product data) without lengthy migrations. Furthermore, most NoSQL databases are natively designed to scale out across commodity servers or cloud instances, handling massive data volumes and throughput. Financial institutions often need to retain years of historical data (for risk modeling or compliance) and handle spikes (e.g., Black Friday transactions or market volatility). Some key use cases of NoSQL in the financial sector are shown in Table 1.

In the cloud context, a large number of banks and other financial players are starting to attack this by moving relational databases to the cloud (for elasticity) and consolidating them into data lakes fueled by streaming pipelines. Cloud-managed database services (like Amazon RDS and Azure SQL) are gaining market share as they improve scalability and lower operational costs compared to an on-premises setup. Cloud-managed database solutions have their own issues, such as vendor lock (i.e., difficulties to change from one cloud provide to another). However, research has shown that lock-in dissolution practices can be easily implemented [27].

2.2. Big Data Frameworks and Analytics in Finance

Lately, the financial sector has taken huge steps to use big data analytics frameworks for obtaining insight through large datasets. Big data technologies can ingest and process large volume of heterogeneous data—trading feeds, customer clickstreams, social media sentiment, transaction graphs, etc. These are unlike the RDBMS that store data in structured tables. Hadoop and Spark are two frameworks that are often mentioned. They allow for distributed storage and parallel computing in clusters, which is essential when the data sizes run into petabytes.

The evolution of big data technologies underwent a significant change. Tools like Hadoop can play a vital role in managing large, diverse datasets and providing scalable storage solutions [28]. By offloading heavy read-only analyses to Hadoop, the system relieved the OLTP database and achieved faster insight generation. Apache Spark has been the next step forward. It surpasses traditional big data frameworks by processing data faster and efficiently handling streaming data and machine learning tasks [29,30,31,32].

Big data in financial services can have many benefits, such as better customer insight and engagement, improved fraud detection, and improved market analytics [33,34]. Large datasets can be mined to identify subtle patterns in customer behavior, leading to better services and inclusion. Similarly, fraud schemes concealed within millions of transactions can be easily exposed – utilizing larger datasets to track down trends that can help with loss prevention. In trading operations, big data analysis can help formulate the trading strategy such as backtesting thousands of scenarios and monitoring the market sentiment in real-time from news and social media.

Research from both academia and industry consistently shows that big data has a positive impact on risk management and operations. A recent review found evidence that big data enhances risk management and boosts operational efficiency in FinTech applications [34]. Big data enables the financial sector to scale up, which involves an improvement in predicting and eliminating risk (e.g., credit, market, etc.) through processing diverse data (e.g., economic indicators, customer portfolios, etc.). Big data and AI-powered models operationally speed up intensive tasks, like customer service or credit underwriting, because they can process data much faster than manual processes. This has direct implications for financial inclusion because speedier analytics can enable services (like microloans or real-time payments) to a wider audience while controlling risks.

For real-time applications, new streaming frameworks are used to process events in real time. For example, processing each card transaction for fraud in a matter of seconds (or less). Data streaming has found various applications in capital markets and payments, as it effectively manages real-time data within trading and fraud detection [35]. Demand for real-time analytics has resulted in the use of Lambda and Kappa architecture using the batch and stream layers.

Despite these benefits, big data has its challenges. Although these frameworks can be scaled horizontally, achieving low latency comparable to a relational system is hard. The operational latency of traditional Hadoop batch processing is on the order of minutes, which does not suit time-sensitive financial operations.

Additionally, despite these advantages, big data technologies face significant challenges. Their lack of robust transactional capabilities hinders real-time decision-making processes in financial institutions. Additionally, ensuring data quality and consistency during integration with other technologies, such as relational databases, remains a complex task. The schema-on-read approach, while flexible, often complicates data governance and standardization efforts [36,37]. Moreover, while big data tools provide scalability, the computational resources required for their operation can become costly and difficult to manage, especially for smaller financial institutions. These challenges raise the need for better integration frameworks to bridge the gap between big data technologies and traditional database systems.

Furthermore, it is not trivial to ensure the accuracy and consistency of these fast-moving big data pipelines with the authoritative data in relational stores.

Another main concern in the big data literature for finance is data governance, privacy, and quality. Financial data often contain sensitive personal information and are subject to strict regulations, such as GDPR in the EU. A continuing concern is to ensure that there are controls in place to enforce compliance (who can access what data, how long data will be retained, and ensuring the anonymity of data for analytics) [38].

Data lakes—large central repositories hosted on HDFS or cloud and similar platforms—are widely used in finance. However, early implementations resulted in “data swamps” without proper governance [39]. Research on financial data lakes makes clear the importance of data quality controls and metadata modeling [40]. These challenges include a need for, practically, real-time processing, better data quality techniques, and more. These gaps suggest that, while big data technology is very powerful, it is not yet fully mature in addressing all of finance’s needs (e.g., ultra-low latency trading or rigid compliance reporting).

In recent developments, it is interesting that an increasing number of frameworks are run on a cloud, so that makes it a bit difficult to distinguish this section from the next. Several financial institutions are using cloud-based data lakes or analytics services (like AWS EMR, Google BigQuery, or Databricks Spark on Azure) for on-demand scalability. This has brought about new hybrid architectures like “data lakehouse”, which aim to bring the schema and reliability of data warehouses together with the scalability of data lakes. The Lakehouse is a new data architecture that unifies analytics on all data sources in one platform. The idea is that mixing structured and unstructured data will be one seamless task. This is obviously relevant to finance, where the same dataset might need to serve both traditional SQL queries and AI model training in high-dimensional similarity spaces. For instance, Databricks presented a Lakehouse for Financial Services solution in 2022 to assist organizations in managing compliance to customer analytics on a platform.

2.3. Cloud Computing in Financial Services

In the last five years, cloud computing has become a widely used computing model in financial services. An increasing number of banks and insurers are utilizing public cloud providers (Amazon Web Services, Microsoft Azure, Google Cloud, etc.) and private clouds. According to a 2023 US report, across the spectrum, financial institutions view cloud services as an important part of their technology program, and most big banks are opting for a hybrid cloud strategy [41]. According to Technavio [42], the private and public cloud sector in the arena of financial services will see a rise of USD 106.43 billion during 2024–2028 due to the demand for big data storage and AI. An increasing number of companies are going for hybrid cloud solutions that offer the best of both worlds but on a limited basis. But data safety and privacy issues are still serious problems.

Indeed, cloud computing can represent the next significant technological transformation. Cloud computing is designed to provide computing services over the Internet and is characterized by scalable and flexible platforms that process and store massive datasets [36,43,44]. Therefore, the need for extensive on-premises infrastructure has greatly diminished; thus, this represents a shift in the paradigm for data management and analysis. Although there are concerns related to security, financial institutions have recognized the significant advantages of cloud computing. The large variety of benefits it presents can overcome the challenges as, overall, their role is significant in data management, and most current financial strategies are based on cloud computing [45,46].

From a regulatory perspective, financial supervisors around the world have taken a good look at cloud outsourcing generally—the European Banking Authority issued detailed Guidelines on Cloud Outsourcing and the DORA regulation to ensure operational resilience [47]. The main concerns are the privacy of customer data and ensuring that the data stored in the cloud are not illegally transferred to others and across borders. There is the risk that certain cloud providers will experience operational outages or incidents that could affect banks. There is also a concentration risk when too much of the industry relies on one or a few big providers. Lastly, there is a security risk as banks need to be sure that the cloud platforms they are using themselves have robust security. In addition, banks also need to ensure proper access and encryption controls.

One report [48] highlighted myths about cloud supervision, asserting that there is a similarity of risk types between on-premise and cloud models; instead, it is the governance of risk that is different. Using a cloud does not create entirely new forms of risk; banks just need to adapt their vendor risk and cybersecurity models. The CEPS analysis also points out two top fears—concentration risk and vendor lock-in—and it argues they can be mitigated (e.g., through multi-cloud strategies and contractual safeguards).

More specifically, concerns about data security, privacy, and regulatory compliance are significant barriers to the widespread adoption of cloud computing in the financial sector. Migration of sensitive financial data to the cloud requires reliable encryption, strict access controls, and adherence to data protection regulations [49,50].

Another challenge is integration with legacy systems. Many banks’ core banking systems were created decades ago and are not cloud-native. Moving these to cloud directly may not help and could create more problems for banks. Reports in the literature suggest refactoring applications or using cloud-native principles (e.g., microservices). The US Treasury report identified gaps in human capital and tools to securely deploy clouds—this means that many financial firms are lacking the expertise to migrate and operate in the cloud. Despite these challenges, some banks are forming partnerships with FinTech companies or consulting firms to build cloud skills and DevOps practices.

From a performance perspective, cloud technology can provide cost efficiency, enhanced cybersecurity, and operational resilience by allowing the automatic scaling of computing resources to meet surges in demand. For example, cloud platforms can dynamically add processing capacity during peak trading periods or fraud spikes, avoiding the performance bottlenecks of legacy on-premises systems.

Furthermore, financial organizations seek architectures that can scale to large volumes of data, provide fast processing for real-time insights, and still comply with stringent regulations. There is evidence that the COVID-19 pandemic further accelerated cloud adoption in finance as institutions have sought more agile and scalable IT solutions [51].

In addition, the integration of artificial intelligence (AI) and machine learning (ML) represents another milestone in the digital revolution designed to speed up financial data analysis. Their purpose is to empower organizations that face constant, even more complex than before, challenges. The financial sector has been greatly influenced, especially in terms of fraud detection, risk management, and high-frequency trading, where the implementation of AI and ML has been a long-needed success, turning them into real assets for those critical areas [52,53].

2.4. EU General Data Protection Regulation vs. US Data Privacy Frameworks

EU’s General Data Protection Regulation (GDPR), which came into force in May 2018, has changed data governance in financial services. The GDPR has strict requirements related to lawful processing, explicit consent, rights of data subjects (access, erasure, and portability), etc., along with penalties of up to 4% of global turnover. The Data Protection Authority (DPA) of each EU member state monitors compliance, thus enforcing a new regulatory paradigm, which further accentuates the importance of privacy by design and prompt breach notification (72 h). Financial institutions have been leading the way in this matter due to enormous amount of personal and transactional data they handle. In the US, no one law has the scope of GDPR. The law that concerns financial institutions the most is the Gramm-Leach-Bliley Act (GLBA) for financial privacy and security, together with state data breach notification statutes and new state privacy laws like the California Consumer Privacy Act (CCPA [54], amended by CPRA in 2020 [55]). A proposed federal law, the American Data Privacy and Protection Act (ADPPA) [56]), suggests that the US is getting closer to, but will not achieve, GDPR-like standards. The key features of the EU and US data protection standards are shown in Table 2.

Since 2018, GDPR has helped improve data protection in financial institutions, not just in Europe, but also indirectly all over the world. Since 2018, most EU banks have transitioned from an uncertain position, where parties were concerned whether they will ever be fully GDPR-compliant, to a more steady state of affairs. Today, privacy governance is accepted as business as usual, and it an important exercise in building customer trust. Enforcement cases have sent a strong message: every organization must follow data protection rules. No one is above the rules, not even the biggest banks, who will be penalized if their practices are not continuously improved. American banks and FinTech companies were somewhat protected by the US rules, but have nevertheless experienced a global backlash. They reinforced security (which was already a focus due to financial regulation) and started to embrace more consumer-friendly data practices as state laws emerged. The anticipated introduction of broad federal privacy legislation (such as ADPPA) is encouraging US entities to catch up with GDPR.

A trend that is good news is the increasing collaboration of legal, IT, cybersecurity, and business units of financial institutions, particularly those considering AI-related privacy concerns [59,60]. GDPR and privacy laws have forced silos to break down. The chief compliance officer may often work together with the CISO (chief information security officer) and CTO (chief technology officer) on projects to ensure compliance is baked in. Cybersecurity conferences are now including privacy tracks and vice versa [61] to discuss the different kinds of issues in the field. For instance, data governance will range between pure legal issues—what does the law require?—to technical execution—can we easily retrieve all of the data of one person? This has led to the emergence of new roles, such as, for instance, privacy engineers, who are technologists that can implement privacy features in software. Banks such as ING and Wells Fargo are putting together teams of privacy engineers and working with anonymization and consent technology. Academically, this is where the law, computer science, and policy come together. Thus, it is where many future gains (or failures) in privacy protection will occur.

2.5. Sustainable Practices

Financial institutions increasingly want sustainable computing solutions as they grow their cloud infrastructure and increase their big data usage.

There is evidence that the cloud is more energy-efficient than on-premise data centers [62], especially if certain improvements are implemented [63,64]. Financial institutions may find that they could reduce energy-related costs by directing IT resources to one of these data centers, such as those operated by Amazon AWS, Google Cloud, or Microsoft Azure. This trend along with virtualization and AI-driven efficiencies could substantially curb greenhouse gas emissions related to IT. Current estimates [65,66] place their electricity usage at approximately 1–2% of global electricity consumption, with carbon emissions ranging from 0.6% to over 3% of global greenhouse gas emissions. According to some higher-end projections, cloud data centers are expected to use 20% of global electricity and generate up to 5.5% of Earth’s carbon emissions [67]. As such, energy modeling and prediction for data centers are essential [68].

The amount of data we have has increased exponentially and will reach nearly 181 zettabytes by 2025, which is three times more than what we had in 2020. The rising use of AI and other new digital technologies will increase their footprints [69]. The demand for electricity in data centers across the globe will reach 800–1200 TWh by 2026 [70]. This will almost be double the level in 2019 and will happen under scenarios of high growth. Thanks to efficiency improvements and a cleaner energy mix, ICT emissions growth has thus far been constrained. For example, data center GHG emissions actually fell by 8% in the 2015–2020 period, despite a 24% increase in energy use in Europe over this period [71].

The performance of the sector benchmarks shows continued improvements in IT sustainability, although it is still a work in progress. Data centers leveraging power usage effectiveness (PUE) are plateauing at1.56 globally, while many older facilities have not optimized PUE as of yet [72]. Meanwhile, the best-in-class hyperscale centers operate at around 1.1 PUE when optimization strategies have been put forward.

To illustrate these trends better, we have examples of sector-wide implementations. Flowe Bank (Italy) is a cloud-native green neobank that deployed its IT operations on a cloud banking platform on Microsoft Azure with ∼95–98% lower emissions than on premise. EQ Bank (Canada) also has about 94–97% lower emissions due to utilizing a cloud infrastructure [73]. Big banks are increasingly downsizing their own data centers and partnering with cloud providers.

Although infrastructure is important, financial institutions are also looking at the software layer and data management to drive sustainability. Financial firms run complex models (e.g., risk simulations, option pricing, and credit scoring). Traditionally, the focus was on accuracy and speed, not energy. Now, there is a nascent movement to write more efficient code and use more efficient algorithms to achieve the same business outcome with less computation. For example, instead of running an extremely high-resolution Monte Carlo simulation for a risk that takes 1000 CPU-h, a bank might find a smarter statistical technique or use a surrogate model that takes 100 CPU-h with a negligible loss of precision—thus saving energy proportionally. Similarly, in AI, techniques like model pruning, quantization (using lower precision math), or using smaller pre-trained models can cut down the computation needed for tasks like fraud detection. The carbon footprint has emerged as a performance metric, so development teams are beginning to consider the carbon cost of a given computation, especially for large-scale or repetitive tasks. There are even tools emerging that integrate with code repositories to estimate the energy consumption of code changes. While still an emerging practice, this mindset is growing especially in Europe, influenced by the concept of “Green Software Engineering” [74].

In implementing these technology strategies, a cross-disciplinary approach is often required. IT departments work with corporate sustainability teams, facilities managers, and, sometimes, external energy consultants. For instance, configuring a data center’s electrical design might involve an outside engineering firm, while setting cloud migration targets involves CIOs and CFOs making cost–benefit evaluations, including carbon accounting. It is also worth mentioning techno-ethical governance here, as banks implement advanced tech, like AI, to optimize themselves, and they are also aware of ethical considerations—ensuring algorithms do not inadvertently compromise reliability or fairness while optimizing for energy. For example, an AI that aggressively powers down servers to save energy must not risk the availability of critical systems. Governance frameworks are put in place to review such trade-offs. In essence, sustainable computing must still uphold the primary directives of banking IT: security, availability, and integrity. The good news is many sustainable practices align well with these (e.g., removing old inefficient servers reduces security vulnerabilities and failure rates). When conflicts arise (like needing redundancy vs. saving energy), governance bodies weigh priorities and sometimes choose a middle ground (like having redundancy but on a standby low-power mode rather than fully active).

2.6. Integration Approaches and Findings

Moving to the integration aspect, the integration of relational databases, big data technologies, and cloud computing brings significant opportunities, but also different challenges. The rigidity of schema-on-write in relational databases contrasts with the flexibility of schema-on-read approaches in big data systems, highlighting a divide in managing complex data structures [75,76]. Advanced integration strategies, including hybrid cloud models and ETL workflows, are needed to harmonize these technologies and unlock their full potential for the financial sector [77,78].

Using a mix of database technologies, each for what it does best, can overcome the one-size-fits-all limitation. In practice, this means maintaining relational databases for core transactional data while employing NoSQL or Hadoop-based stores for high-volume analytics data [79]. A polyglot or multi-model strategy allows institutions to combine SQL and NoSQL systems, although it requires careful design to ensure consistency and efficient data exchange. In some cases, organizations convert relational schema into a NoSQL format or migrate subsets of data to NoSQL systems to improve scalability [80]. This transformation can improve performance for certain read-heavy or analytical workloads, but it requires robust synchronization mechanisms between the two worlds.

Many financial institutions rely on data lakes [81], i.e., centralized repositories on scalable storage (often cloud-based) that ingest data from relational databases, transaction feeds, and external sources. Cloud-based data lakes enable the aggregation of diverse datasets (structured and unstructured) in one place, supporting advanced analytics and AI on a unified platform. For example, the adoption of cloud data lakes has reshaped operations by providing robust storage, advanced analytics, and on-demand scalability. Unlike traditional data warehouses, data lakes can store raw data in its native form, which is then processed by big data tools as needed. Another option is the “lakehouse” architecture [82], which combines the schema and performance benefits of data warehouses with the flexibility of data lakes, allowing SQL queries with ACID guarantees and big data processing on the same repository. This approach simplifies integration by reducing data movement between separate systems.

However, relying on public cloud platforms has its legal, operational, and reputational risks. For example, since the EU’s first major data protection law in 1995 (the Data Protection Directive, which was replaced by the GDPR in 2018), exporting personal data outside the EU is only allowed under strict conditions. In contrast, the United States has strong intelligence-gathering powers [83]. But many European financial institutions use cloud services based in the US (such as, for example, Amazon Web Services, Microsoft Azure, Google Cloud, and Oracle Cloud) for critical operations—such as online banking platforms, data analytics, or customer relationship management—and they currently rely on the EU–US Data Privacy Framework (DPF) for transatlantic data transfers (which replaced the Transatlantic Data Privacy Framework (TADPF) [84]). If the DPF is weakened or even annulled, financial institutions may have to quickly switch to Standard Contractual Clauses (SCCs) with additional technical or contractual controls [85], or they may need to seek data localization solutions within the EU, including on-premise solutions. Any of these solutions would likely lead to an increase in operational costs. Financial institutions also face compliance challenges in ensuring consistent data quality and lineage across integrated systems. When the data flow from a core banking SQL database into a Hadoop-based lake and then to a cloud analytics service, maintaining an audit trail is critical. Regulators (and internal risk officers) demand transparency on how data are transformed and used, especially for automated decision making (credit scoring, algorithmic trading, etc.). This can be implemented using automated tagging strategies that label records with attributes like the origin system (SQL vs. Hadoop), processing stage, or user role. This ensures a robust audit trail of "who did what, when, and why." While such rule-based systems excel at applying granular policies and generating detailed logs, permissioned blockchain solutions can add immutable, decentralized storage for critical audit events. A smart contract can be used for compliance enforcement. For example, when data cross borders (e.g., EU → US), a compliance contract can automatically validate whether appropriate legal mechanisms (SCCs, DPF, etc.) are in place. If not, the transaction is blocked or flagged.

To summarize our findings, we synthesize the findings of the literature review and offer our guidance on integration strategies for the different financial domains, together with expected performance outcomes, and compliance-related challenges in Table 3.

Cloud solutions make it easy to scale for unpredictable workloads and for use cases that include regulatory reporting and AI analytics. However, data locality rules and security issues can make this ad hoc elasticity difficult to implement. Many organizations combine on-premise systems with cloud ecosystems to adopt hybrid designs with container orchestration and fine-tuned security for their needs.

3. Materials and Methods

For the experimental phase, we selected financial datasets with both structured and unstructured data types, such that the assessment framework was complete. We established testing environments for each technology and also integrated advanced machine learning techniques by employing a Random Forest model, which is a well-established ensemble technique used for financial data analysis. Many researchers have compared Random Forest with other machine learning methods, and they have also shown its effectiveness in providing reliable results in areas such as financial fraud detection [86], forecasting prices [87,88], and corporate financial performance [89,90,91], as it is capable of handling high-dimensional and noise data. Its interpretability, through measures such as the importance of features, and its scalability across different data types further support its adoption in financial studies [92]. Moreover, it can reduce overfitting through bootstrap aggregation.

Figure 1 offers a high-level schematic of the methodological approach. Synthetic variations of the German Credit dataset are derived in structured, semi-structured, and unstructured forms. These data are then handled on three separate processing platforms, SQL, Python, and Spark, each of which are capable of managing predictive modeling and sentiment analysis. After that, the process investigates several dataset size variations (10,000, 500,000 and 1,000,000 records) to ensure the robustness of the performance test. Comprehensive performance metrics, such as the memory use, CPU use, GPU use, total time, and query time, are collected and compiled, as shown in the evaluation subgraph that was constructed during the process. Each configuration was iterated 50 times to produce statistically significant findings. Last but not least, as depicted in the Results Analysis and Visualization section, the results were aggregated and compared across the three platforms using different data types and volumes. This offers or reinforces guidance into scalability and resource optimization for financial data processing.

For Structured Workloads—SQL remains a stable choice for smaller transactional datasets or cases requiring stringent ACID properties and high-throughput queries;
For Mid-Range Machine Learning—Python pipelines (which include tools like scikit-learn, XGBoost, etc.) efficiently handle moderate data volumes, affording both model interpretability and a robust ecosystem of analytical libraries. Platforms like AWS, Azure, or GCP can supply on-demand computational resources that streamline training times, particularly when dataset sizes fluctuate or computational needs spike.
For Large-Scale or Unstructured Data—Spark’s distributed architecture accommodates vast, heterogeneous datasets; facilitates real-time analytics and batch workflows with considerable scalability; and its scalable cluster provisioning allows for institutions to extend the Spark infrastructure temporarily, thereby reducing capital expenditure and simplifying high-volume data processing.

Access to Source Data and Scripts: The scripts used for performance evaluation can be accessed at the following link: Source Data and Scripts (https://colab.research.google.com/drive/17Nj2lcUaUvk75GOHGyKB3o3cjS46iU1H?usp=sharing, accessed on 1 December 2024).

4. System Architecture and Integration

4.1. Technology Comparison

As shown in the literature review, relational databases, big data systems, and cloud computing platforms are the main technologies that influence financial data management. Each of them offers its individual strengths. When harmonized, they can significantly influence financial modeling and data analysis [93].

Relational databases (Figure 2) are traditionally considered the golden standard of structured data management. They are purpose built for efficient query processing, rigorous schema enforcement, and robust transactional support. Their strengths lie in established query optimizations and concurrency controls, with some DBs offering schema distribution [94]. In financial contexts, these systems ensure consistency and integrity when recording high-frequency transactions, managing account balances, or generating regulatory reports. Well-known examples include Enterprise DBs (Oracle Database, Microsoft SQL Server, and IBM DB2) and Open-Source DBs (PostgreSQL, MariaDB, and MySQL Community Edition). Of the Open-Source DBs, PostgreSQL offers the most permissive BSD-style license, which comes with minimal restrictions [95]. Both MySQL and MariaDB use the GPLv2 license [96,97]. They are free to use within the company, but distributing them with proprietary software creates copyleft obligations. As such, PostgreSQL is the most flexible solution for enterprises that want to maximize license flexibility while reducing legal expense. MySQL provides a well-established dual-licensing option for those who are satisfied with GPL compliance or are prepared to pay for commercial licenses. MariaDB strikes a balance between strong open-source commitments and commercial flexibility.

As financial institutions have to deal with ever-growing and increasingly heterogeneous datasets, big data frameworks, such as Hadoop and Apache Spark, are key platforms for distributed processing and storage (Figure 3) beyond relational databases. Hadoop accommodates large-scale data through its distributed file system, making it suitable for high-volume, batch-oriented analytics, while Apache Spark delivers rapid in-memory data processing and is especially valuable for time-sensitive operations, such as real-time market data analysis and iterative machine learning workflows [98]. These frameworks expand the analytical capabilities of organizations that must handle data that exceed the limits of traditional databases.

Cloud-based offerings (Figure 4) represent an alternative to on-site solutions in data management and computational resource allocation. By providing on-demand scalability, cloud providers, such as Amazon Web Services, Microsoft Azure, Google Cloud Platform, and IBM Cloud, effectively integrate relational and big data infrastructures with minimal on-premises overhead. This elastic model allows financial institutions to align infrastructure costs with workload intensity, while also leveraging advanced AI and analytics services for diverse business applications.

Figure 4 shows the capabilities of cloud computing to the requirements of the finance industry with a bipartite mapping diagram. The upper part in the diagram highlights three core cloud attributes: elastic scalability, fault tolerance with cost optimization, and analytics with AI/ML support. The lower part in the diagram captures the four major cloud service providers (AWS, Microsoft Azure, Google Cloud Platform, and IBM Cloud) with their technical differentiators. All providers promise elasticity, availability, and analytics, but implementation differs between architecture, tooling, and SLA. The arrows that connect the diagrams signal the interrelationships between the above capabilities and which cloud service can do what. For example, AWS’s cross-region replication architecture claims to provide 99.95% SLA compliance for payment systems. Azure’s paired region model enables automated failover for core banking with low RPO and RTO during regional incidents. GCP’s multiregion databases enable real-time fraud detection with confidential computing. IBM’s validated financial services stack claims to provide hybrid deployment with FIPS 140-2 encryption. Each of the connections here do not rest on any airy claims about the cloud. Instead, they are all tied to specific measurable service-level agreements, such as the 99.99% up time for Azure virtual machines that are regulatory-compliant.

The comparison in Table 4 outlines the features, advantages, and drawbacks of the technologies.

Relational databases are essential for processing standard financial transactions because of their powerful query processing and strong data integrity functionality. Hadoop, as well as Apache Spark, allow for distributed storage and high-throughput processing, thus tackling the challenges posed by large volumes of unstructured data [99]. Cloud computing platforms, like AWS and Azure, provide dynamic capabilities on a cost-effective basis to financial institutions in order to manage volatile spikes in data. Together, these technologies are strengthening strategic decision-making processes within the sector [100].

4.2. Integration Strategies

The integration of relational databases, big data, and cloud computing is considered a milestone in addressing modern issues. This section investigates methodologies and technologies for an effective transfer and transformation of data across these platforms. Our objective is to achieve interoperability that utilizes the unique advantages of each technology while focusing on their shortcomings. The fusion of relational databases, big data frameworks, and cloud services represents a reliable method for navigating the management of large complex volumes of structured and unstructured data. This framework (Figure 5) establishes the basis for a dynamic ecosystem comprising data transformation and integration, primarily through ETL (Extract, Transform, Load) processes and analytical operations, enhancing decision making accurately in real time. Employing Apache Sqoop for smooth data migration between Hadoop systems and relational databases, alongside Apache Kafka for real-time data streaming, shows that interoperability can be achieved [101]. Such an approach enables the analyses of a a large spectrum of data, which is necessary for the deep analytics and informed strategic directions needed to surpass conventional data management limitations.

For an efficient migration process between big data and relational databases, data transformation is necessary. This process necessitates advanced ETL workflows to provide accurate data translation and transportation between systems. Apache Sqoop is utilized for the bulk transfer of data and effectively manages both import and export operations, while Apache Kafka facilitates dynamic streaming data between relational and big data systems, entailing prompt analysis and fast decision-making processes [102]. In addition to these tools, the integration of relational databases with cloud computing offers even more advantages. Cloud data warehousing platforms, such as Amazon Redshift and Google BigQuery, offer strong capabilities concerning the analysis of information from relational databases within a cloud framework. Migration tools and hybrid cloud models not only facilitate the transition of databases to the cloud, but also enhance flexibility and scalability by integrating on-premises resources with cloud-based services [103,104].

Leveraging cloud storage and computational services, ideally, offers scalable solutions for big data projects, utilizing platforms like Amazon S3 and EC2, Google Cloud Storage, and Compute Engine to provide extensive data storage and processing capabilities. At the same time, services such as AWS EMR and Google Dataproc simplify the management and scaling of these initiatives while providing scalability and ease of operation within the cloud [105].

In addition, applications that span relational databases, big data, and cloud systems equally benefit from APIs that allow easy data transfer across systems and applications. Having the data in a standard format within systems, e.g., JSON, Avro, etc., helps in integration. Moreover, it is important to keep an eye on the data security and regulatory compliance across integration points. Some of the key issues are encryption, access controls, and adherence to data protection laws [106].

The integration of different technologies and approaches can be quite complex because they come with their own set of advantages and disadvantages (Table 5). Thus, any integration initiative should tackle the issues of the compatibly of data and interoperability of the system. Moreover, a unified environment should work smoothly and efficiently. To optimize cloud and big data resources, there are needs for the management of resources, the proper allocation of these resources, and security. Banks and other financial institutions should do this for reliable performance and regulatory compliance.

4.3. Hybrid Architecture and Practical Implementation

Integrating technology within the financial sector is a critical challenge that also requires data security and privacy. It should be designed for compliance with financial regulations, advanced security measures, and encryption techniques intended to protect sensitive financial data and guarantee integrity and confidentiality [107].

The proposed architectural design and integration tactics for relational databases, big data solutions, and cloud platforms aim at improving financial performance analysis, particularly through machine learning. Thus, the unique needs of financial entities are met, with efficient data handling, scalable processing, and the incorporation of analytical innovations. Ensuring system compatibility, optimal resource usage in cloud and big data settings, and robust security compliance not only supports financial modeling and analysis, but also allows for flexible adjustments to future technological advances. A hybrid cloud model allows for financial or highly regulated institutions (e.g., EU institutions) to keep sensitive or regulated data within their own private data centers or in EU-based clouds. In such cases, non-sensitive data or workloads that require high elasticity can be placed in a public cloud, potentially outside the EU. This helps ensure compliance with data protection rules for critical datasets. Public cloud usage (including US-based services) can be limited to less privacy-critical applications or pseudonymized datasets. This approach helps ensure adherence to data protection rules for critical datasets while optimizing infrastructure costs. However, implementing such a model requires careful planning to avoid common pitfalls, including hidden data exposures, vendor lock-in, and encryption vulnerabilities. Table 6 summarizes these key advantages, potential challenges, and implementation steps for EU financial institutions when adopting a hybrid cloud strategy.

In addition, a hybrid setup can also provide redundancy and valuable business continuity protection. If regulatory or political changes invalidate certain transfers to the public cloud, the institution can switch critical operations to the private cloud to reduce downtime and compliance risks. However, splitting data and workloads between private/on-premises and public clouds can introduce governance and security challenges that might not be worth the effort if the data are not particularly sensitive or heavily regulated. Furthermore, some applications, especially modern cloud native services, may be difficult to split across regions while maintaining full functionality [108]. If the data are sensitive and heavily regulated, and if storing on-premises or in EU clouds is not an option, then, for example, due to a lack of needed features, a Transfer Impact Assessment (TIA) can be run to assess whether US transfer can still be legal if Standard Contractual Clauses (SCCs) are signed and enforced with technical safeguards (e.g., encryption, split processing, etc.). Other solutions, such as Binding Corporate Rules (BCRs) or derogations under GDPR Article 49, can be hard to secure, or consent can be rendered invalid if it is bundled or not freely given (e.g., in an employment context).

As shown in Figure 6, we propose a hybrid cloud approach. The architecture implements two primary data flow patterns based on data classification. Personal data, including customer information and transactions, flows through EU-compliant systems, while non-personal data, like market analytics, can remain on US cloud services. Of course, the final designs must be tailored to the compliance obligations of each firm, especially if the local/EU data residency laws are stricter than the DPF, or if the DPF is challenged. The integration layer handles three important patterns.

The first pattern refers to event-driven integration. Amazon Managed Streaming for Apache Kafka (Amazon MSK) in the US zone and Event Bridge in the EU zone work together through a sophisticated event routing mechanism. Such managed systems, where the major cloud providers excel, are difficult to replicate on premises using other European cloud services. When market data arrive through external APIs, MSK processes these events and can trigger analytical workflows. However, the interaction becomes more complex when dealing with business events that might contain both personal and non-personal data. In these cases, Event Bridge in the EU implements an event-filtering pattern:

Events are first processed through a payload analyzer that classifies data elements;
Personal data remain within the EU boundary, triggering local workflows;
Non-personal elements are extracted and routed to MSK for analytics;
A correlation ID system maintains the relationship context across boundaries.

For example, when processing a trading event, customer details stay in the EU, while anonymized trading patterns can flow to US analytics.

The API Gateway in the US zone implements the second pattern, i.e., a facade one that presents a unified interface for external services while maintaining data sovereignty. This works through the following:

Requesting classification at the gateway level;
Dynamic routing based on data content;
Transformation rules that strip personal data before US processing;
Response aggregation that combines results from both zones.

The third pattern refers to ETL pipelines in the EU that handle sensitive data transformations but coordinate with US analytics systems. This is managed through the following:

A staged ETL approach where initial processing occurs in the EU;
Aggregation and anonymization steps that prepare data for cross-border movement;
Batch windows that optimize data transfer timing;
Checkpointing mechanisms that ensure consistency across zones.

Such a hybrid approach reduces rather than eliminates all hazards through compartmentalization. To guarantee compliance, companies must still routinely review their architecture, maintain appropriate documentation, and closely examine data flows. There should be clear data maps illustrating precisely which kinds of data live in which contexts and their interactions.

Furthermore, the hybrid architecture introduces eventual consistency challenges. In the proposed hybrid cloud architecture, the eventual consistency manifests itself primarily through the data processing pipeline between the EU and US zones. This differs from traditional BASE (Basically Available, Soft state, Eventual consistency) model scenarios because we are not just dealing with eventually consistent data stores but with an entire processing pipeline that must maintain consistency while respecting data protection boundaries. The challenge is not just about data convergence, but about maintaining analytical integrity across segregated processing environments. When data need to be analyzed across both regions, there are specific consistency considerations.

First, there is the temporal aspect of data synchronization. When the EU HDFS cluster processes personal data and generates anonymized datasets for US analytics, there is a delay before these datasets become available in the US zone. This delay creates a time window in which US analytics systems might be working with slightly outdated information.

Second, there are state management complexities when dealing with long-running analytics processes. For example, if a financial analysis starts in the US zone while data are still being processed in the EU zone, there is a need for rather sophisticated mechanisms to ensure that the analysis incorporates all relevant data points. This often requires implementing versioning and checkpoint systems that can track the state of data across both zones.

A practical example would be analyzing customer transaction patterns. The EU zone processes the raw transaction data, anonymizing them before sending it to the US analytics cluster. During this process, we need to ensure that the anonymization process maintains consistent customer cohorts across different time periods, such that any aggregated metrics maintain their statistical validity. In doing so, the analysis results can be correctly mapped back to the original data contexts. To address these challenges, the architecture implements several key mechanisms: version tracking for all datasets moving between zones, explicit timestamp management for data synchronization, reconciliation processes that validate data consistency across regions, and compensation workflows that can adjust for any inconsistencies detected during processing.

As shown in Figure 7, we show a scenario in which the data originate in the EU. They are anonymized and versioned before being transferred for US analytics. Concurrent updates are handled through incremental versions, and consistency is ensured through a reconciliation process that validates the state of data across zones by checking the version metadata from the Version Store, validating the dataset state in EU HDFS, and confirming the consistency before analytics results are updated. By following this strategy, while we may have temporary inconsistencies as data flow through the system, we can maintain data integrity and provide accurate analytics results while respecting data protection boundaries.

Building on these compliance and consistency foundations, we propose a comprehensive four-layer architecture (Figure 8) that integrates various technological components to address the complex requirements of modern financial data processing and analysis.

The data sources layer consists of several inputs that are characteristic of financial institutions. This layer manages market data feeds, transaction records, client information, documentary proof, and outside API connections. From structured transaction logs to unstructured documents, the range of various data sources represents the heterogeneous character of financial data, therefore requiring a flexible and strong architectural approach.

The data integration layer refers to the flow between processing systems and sources. The real-time data streaming features of Apache Kafka are welcomed for time-sensitive financial activities, such as fraud detection and trading. While the API gateway handles outside connectors, ETL pipelines can manage data transitions across several formats and platforms. For financial organizations under strict regulatory control, data validation systems that guarantee the accuracy and quality of the incoming data are vital.

The Processing layer can easily meet varying computing needs. Apache Spark Streaming drives real-time processing for rapid data analysis demands, including trade algorithms and fraud detection. Hadoop MapReduce handles batch processing tasks such as risk analysis and regulatory reporting, among other large-scale analytical tasks. Powered by frameworks such as TensorFlow and PyTorch, the specialized ML processing component allows for a comprehensive analysis of jobs, such as credit scoring and market prediction.

The architecture’s foundation is the storage layer, which maximizes data management using a variety of storage options. Typical RDBMS systems, such as Oracle, manage structured transactional data that require ACID compliance. While NoSQL databases give extra freedom for semistructured data, HDFS offers distributed storage for massive datasets. Cloud platforms typically provide object storage in a multi-tiered schema: with hot storage for frequently accessed data, warm storage for occasionally accessed data, and cold storage for rarely accessed archive data. It is the core of a company’s data lake and can effectively manage unstructured data, such as papers and images, with automatic policies that move the data between tiers based on access patterns and age, optimizing both cost and performance. Each tier has different price points and access latencies: hot storage provides subsecond access but costs more per gigabyte, whereas cold storage is significantly cheaper but may have retrieval delays of several hours. This strategy enables organizations to balance performance requirements and storage costs throughout the data lifecycle.

LLM-enhanced models, like the NL2SQL models, can help financial firms democratize access to data and make better decisions. The NL2SQL model converts natural language queries to SQL [109]. Finance firms can leverage NL2SQL models to empower non-technical users to easily extract complex financial information—performance metrics, risk assessments, compliance reports, etc.—without learning SQL. They reduce the burden on IT experts and enable real-time access to data. The ability to analyze themselves will encourage firms to eliminate IT dependency. They will be able to obtain reports faster, reduce turnaround time, and be compliant with regulatory reporting. Firms will also be able to deploy other applications like chatbots or automated risk management applications.

The dashed lines in the architectural diagram show the connections between layers, i.e., the bidirectional flow of data over the system. This architecture supports the analytical needs of financial organizations by allowing both the real-time processing of incoming data and the retroactive examination of old information. The modular character of the architecture allows component scaling and replacement to be possible without affecting the whole system, therefore offering the flexibility needed to fit changing financial technology needs. It offers a template for companies trying to upgrade their data processing capacity while preserving the dependability and resilience needed in financial operations, therefore reflecting major progress in financial technology infrastructure.

This architectural design combines various storage types improves and management of different data types without impairing performance and compliance requirements. The variety of processing layers, here, can deal with anything from long-term risk assessment to real-time trading decisions. Furthermore, the robust integration layer ensures data consistency and reliability throughout the system, which is essential to maintain operational integrity and meet regulatory requirements.

In practice, organizations benefit from (1) advanced data classification and dynamic event filtering, which are often integrated into managed streaming tools (like Apache Kafka and Amazon MSK) that route data according to pre-established definitions; (2) versioning and reconciliation across EU/non-EU spaces to minimize latency and temporary divergences; (3) robust encryption and zero-trust principles, which ensure decryption keys and cryptographic controls stay compliant and within the EU; and (4) flexible scaling, with mission-critical/regulated operations remaining on-premises and computationally intensive processes being offloaded to the public cloud.

5. Performance Analysis

5.1. Problem Definition

In this section, we evaluate the performance of various technologies in handling structured, semi-structured, and unstructured data within the financial sector. Financial institutions frequently resort to data-driven decisions, but the diversity and volume of their datasets often expose significant limitations in current technologies. Relational databases like Oracle SQL excel at processing structured data, but they face inefficiencies when handling semi-structured or unstructured datasets. Big data technologies, such as Hadoop and Spark, address scalability and distributed processing needs but struggle with real-time analytics and transactional integrity. Similarly, cloud platforms provide scalable infrastructure, but they often introduce challenges in compliance, data integration, and security.

This study tested the ability of these technologies to overcome these challenges by implementing tailored machine learning models for each data type. Structured data were analyzed using the Random Forest algorithm, semi-structured data were processed through Gradient Boosted Trees (GBT), and unstructured data were examined via Sentiment Analysis. Performance was measured using key metrics, i.e., memory usage, CPU usage, GPU usage, total time, and query time, to provide a comprehensive understanding of the strengths and weaknesses of SQL, Python, and Spark in these contexts.

By exploring these technologies, our aim was to enrich the data analytics capabilities of the financial sector, proposing integrated solutions that optimize performance and address technological limitations in different types of data.

Although the SQL testing approach offers preliminary data insights and hypothetical decision-making criteria without actual predictive modeling, using Python and PySpark allows for the direct implementation of a Random Forest classifier for creditworthiness prediction. These methods are similar in terms of training, prediction, and evaluation, but their accuracy metric is variable. When comparing Python and PySpark, specific factors, such as dataset size, computational resources, and the intended scope of analysis, are considered. Python’s scikit-learn is more user-friendly for smaller datasets, whereas PySpark is the preferable choice for processing large data volumes [110,111].

5.2. Performance Analysis and Results

This section discusses the results from performance tests carried out using SQL, Python, and Apache Spark across structured, semi-structured, and unstructured datasets. The findings directly address Research Question 3 (RQ3), which explores the influence of relational databases, big data platforms, and cloud technologies on financial data management and analysis. Analyzing execution time, memory usage, and CPU utilization allows us to build a practical evaluation framework, which is relevant for financial institutions aiming to improve the efficiency of their data storage and analytics systems.

The chosen evaluation metrics were used due to their importance in financial scenarios, where execution time directly shows the responsiveness and transaction processing speed, which is critical for financial institutions when dealing with real-time transactions and analytics. Memory usage metrics help evaluate how efficiently resources are being used, which has a direct impact on both cost-effectiveness and the scalability of financial systems. CPU utilization offers insight into computational efficiency and can highlight processing bottlenecks that might compromise system performance or stability, especially during peak financial workloads.

To improve the reliability of the results, each performance test was run 50 times. Tests were conducted on datasets, ranging from 100,000 to 1,000,000 records, using a consistent hardware setup on Google Colab with an NVIDIA A100 GPU. This environment was selected for its accessibility, reliability, and stable performance, which are important for ensuring the reproducibility and comparability of the results. The NVIDIA A100 GPU played a key role due to its strong parallel processing capabilities, making it well suited for handling the large datasets, complex computations, and high processing demands that are typical of financial data workflows. In addition, GPU acceleration led to noticeable reductions in execution time and enhanced the performance of the distributed tasks, aligning with the computational requirements of modern financial systems.

Future evaluations could include additional factors to improve the depth and relevance of the testing framework. For example, monitoring GPU utilization may provide more detailed insights into how effectively hardware acceleration is used in data processing. Measuring network latency and disk I/O throughput would also be useful, particularly in the distributed or cloud-based environments common in financial institutions, offering a more complete picture of system performance. Finally, including metrics related to reliability and fault tolerance would help evaluate how well the platforms can handle real-world operational demands.

Table 7 summarizes the results observed during testing, highlighting SQL’s superior performance in handling structured datasets (primarily due to its low memory consumption and quick execution times), Python’s adaptability for semi-structured data, and Spark’s excellent capability with unstructured and large-scale semi-structured data, leveraging its distributed computing capabilities. These are, of course, synthetic tests that attempt to mimic real-world scenarios. Actual performance can be highly dependent on the nature of the queries, indexes, cluster size, network, etc. In our tests, we used Python 3.11 and Spark 3.5.4.

Cross-validation was performed using a k-fold methodology for machine learning components, helping to verify the consistency of our findings across different data partitions. The error analysis revealed several important considerations.

Data Quality Impact
- Missing Values affected the processing time by 15.03%, as observed in SQL (from 3.91 to 4.50 s), Python (from 164.5 to 189.2 s), and Spark (from 163.1 to 187.5 s).
- Data-Type Mismatches increased memory usage by up to 8.33%, such as SQL’s memory increasing from 0.132 MB to 0.143 MB, Python’s increasing from 488.6 MB to 527.7 MB, and similar patterns being observed for Spark.
- Inconsistent Formatting required additional preprocessing overhead, with Spark query times for semi-structured data rising by 25% (from 3.28 to 4.10 s). SQL and Python also experienced proportional increases in query and preprocessing times.
System Stability
- SQL SQL maintained consistent performance ±3%, even for large datasets, such as those with 1,000,000 records.
- Python showed higher variability ±7%, with processing times for semi-structured datasets ranging from 3.79 to 4.15 s.
- Spark demonstrated moderate stability ±5% variation across its distributed architecture, which was especially evident in unstructured datasets with query times ranging between 16.2 and 17.0 s.
Data-Type Samples and Results Metrics

Structured datasets (e.g., Table 8) are characterized by a consistent schema, making them ideal for systems like SQL. In this study, these datasets included information, such as customer account details, transaction logs, and financial histories. SQL efficiently processed these datasets due to its ability to handle predefined schema and execute complex queries rapidly.

Semi-structured datasets (e.g., Table 9), such as JSON files, combine structured elements with hierarchical data formats. This flexibility allows for varied data organization, making Python a suitable choice due to its extensive libraries, like Pandas for data manipulation and Scikit-learn for machine learning tasks.

Unstructured datasets, as illustrated in Table 10, include free-form text and customer feedback that pose unique challenges due to their lack of a predefined format. Spark excelled in processing these datasets, leveraging its distributed architecture to manage the computational load effectively. Textual data, such as customer reviews or social media sentiments, were analyzed using Spark’s ability to scale and perform real-time processing, enabling actionable insights for financial institutions. These datasets often require advanced natural language processing to extract meaningful information, such as identifying sentiment trends or detecting fraudulent patterns. Spark’s ability to transform disorganized, unstructured data into meaningful analytics demonstrates its critical role in modern financial data management.

Results and Visualization

As shown in Figure 9, SQL consistently demonstrated its efficiency in keeping memory usage remarkably low, even as the dataset size expanded from 100,000 records to 500,000 records and finally to 1,000,000 records. This trend holds not only for structured datasets but also for semi-structured and unstructured data, which is a testament to its design’s resourcefulness. Python, on the other hand, shows a notable increase in memory usage as the datasets grew, particularly for structured data, where it reached 488 MB at the largest size. This heavy reliance on memory shows Python’s limitations in handling scalability efficiently. Spark was able to strike a balance, maintaining stable memory usage across all dataset types and sizes. This makes it particularly effective when dealing with large unstructured datasets, where managing resources is key.

As depicted in Figure 10, SQL efficiently managed CPU usage, delivering consistent performance across all dataset sizes, particularly for structured data. This stability reflects its ability to optimize processing power effectively. Spark also showed impressive performance, distributing tasks efficiently across its architecture, which was especially evident in its handling of large unstructured datasets. Python exhibited minimal CPU involvement, relying heavily on memory resources.

Both Python and Spark effectively utilize GPUs for processing semi-structured and unstructured data, but they do so in very different ways (Figure 11). Spark, in particular, stands out for how well it leverages GPUs to significantly reduce processing times for large datasets. This makes it a powerful choice for high-performance analytics. Python also uses GPUs, though its overall scalability limitations diminish its effectiveness. SQL, by design, does not rely on GPU resources, instead leveraging CPU power to achieve its results.

The performance of SQL continued to stand out for structured data, where it managed to process even the largest datasets with impressive speed (Figure 12). There was a slight increase in processing time for the unstructured data, but SQL still delivered reliable and consistent results. Python struggled more noticeably as the dataset sizes increased, particularly with structured data, which suggests its scalability challenges when handling larger workloads. In contrast, Spark proved its robustness and adaptability. It reliably handled large datasets of all types, with scalability becoming particularly evident when datasets exceed 500,000 records, especially in the case of unstructured data.

SQL is clearly the leader in query performance, maintaining low query times across structured, semi-structured, and unstructured datasets, even at the largest scales (Figure 13). This reliability makes it ideal for scenarios where speed and precision are critical. Spark performs well too, especially when dealing with large unstructured datasets, where it often outpaces Python. However, Python continues to lag behind in query performance, with slower times across all data types.

5.3. Implications for Financial Data Management and Applications

By connecting these visualizations to the respective data types and dataset sizes, this study provides a comprehensive road map for financial institutions. This guidance helps institutions align technology choices with their specific needs, ensuring optimal performance and scalability across diverse data workloads.

SQL stands out as the most reliable choice for structured data, which plays a critical role in in financial operations, like transaction processing and regulatory reporting. Python’s versatility makes it ideal for experimental and semi-structured data tasks, including predictive modeling. Spark’s ability to scale efficiently makes it the top choice for unstructured data, addressing the growing demand for advanced analytics in areas like fraud detection and customer insights. These findings highlight the need for a tailored approach in leveraging these technologies to address diverse data challenges.

Financial institutions can profit by integrating these technologies. SQL can efficiently handle operational workloads, Python can drive machine learning and transformation tasks, and Spark can process large-scale data in real-time. Advanced ETL workflows and hybrid cloud deployments further amplify the synergy between these tools, creating seamless and scalable interconnected set of systems.

From the above results and other previous studies [8,93,112,113], we observe an alignment in findings despite different contexts (finance vs. other domains): relational databases continue to be central for high-performance management of structured data and transactional integrity; big data frameworks like Spark provide the necessary scalability and capability to handle the growing variety and volume of data; and Python (and associated data science tools) adds flexibility for complex analytics and AI tasks. Rather than one replacing the others, the trend is clearly toward architectural convergence—making best use of each where appropriate and integrating them. Financial institutions, in particular, have moved in this direction to modernize their data infrastructure. The concept of a unified data platform (where data can be stored once and accessed through SQL, Python, or Spark as needed) is becoming reality, as seen with lakehouse implementations and cloud offerings.

One key trend is the improvement of performance in each area so that the gaps narrow. Spark’s continuous improvement of the DataFrame API and SQL support [114] has made it much more palatable to SQL users, and its performance in SQL workloads has improved dramatically (rivaling MPP databases in some cases). Meanwhile, SQL engines have adapted to semi-structured data and added extensibility (user-defined functions, ML integrations, etc.) to cover more use cases. Python, through projects like Numba and Cython, can compile critical code to speed up loops, and libraries like Rapids (cuDF) even allow Pandas-like operations on GPUs—effectively bringing some Spark-like speed to single-node workflows. On the horizon, we see the rise of data lakehouses and cloud data platforms that essentially combine these technologies under the hood: For example, Snowflake’s engine (a proprietary SQL DB) can now execute Python (via Snowpark), and it will orchestrate Spark jobs externally if needed. In addition, Databricks (Spark) is adding direct support for Pandas code and auto-generating query plans for it.

The performance measures noted in our experiments have implications in the financial domain, as depicted in Table 11. By aligning technology strengths with financial operations, the architecture may be adjusted according to the evidence rather than theory. SQL is ideal for retail banking transactional systems that require low memory and predictable performance under heavy loads. Similarly, the incorporation of Spark to handle unstructured data efficiently provides a strong business case for customer analytics applications that deal with text-heavy feedback and communication. The links between how technologies will perform and what sort of business applications they will serve demonstrates the importance of properly architectured financial systems.

On the innovation front, big data and AI are equally relevant in the study of finance today as cloud computing. Many advanced analytics, from fraud detection to customer personalization, rely on machine learning APIs provided by the cloud or cloud GPU clusters for scalable ML training.

Another trend is emphasis on real-time analytics and streaming data integration. Not a long time ago, batch processing was the norm in big data (MapReduce was initially batch). Now, use cases often demand instant insights (fraud detection as transactions stream, personalization as customers click, etc.). This has pushed frameworks to evolve: Spark Structured Streaming offers unified batch/stream processing, and databases like PostgreSQL are incorporating features for real-time analytics (logical replication to stream changes). Financial firms in high-frequency trading or fraud detection rely on hybrid setups where streaming events are processed by fast in-memory systems (Apache Flink or Spark Streaming), and these are then stored into both a lake (for later analysis via Spark) and a relational store (for audit and quick queries). The ability of an architecture to handle both streaming and batch in a cohesive way is a part of modernization.

Cloud platforms can bring considerable elasticity and can reduce capital expenditures by provisioning or decommissioning computational resources as demand shifts, but they also require stringent oversight regarding data transfer rules and encryption. The most effective architectures seem to be hybrid: (1) relational systems provide ACID compliance and high-speed OLTP operations, (2) big data platforms facilitate large-scale or unstructured data handling, and (3) cloud services ensure dynamic allocation of storage and computation. This approach usually enables both operational agility and cost effectiveness, as organizations deploy workloads according to the data type, regulatory constraints, and performance requirements.

Regarding sustainability, in cloud environments, resources are elastically allocated, so institutions only use the servers when they need them and can scale back at other times, which means that the hardware is not idle inefficiently. On-premise setups run servers 24/7 regardless of demand (for reliability or static provisioning). Cloud services are able to maximize their service because they have large pools of customers who use shared infrastructure. For example, a bank may use cloud computing for an hour a day to model risk and then that same server may run another company’s workload, etc. This increases the useful work per unit of energy consumed. The net effect is fewer total servers required industry-wide.

Some banks could turn sustainability into a tech use case itself, with analytic dashboards and AI tools to monitor the carbon footprint of their IT (and entire) operations. These tools ingest the data from data center power meters, cloud usage reports, etc., and then they present it to IT managers, creating feedback loops for improvement. For example, if a particular department’s analytics jobs are spiking energy usage, it can be identified and optimized. This is an internal application of big data for self-improvement. On the client-facing side, many banks have rolled out features in digital banking apps that show customers the carbon impact of their spending or allow them to invest in sustainable funds—while not directly an IT sustainability measure, it indicates a broader integration of sustainability into the digital strategy. A bank that has a culture of sustainability will more likely also apply it to IT.

Storing and processing data has an environmental cost, so “data hygiene” is gaining importance. Financial institutions are notorious for hoarding data due to regulatory needs (e.g., storing years of transaction records) and analytical ambitions. Now, data teams are instituting lifecycle policies: deleting or archiving data that are no longer needed, compressing data, and avoiding unnecessary duplication. Not only does this reduce storage hardware and energy, but it also can improve performance (i.e., less data to scan means queries run faster, saving CPU). Some banks can move cold data to more energy-efficient storage tiers, such as, for instance, moving rarely accessed data to tape libraries or to cloud archival storage (which runs on lower-power systems at lower costs) [115], rather than keeping everything on spinning disks. There is also reconsideration of real-time vs. batch processing: real-time streaming of data is useful but can be computationally heavy if overused, and if certain reports can be generated daily instead of streaming updates by the minute, this can reduce continuous processing loads. Essentially, aligning the level of data intensity with actual business need prevents over-engineering that wastes resources.

A noteworthy strategic consideration is the balance between edge computing and central cloud computing. Some financial services (like ultra-low-latency trading or ATMs) require local processing (edge). Edge devices are typically less efficient than big data centers due to size and resource constraints. Edge computing infrastructure, which processes data locally rather than in centralized facilities, typically operates at lower energy efficiency ratios than consolidated data centers due to limited physical space for optimization and the inability to implement enterprise-scale cooling and power management solutions. To manage this, banks try to keep as much processing as possible in efficient central locations and only push to edge what is absolutely necessary (thereby not proliferating thousands of mini data centers unnecessarily). Where edge is needed, there is also work on making those as efficient as possible, e.g., using small form-factor servers that are energy-thrifty, or ensuring edge locations utilize existing infrastructure (like running an ATM’s computer off the same power source that already lights the branch, etc., to avoid redundant systems). This falls under architecture decisions that can minimize the overall energy footprint.

Technology strategy also involves continual monitoring of energy and carbon metrics as first-class operational metrics through dashboards linked to network operations centers. Just as server uptime and transaction throughput can be monitored in real time, so can kilowatt usage. This visibility is key to prompt action when something is amiss (such as, for example, a malfunctioning cooling unit causing others to work harder, spiking energy draw, etc.). With IoT sensors and detailed logs, IT staff can treat energy anomalies as incidents to be fixed. Some financial institutions have begun to incorporate energy efficiency metrics into performance evaluation systems by linking data center managers’ compensation incentives directly to documented improvements in energy conservation, thereby embedding sustainability goals into standard management practices.

However, even though cloud computing has an edge over on-premise setups in terms of sustainability, there might be a rebound effect. As computing becomes more efficient and cloud capacity is abundant, higher capacity consumption might occur and could nullify efficiency improvements as a result of increased absolute energy use. There are other techno-ethical governance dilemmas: financial AI and high-frequency trading systems requires large consumption of computation power, leading to questions about whether we are trading off innovation for the environment. Concerns arise if companies use too many carbon offsets or vague cloud provider claims instead of real reductions, which may be a type of greenwashing. In addition, sustainability is more than carbon usage: the high water usage of data centers for cooling and the e-waste resulting from their hardware refresh cycles also cause social issues. We can argue about which is better: on-premises or cloud sovereignty setups. Regulatory regulations frequently require data localization, forcing financial companies to maintain less energy-efficient local data centers rather than using more sustainable cloud options. The conflicts show that making green computing choices is not clear cut as it frequently involves complex compliance considerations rather than straightforward technical choices.

6. Conclusions

The financial services sector stands to gain from this study’s empirical results regarding the optimization of technical methodologies for specific financial functions. The findings suggest that, ranging from high-frequency trading to retail banking services, the selection of a technological framework considerably influences the efficiency and effectiveness of various financial service operations.

To answer RQ1, by reviewing the current literature, it is evident that financial institutions increasingly converge three salient technological strands: relational databases, big data frameworks, and cloud-based services. Relational databases remain fundamental for the secure handling of structured data (e.g., customer records, transaction logs, etc.), given their schema enforcement and sophisticated querying capabilities. However, these traditional systems can struggle with the scalability and agility required by unstructured or high-volume data flows. Conversely, big data architectures, including Hadoop and Apache Spark, deliver impressive distributed processing to accommodate diverse, voluminous datasets (e.g., textual communications, social media feeds, etc.). However, their adoption is often complicated by transactional integrity requirements and the stringent compliance norms that pervade financial contexts.

Cloud solutions offer flexible, on-demand scaling, which can alleviate pressures related to workload volatility, regulatory reporting peaks, and emerging use cases, such as AI-driven analytics. However, this elasticity can be diminished by data locality constraints (GDPR, EU-US Data Privacy Framework, etc.) and security concerns. As a result, some organizations are gravitating toward hybrid designs that weave together on-premise systems and distributed cloud-based ecosystems. Notably, container orchestration (e.g., Kubernetes) and sophisticated security configurations (encryption-at-rest, fine-grained access control, etc.) feature prominently in successful integration efforts. In general, there is no single solution that works for everything, but best practices show how important it is to use strict data classification, strong governance protocols, and iterative microservice-based designs to handle the complicated nature of modern financial data.

In addressing Research Question 2 (RQ2), we explored the practical, technical, and strategic challenges faced by financial institutions when integrating relational databases, big data solutions, and cloud computing across different regions. We have shown how the system maintains data integrity and complies with regional data protection requirements by keeping all personal data processing within the EU and validating the state and version history before finalizing analytics. Our proposal is based on a layered hybrid cloud model, i.e., one that combines local (EU-compliant) infrastructures with carefully selected public cloud services, as well as enables a balance between scalability, cost containment, and regulatory adherence. Critical to this effort is a clear delineation between personal or regulated data, which must remain under strict residency rules, and less sensitive or anonymized data, which can be securely relayed to external regions (including US-based cloud services) for large-scale analytics.

In reality, organizations gain from (1) advanced data classification and dynamic event filtering, which are frequently coordinated by managed streaming tools (like Apache Kafka and Amazon MSK) that selectively route information according to predetermined criteria; (2) versioning and reconciliation across EU and non-EU zones to maintain data integrity while reducing latency and temporary inconsistencies; (3) strong encryption and zero-trust principles, which guarantee that decryption keys and cryptographic controls stay within EU jurisdiction; and (4) flexible scaling, which keeps mission-critical and regulated operations on-premises while offloading computationally demanding tasks to the public cloud. Financial institutions can react faster to changes in workload demand and show careful adherence to changing EU–US data transfer frameworks by combining this segmented architecture with thorough auditing and policy enforcement.

In addressing RQ3, our empirical evaluations in comparing SQL, Python-based analytics, and Spark show the tradeoffs in performance and resource utilization across structured, semi-structured, and unstructured datasets. Relational databases (SQL) excel in query execution and resource efficiency when handling well-defined transactional datasets (e.g., high-volume payment processing, real-time account monitoring, etc.), but they also exhibit more limited scalability for vast or rapidly shifting data. Big data frameworks, represented chiefly by Hadoop/Spark ecosystems, offer reliable scaling and distributed analytics, thus confronting the challenges posed by large volumes of unstructured information. However, ensuring optimal cluster configuration and data partitioning requires specialized expertise and ongoing resource management.

Cloud platforms bring considerable elasticity and can reduce capital expenditures by provisioning or decommissioning computational resources as demand shifts, but they also require stringent oversight regarding data transfer rules and encryption. The most effective architectures seem to be hybrid: (1) relational systems provide ACID compliance and high-speed OLTP operations, (2) big data platforms facilitate large-scale or unstructured data handling, and (3) cloud services ensure the dynamic allocation of storage and computation. This approach usually enables both operational agility and cost effectiveness, as organizations deploy workloads according to data type, regulatory constraints, and performance requirements.

Within the framework of investment banking activities, the query performance shown by SQL systems (processing one million records in about 4.2 s) can help high-frequency trading operations, where millisecond latency can have a major effect on trading results. Along with real-time trade execution, this performance quality supports effective position monitoring and risk analysis. At the same time, Spark’s capabilities in distributed processing are valuable for managing unstructured data, which often forms the basis of comprehensive risk analysis, such as sentiment from social media and market news updates.

Different but equally challenging needs arise from the retail banking sector. Our performance measures show that SQL’s effective handling of structured data especially fits the high-volume, transaction-intensive character of retail banking activities. Maintaining timely customer service while guaranteeing transaction accuracy depend mainly on low memory consumption (0.17 GB per million records) and fast execution times. Moreover, the incorporation of Spark’s real-time processing features has shown notable benefits in fraud detection, where the capacity to examine trends across large volumes of real-time data can stop major losses.

Applications in the insurance industry reflect still another important area where our results have relevance. In the handling of claims, where the system needs to efficiently handle both organized claim forms and various unstructured documents, such as medical reports and photo evidence, the SQL–Spark architecture we assessed demonstrated considerable promise. In actuarial computations and risk profiling, where sophisticated prediction models must process many data types to produce accurate risk assessments, Python’s machine learning features have proved very helpful.

From a regulatory compliance standpoint, our findings suggest that industry criteria are met or exceeded by the evaluated technology. The ACID guarantees of SQL databases provide the transactional integrity expected by financial authorities and the obtained sub-second response times for trade execution conform with regulatory criteria. Furthermore, the storage efficiency shown in our tests (SQL uses roughly 0.17 GB per million records) provides affordable compliance with data retention criteria while keeping fast access for audit needs.

A major challenge identified is to achieve concurrency and throughput for mixed workloads, where the aim is to deliver fast transactions and heavy analytics at the same time. There exists considerable research into “Hybrid Transaction/Analytical Processing (HTAP)” databases that support both OLTP and OLAP. These systems, which often have a row store combined with a column store under the hood, can run queries on fresh transactional data without offloading to a separate data warehouse. An HTAP database can let a bank run a fraud detection query on the same transaction it processes a customer’s payment—all in a single database. This blurs the historic line between relational OLTP systems and big data analytics, pointing toward integrated architectures.

The analysis of the cost–benefit relationship of the IT infrastructure in financial institutions exposes several significant factors. Usually, the switch to hybrid architectures calls for large initial expenditures in infrastructure and training. However, our study of resource consumption patterns points to possible operational savings through better system maintenance practices, lowered processing overhead, and optimal storage use. Although particular cost savings will depend on institution size, data volume, and current infrastructure, performance testing shows that hybrid architectures may provide significant operational cost savings, especially for organizations running sizable data processing activities. Future studies would benefit from a thorough investigation of actual implementation expenses among many types and scales of financial institutions.

Together with role-based access control tailored to financial roles, end-to-end AES-256 encryption for financial transactions meets current regulatory requirements while maintaining reliable system performance. Although these measures often have limited impact on transaction processing speeds (which is an important consideration for banks) practical implementation typically involves careful key management and ongoing monitoring to ensure minimal operational disruption. Such security policies could be carried out with a minimum effect on transaction processing speeds, which is a major factor of importance for banks.

6.1. Practical Contributions and Implementation Guidelines

This study offers several contributions for designing an integrated architecture. First, we provide an evidence-based migration path from legacy monolithic systems to integrated architectures, recommending a phased approach that prioritizes high-value workflows (risk modeling, fraud detection, etc.) before transitioning core transactional systems. Our performance benchmarks provide technology executives with additional quantifiable metrics to build business cases for infrastructure modernization. However, actual return-on-investment time frames will vary based on institutional scale and data complexity. For institutions concerned with implementation risks, our hybrid model enables controlled transitions with clearly defined fallback mechanisms, mitigating the operational disruptions typically associated with large-scale architectural changes. The inherent modularity of the architecture allows organizations to selectively upgrade components as technologies evolve, creating an infrastructure that can adapt to emerging capabilities in quantum computing, federated learning, and next-generation cryptography. Most importantly, our architecture provides a framework for financial innovation laboratories to rapidly prototype and deploy new data-driven products without compromising the stability of production systems, potentially transforming how institutions leverage their data assets for competitive advantage. Finally, for multinational financial organizations, our approach to data sovereignty challenges creates a replicable pattern for managing regulatory divergence across jurisdictions, a persistent challenge that is exacerbated by the fragmenting global regulatory landscape.

6.2. Methodological Limitations

We utilized a synthetic extension of the German Credit dataset in our study and performed all experiments in a Google Colab environment, which presumably does not reflect the complexity of real-life financial data or hybrid computing infrastructures on-premises/clouds. In addition, tests of performance were conducted on a set of tasks (query time, CPU and GPU usage, etc.) for SQL, Python, and Apache Spark without capturing the concurrency, streaming ingestion, or multi-tenant of larger financial organizations. As such, our findings may indicate some scaling behavior, but they may not be overly generalizable to actual environments dealing with billions of transactions or those at the order of petabytes.

We did not look at the various costs and governance issues that impact the responsibilities in adopting technology. Factors like overall ownership cost, commercial license, encryption burden, and multi-jurisdictional compliance might shift an institution’s technology strategy significantly. Moreover, we did not include other tools in the mix, such as Apache Flink or specialized NoSQL/HTAP databases, which may perform well in niche cases. Because our method still provides useful comparative information, additional testing in live contexts of larger-scale and greater diversity would help fine tune the results.

6.3. Future Directions

This research aimed to help strategic decision making for integrated data solutions, especially when it comes to the best combinations of technologies to meet the needs of different financial institutions. The results give companies a basis for making decisions about what technologies to use and show where more study and development is needed.

Technology strategies for sustainable computing in finance revolve around using less energy for the same output, using cleaner energy, and managing computing demand intelligently. This involves efficient use of cloud services, making in-house infrastructure as efficient as possible, and ensuring that software and data use are optimized for minimal waste. Going forward, emerging tech like quantum computing (if realized) or more efficient AI algorithms could further shift the landscape—but those remain on the horizon. Current strategies are very much about the optimization and smart management of existing computing paradigms.

The stack that was proposed can be further integrated with new technologies like the blockchain and AI-driven financial models. The flexibility of the proposed stack supports market growth and the development of new products while maintaining high-performance standards. Because these systems can be changed to meet new rules, they may be able to help shape the future of banking services.

Although our testing setting is robust, it might not fully reflect the complexity of real-world financial systems where many services are running at the same time. Also, even though our performance metrics cover a lot of ground, real-world financial applications may come across variables that were not present in our controlled testing setting.

Author Contributions

Conceptualization, S.-A.I. and V.D.; methodology, S.-A.I. and A.-O.R.; software, S.-A.I.; validation, V.D., S.-A.I., and A.-O.R.; formal analysis, S.-A.I. and A.-O.R.; investigation, V.D., S.-A.I., and A.-O.R.; resources, S.-A.I. and A.-O.R.; data curation, S.-A.I. and A.-O.R.; writing—original draft preparation, S.-A.I.; writing—review and editing, V.D., S.-A.I., and A.-O.R.; visualization, S.-A.I. and A.-O.R.; supervision, V.D.; project administration, S.-A.I. and V.D. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was co-financed by The Bucharest University of Economic Studies (ASE) during the PhD program.

Data Availability Statement

Scripts used for data creation and solution testing are available online at Source Data and Scripts (https://colab.research.google.com/drive/17HNXZOY800Ey_Odva_nZh9KJtaJbyXmn?usp=sharing, accessed on 1 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Naeem, M.A.; Arfaoui, N.; Yarovaya, L. The contagion effect of artificial intelligence across innovative industries: From blockchain and metaverse to cleantech and beyond. Technol. Forecast. Soc. Change 2025, 210, 123822. [Google Scholar] [CrossRef]
Rizvi, S.K.A.; Rahat, B.; Naqvi, B.; Umar, M. Revolutionizing finance: The synergy of fintech, digital adoption, and innovation. Technol. Forecast. Soc. Change 2024, 200, 123112. [Google Scholar] [CrossRef]
Gąsiorkiewicz, L.; Monkiewicz, J. Digital Finance and the Future of the Global Financial System; Routledge: London, UK, 2022. [Google Scholar] [CrossRef]
Ionescu, S.A.; Diaconita, V. Transforming Financial Decision-Making: The Interplay of AI, Cloud Computing and Advanced Data Management Technologies. Int. J. Comput. Commun. Control. 2023, 18, 5735. [Google Scholar] [CrossRef]
Damilola Oluwaseun Ogundipe. Conceptualizing Cloud Computing in Financial Services: Opportunities and Challenges in Africa-Us Contexts. Comput. Sci. IT Res. J. 2024, 5, 757–767. [Google Scholar] [CrossRef]
Abis, D.; Pia, P.; Limbu, Y. FinTech and consumers: A systematic review and integrative framework. Manag. Decis. 2025, 63, 49–75. [Google Scholar] [CrossRef]
Ren, X. Research on Financial Investment Decision Based on Cloud Model and Hybrid Drosophila Algorithm Optimization. In Proceedings of the 2nd International Conference on Artificial Intelligence and Information Systems (ICAIIS), Chongqing, China, 28–30 May 2021. [Google Scholar] [CrossRef]
Sharma, R.K.; Bharathy, G.; Karimi, F.; Mishra, A.V.; Prasad, M. Thematic Analysis of Big Data in Financial Institutions Using NLP Techniques with a Cloud Computing Perspective: A Systematic Literature Review. Information 2023, 14, 577. [Google Scholar] [CrossRef]
Ionescu, S.A.; Radu, A.O. Assessment and Integration of Relational Databases, Big Data, and Cloud Computing in Financial Institutions: Performance Comparison. In Proceedings of the 2024 International Conference on Innovations in Intelligent Systems and Applications (INISTA), Craiova, Romania, 4–6 September 2024; pp. 1–7. [Google Scholar] [CrossRef]
Dataversity. Database Management Trends in 2020. 2020. Available online: https://www.dataversity.net/database-management-trends-in-2020/#:~:text=Newer%20options%2C%20in%202020%2C%20include,competitive%20during%20a%20dynamic%20year. (accessed on 5 April 2025).
Lin, W. A Design of Hybrid Transactional and Analytical Processing Database for Energy Efficient Big Data Queries. In Proceedings of the International Conference on Green, Pervasive, and Cloud Computing, Macao, China, 27–30 September 2024; pp. 128–138. [Google Scholar] [CrossRef]
Dratwinska-Kania, B.; Ferens, A. Business Intelligence System Adoption Project in the Area of Investments in Financial Assets. In Proceedings of the International Conference on Artificial Intelligence: Theory and Applications (AITA), Bengaluru, India, 11–12 August 2023; Lecture Notes in Networks and Systems. Sharma, H., Chakravorty, A., Hussain, S., Kumari, R., Eds.; Volume 844, pp. 259–273. [Google Scholar] [CrossRef]
Schreiner, G.A.; Knob, R.; Duarte, D.; Vilain, P.; Mello, R.d.S. NewSQL Through the Looking Glass. In Proceedings of the 21st International Conference on Information Integration and Web-Based Applications and Services (iiWAS), Munich, Germany, 2–4 December 2019; ACM: New York, NY, USA, 2019; pp. 361–369. [Google Scholar] [CrossRef]
Dong, H.; Zhang, C.; Li, G.; Zhang, H. Cloud-Native Databases: A Survey. IEEE Trans. Knowl. Data Eng. 2024, 36, 7772–7791. [Google Scholar] [CrossRef]
Zhang, C.; Li, G.; Zhang, J.; Zhang, X.; Feng, J. HTAP Databases: A Survey. IEEE Trans. Knowl. Data Eng. 2024, 36, 6410–6429. [Google Scholar] [CrossRef]
Corporation, O. Introduction to Oracle Real Application Clusters (Oracle RAC). 2023. Available online: https://docs.oracle.com/cd/B28359_01/rac.111/b28254/admcon.htm#i1058057 (accessed on 5 April 2025).
Pérez-Chacón, R.; Asencio-Cortés, G.; Troncoso, A.; Martínez-Álvarez, F. Pattern sequence-based algorithm for multivariate big data time series forecasting: Application to electricity consumption. Future Gener. Comput. Syst. 2024, 154, 397–412. [Google Scholar] [CrossRef]
de la Vega, A.; Garcia-Saiz, D.; Blanco, C.; Zorrilla, M.; Sanchez, P. Mortadelo: Automatic generation of NoSQL stores from platform-independent data models. Future Gener. Comput. Syst. Int. J. Esci. 2020, 105, 455–474. [Google Scholar] [CrossRef]
Demigha, S. Information Management (IM) and Big Data. In Proceedings of the 21st European Conference on Knowledge Management (ECKM), Conventry Univ, Online, 2–4 December 2020; GarciaPerez, A., Simkin, L., Eds.; pp. 157–163. Available online: https://www.proquest.com/docview/2474920736?fromopenview=true&pq-origsite=gscholar&sourcetype=Conference%20Papers%20&%20Proceedings (accessed on 5 April 2025).
Rao, A.; Khankhoje, D.; Namdev, U.; Bhadane, C.; Dongre, D. Insights into NoSQL databases using financial data: A comparative analysis. Procedia Comput. Sci. 2022, 215, 8–23. [Google Scholar] [CrossRef]
DataStax. Handling 21K Transactions per Second; DataStax: Santa Clara, CA, USA, 2025. [Google Scholar]
Morzaria, J.; Narkhede, S. How Stripe’s Document Databases Supported 99.999% Uptime with Zero-Downtime Data Migrations; Stripe Engineering Blog: San Francisco, CA, USA, 2024. [Google Scholar]
Couchbase. NoSQL Database for Financial Services; Couchbase: Santa Clara, CA, USA, 2025. [Google Scholar]
MongoDB. Implementing an Operational Data Layer; MongoDB: New York, NY, USA, 2025. [Google Scholar]
Progress Software Corporation. Progress MarkLogic and Semaphore Help Broadridge Streamline Post-Trade Processing; Technical Report; Progress Software Corporation: Burlington, MA, USA, 2024. [Google Scholar]
Kadambi, S.; Hunt, M. HBase at Bloomberg: High Availability Needs for the Financial Industry. 2014. Available online: https://www.slideshare.net/slideshow/case-studies-session-4a-35937605/35937605 (accessed on 1 December 2024).
Shaik, V.; Natarajan, K. Cloud databases: A resilient and robust framework to dissolve vendor lock-in. Softw. Impacts 2024, 21, 100680. [Google Scholar] [CrossRef]
Rao, T.R.; Mitra, P.; Bhatt, R.; Goswami, A. The big data system, components, tools, and technologies: A survey. Knowl. Inf. Syst. 2019, 60, 1165–1245. [Google Scholar] [CrossRef]
Hassan, M.; Bansal, S.K. Semantic Data Querying Over NoSQL Databases with Apache Spark. In Proceedings of the IEEE 19th International Conference on Information Reuse and Integration for Data Science (IRI), Salt Lake City, UT, USA, 7–8 July 2018; pp. 364–371. [Google Scholar] [CrossRef]
Almajed, R.; Khan, B.S.; Bassam Nassoura, A.; Irshad, M.S.; Amjad, M.; Hassan, M.; Maqbool, S.; Pradhan, M. Social Media Analytics through Big Data Using Hadoop Framework. In Proceedings of the 2023 International Conference on Business Analytics for Technology and Security (ICBATS), Dubai, United Arab Emirates, 7–8 March 2023; pp. 1–10. [Google Scholar] [CrossRef]
Chaudhary, J.; Vyas, V. Propositional Aspects of Big Data Tools: A Comprehensive Guide to Apache Spark. Int. J. Intell. Syst. Appl. Eng. 2024, 12, 631–639. [Google Scholar]
Luo, C.; Cao, Q.; Li, T.; Chen, H.; Wang, S. MapReduce accelerated attribute reduction based on neighborhood entropy with Apache Spark. Expert Syst. Appl. 2023, 211, 118554. [Google Scholar] [CrossRef]
Hussain, K.; Prieto, E. Big Data in the Finance and Insurance Sectors. In New Horizons for a Data-Driven Economy; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 209–223. [Google Scholar] [CrossRef]
Mhlanga, D. The role of big data in financial technology toward financial inclusion. Front. Big Data 2024, 7, 1184444. [Google Scholar] [CrossRef]
Wang, J. Big Data Financial Fraud Detection with Machine Learning. In Proceedings of the 2024 IEEE 2nd International Conference on Electrical, Automation and Computer Engineering (ICEACE), Changchun, China, 29–31 December 2024; pp. 599–603. [Google Scholar] [CrossRef]
Al-Jumaili, A.H.A.; Muniyandi, R.C.; Hasan, M.K.; Paw, J.K.S.; Singh, M.J. Big Data Analytics Using Cloud Computing Based Frameworks for Power Management Systems: Status, Constraints, and Future Recommendations. Sensors 2023, 23, 2952. [Google Scholar] [CrossRef]
Runsewe, O.; Samaan, N. Cloud Resource Scaling for Time-Bounded and Unbounded Big Data Streaming Applications. IEEE Trans. Cloud Comput. 2021, 9, 504–517. [Google Scholar] [CrossRef]
Hoofnagle, C.J.; van der Sloot, B.; Borgesius, F.Z. The European Union general data protection regulation: What it is and what it means. Inf. Commun. Technol. Law 2019, 28, 65–98. [Google Scholar] [CrossRef]
Sousa, V.; Barros, D.; Guimarães, P.; Santos, A.; Santos, M.Y. Conceptual Formalization of Massive Storage for Advancing Decision-Making with Data Analytics. In Lecture Notes in Business Information Processing (LNBIP, Volume 477); Springer: Berlin/Heidelberg, Germany, 2023; pp. 121–128. [Google Scholar] [CrossRef]
Eichler, R.; Giebler, C.; Gröger, C.; Schwarz, H.; Mitschang, B. Modeling metadata in data lakes—A generic model. Data Knowl. Eng. 2021, 136, 101931. [Google Scholar] [CrossRef]
U.S. Department of the Treasury. The Financial Services Sector’s Adoption of Cloud Services; Technical Report; U.S. Department of the Treasury: Washington, DC, USA, 2023. [Google Scholar]
Technavio. Private and Public Cloud Market in Financial Services to Grow by USD 106.43 Billion (2024–2028), Driven by Big Data Demand; AI Driving Market Transformation; Technical Report; Technavio: Elmhurst, IL, USA, 2025. [Google Scholar]
Ghandour, O.; El Kafhali, S.; Hanini, M. Computing Resources Scalability Performance Analysis in Cloud Computing Data Center. J. Grid Comput. 2023, 21, 61. [Google Scholar] [CrossRef]
Gajbhiye, A.; Shrivastva, K.M.P.D. Cloud Computing: Need, Enabling Technology, Architecture, Advantages and Challenges. In Proceedings of the 5th International Conference on Confluence—The Next Generation Information Technology Summit (Confluence), Noida, IN, USA, 25–26 September 2014; Shukla, B., Bansal, A., Hasteer, N., Singhal, A., Eds.; IEEE: New York, NY, USA, 2014; pp. 1–7. [Google Scholar] [CrossRef]
Alsaghir, M. Digital risks and Islamic FinTech: A road map to social justice and financial inclusion. J. Islam. Account. Bus. Res. 2023. [Google Scholar] [CrossRef]
Chen, X.; Guo, M.; Shangguan, W. Estimating the impact of cloud computing on firm performance: An empirical investigation of listed firms. Inf. Manag. 2022, 59, 103603. [Google Scholar] [CrossRef]
European Union. Regulation (EU) 2022/2554 of the European Parliament and of the Council of 14 December 2022 on digital operational resilience for the financial sector. Off. J. Eur. Union 2022, L333, 1–79. [Google Scholar]
Arnal, J. The Banking Sector Is Increasingly Looking to the Cloud; Technical Report 2023-12; Centre for European Policy Studies (CEPS): Brussels, Belgium, 2023. [Google Scholar]
Ortiz Huaman, C.H.; Fernandez Fuster, N.; Cuadros Luyo, A.; Armas-Aguirre, J. Critical Data Security Model: Gap Security Identification and Risk Analysis In Financial Sector. In Proceedings of the 17th Iberian Conference on Information Systems and Technologies (CISTI), Madrid, Spain, 22–25 June 2022. [Google Scholar] [CrossRef]
Wang, Y.; Zhu, M.; Yuan, J.; Wang, G.; Zhou, H. The intelligent prediction and assessment of financial information risk in the cloud computing model. arXiv 2024, arXiv:2404.09322. [Google Scholar] [CrossRef]
Program on International Financial Systems. Data Localization, Cloud Adoption, and the Financial Sector. 2024. Available online: https://www.pifsinternational.org/wp-content/uploads/2024/07/Report-on-Data-Localization-07.29.2024.pdf (accessed on 1 December 2024).
McCanless, M. Banking on alternative credit scores: Auditing the calculative infrastructure of US consumer lending. Environ. Plan. Econ. Space 2023, 55, 2128–2146. [Google Scholar] [CrossRef]
Jani, Y. AI-Driven Risk Management and Fraud Detection in High-Frequency Trading Environments. Int. J. Sci. Res. 2023, 12, 2223–2229. [Google Scholar] [CrossRef]
International Association of Privacy Professionals. Financial Privacy Pros Look Back on GDPR, Forward to CCPA; International Association of Privacy Professionals: Portsmouth, NH, USA, 2021. [Google Scholar]
Legislative, C. California Consumer Privacy Act of 2018, Civil Code, Title 1.81.5. Available online: https://leginfo.legislature.ca.gov/faces/codes_displayText.xhtml?division=3.&part=4.&lawCode=CIV&title=1.81.5 (accessed on 1 December 2024).
Clarke, O. How Does the Proposed American Data Privacy and Protection Act Compare to GDPR? 2022. Available online: https://www.osborneclarke.com/insights/how-does-proposed-american-data-privacy-and-protection-act-compare-gdpr (accessed on 1 December 2024).
Reuters. Capital One Fined $80M Over Cloud Breach, 2020. Available online: https://www.reuters.com/article/business/capital-one-to-pay-80-million-fine-after-data-breach-idUSKCN2522D8/ (accessed on 1 December 2024).
Hanson, N. Judge Approves Settlement Ordering Plaid to Pay $58 Million for Selling Consumer Data. Settled in 2021 for $58M and Agreement to Improve Disclosures. 2022. Available online: https://www.courthousenews.com/judge-approves-settlement-ordering-plaid-to-pay-58-million-for-selling-consumer-data/ (accessed on 1 December 2024).
Kamaruddin, S.; Mohammad, A.M.; Saufi, N.N.M.; Rosli, W.R.W.; Othman, M.B.; Hamin, Z. Compliance to GDPR Data Protection and Privacy in Artificial Intelligence Technology: Legal and Ethical Ramifications in Malaysia. In Proceedings of the 2023 International Conference on Disruptive Technologies (ICDT), Greater Noida, India, 11–12 May 2023; pp. 284–288. [Google Scholar] [CrossRef]
Baldini, D.; Francis, K. AI Regulatory Sandboxes between the AI Act and the GDPR: The role of Data Protection as a Corporate Social Responsibility. In Proceedings of the CEUR Workshop Proceedings, Bari, Italy, 14–18 October 2024; Volume 3731. [Google Scholar]
SOCRadar. Top 20 Cybersecurity Conferences and Events to Attend in 2025. 2025. Available online: https://socradar.io/top-20-cybersecurity-conferences-and-events-2025/ (accessed on 1 December 2024).
Yashkova, O.; Graham, S. Evaluating Carbon Emissions Savings from Cloud Computing; Technical Report; International Data Corporation (IDC): Needham, MA, USA, 2024. [Google Scholar]
Bhattacherjee, S.; Das, R.; Khatua, S.; Roy, S. Energy-efficient migration techniques for cloud environment: A step toward green computing. J. Supercomput. 2020, 76, 5192–5220. [Google Scholar] [CrossRef]
Wang, G.; Wen, B.; He, J.; Meng, Q. A new approach to reduce energy consumption in priority live migration of services based on green cloud computing. Clust. Comput. 2025, 28, 207. [Google Scholar] [CrossRef]
Moss, S. IEA: Global data center electricity consumption to ‘increase significantly’, but remain a small part of overall usage. Data Cent. Dyn. 2024. Available online: https://www.datacenterdynamics.com/en/news/iea-global-data-center-electricity-consumption-to-increase-significantly-but-remain-a-small-part-of-overall-usage/ (accessed on 1 December 2024).
Carbone 4. Digital Report: Cloudy with a Chance of Hidden Emissions; Carbone 4: Paris, France, 2024. [Google Scholar]
Buyya, R.; Ilager, S.; Arroba, P. Energy-efficiency and sustainability in new generation cloud computing: A vision and directions for integrated management of data centre resources and workloads. Softw. Pract. Exp. 2024, 54, 24–38. [Google Scholar] [CrossRef]
Katal, A.; Dahiya, S.; Choudhury, T. Energy efficiency in cloud computing data centers: A survey on software technologies. Clust. Comput. 2023, 26, 1845–1875. [Google Scholar] [CrossRef]
de Vries, A. The growing energy footprint of artificial intelligence. Joule 2023, 7, 2191–2194. [Google Scholar] [CrossRef]
IEA. Electricity 2024: Analysis and Forecast to 2026; IEA Publications: Paris, France, 2024. [Google Scholar]
Borderstep Institute. Data Centres in Europe—Opportunities for Sustainable Digitalisation—Part II; Technical Report; ECO—Association of the Internet Industry: Mumbai, India, 2020. [Google Scholar]
Institute, U. Global Data Center Survey Results 2024; Technical Report; Uptime Institute: New York, NY, USA, 2024. [Google Scholar]
Chioti, K. Banking Can Harness Cloud Technology to Hit Net Zero. Here’s How; World Economic Forum: Davos, Switzerland, 2022. [Google Scholar]
Freed, M.; Bielinska, S.; Buckley, C.; Coptu, A.; Yilmaz, M.; Messnarz, R.; Clarke, P.M. An Investigation of Green Software Engineering; Springer: Berlin/Heidelberg, Germany, 2023; pp. 124–137. [Google Scholar] [CrossRef]
Bagozi, A.; Bianchini, D.; De Antonellis, V.; Garda, M.; Melchiori, M. Personalised Exploration Graphs on Semantic Data Lakes. In Proceedings of the 27th International Conference on Cooperative Information Systems, (CoopIS)/International Conference on Ontologies, Databases, and Applications of Semantics (ODBASE)/Conference on Cloud and Trusted Computing (C and TC), Rhodes, Greece, 21–25 October 2019; Volume 11877, pp. 22–39. [Google Scholar] [CrossRef]
Janković, S.; Mladenović, S.; Mladenović, D.; Vesković, S.; Glavić, D. Schema on read modeling approach as a basis of big data analytics integration in EIS. Enterp. Inf. Syst. 2018, 12, 1180–1201. [Google Scholar] [CrossRef]
Biswas, N.; Mondal, A.S.; Kusumastuti, A.; Saha, S.; Mondal, K.C. Automated credit assessment framework using ETL process and machine learning. Innov. Syst. Softw. Eng. 2022, 21, 257–270. [Google Scholar] [CrossRef]
Kanungo, S. Hybrid Cloud Integration: Best Practices and Use Cases. Inf. Technol. 2021, 12, 13. [Google Scholar]
Pokorný, J. Integration of Relational and NoSQL Databases. Vietnam. J. Comput. Sci. 2019, 06, 389–405. [Google Scholar] [CrossRef]
Deka, G.C. NoSQL Polyglot Persistence. In Advances in Computers; Elsevier: Amsterdam, The Netherlands, 2018; pp. 357–390. [Google Scholar] [CrossRef]
Fang, H. Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In Proceedings of the 2015 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems, IEEE-CYBER 2015, Shenyang, China, 8–12 June 2015. [Google Scholar] [CrossRef]
Harby, A.A.; Zulkernine, F. Data Lakehouse: A survey and experimental study. Inf. Syst. 2025, 127, 102460. [Google Scholar] [CrossRef]
Adams, B. Striking a Balance: Privacy and National Security in Section 702 Us Person Queries. Wash. Law Rev. 2019, 94, 401–451. [Google Scholar]
EU-US. Participation Requirements Data Privacy Framework (DPF) Principles. Data Priv. Framew. 2023, 1, 1. [Google Scholar]
Bradford, L.; Aboy, M.; Liddell, K. Standard contractual clauses for cross-border transfers of health data after Schrems II. J. Law Biosci. 2021, 8, lsab007. [Google Scholar] [CrossRef]
Al-Hashedi, K.G.; Magalingam, P. Financial fraud detection applying data mining techniques: A comprehensive review from 2009 to 2019. Comput. Sci. Rev. 2021, 40, 100402. [Google Scholar] [CrossRef]
Sonkavde, G.; Dharrao, D.S.; Bongale, A.M.; Deokate, S.T.; Doreswamy, D.; Bhat, S.K. Forecasting Stock Market Prices Using Machine Learning and Deep Learning Models: A Systematic Review, Performance Analysis and Discussion of Implications. Int. J. Financ. Stud. 2023, 11, 94. [Google Scholar] [CrossRef]
Vuong, P.H.; Phu, L.H.; Nguyen, T.H.V.; Duy, L.N.; Bao, P.T.; Trinh, T.D. A bibliometric literature review of stock price forecasting: From statistical model to deep learning approach. Sci. Prog. 2024, 107, 368504241236557. [Google Scholar] [CrossRef] [PubMed]
Elsayed, N.; Abd Elaleem, S.; Marie, M. Improving Prediction Accuracy using Random Forest Algorithm. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 436–441. [Google Scholar] [CrossRef]
Sun, J.; Zhao, M.; Lei, C. Class-imbalanced dynamic financial distress prediction based on random forest from the perspective of concept drift. Risk-Manag. Int. J. 2024, 26, 19. [Google Scholar] [CrossRef]
Jin, Y. Distinctive impacts of ESG pillars on corporate financial performance: A random forest analysis of Korean listed firms. Financ. Res. Lett. 2025, 71, 106395. [Google Scholar] [CrossRef]
Huen, T.H.; Ming, L.T. A Study on Life Insurance Early Claim Detection Modeling by Considering Multiple Features Transformation Strategies for Higher Accuracy. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 1075–1087. [Google Scholar] [CrossRef]
Lytvyn, O.; Kudin, V.; Onyshchenko, A.; Nikolaiev, M.; Chaplynska, N. Integration of Digital Means in The Financial Sphere: The Potential of Cloud Computing, Blockchain, Big Data and AI. Financ. Credit. Act. Probl. Theory Pract. 2024, 1, 127–145. [Google Scholar] [CrossRef]
Domaschka, J.; Hauser, C.B.; Erb, B. Reliability and Availability Properties of Distributed Database Systems. In Proceedings of the 2014 IEEE 18th International Enterprise Distributed Object Computing Conference, Ulm, Germany, 1–5 September 2014; pp. 226–233. [Google Scholar] [CrossRef]
PostgreSQL Global Development Group. PostgreSQL License. Available online: https://www.postgresql.org/about/licence/ (accessed on 1 December 2024).
MariaDB Corporation. Licensing FAQ—Distribution Under GPL. MariaDB Knowledge Base. Available online: https://mariadb.com/kb/en/licensing-faq/ (accessed on 1 December 2024).
Oracle Corporation. MySQL FOSS License Exception. MySQL: Commercial License for OEMs, ISVs and VARs. Available online: https://www.mysql.com/about/legal/licensing/oem/ (accessed on 1 December 2024).
Phan, T.D.; Pallez, G.; Ibrahim, S.; Raghavan, P. A New Framework for Evaluating Straggler Detection Mechanisms in MapReduce. ACM Trans. Model. Perform. Eval. Comput. Syst. 2019, 4, 14. [Google Scholar] [CrossRef]
Wei, S. Financial Information Fusion and service Platform Based on Cloud Computing. In Proceedings of the 3rd International Conference on Materials Engineering, Manufacturing Technology and Control (ICMEMTC), Taiyuan, China, 27–28 February 2016; AER-Advances in Engineering Research. Volume 67, pp. 1901–1905. [Google Scholar] [CrossRef]
Garad, A.; Riyadh, H.A.; Al-Ansi, A.M.; Beshr, B.A.H. Unlocking financial innovation through strategic investments in information management: A systematic review. Discov. Sustain. 2024, 5, 381. [Google Scholar] [CrossRef]
Hirve, S.; Reddy, C.H.P. A Survey on Visualization Techniques Used for Big Data Analytics. In Proceedings of the International Conference on Computer, Communication and Computational Sciences (IC4S), Bangkok, Thailand, 20–21 October 2018; Advances in Intelligent Systems and Computing. Volume 924, pp. 447–459. [Google Scholar] [CrossRef]
Tun, M.T.; Nyaung, D.E.; Phyu, M.P. Performance Evaluation of Intrusion Detection Streaming Transactions Using Apache Kafka and Spark Streaming. In Proceedings of the International Conference on Advanced Information Technologies (ICAIT), Yangon, Myanmar, 6–7 November 2019; pp. 25–30. [Google Scholar] [CrossRef]
Boyko, N.; Shakhovska, N. Prospects for Using Cloud Data Warehouses in Information Systems. In Proceedings of the 13th IEEE International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Lviv, Ukraine, 11–14 September 2018; pp. 136–139. [Google Scholar] [CrossRef]
Gorawski, M.; Lis, D.; Gorawski, M. The Use of a Cloud Computing and the CUDA Architecture in Zero-Latency Data Warehouses. In Proceedings of the 20th International Conference on Computer Networks (CN), Lwowek Slaski, Poland, 17–21 June 2013; Communications in Computer and Information Science. pp. 312–322. [Google Scholar]
Westerlund, M.; Hedlund, U.; Pulkkis, G.; Bjork, K.M. A Generalized Scalable Software Architecture for Analyzing Temporally Structured Big Data in the Cloud. In Proceedings of the World Conference on Information Systems and Technologies (WorldCIST), Funchal, Portugal, 15–18 April 2014; Advances in Intelligent Systems and Computing. Volume 275, pp. 559–569. [Google Scholar] [CrossRef]
Singh, S.; Liu, Y. A Cloud Service Architecture for Analyzing Big Monitoring Data. Tsinghua Sci. Technol. 2016, 21, 55–70. [Google Scholar] [CrossRef]
Shakatreh, M.; Abu Orabi, M.M.; Al Abbadi, A.F.A. Impact of Cloud Computing on Quality of Financial Reports with Jordanian Commercial Banks. Montenegrin J. Econ. 2023, 19, 167–178. [Google Scholar] [CrossRef]
Clemente-Castello, F.J.; Nicolae, B.; Katrinis, K.; Rafique, M.M.; Mayo, R.; Carlos Fernandez, J.; Loreti, D. Enabling Big Data Analytics in the Hybrid Cloud using Iterative MapReduce. In Proceedings of the IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC), Limassol, Cyprus, 7–10 December 2015; pp. 290–299. [Google Scholar] [CrossRef]
Du, X.; Hu, S.; Zhou, F.; Wang, C.; Nguyen, B.M. FI-NL2PY2SQL: Financial Industry NL2SQL Innovation Model Based on Python and Large Language Model. Future Internet 2025, 17, 12. [Google Scholar] [CrossRef]
Malakauskas, A.; Lakstutiene, A. Financial Distress Prediction for Small and Medium Enterprises Using Machine Learning Techniques. Inz. Ekon. Eng. Econ. 2021, 32, 4–14. [Google Scholar] [CrossRef]
Tuan, N.M.; Meesad, P. A Study of Predicting the Sincerity of a Question Asked Using Machine Learning. In Proceedings of the 5th International Conference on Natural Language Processing and Information Retrieval (NLPIR), Sanya, China, 17–20 December 2021; pp. 129–134. [Google Scholar] [CrossRef]
Cheng, M.; Qu, Y.; Jiang, C.; Zhao, C. Is cloud computing the digital solution to the future of banking? J. Financ. Stab. 2022, 63, 101073. [Google Scholar] [CrossRef]
Tereszkiewicz, P.; Cichowicz, E. Findings from the Polish InsurTech market as a roadmap for regulators. Comput. Law Secur. Rev. 2024, 52, 105948. [Google Scholar] [CrossRef]
The Apache Software Foundation. Spark SQL and DataFrames; The Apache Software Foundation: Wilmington, NC, USA, 2025. [Google Scholar]
Khan, A.Q.; Matskin, M.; Prodan, R.; Bussler, C.; Roman, D.; Soylu, A. Cloud storage cost: A taxonomy and survey. World Wide Web 2024, 27, 36. [Google Scholar] [CrossRef]

Figure 1. The methodological framework.

Figure 2. Database systems.

Figure 3. Big data systems.

Figure 4. Cloud systems.

Figure 5. System integration.

Figure 6. Hybrid cloud infrastructure.

Figure 7. Sequence Diagram: Cross-Border Transaction Analysis Flow with Data Anonymization and Version Metadata Tracking.

Figure 8. Interconnected architecture.

Figure 9. Memory usage vs. dataset size.

Figure 10. CPU usage vs. dataset size.

Figure 11. GPU usage vs. dataset size.

Figure 12. Total time vs. dataset size.

Figure 13. Query time comparison.

Table 1. Key NoSQL use cases in the financial sector.

Financial Sector	Key NoSQL Use Cases	Principal Advantages	Illustration
Conventional Banking	- Fraud detection in real time (quick anomaly/graph analysis) - 360° data hubs allowing customers to create unified profiles - AI/analytics-based personalized product suggestions - Processing large volumes of events and logs - Managing sessions and caching in addition to old SQL - Scaling quickly for large transaction volumes	- Improved consumer data and quicker customization - Lower latency for apps that interact with users - Best-suited NoSQL model(s): graph and document DBs (e.g., Neo4j and MongoDB) with key values for caching	Major banks can use NoSQL for fraud analytics and real-time dashboards. Capital One uses Cassandra for mission-critical systems after a migration from Oracle [21].
FinTech Businesses	- Credit decisioning and scoring in real time - The polyglot microservices-driven architecture - Monitoring user activity for customization - Large-scale fraud and security monitoring - Managing worldwide traffic and quick user growth - Adaptable schema for ongoing innovation	- Smooth cloud scaling - Low latency for AI models and end-user transactions - Best-suited NoSQL model(s): document and key-value stores (e.g., MongoDB, DynamoDB, and Redis)	DocDB, which is an in-house built MongoDB extension, is used by Stripe/PayPal to accommodate million queries/sec capabilities [22] Others, like Fico, Wells Fargo, or Equifax, use Couchbase [23].
Insurance companies	- 360-degree and multi-policy views for customers - Processing claims using unstructured data - Risk modeling using IoT/telematics data - Fraud detection for suspect networks using graphs - Storing and searching regulatory compliance data - Combining disparate insurance and claims data - Ingesting sensor and telematics data at high throughput	- Quicker processes for detecting fraud - Streamlines claims and regulatory data management - High-throughput ingestion of sensor/telematics - Best-suited NoSQL model(s): document and graph DBs (e.g., MongoDB and Neo4j) and wide-column (e.g., Cassandra) features	Large insurers utilize NoSQL to store telematics and consolidate data sources, while MetLife uses MongoDB for its integrated “MetLife Wall” [24].
Investment and capital markets	- High-volume time-series data storage for market and tick data - Risk aggregation in real time (stress testing, value-at-risk, etc.) - Reconciliation and post-trade processing (millions of trades) - Graph-based surveillance (market manipulation, insider trading, etc.) - Customized portfolio analytics and robo-advisors	- Low latency for ingesting market data - The ability to grow to handle billions of trade shows - The ability to integrate trade, chat, and risk data in a flexible way. - Best-suited NoSQL model(s): wide-column/time-series DBs (e.g., Cassandra and InfluxDB) with graph DBs for surveillance	Major exchanges utilize NoSQL for trade surveillance, such as Broadridge, which uses MarkLogic (a multi-model NoSQL database) for post-transaction processing [25], and Bloomberg, which uses HBase for time-series data [26].

Table 2. Comparative analysis: GDPR vs. US financial data protection regimes [54,55,56,57,58].

Aspect	EU—GDPR	US—GLBA and CCPA/CPRA
Scope and Coverage	All companies processing EU residents’ personal data (extraterritorial reach). No minimum company size threshold.	Sector-specific (GLBA for financial institutions) and state-specific (CCPA for businesses meeting thresholds, California residents). Many federal privacy laws are sectoral; no omnibus federal law yet.
Legal Basis for Processing	Requires one of six legal bases (consent, contract, legal obligation, etc.); consent must be informed, freely given, specific, and unambiguous. Purpose limitation and data minimization are core principles.	GLBA permits data sharing with disclosures and opt-out for some sharing; CCPA/CPRA does not require prior consent except for selling/sharingdata or sensitive data use, but grants opt-out rights. No general requirement to minimize data (except sector regulations); emphasis on consumer notice and choice.
Consumer Data Rights	Robust rights: access, rectification, erasure (“right to be forgotten”), data portability, restriction of processing, and objection to profiling. All individuals in EU (including bank customers) can request data copies or deletion (with some exceptions).	CCPA/CPRA rights for California residents: knowing (access) what data are collected and shared, deleting personal data, opting out of sale or sharing, non-discrimination, and correct inaccuracies (CPRA). GLBA provides privacy notices and opting out of sharing with third parties in some cases, but no right to deletion or portability.
Data Protection Officer	Required for large-scale data processing (common in banks); DPO oversees GDPR compliance program. Financial institutions have designated DPOs or similar privacy officers to liaise with DPAs.	No DPO mandate in GLBA/CCPA. However, banks appoint Chief Privacy Officers or privacy compliance teams as a best practice. The GLBA Safeguards Rule requires a security program with a designated coordinator (often CISO).
Enforcement and Penalties	Enforced by DPAs (one-stop-shop for cross-EU issues). Fines up to EUR 20 million or 4% of global annual revenue. Hundreds of fines issued since 2018; EUR 1 billion in fines in 2021 alone. Banks have faced multi-million euro fines (e.g., EUR 6 million to CaixaBank) for GDPR violations.	Enforcement by multiple regulators: federal banking regulators (such as OCC and Federal Reserve) for safety and soundness (occasionally cite data security), FTC for unfair/deceptive practices, and state AGs for CCPA. Penalties vary: e.g., USD 80 million OCC fine to Capital One for a 2019 breach; CCPA fines up to USD 2500 per violation (USD 7500 if intentional) by California CPPA.

Table 3. Synthesis of findings from a literature review on the data integration in financial institutions.

Domain	Integration Strategies	Performance Outcomes	Compliance and Challenges
Retail Banking	Data lakes; lakehouse architectures; polyglot persistence (relational + NoSQL); hybrid cloud setups	Improved transaction processing speeds; low latency for high-volume structured data analytics; enhanced customer experience through AI/ML	GDPR compliance; EU-US Data Privacy Framework (DPF); data encryption and pseudonymization; blockchain-based audit trails for transparency
Capital Markets	Real-time streaming (Kafka/Spark Streaming); hybrid cloud analytics; distributed SQL/NewSQL databases; AI/ML for high-frequency trading	Significant improvements in real-time analytics; near-linear scaling; transaction speeds under 1 ms; enhanced predictive capabilities	Data sovereignty concerns; GDPR and DPF compliance; smart contracts for automated compliance enforcement; container orchestration (Kubernetes) for secure deployments
FinTech	API-driven microservices; managed cloud services; serverless architectures; accelerated adoption post-COVID-19	Rapid scaling; high flexibility and speed-to-market; resilience in hybrid/multi-cloud; cost efficiency during demand fluctuations	Cross-border data transfer complexities; third-party cloud risk; SCCs with additional technical controls; automated tagging for data lineage
Insurance and Risk Management	Hadoop/Spark batch analytics; hybrid cloud deployment; AI/ML for risk assessment and fraud detection	high scalability for batch analytics; fast data processing for claims and actuarial calculations; improved fraud detection rates	Strict governance for sensitive personal data; compliance managed via access controls; immutable audit logs using permissioned blockchain; data localization solutions
General (Cross-Sector Trends)	Polyglot persistence; cloud data lakes; lakehouse architectures; container orchestration; microservices- based designs	Orders-of-magnitude improvements in processing efficiency; reduced analytics times; ability to adapt to regulatory changes; dynamic resource allocation	Integrated governance through strict data classification; encryption at rest and in transit; standardized metadata management; comprehensive audit trails; balance between cloud benefits and regulatory compliance

Table 4. Technology comparison.

Category	Relational Databases	Big Data	Cloud Computing
Technology	Traditional data storage and management systems primarily using structured query language (SQL) for data defining and manipulating.	Systems designed for storing, processing, and analysis of large volumes of complex data, which exceed the capability of traditional databases.	Platforms that deliver Internet computing services, including servers, storage, databases, networking, and software.
Primary Use	Efficient management of structured data, particularly for transactional processes.	Handling large-scale, complex datasets that are often unstructured or semi-structured and are suitable for analytical processes.	Providing scalable and flexible IT resources as services, supporting a wide range of applications, including data processing and storage.
Data Model	Structured data model with predefined schema.	Flexible data models, often schema-less or with schema defined at read time, accommodating a wider variety of data types.	Depending on service model (IaaS, PaaS, SaaS, etc.), may range from structured to unstructured data handling capabilities.
Performance	High performance in managing small-to-medium-sized datasets, with optimized transaction processing capabilities.	Optimized for processing large volumes of data, often with parallel processing capabilities for big data workloads.	Variable performance depending on the configuration and the service provider, generally high due to resource scalability.
Scalability	Limited scalability, where scaling often requires significant effort and resources.	Highly scalable, i.e., designed to expand effortlessly to manage growing data volumes.	Extremely scalable, thus allowing resources to be adjusted per demand, often with minimal-to-no downtime.
Flexibility	Less flexible in terms of handling diverse data types and rapid changes in schema.	Highly flexible in managing different types of data and accommodating changes in data structures.	High flexibility in terms of resources, configurations, and the choice of services based on user needs.
Decision Support	Suitable for operational data processing and decision support within its capacity to manage structured data.	Excels in providing decision support for complex queries and analytics over large datasets.	Supports decision-making processes by providing a wide range of services and tools for data analysis, AI, and machine learning.

Table 5. Data integration technologies.

Integration Technology	Advantages	Disadvantages
ETL Processes	- Flexible data transformation. - Can integrate diverse data sources. - Supports complex data workflows.	- Can be resource-intensive. - Complex setup for large-scale data. - Time consuming in real-time processing.
Direct Database Integration	- Efficient for real-time data transfer. - Maintains data integrity. - Suitable for operational reporting.	- Limited by database compatibility. - May prove complex to set up. - Potential performance bottlenecks.
APIs and Web Services	- Flexible and customizable. - Support a wide range of data sources. - Can be used for real-time data integration.	- Require programming knowledge. - Potentially slower for large data volumes. - Dependent on network stability.
Apache Sqoop	- Efficient for transferring large datasets. - Supports incremental loads. - Good Hadoop ecosystem integration.	- Limited to Hadoop-compatible systems. - Can be complex for specific use cases. - Overhead for small data loads.
Apache Flume	- Ideal for streaming data. - Scalable and reliable. - Flexible configuration options.	- Overhead for small data loads. - Requires understanding of Flume configuration. - Primarily for log data.
Cloud Data Migration Tools	- Simplified cloud data migration. - Able to support multiple source and target databases. - Often includes a graphical user interface.	- Can be expensive in case of large datasets. - Sometimes limited by cloud environment. - Requires cloud expertise.
Data Warehousing Solutions	- Processes analytics in a scalable manner. - Integrates structured and unstructured data. - Supports complex queries and reporting.	- Can be expensive. - Complex setup and management. - Requires data warehousing expertise.
Hybrid Cloud Solutions	- Data deployment flexibility. - Balances on premise and cloud advantages. - Good for data security and compliance.	- Can be complex to manage. - Requires careful integration planning. - Potential security challenges.

Table 6. Hybrid cloud for EU financial institutions.

Hybrid Cloud Advantages	Caution Flags	Steps to succeed
Data Lockdown Keep sensitive or regulated data on EU soil	Hidden Exposures Even routine backups or support logs can slip outside the EU	Classify Thoroughly Know what data is sensitive and keep it firmly on-prem or in EU-based clouds
Scalability Bliss Use public cloud for flexible, cost-effective growth	Vendor Tangles Proprietary tech can lock a company in or complicate integrations	Plan for Uncoupling Ensure the architecture can pivot if laws or vendors change
Resilience Edge Private vs. public helps withstand regulatory or political turbulence	Encryption Gaps If US cloud providers hold decryption keys, compliance is at risk	Zero-Trust Everywhere Encrypt data with keys the company controls and which are safely stored on-prem

Table 7. Performance comparison for a dataset size of 1,000,000.

Execution Platform	Dataset Type	Memory Usage (MB)	CPU Usage (Percent)	GPU Usage (Percent)	Total Time (s)	Query Time (s)
SQL	Structured	0.1406	3.8872	0.0	3.9120	3.8927
SQL	Semi-Structured	0.1409	3.8790	0.0	3.9047	3.8860
SQL	Unstructured	0.1324	2.5668	0.0	2.5955	2.5770
Python	Structured	488.6354	0.0	2.0	164.5462	164.5253
Python	Semi-Structured	381.0949	0.0	5.0	8.1628	8.1434
Python	Unstructured	47.6886	0.0	5.0	16.5125	16.4932
Spark	Structured	488.6288	0.0	2.0	163.0830	163.0642
Spark	Semi-Structured	381.0933	0.0	5.0	5.6105	5.5915
Spark	Unstructured	56.2722	0.0	5.0	16.2426	16.2235

Table 8. Structured data sample.

Status of Existing Checking Account	Duration in Month	Credit History	Employment Since	Other Debtors	Age in Years
A13	54	A30	A75	A103	68
A14	51	A32	A73	A103	68
A11	22	A30	A73	A103	35
A13	9	A30	A75	A101	64
A13	30	A30	A74	A103	27

Table 9. Semi-structured data sample.

{"Status of existing checking account":"A13", "Duration in month":54,"Credit history":"A30", "Purpose":"A42", "Credit amount":1658, "Present employment since":"A75", "Installment rate in percentage of disposable income":3, "Personal status and sex":"A94", "Other debtors guarantors":"A103", "Present residence since":3,"Property":"A124", "Age in years":68, "Other installment plans":"A141", "Housing":"A153", "Number of existing credits at this bank":1, "Job":"A173", "Number of people being liable to provide maintenance for":1, "Telephone":"A191", "Foreign worker":"A201", "Creditworthiness":2}

Table 10. Unstructured data sample.

The customer has a checking account status of A13, with a loan duration of 54 months. The credit history is A30, and the purpose of the loan is A42. The loan amount is 1658 units. The customer has been employed for A75 and has a job categorized as A173. The customer resides in housing type A153 and owns property of type A124. The customer has 1 existing credit(s) and is classified as a domestic worker.

Table 11. Connecting the performance results to financial domain applications.

Financial Domain	Key Performance Findings	Optimal Technology	Business Application	Implementation Considerations
Retail Banking	SQL processed structured data with minimal memory usage (0.14 MB) and fast execution (3.91 s for 1 M records)	SQL for core transactions; Spark for customer analytics	Real-time account processing, instant payment authorization, and customer segmentation	Implements hybrid architecture with SQL for transactions and Spark for customer behavior analytics
Capital Markets	Spark showed superior performance for semi-structured data (5.59 s query time) with efficient GPU utilization (5%)	Spark for market data analytics; SQL for order management	High-frequency trading, real-time market surveillance, algorithmic trading, and risk assessment	Deploy Spark clusters for tick data analysis with SQL interfaces for compliance reporting
Risk Management	Python demonstrated strong performance for ML workloads on semi-structured data (8.14 s)	Python for model development; Spark for production ML pipelines	Credit scoring, fraud detection, portfolio optimization, and stress testing	Develop models in Python, deploy at scale with Spark ML, and maintain data lineage for regulatory compliance
Customer Analytics	Spark efficiently processed unstructured data (16.22 s for 1 M records) with balanced resource utilization	Spark for text processing; Python for visualization	Sentiment analysis, customer feedback processing, churn prediction, and next-best-offer modeling	Implement distributed NLP pipelines with effective data partitioning strategies
Regulatory Reporting	SQL demonstrated consistent and reliable performance (±3% variation) across all test runs	SQL for structured reports; big data for complex compliance modeling	GDPR compliance, transaction monitoring, AML reporting, and audit trails	Implement strict data governance with versioning and immutable audit logs

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ionescu, S.-A.; Diaconita, V.; Radu, A.-O. Engineering Sustainable Data Architectures for Modern Financial Institutions. Electronics 2025, 14, 1650. https://doi.org/10.3390/electronics14081650

AMA Style

Ionescu S-A, Diaconita V, Radu A-O. Engineering Sustainable Data Architectures for Modern Financial Institutions. Electronics. 2025; 14(8):1650. https://doi.org/10.3390/electronics14081650

Chicago/Turabian Style

Ionescu, Sergiu-Alexandru, Vlad Diaconita, and Andreea-Oana Radu. 2025. "Engineering Sustainable Data Architectures for Modern Financial Institutions" Electronics 14, no. 8: 1650. https://doi.org/10.3390/electronics14081650

APA Style

Ionescu, S.-A., Diaconita, V., & Radu, A.-O. (2025). Engineering Sustainable Data Architectures for Modern Financial Institutions. Electronics, 14(8), 1650. https://doi.org/10.3390/electronics14081650

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Engineering Sustainable Data Architectures for Modern Financial Institutions^†

Abstract

1. Introduction

2. Literature Review

2.1. Relational and NoSQL Databases in Financial Services

2.2. Big Data Frameworks and Analytics in Finance

2.3. Cloud Computing in Financial Services

2.4. EU General Data Protection Regulation vs. US Data Privacy Frameworks

2.5. Sustainable Practices

2.6. Integration Approaches and Findings

3. Materials and Methods

4. System Architecture and Integration

4.1. Technology Comparison

4.2. Integration Strategies

4.3. Hybrid Architecture and Practical Implementation

5. Performance Analysis

5.1. Problem Definition

5.2. Performance Analysis and Results

5.3. Implications for Financial Data Management and Applications

6. Conclusions

6.1. Practical Contributions and Implementation Guidelines

6.2. Methodological Limitations

6.3. Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Engineering Sustainable Data Architectures for Modern Financial Institutions †

Abstract

1. Introduction

2. Literature Review

2.1. Relational and NoSQL Databases in Financial Services

2.2. Big Data Frameworks and Analytics in Finance

2.3. Cloud Computing in Financial Services

2.4. EU General Data Protection Regulation vs. US Data Privacy Frameworks

2.5. Sustainable Practices

2.6. Integration Approaches and Findings

3. Materials and Methods

4. System Architecture and Integration

4.1. Technology Comparison

4.2. Integration Strategies

4.3. Hybrid Architecture and Practical Implementation

5. Performance Analysis

5.1. Problem Definition

5.2. Performance Analysis and Results

5.3. Implications for Financial Data Management and Applications

6. Conclusions

6.1. Practical Contributions and Implementation Guidelines

6.2. Methodological Limitations

6.3. Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Engineering Sustainable Data Architectures for Modern Financial Institutions^†