A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption

Lagos-Obando, Juan; Aillapán, Gabriel; Fenner-López, Julio; Bustamante-Mora, Ana; Burgos-López, María

doi:10.3390/app151910356

Open AccessArticle

A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption

by

Juan Lagos-Obando

¹

,

Gabriel Aillapán

^1,*

,

Julio Fenner-López

¹

,

Ana Bustamante-Mora

¹

and

María Burgos-López

²

¹

Departamento de Ciencias de la Computación e Informática (DCI), Universidad de La Frontera, Temuco 4811230, Chile

²

Departamento Industrial, Tecnológico de Monterrey, Monterrey 64700, Nuevo León, Mexico

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10356; https://doi.org/10.3390/app151910356

Submission received: 1 May 2025 / Revised: 20 June 2025 / Accepted: 2 July 2025 / Published: 24 September 2025

(This article belongs to the Special Issue Cryptography in Data Protection and Privacy-Enhancing Technologies)

Download

Browse Figures

Versions Notes

Abstract

Managing and analyzing data in data lakes for big data environments requires robust protocols to ensure security, scalability, and compliance with privacy regulations. The increasing need to process sensitive data emphasizes the relevance of secure-by-design approaches that integrate encryption techniques and governance frameworks to protect personal and confidential information. This study proposes a protocol that combines the capabilities of Databricks and format-preserving encryption to improve data security and accessibility in data lakes without compromising usability or structure. The protocol was developed using a design science methodology, incorporating findings from a systematic literature review and validated through expert feedback and proof-of-concept experiments in banking environments. The proposed solution integrates multiple layers, data ingestion, persistence, access, and consumption, leveraging the processing capabilities of Databricks and format-preserving encryption to enable secure data management and governance. Validation results indicate the protocol is effectiveness in protecting sensitive data, with promising applicability in regulated industries. This work contributes to addressing key challenges in big data security and lays the groundwork for future developments in data governance and encryption techniques.

Keywords:

data lake; format-preserving encryption; big data; secure by design; Databricks; sensitive data protection; privacy-enhancing technologies

1. Introduction

The exponential growth of digital information over the past two decades has fundamentally transformed how organizations manage, store, and analyze data [1,2]. As industries such as finance, healthcare, and government expand their digital footprints, they increasingly rely on data lakes, scalable, flexible storage architectures capable of managing extensive volumes of structured, semi-structured, and unstructured data [3]. Data lakes offer opportunities for comprehensive analytics and innovation but simultaneously introduce new challenges related to data security, privacy preservation, and regulatory compliance [4,5].

In parallel, the accelerated adoption of cloud computing and distributed processing platforms, such as Databricks, has further reshaped the data landscape, enabling organizations to perform complex analytics on massive datasets more efficiently [6]. However, this growing computational capacity comes with heightened risks. The handling of sensitive personal and financial information at scale, often in hybrid or multicloud environments, increases the attack surface for cyber threats and subjects organizations to stricter regulatory scrutiny [7,8].

Security breaches, data misuse, and non-compliance with data protection regulations such as general data protection regulation (GDPR) can have devastating legal and reputational consequences. Recognizing these risks, the secure-by-design (SBD) paradigm has gained traction. This approach advocates integrating security mechanisms into every stage of system development, from the earliest architectural decisions to deployment and operation [7,9]. Unlike traditional perimeter-based security strategies, SBD frameworks promote resilience, adaptability, and vulnerability minimization from inception.

Among the techniques developed to protect sensitive information in complex data environments, format-preserving encryption (FPE) has emerged as a particularly promising solution [10,11]. By encrypting data without altering its original format, FPE enables the preservation of database schema integrity and the compatibility of encrypted data with existing analytics workflows. Its relevance has grown in sectors where maintaining the usability of data while ensuring its confidentiality is crucial, such as in banking environments handling customer financial data [12,13].

Despite advances in big data infrastructure, encryption techniques, and data governance frameworks, some existing solutions address these challenges in a fragmented manner, often applying security mechanisms such as strong encryption or data obfuscation that compromise the usability of the data for analytical purposes. This trade-off becomes particularly problematic in contexts that require both data protection and analytical continuity, such as financial or healthcare environments where insights must be derived from sensitive data. The absence of such integrated SBD approaches capable of balancing protection with usability highlights a critical gap in current data lake implementations.

To address this gap, this paper proposes an SBD protocol for managing sensitive data in data lakes. The protocol uses, as an example, the distributed processing capabilities of Databricks, Delta Lake storage optimization, and FPE techniques to enhance data security and governance without compromising analytic usability. The adopted methodology follows the design science research approach [14,15], ensuring a structured, iterative development and validation process based on both theoretical foundations and practical experimentation.

This paper consists of the following sections: Section 2 contains the background concepts that underpin the proposed solution, including data portability, SBD principles, data lake architectures, ingestion strategies, and encryption techniques. Section 3 describes the methodology applied to design and validate the protocol. Section 4 details the protocol architecture and its components. Section 5 presents the results of the systematic mapping and validation from expert feedback. Finally, Section 6 discusses the implications of the findings and Section 7 concludes the document with the consolidated findings of the study.

2. Background

The development of a protocol for data portability in banking environments poses unique challenges, given the need to balance regulatory compliance, data security, and system interoperability. To address these challenges, it is essential to explore key concepts that form the foundation of this research. Data portability ensures that users retain control over their data and that it can be transferred between systems securely and seamlessly. The SBD approach is particularly relevant in domains where sensitive data must be handled with extreme care, such as banking or health data. In such cases, it is therefore particularly important to integrate security into all stages of system development in order to protect personal, sensitive, and confidential information.

Additionally, the technical infrastructure supporting the protocol, such as data lake architecture, data ingestion processes, and data governance frameworks, plays a pivotal role in ensuring the secure and efficient management of data portability. Advanced encryption techniques, such as FPE, are also central to enabling the secure handling of sensitive banking data without compromising usability. The following sections provide an in-depth exploration of these topics, reviewing the literature and establishing the conceptual framework for the proposed protocol.

2.1. Data Portability and Interoperability in the Context of Cloud Computing and Data Protection

Data portability and interoperability are closely linked concepts within the fields of data protection and cloud computing. As a key provision of the GDPR, data portability grants users the right to transfer their personal data between services [16,17]. This right not only empowers individuals but also encourages platform interoperability by enabling data reuse across systems.

In cloud computing, data portability involves more than simply transferring files. Applications often require specific data formats to function correctly, making it essential to ensure that both structured and unstructured data remain accessible and compatible after migration [18]. To achieve this, standardized mechanisms for data export and conversion, along with interoperable storage services, must be in place.

Nevertheless, implementing data portability in practice remains challenging. One major obstacle is the lack of clear regulatory guidance on which data should be included in portability requests. This ambiguity creates uncertainty for users and service providers alike, slowing the development and adoption of effective solutions [19].

2.2. Security by Design

SBD is a development philosophy that integrates security into the earliest stages of the software lifecycle to minimize vulnerabilities and reduce costly rework [7,20].

By embedding both technical and organizational safeguards from the outset, SBD aims to produce systems that are inherently resilient to cyber threats [8]. Industry leaders like Google [21] and Microsoft [22] emphasize that security must be an integral part of product design, with a user-centered approach that considers developers as primary users. Given this, the SBD approach should extend to emerging domains such as cloud computing and big data to ensure secure system development in data-intensive environments.

In response to these challenges, ref. [9] proposes an SBD framework for deploying secure big data solutions in cloud environments. The methodology addresses risks early in the design phase, enabling the development of robust cloud applications. It supports secure infrastructure-as-a-service models by integrating Apache Hadoop 3.0 and other advanced technologies, mapping security concerns into a structured model for reliable BigCloud systems.

A complementary approach is presented by [4], who introduced a 12-phase methodological framework for integrating security and privacy into big data ecosystems. Based on established standards and best practices, their process spans from requirement analysis to risk management, aligning closely with the SBD principle of embedding security early in development. Also, the ADOC model exemplifies an SBD approach by using open-source tools to automate security across DevOps pipelines, including SAST/DAST integration in CI/CD workflows with Open Policy Agent (OPA) and AWS, enhancing policy enforcement and vulnerability detection [23].

2.3. Infrastructure

Big data infrastructure has become essential for managing complex, large-scale datasets that exceed the capabilities of traditional systems [1]. To handle such data efficiently, specialized cloud-based infrastructures are often required. Researchers have proposed various strategies to optimize resource allocation in cloud environments for big data processing [24].

Big data is commonly characterized by the so-called “V’s”: Volume, Variety, Velocity, Veracity, and Value [2]. These characteristics introduce significant challenges in data acquisition, storage, preprocessing, analysis, visualization, and security. As a result, the tools and infrastructure needed to manage big data are inherently complex.

Given these demands, organizations increasingly rely on external cloud services to provide scalable and cost-effective computational resources. Instead of deploying their own infrastructure, companies often opt for third-party solutions that offer the necessary capacity and flexibility.

Distributed computing plays a central role in big data processing, enabling efficient handling of large datasets. Cloud computing further enhances this capability by offering scalability, accessibility, and cost-efficiency [25].

Concrete implementations, such as those based on Databricks, AWS S3, and Apache Spark, illustrate how these technologies are applied in practice. For instance, one study describes a system processing 1 TB of data every five days using Spark for computation and Delta Lake on AWS S3 for storage [6]. Such architectures exemplify modern approaches to scalable data management.

Cloud-based infrastructures also come with specific advantages and limitations. On one hand, they offer high storage capacity, auto-scaling, and cost benefits; on the other, they raise concerns about data security and privacy [26]. While cloud providers typically ensure baseline security, risks increase once data enters production environments, highlighting the need for continuous protection strategies.

2.4. Format-Preserving Encryption

FPE is a symmetric encryption technique that encrypts a plaintext X into a ciphertext Y under a key K, while preserving the original format of the data [10]. An example of this format-preserving behavior is illustrated in Table 1, which shows a phone number, a text string, and a credit card number, along with their corresponding FPE versions.

This property makes FPE especially useful for masking sensitive information without altering database schemas or application logic, thereby supporting data portability and system interoperability. Several contributions have advanced the practical use of FPE. For instance, ref. [11] presents a framework to handle format-specific constraints, while ref. [27] outlines standardized block cipher modes suitable for FPE implementations.

Given its ability to protect data in place, FPE has become increasingly relevant in cloud computing and big data environments, where large volumes of personal and sensitive data are processed daily. In such contexts, it enables secure data handling without disrupting existing workflows or storage formats.

Notably, ref. [12] proposes an FPE-based data processing scheme using Apache Spark, with potential applications in banking systems. However, the proposal does not address aspects like data ingestion or governance model. In other applications, ref. [28] described the use of FPE for the encryption of biometric information used in banking. In a similar vein, ref. [13] introduced a method combining the FF1-SM4 algorithm for string desensitization with Pallier encryption for numerical fields, enabling homomorphic operations. The approach ensures data protection across private and public cloud deployments.

Finally, ref. [5] described an FPE scheme based on AES with XOR and translation techniques, ensuring referential integrity and compatibility with legacy systems. These studies collectively illustrate the growing adoption of FPE as a valuable tool for securing sensitive data in modern computing environments and provide a foundation for its integration into our proposed protocol.

2.5. Data Lake Architecture

Data lake repositories have gained widespread adoption due to their flexibility and advantages over more rigid storage systems like data warehouses. According to [3], a data lake serves as a landing zone for raw data from multiple sources. Similarly, ref. [29] defines a data lake as a scalable system capable of storing and analyzing structured, unstructured, and semi-structured data in its native format, primarily used by data specialists for knowledge extraction.

These definitions highlight a key characteristic of data lakes: the ability to handle diverse data formats and structures. To manage this complexity effectively, well-defined data lake architectures are essential. As noted in [30], such architectures provide a conceptual framework for organizing data flow and storage, supporting efficient data capture, use, and reuse while minimizing redundant processing [31].

Data lake architectures can generally be categorized into two types: zones and ponds [30]. Additionally, ref. [29] proposed an alternative architecture based on functionality and maturity levels, aiming to address some limitations of the zone and pond models.

For the purposes of this work, we adopt a zone-based architecture, which organizes data according to its level of refinement. One of its main benefits is that it allows access to raw data even when transformed or preprocessed versions are available [30]. This approach supports transparency and traceability throughout the data lifecycle.

2.6. Data Ingestion

Data ingestion facilitates the transfer of data from a source to a data lake [29]. As described in [32], ingestion can be classified into three types: batch, streaming, and orchestrated. Batch ingestion involves transferring data at defined intervals, typically hours, days, or months, while streaming ingestion enables continuous data transfer, ideally in real time. However, achieving true real-time capabilities often presents technical challenges due to latency and processing constraints.

An approach to enriched data ingestion that incorporates access controls is also explored by [33], with a proposal that includes a taxonomy of transformations aimed at ensuring confidentiality or integrity, using mechanisms such as encryption or hashing. Additionally, the transformation of data through anonymization is addressed, with the understanding that encrypted data becomes unsuitable for analysis.

2.7. Data Governance

Data governance is a critical component of organizational information management strategies, essential for ensuring data quality, security, and compliance [34]. It involves defining policies, standards, and structures to manage data effectively throughout its lifecycle, from creation to disposal [35]. This is especially relevant for sectors like banking and public administration, where data integrity and confidentiality are paramount [34].

In cloud computing and big data environments, governance plays an even more crucial role in securing and managing vast volumes of information. It requires frameworks that regulate access, ownership, and protection [36]. As data volumes grow, organizations must adopt robust governance strategies to address emerging challenges in privacy and security [37]. Within governance models, several roles can be identified, each with defined responsibilities. According to [38], two key roles are data owners and data stewards.

Data owners are accountable for the datasets under their control. They define access rules, ensure proper usage, and establish governance policies to maintain compliance with internal and external standards. Data stewards, in turn, act as custodians of data within business processes. They ensure that governance policies are correctly applied and maintain data quality and consistency based on domain-specific knowledge.

These roles are clearly defined in the framework proposed by [38], emphasizing the importance of clear accountability and operational support in effective data governance.

Additional roles focused on security, such as chief information security officer (CISO) and data security officer, are also proposed in the literature [39]. However, these positions are less commonly implemented in practice.

3. Methodology

Given the need to protect sensitive and personal data in different domains without compromising its usability or original format and allowing for its recovery in specific contexts, this work explores the design of a protocol aimed at achieving a balance between data protection and practical utility. Based on insights from the literature and refined through expert feedback, the proposed protocol incorporates the premise that such a mechanism could effectively facilitate this balance. Its development was guided by the Design Science methodology, structuring the process into successive stages of problem investigation, solution design, validation with key stakeholders, and technical evaluation.

The methodology used for this work is design science, as described by [14], which is applied to research in information systems and software engineering. The author proposes the use of workflows known as the “regulative cycle” structured into four subtasks shown in Figure 1.

This approach emphasizes the creation and validation of artifacts to address real-world problems through four iterative phases: (1) conceptual analysis, where the problem and its context are thoroughly investigated to identify stakeholder needs and existing solutions; (2) solution design, in which an artifact, such as a model, framework, or protocol, is developed based on theoretical foundations and practical requirements; (3) design validation, where the proposed solution is evaluated to determine its effectiveness in solving the identified problem without full-scale implementation; and (4) implementation, which involves applying the artifact in a real-world setting to assess its impact and refine its applicability.

The definition of the four components of the regulative cycle is divided into two major areas: the understanding of the state of the art, which refers to the investigation of the problem, and the proposal of the protocol through its design, validation, and implementation.

3.1. Design Science

3.1.1. Conceptual Analysis

In the initial phase of this study, a systematic mapping of the literature was conducted to gather information on encryption techniques applied to big data tools and data lake repositories for the protection of personal and sensitive data. This mapping is presented in Section 3.2. To achieve this, the systematic mapping methodology proposed by Petersen [40] was employed, providing a structured approach to organizing the activities necessary for the development of the systematic mapping.

The primary objective of this systematic mapping is to identify and classify the various encryption techniques utilized in big data tools and data lake repositories. This phase aims to address the critical need for safeguarding personal and sensitive information by offering a comprehensive overview of existing methods, as well as the requirements for their practical implementation.

3.1.2. Solution Design

To develop the proposed protocol, we conducted a systematic literature review to identify foundational elements and requirements [41], synthesizing these insights within an SBD [21] protocol to create an initial version of that artifact [42].

This version was refined through two Delphi-based expert consultation rounds [43], ensuring theoretical robustness and practical applicability, particularly for regulated environments like banking.

The protocol’s design builds on findings from the systematic mapping (Section 5.1), incorporating key elements into its architecture. It leverages the software reference architecture (SRA) for semantic-aware big data systems [44] as a foundational reference.

3.1.3. Design Validation

Once the solution design stage is over, the protocol validation stage begins, based on a variation of the usability and quality surveys originally proposed in [44,45] and later adapted by [46], applied to experts in the big data area in banking environments, in order to ensure that key indicators are obtained according to the qualitative scale of [47].

The first part of the survey focuses on the profile of participants. The key questions in this section aim to obtain dimensions of the experience of the experts expressed in time, as well as their role in the teams and tasks performed during their professional practice. The questions can be reviewed in Table 2.

For the Usability criterion, the survey worked on the basis of [45], a form of 10 numbered expressions, answered on a Likert scale with values shown to the respondent between 1 and 5, related to the concepts of “Strongly Disagree” to “Strongly Agree”, respectively.

For odd-numbered questions (SQ2.1, SQ2.3, SQ2.5, SQ2.7, and SQ2.9), high values are expected; for even-numbered questions (SQ2.2, SQ2.4, SQ2.6, SQ2.8, and SQ2.10), low values are desirable. The questions for this section are detailed in Table 3.

The expression (1) is based on the normalization of high desirable values, such as those present in the sum of oddQuestionsScore, and low desirable values, such as those present in evenQuestionsScore, ensuring values between 0 and 100 that are comparable and qualitatively evaluated.

U s a b i l i t y = 2.5 \times (o d d Q u e s t i o n s S c o r e + e v e n Q u e s t i o n s S c o r e)

(1)

where

\begin{matrix} o d d Q u e s t i o n s S c o r e & = \sum_{i = 1}^{5} (o d d Q u e s t i o n V a l u e [i] - 1) \\ e v e n Q u e s t i o n s S c o r e & = \sum_{i = 1}^{5} (5 - e v e n Q u e s t i o n V a l u e [i]) \end{matrix}

For the quality criterion, the survey worked on the basis of [44], building a form of 7 numbered expressions, answered on a Likert scale with values shown to the respondent between 1 and 5, related to the concepts of “Strongly Disagree” to “Strongly Agree”. The expressions are aligned to the quality sub-characteristics as shown in Table 4.

In this section, the highest possible value for each question is desirable, so the expression (2) aims to the normalization of values, ensuring a quality score between 0 and 100 that are comparable and qualitatively evaluated.

Q u a l i t y = \frac{100}{28} \times \sum_{i = 1}^{7} (q u e s t i o n S c o r e [i] - 1)

(2)

Finally, to analyze the scores resulting from usability and quality, they will be analyzed under the Bangor Qualitative Scale [47], which allows a perspective associated with adjectives over the numerical value as is shown in Figure 2. The use of this scale will give us a perspective of the degree of “approval” that the proposal has in the two criteria previously exposed, having the desirable minimum acceptability of “marginally high” and an ideal of “acceptable”.

3.1.4. Implementation

The final stage of the workflow described in Section 3.1 was implemented through the development of a guide that documents the procedure for applying the proposed protocol. This guide integrates the feedback gathered during the design validation phase, enabling the refinement of the proposed procedures. As a result, the documented guidelines are ensured to be both applicable and effective, aligning with the requirements and expectations identified in the earlier stages of the workflow.

3.2. Systematic Mapping of the Literature

3.2.1. Research Questions

The proposed research questions, shown in Table 5, position us for the search and exploration of methodological and technical contributions in the literature, as well as in identifying challenges and gaps in the implementation of encryption.

3.2.2. Inclusion and Exclusion Criteria

All documents resulting from the search strings applied in the search engines were subjected to systematic inclusion and exclusion criteria, allowing for a quick filtering process.

The selected studies must meet the following requirements:

Inclusion Criteria:

(i): Language: Only studies published in English and Spanish are included, ensuring that the documents are accessible and relevant to the international academic literature.
(ii): Publication Date: Only works published between 2014 and January 2024 are considered to include all relevant documents in the field of data protection in big data and data lake environments.
(iii): Sources: Only works from scientific journals and conferences are accepted.

Exclusion Criteria:

(i): Study Domain: All works not focused on the field of information security in big data environments and data lake repositories are excluded, ensuring that the research is exclusively focused on the topic of interest.
(ii): Accessibility: Documents that could not be fully accessed or were not relevant to the analysis were excluded.
(iii): Duplication: Duplicate documents between academic search engines were excluded, retaining only one and discarding the others.

This filtering process ensures that the included studies are relevant, up-to-date, and come from reliable sources, facilitating the review of encryption techniques in the context of personal and sensitive data protection in big data and data lake environments.

3.2.3. Search and Selection Process

The search string used consists of key terms within the domain, applied in IEEE Xplore, Web of Science (WoS), Scopus, and ACM Library. The use of these search engines is based on their broad coverage, quality, and focus on scientific articles.

The key terms used were “data lake” and (“privacy” OR “encryption”). By applying the corresponding Boolean expressions in each search engine, the search string is structured as follows: (“data lake” AND (“privacy” OR “encryption”)). Including these terms allows us to focus on data lake repositories and, by extension, big data technologies, while the terms “privacy” and “encryption” enable the identification of works specifically discussing data privacy preservation and encryption within this domain.

Based on these search strings, inclusion criteria that could be automated by the search engines were applied, filtering directly by language, publication date, and sources whenever possible. The preliminary search results are reflected in Table 6, where applying the aforementioned search string yielded 26 documents from IEEE Xplore, 10 documents from WoS, 533 documents from Scopus, and 161 documents from ACM Library.

The document selection process consists of three stages. In the first stage, non-automated inclusion criteria and exclusion criteria are applied as an initial filter to the obtained results, reviewing the title, abstract, introduction, and conclusions. In the second stage, the documents addressing topics related to information security and cryptography in big data and/or data lakes are validated and identified. In the third stage, documents specifically discussing encryption applied to personal and sensitive data are filtered. This process results in the following list of relevant articles for the study: [33,48,49,50,51,52,53,54,55].

3.2.4. Classification Scheme

In this study, six classification schemes were applied to the selected studies:

Type of Contribution and Approach:
The analyzed documents are classified according to the type of contribution and approach adopted. Regarding the contribution, three main categories are identified: (1) methodology, which encompasses systematic techniques and tools to address problems; (2) method or framework, which provides consistent structures or principles to solve specific problems; and (3) technique, which includes specific improvements such as algorithms or specific implementations.
Regarding the approaches, the documents can adopt one of the following: (1) innovative, introducing significant advancements with new ideas, methods, or technologies; (2) positional, analyzing phenomena from a particular viewpoint in relation to the context and existing practices; or (3) canon, based on established practices, established methods, or accepted standards in the field. Each document may contribute more than one contribution but only one type of approach.
Encryption Techniques:
In the context of big data and data lake repositories, the encryption techniques used to protect personal and sensitive data include advanced encryption standard (AES), recognized for its effectiveness and performance; homomorphic encryption (HE), which allows operations on encrypted data without the need to decrypt it; format-preserving encryption (FPE), which maintains the original format of the data, facilitating its integration with existing systems; elliptic curve cryptography (ECC), which stands out for offering a level of security comparable to other traditional cryptographic techniques but with smaller keys, reducing storage and processing requirements, ideal for resource-limited environments; and attribute-based encryption (ABE), which ensures fine-grained encrypted access control to externalized data.
Other emerging techniques are also identified, expanding the available options according to the specific requirements presented by the documents.
Format Requirements:
Format requirements for data are classified according to their state. For data in use, the requirements focus on the needs for analysis and machine learning, ensuring that the data can be processed efficiently without fully decrypting it. For data at rest, the requirements focus on the structure of the data, ensuring its correct integration and storage while maintaining its integrity. Finally, for data in transit, the requirements are grouped according to the communication protocols and technologies employed, ensuring secure transmission of data across networks or between systems.
Other Protection Strategies:
Refers to other ways of protecting data in the context of the research. This includes anonymization, which involves modifying the original data to hide sensitive information and prevent the identification of individuals or entities; access control, which encompasses policies and mechanisms that determine who can access the data and under what conditions, ensuring that only authorized individuals have access to the information; and security audits, which involve the continuous monitoring and review of activities related to data access and usage, with the goal of identifying vulnerabilities and ensuring compliance with security policies.
Domain of Development of the Documents:
The application domain of the document can be classified into three areas: industrial, healthcare, and academic. The industrial domain refers to documents where the research focus is developed in the context of an organization or industrial sector. The healthcare domain refers to research focused on medical data or data protection within the healthcare field. Finally, the Academic domain encompasses documents aimed at presenting general research without a specific focus on the industry or healthcare sector.
Challenges and Gaps:
The challenges and gaps identified in the reviewed documents can be classified into four key areas: costs, which limit the adoption of advanced technologies; data standards, necessary to ensure interoperability and facilitate information exchange between systems; security and regulatory compliance, which are essential for protecting sensitive data and complying with regulations such as Chilean law N. 19.628; and data management and analysis, which refers to the challenges associated with efficiently managing and processing large volumes of data in big data and data lake contexts. This scheme highlights the most relevant areas for future research and the development of technological solutions.

4. Proposal

Based on the analysis of the Section 2, this work presents a protocol designed to ensure data portability in data lakes while incorporating the principles of SBD. The main objective is to safeguard the protection of personally identifiable data and ensure proper control over its usage. This proposal positions security as a fundamental pillar in big data applications operating on data lakes. The protocol is structured into multiple layers, corresponding to the different stages of big data processes, following the model proposed by [44] and illustrated in Figure 3.

The protocol design encompasses the data ingestion stage, where data is collected from various sources and formats and initially stored in a data lake in its raw format. From this stage, the protocol enables two possible paths:

Unsecured Path:
Represented in Figure 4, this path involves accessing the data without encryption measures, reserved exclusively for extraordinary cases, such as requests from entities with superior authority, for example, for judicial, law enforcement, or legal compliance purposes.
Secured Path:
Represented in Figure 5, in this path, the data undergoes an encryption scheme based on masking through FPE and is transformed into the Delta Lake format. This allows controlled and secure data consumption under the supervision of data stewards, who regulate access and ensure compliance with security policies.

However, it is important to note that the protocol cannot guarantee control over the use of data once it has been transferred to third parties. Therefore, the responsibility lies with the data owners (D.O.), data stewards (D.S.), and requesters, who must adhere to the principles and regulations established for the ethical and secure handling of information.

4.1. Ingestion Layer

The ingestion layer is responsible for loading or collecting data from external sources and ensuring its proper entry into the protocol. Once inside, this layer immediately transfers the data to the persistence layer, where the data lake storage is located. This ensures that the data does not remain in the ingestion layer longer than necessary.

Data Ingestion. This component is responsible for extracting data from external sources and transporting it to the data lake. Although this component can be considered an autonomous process, within the context of this protocol, no distinction is made between the types of ingestion (batch or stream), as both are treated equally.

4.2. Persistence Layer

This is the data landing layer, where data is stored and can only be accessed by the data owner, who has authority over the data and is responsible for its proper management. This layer is divided into two containers, which serve as a division between the raw data, which is unencrypted, and the data in Delta Lake format, which is encrypted.

Raw Container. This container within the proposed data lake stores the information exactly as it arrives into the protocol and can only be accessed directly by the data owner.
Delta Lake Container. This container stores the data in encrypted Delta Lake format. Through this container, the data steward will grant access to users via the secure access catalog component, allowing the data to be consumed.

4.3. Data Access Layer

This layer is responsible for access management and data masking using the tools provided by the Databricks platform., as well as offering the necessary interfaces for integration with other components of the protocol. In this layer, the data steward defines which data is securely transferred to the consumption layer, thanks to the preprocessing of data masking. It also allows data to be transferred to the export layer in an unsecured manner, meaning without masking. The components of this layer are divided into services, tools, and workspaces provided by the Databricks platform.

Raw Control Catalog. This component is responsible for accessing raw data without the security layer provided by the masking process component. These data can only be accessed by the data owner, who has full control and is responsible for managing the shared, unprotected data. The unsecured data is then transferred to the consumption layer, where it is consumed by external parties that require the data in its raw form.
Secure Access Catalog. This component is responsible for accessing the data encrypted with FPE stored in the Delta Lake container. These data are managed by the data steward, who has the authority to share the masked data with the consumer layer, where external parties to the protocol can freely use the data.
Masking Process. This component enables data masking using an FPE scheme. Within the Databricks platform, Spark is used as the collector, which serves as the primary tool for connecting to data sources [46]. Above Spark is PySpark (3.5.0), a Python (3.11) -based interface that facilitates interaction with Spark. With these tools, the masking scheme is developed on Spark using PySpark, ensuring optimal performance in data processing. The transformed data is then converted into Delta Lake format and can be stored in the Delta Lake container, where it can later be accessed through the secure access catalog.

4.4. Consumer Layer

The consumption layer is the final stage of the protocol, responsible for the output of the data. The components in this layer represent the data consumed in either raw or Delta Lake formats, as needed. The exported data is consumed by external actors to the protocol and can be classified into two categories:

Safe Data. These are the encrypted data exported in Delta Lake format for secure consumption, typically intended for analysis by external actors to the protocol. These data undergo the masking scheme and, with authorization from the data steward, are protected before being shared.
Unsafe Data. These are the data exported in various formats but extracted directly from the data lake without going through the masking scheme. As a result, they are shared in an unsecured manner and lack the necessary controls to ensure their protection.

4.5. A POC Approach

Through previous documented work [56], we tested the technical feasibility of some elements of the protocol in Databricks. This allowed us to establish tests such as protecting Chilean RUTs (identity document from Chile) achieving 100% correct encryption. This work also allowed us to review some performance values for the encryption and decryption processes of data batches between 10,000 and 70,000 records.

The FPE implementation employed is based on the FFX standard [57], a flexible extension of the Feistel model designed for the encryption of strings with arbitrary formats.

Leveraging the capabilities of Databricks with PySpark, two implementations were developed using user-defined functions (UDFs), enabling the execution of specific tasks such as complex computations or customized data manipulations, including data encryption using an FPE algorithm.

One of these implementations includes the FPE encryption algorithm to protect Chilean RUTs, while the other provides the corresponding decryption algorithm to retrieve the original content.

Algorithm 1 describes the encryption process for Chilean RUTs FPE. On line 1, the function is defined and receives a Chilean RUT as input. From lines 2 to 4, the input RUT is split into three components: the first digit, the base digits, and the verification digit.

Lines 5 to 7 initialize separate cipher instances, each configured with a specific alphabet tailored to the format of its corresponding RUT component:

123456789 for the first digit.
1234567890 for the base digits
1234567890K for the verification digit.

From lines 8 to 10, each component is encrypted independently using its respective cipher. Finally, the resulting encrypted components are concatenated and returned as the encrypted RUT string.

For Algorithm 2, the steps are the same as in Algorithm 1 up to line 7. From lines 8 to 10, the cipher instances are used again, but in decryption mode, to recover the original RUT components. Finally, on line 11, the RUT is reconstructed from the decrypted components and returned on line 12.

In the context of the algorithms, CreateCipher serves as an abstraction of the FPE implementation based on the FFX specification. It is initialized with three parameters: the secret key, the target length of the input string, and the character alphabet to be used for encryption.

This approach ensures that the output preserves the original format and adheres to the standard structure of Chilean RUT.

Algorithm 1 RUT encryption process.

1:: function Encrypt( $r u t$ )
2:: $f i r s t_d i g i t \leftarrow r u t [0]$
3:: $b a s e_r u t \leftarrow r u t [1 : - 1]$
4:: $v e r i f i e r_d i g i t \leftarrow r u t [- 1]$
5:: $e_f i r s t \leftarrow C R E A T E C I P H E R (s e c r e t_k e y, 1, f i r s t_a l p h a b e t)$
6:: $e_b a s e \leftarrow C R E A T E C I P H E R (s e c r e t_k e y, l e n (b a s e_r u t), b a s e_a l p h a b e t)$
7:: $e_v e r i f i e r \leftarrow C R E A T E C I P H E R (s e c r e t_k e y, 1, f i n a l_a l p h a b e t)$
8:: $e n c r y p t e d_f i r s t \leftarrow e_f i r s t . E n c r y p t (f i r s t_d i g i t)$
9:: $e n c r y p t e d_b a s e \leftarrow e_b a s e . E n c r y p t (b a s e_r u t)$
10:: $e n c r y p t e d_v e r i f i e r \leftarrow e_v e r i f i e r . E n c r y p t (v e r i f i e r_d i g i t)$
11:: $n e w_r u t \leftarrow e n c r y p t e d_f i r s t + e n c r y p t e d_b a s e + “ - ” + e n c r y p t e d_v e r i f i e r$
12:: return $n e w_r u t$
13:: end function

Algorithm 2 RUT decryption process.

1:: function Decrypt( $r u t$ )
2:: $f i r s t_e n c r y p t e d \leftarrow r u t [0]$
3:: $b a s e_e n c r y p t e d \leftarrow r u t [1 : - 1]$
4:: $v e r i f i e r_e n c r y p t e d \leftarrow r u t [- 1]$
5:: $d_f i r s t \leftarrow C R E A T E C I P H E R (s e c r e t_k e y, 1, f i r s t_a l p h a b e t)$
6:: $d_b a s e \leftarrow C R E A T E C I P H E R (s e c r e t_k e y, l e n (b a s e_e n c r y p t e d), b a s e_a l p h a b e t)$
7:: $d_v e r i f i e r \leftarrow C R E A T E C I P H E R (s e c r e t_k e y, 1, f i n a l_a l p h a b e t)$
8:: $f i r s t_d e c r y p t e d \leftarrow d_f i r s t . D e c r y p t (f i r s t_e n c r y p t e d)$
9:: $b a s e_d e c r y p t e d \leftarrow d_b a s e . D e c r y p t (b a s e_e n c r y p t e d)$
10:: $v e r i f i e r_d e c r y p t e d \leftarrow d_v e r i f i e r . D e c r y p t (v e r i f i e r_e n c r y p t e d)$
11:: $o r i g i n a l_r u t \leftarrow f i r s t_d e c r y p t e d + b a s e_d e c r y p t e d + “ - ” + v e r i f i e r_d e c r y p t e d$
12:: return $o r i g i n a l_r u t$
13:: end function

5. Results

5.1. Systematic Mapping Results

5.1.1. Data Extraction and Mapping

The data extraction is crucial for organizing relevant information and answering the research questions. The results of this study are shown using bubble charts and a bar graph. Figure 6 focuses on representing the results of the contributions and types of approaches found in the documents, providing a high-level view of the documents identified in the mapping and the trends observed. The bar chart in Figure 7 represents the encryption techniques found in big data and data lakes based on our exclusion criteria. Figure 8 details the requirements found for data to undergo encryption at different stages and other strategies for data protection identified in the documents. Figure 9 represents the domains of the documents found in the mapping and the challenges they present for future work in the area. The information presented in the bubble diagrams is divided into two axes: the positive and negative X-axis corresponds to the classification exposed in Section 3.2.4, while the Y-axis represents the context in which it is applied (on big data or data lakes).

5.1.2. Analysis and Discussion

RQ1 What types of contributions are found in the selected documents?

Figure 6 shows that the most common type of contribution in the selected documents is method or framework, accounting for 43.75%, followed by technique at 31.25%, and methodology at 25%.

Each document may contribute more than one type of contribution, but only one approach, with innovative approaches standing out at 40%. Of these, 30% are related to data lakes, and 40% are associated with big data.

Finally, documents with a position approach represent 20% and are exclusively focused on data lakes.

The results reveal a tendency towards innovations in data lakes, while big data often involves the use of already established tools. With these tools, methods or frameworks are proposed over other types of contributions.

RQ2 What encryption techniques are used on big data tools for processing personal and sensitive data?

Figure 7 shows the encryption techniques used in big data tools to protect personal and sensitive data. According to our classification scheme, the most common techniques are HE (homomorphic encryption) and AES (advanced encryption standard), due to their ability to perform operations on encrypted data and their high performance, respectively.

Techniques like ECC (elliptic curve cryptography), FPE (format-preserving encryption), and ABE (attribute-based encryption) each appear once. Within the “Other” group, proxy re-encryption stands out, which is a technique based on the heterogeneous re-encoding of encrypted text, useful for scenarios requiring delegated secure access in distributed environments.

The results indicate a need for encryption techniques that offer features that differentiate them from conventional ones like HE. While AES is more widely used than other forms of encryption such as FPE and ABE in this context, the latter techniques present promising security potential and contributions to the ecosystem. Their appeal lies in features such as format preservation and access.

RQ3 What encryption techniques are applied to data in data lake repositories?

Figure 7 presents the encryption techniques identified in data lakes, highlighting that HE (homomorphic encryption), AES (advanced encryption standard), and ECC (elliptic curve cryptography) are the most commonly used, each appearing twice.

These techniques stand out for their ability to handle large volumes of data securely: HE allows operations on encrypted data, AES is efficient and widely adopted, and ECC provides security with smaller keys, making it ideal for resource-constrained environments. On the other hand, techniques such as FPE (format-preserving encryption), ABE (attribute-based encryption), and others were not mentioned in the reviewed documents.

RQ4 What data format requirements are necessary for encryption in use, at rest, or in transit?

Figure 8 highlights that the most frequent format requirements correspond to data structure, accounting for 36.36%. These are primarily observed in big data (27.27%), while their presence in data lakes is lower (9.09%).

On the other hand, requirements related to transmission protocols and technologies also represent 36.36%, but these are more significant in data lakes (27.27%), with a smaller percentage in big data (9.09%). Finally, learning and analysis requirements make up 27.27% in total, with a greater incidence in data lakes (18.18%) than in big data (9.09%).

RQ5 What other strategies for protecting personal and sensitive data are found in the selected documents?

Figure 8 shows that additional strategies used to protect personal and sensitive data include access control and security audits, both representing a total of 38.46%, with differences in their distribution: access control is more common in big data (23.08%) than in data lakes (15.38%), while security audits are more prevalent in data lakes (23.08%) than in big data (15.38%).

On the other hand, anonymization is the least common practice, with 15.38% in big data and only 7.69% in data lakes.

RQ6 What are the industry domains presented where personal and sensitive data protection is applied?

Figure 9 shows that the academic domain is the most common, representing 30% of the documents in big data and 20% in data lakes, reflecting a predominant focus on general and theoretical research.

The health domain, present exclusively in data lakes with 30%, highlights its interest in protecting sensitive data related to medical information.

Lastly, the industrial domain accounts for 20% of the total, equally divided between big data and data lakes (10% each), focusing on practical applications for businesses, organizations, or related areas.

RQ7 In those documents that present future work, what kind of challenges and gaps are identified?

Figure 9 identifies the most common challenges and gaps in the reviewed documents. The primary challenge is security, privacy, and regulatory compliance, present in 30% of studies related to data lakes and 10% in big data. This reflects the importance of ensuring the protection of sensitive data and adhering to specific regulations.

Data management and analysis emerge as challenges in 20% of data lake documents and 10% in big data, emphasizing the technical difficulties related to data scalability and complexity. Other challenges, such as costs (20%) and data immutability and standards (10%), are unique to data lake environments.

These findings underscore the need for ongoing research to address these challenges in both data lake and big data contexts, with a particular focus on enhancing data security, managing large-scale datasets, and ensuring compliance with regulatory requirements.

5.2. Survey Results

To assess the usability and quality of the proposed protocol, the validation process described in Section 3.1.3 was carried out. This section presents the results of the survey through which the proposal was evaluated.The survey was administered to a company with 35 developers and consultants specializing in big data in the banking sector, obtaining 28 responses, representing a sample size of 80% of employees.

5.2.1. Participant Profile

Gathering the profile of respondents allows us to support and evaluate our proposal based on their professional experience. The results are presented in Table 7, highlighting the number of individuals by role and their average years of experience in IT, big data, and data lakes. Within this group of respondents, a subgroup of 10 individuals with more than five years of experience in the field was identified, providing perspectives based on a broader professional trajectory and enabling a more detailed analysis of the collected data.

5.2.2. Usability Assessment by Role

The usability evaluation yielded generally favorable results, with the overall average score exceeding 70 on a 0–100 scale. These values are presented in Figure 10, which shows that the mean score across participants is above the commonly accepted usability threshold. When disaggregating the data by professional role, two distinct clusters emerge:

The first cluster includes web and application developers (mean = 64.17) and participants currently undergoing training (mean = 66.25), both of which rated the protocol within the marginally acceptable range (classified as “Ok”).

The second cluster comprises roles that assessed usability more positively: big data consultants (mean = 73.96) and technical leads (mean = 72.50) placed usability in the “Good” category. Notably, business intelligence consultants (mean = 88.75) and the Director (mean = 92.50) rated the usability as “Excellent”.

Taken together, these results yield a global average usability score of 72.23, situating the protocol within the “Good” category. Furthermore, participants with extensive professional experience exhibited a notably higher average score of 77.25, approaching the upper boundary of the “Good” range and suggesting a positive correlation between experience and perceived usability.

From a statistical review, the usability responses had an average standard deviation between questions of 0.84, with a minimum of 0.59 in the first questions and a maximum of 1.06 in the ninth question.

5.2.3. Quality Assessment by Role

The results of the quality-related assessment indicate even more favorable outcomes compared to the usability evaluation, with the overall mean score approaching 80. These values are depicted in Figure 11, where a noticeable improvement is evident when analyzing responses across different organizational roles.

The only role with a mean score below the 70-point threshold was that of participants currently in training (mean = 66.96). In contrast, all other roles reported scores above this threshold. Web and application developers achieved a mean score of 72.62, while big data consultants reported a higher mean of 78.57, nearing the upper boundary of the “Good” category. Notably, all remaining roles surpassed the 80-point mark: technical leads (mean = 82.14), business intelligence consultants (mean = 87.50), and the director, who assigned the maximum score on the scale.

Overall, the mean score for quality assessment was 77.81, representing a marked increase relative to the usability results. Furthermore, individuals with considerable professional experience reported an average score of 85, thereby positioning their perception of quality within the “Excellent” category.

From a statistical review, the quality responses had an average standard deviation between questions of 0.78, with a minimum in the second, third, and sixth questions of ~0.73 and a maximum of 0.87 in the last question.

5.2.4. Analysis of Outlier Responses in Usability Evaluation

Among the collected data, two items deviated notably from the overall positive trend observed in the usability evaluation. These items correspond to Questions 2.4 (SD = 1.055) and 2.10 (SD = 1.026) in Table 3, titled “I think I would need the support of a technician to be able to use this protocol” and “I needed to learn a lot of things before I could get started with this protocol”, respectively.

As illustrated in Figure 12, responses to Question 2.4 revealed that 21.5% of participants perceived a degree of complexity in the implementation of the protocol. Specifically, 17.9% selected value 4 and 3.6% selected value 5 on the Likert scale, indicating agreement with the need for technical support. The largest proportion of responses clustered around the neutral midpoint (value 3), accounting for 35.5% of participants. Meanwhile, 42.9% of responses aligned with the expected trend, suggesting ease of use, distributed between value 1 (14.3%) and value 2 (28.6%).

Similarly, results for Question 2.10, shown in Figure 13, revealed that 17.9% of respondents indicated a need for prior learning to effectively use the protocol (value 4: 14.3%; value 5: 3.6%). Additionally, 39.2% of responses were neutral (value 3), highlighting a possible uncertainty or variability in users’ prior knowledge. However, the majority of participants (42.9%) disagreed with the statement, suggesting minimal learning was required. These responses were split between value 1 (13.8%) and value 2 (28.6%).

These results suggest that while the overall usability of the protocol was rated positively, a subset of users perceived potential barriers to initial adoption, particularly regarding the need for technical assistance or prior learning.

6. Discussion

The results obtained from the usability and quality questions indicate positive evaluations according to the scale proposed by [47]. Considering that scores above 70 are regarded as “good” most roles within the organization achieved values within or above this threshold. The two roles that showed comparatively lower scores were those labeled “In Training” and “Web and Application Developer”. This trend may be attributed to a lack of experience in the big data domain, as these participants, according to the profiling table (Table 7), reported the least experience in both general IT and in areas specifically related to big data and data lakes.

The overall average score closely aligns with the responses provided by big data consultants. This is particularly relevant, as these professionals are expected to be the primary users of the protocol in real-world scenarios. Their positive evaluation suggests a strong consistency between general perceptions and those of the domain experts.

Additionally, participants with substantial experience, those with the highest number of years working in the field, reported the highest scores on usability and quality questions. This finding supports the notion that the proposed protocol is well received even by individuals with advanced knowledge and expertise in the area.

In summary, the usability and quality results suggest that the protocol is well suited for implementation in the banking context.

No responses fell within the “not acceptable” range as defined by [47], and the majority were situated within the “acceptable” or higher categories. These findings indicate that the protocol meets, and in many cases exceeds, the minimum expected standards for usability and quality in this sector.

6.1. Analysis of Outlier Responses Regarding Protocol Usability

From a general perspective, the data show acceptable values; however, it is crucial to analyze the two questions that yielded responses deviating from the expected trend.

Based on the responses to question 2.4, as shown in Figure 12, there is a high concentration of responses at value 3, which may be interpreted as a neutral stance.

Additionally, a considerable number of participants indicated that they would require technical support to implement the protocol. Although this figure is lower than that of participants who believed they would not need assistance, the high proportion of neutral responses suggests uncertainty about the challenges of applying the protocol in a professional setting.

This combination of neutral and affirmative responses regarding the need for technical assistance may be due to a lack of familiarity with configuring data catalogs, as well as the inherent complexity involved in implementing a format-preserving encryption scheme.

On the other hand, this result may also be influenced by the diversity of roles among the survey participants. While the survey included big data consultants, who are expected to have greater knowledge on the topic, it also involved individuals in roles that may not be familiar with the tools required to use the protocol.

A similar analysis can be conducted for the responses to question 2.10, shown in Figure 13. Once again, there is a high concentration of responses at value 3, suggesting a neutral stance. However, there is also a significant number of affirmative responses indicating a need to acquire new knowledge prior to implementing the protocol.

As with question 2.4, the high proportion of neutral responses may reflect uncertainty in understanding the components of the protocol. This is further supported by the affirmative responses indicating that participants feel the need to learn various concepts before implementation.

This phenomenon may be partially explained by the profile of the respondents, as several were still in training and came from web application development backgrounds. However, the abundance of neutral responses also suggests that the protocol may be perceived as moderately complex from a theoretical standpoint.

Its comprehension may require familiarity with component diagrams and their layers, as well as more technical aspects such as the structure of a data catalog with designated stakeholders, or the concept of format-preserving encryption, which may be unfamiliar to individuals outside the field of cryptography.

6.2. Key Contributions in Relation to Prior Research

The SBD approach proposed in this work not only addresses key gaps identified in the literature, such as the lack of standardized mechanisms for protecting sensitive data in data lake environments but also offers a practical and integrated solution that combines FPE, robust data governance, and a scalable architecture based on Databricks. This holistic approach enables organizations to overcome current limitations related to the usability and integration of security protocols within analytical workflows.

In contrast to the work of [5], which focuses solely on FPE without incorporating governance or analytical tool integration, our solution embeds these elements within a comprehensive SBD framework. This allows for immediate deployment in real-world scenarios, particularly in highly regulated sectors such as finance, marking a significant evolution toward more complete and operationally viable solutions.

Our findings are consistent with those [12], who demonstrated the effectiveness of FPE in enabling secure and usable encryption for sensitive datasets. However, our protocol builds upon this foundation by embedding FPE within a full architectural model that includes governance policies and access control mechanisms. This integration enhances its applicability and adaptability in professional and regulatory contexts.

In summary, our proposal not only fills critical gaps in existing literature but also provides an actionable, end-to-end protocol that ensures both data protection and operational efficiency in big data analytics environments.

7. Conclusions

This paper presented an SBD protocol for handling sensitive data in data lakes, combining the distributed processing capabilities of Databricks with FPE to ensure both data confidentiality and analytical usability. The proposed solution was developed using a design science research approach, integrating findings from a systematic literature review and validated through expert feedback in real-world banking environments. The results demonstrate that the protocol achieves high usability and quality scores, particularly among experienced professionals, and offers a structured, end-to-end protocol for secure data management.

The integration of FPE within a scalable architecture based on Delta Lake ensures that sensitive information remains protected throughout its lifecycle while preserving compatibility with existing analytics workflows. This is especially relevant in highly regulated industries such as finance and healthcare, where maintaining data utility while ensuring compliance with strict privacy regulations is critical. The approach presented here is not limited to any specific domain and can be applied broadly whenever there is a need to preserve both data security and format consistency across sensitive systems. The validation process, which included usability and quality assessments by domain experts, confirms that the protocol meets and often exceeds expected standards for implementation in production environments.

Furthermore, our proposal addresses key gaps identified in the literature, such as the lack of standardized mechanisms for protecting sensitive data in data lake environments. Unlike previous approaches that focus solely on encryption or data governance in isolation, our protocol embeds security at every stage, from ingestion to consumption, providing a comprehensive and actionable solution that aligns with the principles of SBD.

The responses obtained from the usability survey indicate that the protocol is well-received overall, with an average score above the “Good” threshold according to Bangor’s qualitative scale. While some roles reported a need for technical support or prior learning, these findings highlight opportunities for future improvements in documentation and training materials rather than limitations of the protocol itself.

Finally, the strong acceptance obtained during the validation of the protocol by key stakeholders aligns with the approach proposed in the methodology: designing a protocol that integrates security from the outset and maintains data usability proves to be highly relevant for professionals who handle sensitive information across multiple domains.

Author Contributions

Conceptualization and methodology, J.L.-O., G.A. and J.F.-L.; validation, J.L.-O. and G.A.; writing—original draft preparation, G.A., A.B.-M. and M.B.-L.; writing—review and editing, J.L.-O., G.A., J.F.-L., A.B.-M. and M.B.-L.; supervision, J.L.-O. and J.F.-L. All authors have read and agreed to the published version of the manuscript.

Funding

Funded (partially) by Dirección de Investigación, Universidad de La Frontera, Grant PP24-0027.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Acknowledgments

The authors sincerely appreciate the anonymous reviewers for their valuable feedback and constructive recommendations, which have greatly improved the technical robustness and clarity of this article. We are truly grateful for their dedication and expertise in reviewing this manuscript.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ABE	Attribute-Based Encryption
AES	Advanced Encryption Standard
BD	Big Data
BI	Business Intelligence
CISO	Chief Information Security Officer
DL	Data Lake
ECC	Elliptic Curve Cryptography
FPE	Format-Preserving Encryption
GDPR	General Data Protection Regulation
HE	Homomorphic Encryption
IaaS	Infrastructure as a Service
IT	Information Technology
SBD	Secure by Design
SRA	Software Reference Architecture
SUS	System Usability Scale

References

Chen, J.; Wang, H. Guest Editorial: Big Data Infrastructure I. IEEE Trans. Big Data 2018, 4, 148–149. [Google Scholar] [CrossRef]
Rawat, R.; Yadav, R. Big data: Big data analysis, issues and challenges and technologies. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1022, 012014. [Google Scholar] [CrossRef]
Panwar, A.; Bhatnagar, V. Data lake architecture: A new repository for data engineer. Int. J. Organ. Collect. Intell. (IJOCI) 2020, 10, 63–75. [Google Scholar] [CrossRef]
Moreno, J.; Fernandez, E.B.; Serrano, M.A.; Fernandez-Medina, E. Secure development of big data ecosystems. IEEE Access 2019, 7, 96604–96619. [Google Scholar] [CrossRef]
Gupta, S.; Jain, S.; Agarwal, M. Ensuring data security in databases using format preserving encryption. In Proceedings of the 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 11–12 January 2018; pp. 1–5. [Google Scholar]
Kumar, D.; Li, S. Separating storage and compute with the databricks lakehouse platform. In Proceedings of the 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA), Shenzhen, China, 13–16 October 2022; pp. 1–2. [Google Scholar]
Mouratidis, H.; Kang, M. Secure by Design: Developing Secure Software Systems from the Ground Up. Int. J. Secur. Softw. Eng. 2011, 2, 23–41. [Google Scholar] [CrossRef]
Shirtz, D.; Koberman, I.; Elyashar, A.; Puzis, R.; Elovici, Y. Enhancing Energy Sector Resilience: Integrating Security by Design Principles. arXiv 2024, arXiv:2402.11543. [Google Scholar]
Awaysheh, F.M.; Aladwan, M.N.; Alazab, M.; Alawadi, S.; Cabaleiro, J.C.; Pena, T.F. Security by design for big data frameworks over cloud computing. IEEE Trans. Eng. Manag. 2021, 69, 3676–3693. [Google Scholar] [CrossRef]
Bellare, M.; Ristenpart, T.; Rogaway, P.; Stegers, T. Format-preserving encryption. In Selected Areas in Cryptography, Proceedings of the 16th Annual International Workshop, SAC 2009, Calgary, AB, Canada, 13–14 August 2009; Revised Selected Papers 16; Springer: Berlin/Heidelberg, Germany, 2009; pp. 295–312. [Google Scholar]
Weiss, M.; Rozenberg, B.; Barham, M. Practical solutions for format-preserving encryption. arXiv 2015, arXiv:1506.04113. [Google Scholar]
Cui, B.; Zhang, B.; Wang, K. A data masking scheme for sensitive big data based on format-preserving encryption. In Proceedings of the 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), Guangzhou, China, 21–24 July 2017; Volume 1, pp. 518–524. [Google Scholar]
Wu, M.; Huang, J. A Scheme of Relational Database Desensitization Based on Paillier and FPE. In Proceedings of the 2021 3rd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, 3–5 December 2021; pp. 374–378. [Google Scholar]
Wieringa, R. Design science as nested problem solving. In Proceedings of the 4th International Conference on Design Science Research in Information Systems and Technology, Philadelphia, PA, USA, 7–8 May 2009; pp. 1–12. [Google Scholar]
Wieringa, R.J. Design Science Methodology for Information Systems and Software Engineering; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
Wohlfaxrth, M. Data Portability on the Internet: An Economic Analysis. In Proceedings of the International Conference on Interaction Sciences, Seoul, Republic of Korea, 10–13 December 2017. [Google Scholar]
Wohlfarth, M. Data Portability on the Internet. Bus. Inf. Syst. Eng. 2019, 61, 551–574. [Google Scholar] [CrossRef]
Bozman, J.; Chen, G. Cloud Computing: The Need for Portability and Interoperability; IDC Executive Insights; IDC Corporate: Needham, MA, USA, 2010; pp. 74–75. [Google Scholar]
Huth, D.; Stojko, L.; Matthes, F. A Service Definition for Data Portability. In Proceedings of the International Conference on Enterprise Information Systems, Heraklion, Greece, 3–5 May 2019. [Google Scholar]
Kadam, S.P.; Joshi, S.D. Secure by design approach to improve security of object oriented software. In Proceedings of the 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 11–13 March 2015; pp. 24–30. [Google Scholar]
Kern, C. Secure by Design at Google; Technical Report, Google Security Engineering; Google Research: Mountain View, CA, USA, 2024. [Google Scholar]
Howard, M.; Lipner, S. The Security Development Lifecycle; Microsoft Press: Redmond, WA, USA, 2006. [Google Scholar]
Paul, A.; Manoj, R.; S, U. Amazon Web Services Cloud Compliance Automation with Open Policy Agent. In Proceedings of the 2024 International Conference on Expert Clouds and Applications (ICOECA), Bengaluru, India, 18–19 April 2024. [Google Scholar] [CrossRef]
Arostegi, M.; Torre-Bastida, A.I.; Bilbao, M.N.; Ser, J.D. A heuristic approach to the multicriteria design of IaaS cloud infrastructures for Big Data applications. Expert Syst. 2018, 35, e12259. [Google Scholar] [CrossRef]
Megahed, M.E.; Badry, R.M.; Gaber, S.A. Survey on Big Data and Cloud Computing: Storage Challenges and Open Issues. In Proceedings of the 2023 4th International Conference on Communications, Information, Electronic and Energy Systems (CIEES), Plovdiv, Bulgaria, 23–25 November 2023; pp. 1–6. [Google Scholar]
Zagan, E.; Danubianu, M. Cloud DATA LAKE: The new trend of data storage. In Proceedings of the 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey, 11–13 June 2021; pp. 1–4. [Google Scholar]
Dworkin, M. Recommendation for Block Cipher Modes of Operation. Methods and Techniques; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2001. [Google Scholar]
Sayed, B.Y.; Mahmoud, A.M.; El-Rabaie, E.S.; Bauomy, N.A.S. CBSB: Robust Cancelable Biometric System for Banks Using Deep Learning. In Proceedings of the 2023 11th International Japan-Africa Conference on Electronics, Communications, and Computations (JAC-ECC), Alexandria, Egypt, 18–20 December 2023; pp. 30–33. [Google Scholar] [CrossRef]
Sawadogo, P.; Darmont, J. On data lake architectures and metadata management. J. Intell. Inf. Syst. 2021, 56, 97–120. [Google Scholar] [CrossRef]
Giebler, C.; Gröger, C.; Hoos, E.; Schwarz, H.; Mitschang, B. Leveraging the data lake: Current state and challenges. In Big Data Analytics and Knowledge Discovery, Proceedings of the 21st International Conference, DaWaK 2019, Linz, Austria, 26–29 August 2019; Proceedings 21; Springer: Berlin/Heidelberg, Germany, 2019; pp. 179–188. [Google Scholar]
Madsen, M. How to Build an Enterprise Data Lake: Important Considerations Before Jumping; Third Nature Inc.: San Mateo, CA, USA, 2015; pp. 13–17. [Google Scholar]
Gupta, S.; Giri, V. Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake; Apress: Berkeley, CA, USA, 2018. [Google Scholar]
Anisetti, M.; Ardagna, C.A.; Braghin, C.; Damiani, E.; Polimeno, A.; Balestrucci, A. Dynamic and scalable enforcement of access control policies for big data. In Proceedings of the 13th International Conference on Management of Digital EcoSystems, Virtual Event, 1–3 November 2021; pp. 71–78. [Google Scholar]
Quinto, B. Big data governance and management. In Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark; Apress: Berkeley, CA, USA, 2018; pp. 495–506. [Google Scholar]
Muñoz, A.P.; Martí, L.; Sánchez-Pi, N. Data Governance, a Knowledge Model Through Ontologies. In Proceedings of the Congreso Internacional de Tecnologías e Innovación, Guayaquil, Ecuador, 22–25 November 2021. [Google Scholar]
Saed, K.A.; Aziz, N.A.; Ramadhani, A.W.; Hassan, N.H. Data Governance Cloud Security Assessment at Data Center. In Proceedings of the 2018 4th International Conference on Computer and Information Sciences (ICCOINS), Kuala Lumpur, Malaysia, 13–14 August 2018; pp. 1–4. [Google Scholar]
Liu, W. How Data Security Could Be Achieved in The Process of Cloud Data Governance? In Proceedings of the 2022 2nd International Conference on Management Science and Software Engineering (ICMSSE 2022), Dali, China, 14–16 July 2022; Atlantis Press: Dordrecht, The Netherlands, 2022; pp. 114–120. [Google Scholar] [CrossRef]
Dingre, S.S. Exploration of Data Governance Frameworks, Roles, and Metrics for Success. J. Artif. Intell. Cloud Comput. 2023, 2, 1–3. [Google Scholar] [CrossRef]
Khatri, V.; Brown, C.V. Designing data governance. Commun. ACM 2010, 53, 148–152. [Google Scholar] [CrossRef]
Petersen, K.; Feldt, R.; Mujtaba, S.; Mattsson, M. Systematic mapping studies in software engineering. In Proceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering (EASE), Bari, Italy, 26–27 June 2008; BCS Learning & Development: Swindon, UK, 2008. [Google Scholar]
Negri-Ribalta, C.; Lombard-Platet, M.; Salinesi, C. Understanding the GDPR from a requirements engineering perspective—A systematic mapping study on regulatory data protection requirements. Requir. Eng. 2024, 29, 523–549. [Google Scholar] [CrossRef]
Sommerville, I. Software Engineering, 10th ed.; Series Software Engineering; Pearson: Boston, MA, USA, 2015; Volume 10. [Google Scholar]
Steurer, J. The Delphi method: An efficient procedure to generate knowledge. Skelet. Radiol. 2011, 40, 959–961. [Google Scholar] [CrossRef]
Nadal, S.; Herrero, V.; Romero, O.; Abelló, A.; Franch, X.; Vansummeren, S.; Valerio, D. A software reference architecture for semantic-aware Big Data systems. Inf. Softw. Technol. 2017, 90, 75–92. [Google Scholar] [CrossRef]
Brooke, J. SUS: A ‘Quick and Dirty’ Usability Scale. In Usability Evaluation In Industry; CRC Press: Boca Raton, FL, USA, 1996; pp. 207–212. [Google Scholar] [CrossRef]
Lagos, J.; Cravero, A. Process Formalization Proposal for Data Ingestion in a Data Lake. In Proceedings of the 2022 41st International Conference of the Chilean Computer Science Society (SCCC), Santiago, Chile, 21–25 November 2022; pp. 1–8. [Google Scholar]
Bangor, A.; Kortum, P.; Miller, J. Determining what individual SUS scores mean: Adding an adjective rating scale. J. Usability Stud. 2009, 4, 114–123. [Google Scholar]
Rieyan, S.A.; News, M.R.K.; Rahman, A.M.; Khan, S.A.; Zaarif, S.T.J.; Alam, M.G.R.; Hassan, M.M.; Ianni, M.; Fortino, G. An advanced data fabric architecture leveraging homomorphic encryption and federated learning. Inf. Fusion 2024, 102, 102004. [Google Scholar] [CrossRef]
Yeng, P.K.; Diekuu, J.B.; Abomhara, M.; Elhadj, B.; Yakubu, M.A.; Oppong, I.N.; Odebade, A.; Fauzi, M.A.; Yang, B.; El-Gassar, R. HEALER2: A Framework for Secure Data Lake Towards Healthcare Digital Transformation Efforts in Low and Middle-Income Countries. In Proceedings of the 2023 International Conference on Emerging Trends in Networks and Computer Communications (ETNCC), Windhoek, Namibia, 16–18 August 2023; pp. 1–9. [Google Scholar]
Shang, X.; Subenderan, P.; Islam, M.; Xu, J.; Zhang, J.; Gupta, N.; Panda, A. One stone, three birds: Finer-grained encryption with apache parquet@ large scale. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 5802–5811. [Google Scholar]
Hamadou, H.B.; Pedersen, T.B.; Thomsen, C. The danish national energy data lake: Requirements, technical architecture, and tool selection. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 1523–1532. [Google Scholar]
Revathy, P.; Mukesh, R. Analysis of big data security practices. In Proceedings of the 2017 3rd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), Tumkur, India, 21–23 December 2017; pp. 264–267. [Google Scholar]
Rawat, D.B.; Doku, R.; Garuba, M. Cybersecurity in big data era: From securing big data to data-driven security. IEEE Trans. Serv. Comput. 2019, 14, 2055–2072. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, C.; Guan, S. A data lake-based security transmission and storage scheme for streaming big data. Clust. Comput. 2024, 27, 4741–4755. [Google Scholar] [CrossRef]
Kai, L.; Liang, Z.; Yaojing, Y.; Dazhu, Y.; Min, Z. Research on Federated Learning Data Management Method Based on Data Lake Technology. In Proceedings of the 2023 International Conference on Computers, Information Processing and Advanced Education (CIPAE), Ottawa, ON, Canada, 26–28 August 2023; pp. 385–390. [Google Scholar]
Gabriel, A.; Julio, F.; Juan, L. Chilean Rut Encryption Using FPE: A POC in Databricks. 2024. Available online: https://doi.org/10.5281/ZENODO.15320875 (accessed on 30 April 2025).
Bellare, M.; Rogaway, P.; Spies, T. The FFX mode of operation for format-preserving encryption. NIST Submiss. 2010, 20, 1–18. [Google Scholar]

Figure 1. Representation of the design science cycle.

Figure 2. Qualitative rating scale [47].

Figure 3. Architectural representation of data flow and protocol components.

Figure 4. Representation of the process for obtaining unsecured data.

Figure 5. Representation of the process to obtaining secured data.

Figure 6. Bubble diagram. Visualization according to the type of contribution and approach of the documents.

Figure 7. Bar chart. Visualization of identified encryption techniques.

Figure 8. Bubble diagram. Visualization of requirements and other strategies found in the documents.

Figure 9. Bubble diagram. Visualization of domains and challenges found in the documents.

Figure 10. Usability scores based on the qualitative rating scale proposed by [47].

Figure 11. Quality scores based on the qualitative rating scale proposed by [47].

Figure 12. Results for the question: “I think that I would need the support of a technical person to be able to use this protocol”.

Figure 13. Results for the question: “I needed to learn a lot of things before I could get started with this protocol”.

Table 1. Example of FPE.

Original Text	Encrypted Text (FPE)
42437023	56548362
Hello World	Okvhu Lcsie
4243-7023-1234-5678	5654-8362-9876-0421

Table 2. Participant profiling questions.

Question ID	Question Text	Response Type
SQ1.1	Regardless of whether you were a graduate or not, since approximately what date have you been working in IT?	Short answer (date)
SQ1.2	Regardless of whether you were a graduate or not, since approximately what date have you been working in Big Data?	Short answer (date)
SQ1.3	Regardless of whether you were a graduate or not, since approximately what date have you been working with Data Lake?	Short answer (date)
SQ1.4	Which role within the company most closely matches your functions?	Multiple choice: Director Technical Leader Big Data Consultant BI Consultant Web App Developer Trainee

Table 3. Usability survey questions.

Question ID	Statement
SQ2.1	I think that I would like to use this protocol frequently.
SQ2.2	I found the protocol unnecessarily complex.
SQ2.3	I thought the protocol was easy to use.
SQ2.4	I think that I would need the support of a technical person to be able to use this protocol.
SQ2.5	I found the various functions of this protocol were well integrated.
SQ2.6	I thought there was too much inconsistency in this protocol.
SQ2.7	I would imagine that most people would learn to use this protocol very quickly.
SQ2.8	I found the protocol very cumbersome to use.
SQ2.9	I felt very confident using the protocol.
SQ2.10	I needed to learn a lot of things before I could get going with this protocol.

Table 4. Quality assessment questions.

Question ID	Quality Attribute	Statement
SQ3.1	Usefulness	The presented protocol would be useful in my work.
SQ3.2	Satisfaction	Overall I fell satisfied with the presented protocol.
SQ3.3	Trust	I would trust the protocol to handle my work with sensitive data.
SQ3.4	Perceived Relative Benefit	Using the proposed protocol would be an improvement with respect to my current way of handling and analyzing sensitive data.
SQ3.5	Functional Completeness	In general, the proposed protocol covers the needs of my work.
SQ3.6	Functional Appropriateness	The proposed protocol facilitates the management of the work with sensitive data.
SQ3.7	Willingness to Adopt	I would like to adopt the protocol in my work.

Table 5. Systematic mapping research questions.

ID	Research Question
RQ1	What types of contributions are found in the selected documents?
RQ2	What encryption techniques are used in big data tools for processing personal and sensitive data?
RQ3	What encryption techniques are applied to data in data lake repositories?
RQ4	What data format requirements are applied to data in data lake repositories?
RQ5	What other strategies for protecting personal and sensitive data are found in the selected documents?
RQ6	What are the industry domains presented where personal and sensitive data protection is applied?
RQ7	In documents that present future work, what kind of challenges and gaps are identified?

Table 6. Search engine results with and without inclusion/exclusion criteria applied.

Search Engine	Query Applied	Inclusion/Exclusion Criteria Applied
Scopus	533	5
WoS	10	1
IEEE	9	2
ACM	27	1

Table 7. Participants profile results.

Company Position	Quantity	Exp. in IT (yrs)	Exp. in BD (yrs)	Exp. in DL (yrs)
Big Data Consultant	13	5.47	1.87	1.77
Director	1	16.00	16.00	3.50
In Training	4	0.47	0.41	0.38
Web and App Developer	6	1.17	0.64	0.58
Technical Lead	2	12.68	7.18	5.39
BI Consultant	2	9.86	3.26	3.26

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lagos-Obando, J.; Aillapán, G.; Fenner-López, J.; Bustamante-Mora, A.; Burgos-López, M. A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption. Appl. Sci. 2025, 15, 10356. https://doi.org/10.3390/app151910356

AMA Style

Lagos-Obando J, Aillapán G, Fenner-López J, Bustamante-Mora A, Burgos-López M. A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption. Applied Sciences. 2025; 15(19):10356. https://doi.org/10.3390/app151910356

Chicago/Turabian Style

Lagos-Obando, Juan, Gabriel Aillapán, Julio Fenner-López, Ana Bustamante-Mora, and María Burgos-López. 2025. "A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption" Applied Sciences 15, no. 19: 10356. https://doi.org/10.3390/app151910356

APA Style

Lagos-Obando, J., Aillapán, G., Fenner-López, J., Bustamante-Mora, A., & Burgos-López, M. (2025). A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption. Applied Sciences, 15(19), 10356. https://doi.org/10.3390/app151910356

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption

Abstract

1. Introduction

2. Background

2.1. Data Portability and Interoperability in the Context of Cloud Computing and Data Protection

2.2. Security by Design

2.3. Infrastructure

2.4. Format-Preserving Encryption

2.5. Data Lake Architecture

2.6. Data Ingestion

2.7. Data Governance

3. Methodology

3.1. Design Science

3.1.1. Conceptual Analysis

3.1.2. Solution Design

3.1.3. Design Validation

3.1.4. Implementation

3.2. Systematic Mapping of the Literature

3.2.1. Research Questions

3.2.2. Inclusion and Exclusion Criteria

3.2.3. Search and Selection Process

3.2.4. Classification Scheme

4. Proposal

4.1. Ingestion Layer

4.2. Persistence Layer

4.3. Data Access Layer

4.4. Consumer Layer

4.5. A POC Approach

5. Results

5.1. Systematic Mapping Results

5.1.1. Data Extraction and Mapping

5.1.2. Analysis and Discussion

5.2. Survey Results

5.2.1. Participant Profile

5.2.2. Usability Assessment by Role

5.2.3. Quality Assessment by Role

5.2.4. Analysis of Outlier Responses in Usability Evaluation

6. Discussion

6.1. Analysis of Outlier Responses Regarding Protocol Usability

6.2. Key Contributions in Relation to Prior Research

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI