A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption
Abstract
1. Introduction
2. Background
2.1. Data Portability and Interoperability in the Context of Cloud Computing and Data Protection
2.2. Security by Design
2.3. Infrastructure
2.4. Format-Preserving Encryption
2.5. Data Lake Architecture
2.6. Data Ingestion
2.7. Data Governance
3. Methodology
3.1. Design Science
3.1.1. Conceptual Analysis
3.1.2. Solution Design
3.1.3. Design Validation
3.1.4. Implementation
3.2. Systematic Mapping of the Literature
3.2.1. Research Questions
3.2.2. Inclusion and Exclusion Criteria
- (i)
- Language: Only studies published in English and Spanish are included, ensuring that the documents are accessible and relevant to the international academic literature.
- (ii)
- Publication Date: Only works published between 2014 and January 2024 are considered to include all relevant documents in the field of data protection in big data and data lake environments.
- (iii)
- Sources: Only works from scientific journals and conferences are accepted.
- (i)
- Study Domain: All works not focused on the field of information security in big data environments and data lake repositories are excluded, ensuring that the research is exclusively focused on the topic of interest.
- (ii)
- Accessibility: Documents that could not be fully accessed or were not relevant to the analysis were excluded.
- (iii)
- Duplication: Duplicate documents between academic search engines were excluded, retaining only one and discarding the others.
3.2.3. Search and Selection Process
3.2.4. Classification Scheme
- Type of Contribution and Approach:The analyzed documents are classified according to the type of contribution and approach adopted. Regarding the contribution, three main categories are identified: (1) methodology, which encompasses systematic techniques and tools to address problems; (2) method or framework, which provides consistent structures or principles to solve specific problems; and (3) technique, which includes specific improvements such as algorithms or specific implementations.Regarding the approaches, the documents can adopt one of the following: (1) innovative, introducing significant advancements with new ideas, methods, or technologies; (2) positional, analyzing phenomena from a particular viewpoint in relation to the context and existing practices; or (3) canon, based on established practices, established methods, or accepted standards in the field. Each document may contribute more than one contribution but only one type of approach.
- Encryption Techniques:In the context of big data and data lake repositories, the encryption techniques used to protect personal and sensitive data include advanced encryption standard (AES), recognized for its effectiveness and performance; homomorphic encryption (HE), which allows operations on encrypted data without the need to decrypt it; format-preserving encryption (FPE), which maintains the original format of the data, facilitating its integration with existing systems; elliptic curve cryptography (ECC), which stands out for offering a level of security comparable to other traditional cryptographic techniques but with smaller keys, reducing storage and processing requirements, ideal for resource-limited environments; and attribute-based encryption (ABE), which ensures fine-grained encrypted access control to externalized data.Other emerging techniques are also identified, expanding the available options according to the specific requirements presented by the documents.
- Format Requirements:Format requirements for data are classified according to their state. For data in use, the requirements focus on the needs for analysis and machine learning, ensuring that the data can be processed efficiently without fully decrypting it. For data at rest, the requirements focus on the structure of the data, ensuring its correct integration and storage while maintaining its integrity. Finally, for data in transit, the requirements are grouped according to the communication protocols and technologies employed, ensuring secure transmission of data across networks or between systems.
- Other Protection Strategies:Refers to other ways of protecting data in the context of the research. This includes anonymization, which involves modifying the original data to hide sensitive information and prevent the identification of individuals or entities; access control, which encompasses policies and mechanisms that determine who can access the data and under what conditions, ensuring that only authorized individuals have access to the information; and security audits, which involve the continuous monitoring and review of activities related to data access and usage, with the goal of identifying vulnerabilities and ensuring compliance with security policies.
- Domain of Development of the Documents:The application domain of the document can be classified into three areas: industrial, healthcare, and academic. The industrial domain refers to documents where the research focus is developed in the context of an organization or industrial sector. The healthcare domain refers to research focused on medical data or data protection within the healthcare field. Finally, the Academic domain encompasses documents aimed at presenting general research without a specific focus on the industry or healthcare sector.
- Challenges and Gaps:The challenges and gaps identified in the reviewed documents can be classified into four key areas: costs, which limit the adoption of advanced technologies; data standards, necessary to ensure interoperability and facilitate information exchange between systems; security and regulatory compliance, which are essential for protecting sensitive data and complying with regulations such as Chilean law N. 19.628; and data management and analysis, which refers to the challenges associated with efficiently managing and processing large volumes of data in big data and data lake contexts. This scheme highlights the most relevant areas for future research and the development of technological solutions.
4. Proposal
- Unsecured Path:Represented in Figure 4, this path involves accessing the data without encryption measures, reserved exclusively for extraordinary cases, such as requests from entities with superior authority, for example, for judicial, law enforcement, or legal compliance purposes.
- Secured Path:Represented in Figure 5, in this path, the data undergoes an encryption scheme based on masking through FPE and is transformed into the Delta Lake format. This allows controlled and secure data consumption under the supervision of data stewards, who regulate access and ensure compliance with security policies.
4.1. Ingestion Layer
- Data Ingestion. This component is responsible for extracting data from external sources and transporting it to the data lake. Although this component can be considered an autonomous process, within the context of this protocol, no distinction is made between the types of ingestion (batch or stream), as both are treated equally.
4.2. Persistence Layer
- Raw Container. This container within the proposed data lake stores the information exactly as it arrives into the protocol and can only be accessed directly by the data owner.
- Delta Lake Container. This container stores the data in encrypted Delta Lake format. Through this container, the data steward will grant access to users via the secure access catalog component, allowing the data to be consumed.
4.3. Data Access Layer
- Raw Control Catalog. This component is responsible for accessing raw data without the security layer provided by the masking process component. These data can only be accessed by the data owner, who has full control and is responsible for managing the shared, unprotected data. The unsecured data is then transferred to the consumption layer, where it is consumed by external parties that require the data in its raw form.
- Secure Access Catalog. This component is responsible for accessing the data encrypted with FPE stored in the Delta Lake container. These data are managed by the data steward, who has the authority to share the masked data with the consumer layer, where external parties to the protocol can freely use the data.
- Masking Process. This component enables data masking using an FPE scheme. Within the Databricks platform, Spark is used as the collector, which serves as the primary tool for connecting to data sources [46]. Above Spark is PySpark (3.5.0), a Python (3.11) -based interface that facilitates interaction with Spark. With these tools, the masking scheme is developed on Spark using PySpark, ensuring optimal performance in data processing. The transformed data is then converted into Delta Lake format and can be stored in the Delta Lake container, where it can later be accessed through the secure access catalog.
4.4. Consumer Layer
- Safe Data. These are the encrypted data exported in Delta Lake format for secure consumption, typically intended for analysis by external actors to the protocol. These data undergo the masking scheme and, with authorization from the data steward, are protected before being shared.
- Unsafe Data. These are the data exported in various formats but extracted directly from the data lake without going through the masking scheme. As a result, they are shared in an unsecured manner and lack the necessary controls to ensure their protection.
4.5. A POC Approach
- 123456789 for the first digit.
- 1234567890 for the base digits
- 1234567890K for the verification digit.
Algorithm 1 RUT encryption process. |
|
Algorithm 2 RUT decryption process. |
|
5. Results
5.1. Systematic Mapping Results
5.1.1. Data Extraction and Mapping
5.1.2. Analysis and Discussion
- Figure 6 shows that the most common type of contribution in the selected documents is method or framework, accounting for 43.75%, followed by technique at 31.25%, and methodology at 25%.
- Figure 7 shows the encryption techniques used in big data tools to protect personal and sensitive data. According to our classification scheme, the most common techniques are HE (homomorphic encryption) and AES (advanced encryption standard), due to their ability to perform operations on encrypted data and their high performance, respectively.
- Figure 7 presents the encryption techniques identified in data lakes, highlighting that HE (homomorphic encryption), AES (advanced encryption standard), and ECC (elliptic curve cryptography) are the most commonly used, each appearing twice.
- Figure 8 highlights that the most frequent format requirements correspond to data structure, accounting for 36.36%. These are primarily observed in big data (27.27%), while their presence in data lakes is lower (9.09%).
- Figure 8 shows that additional strategies used to protect personal and sensitive data include access control and security audits, both representing a total of 38.46%, with differences in their distribution: access control is more common in big data (23.08%) than in data lakes (15.38%), while security audits are more prevalent in data lakes (23.08%) than in big data (15.38%).
- Figure 9 shows that the academic domain is the most common, representing 30% of the documents in big data and 20% in data lakes, reflecting a predominant focus on general and theoretical research.
- Figure 9 identifies the most common challenges and gaps in the reviewed documents. The primary challenge is security, privacy, and regulatory compliance, present in 30% of studies related to data lakes and 10% in big data. This reflects the importance of ensuring the protection of sensitive data and adhering to specific regulations.
5.2. Survey Results
5.2.1. Participant Profile
5.2.2. Usability Assessment by Role
5.2.3. Quality Assessment by Role
5.2.4. Analysis of Outlier Responses in Usability Evaluation
6. Discussion
6.1. Analysis of Outlier Responses Regarding Protocol Usability
6.2. Key Contributions in Relation to Prior Research
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ABE | Attribute-Based Encryption |
AES | Advanced Encryption Standard |
BD | Big Data |
BI | Business Intelligence |
CISO | Chief Information Security Officer |
DL | Data Lake |
ECC | Elliptic Curve Cryptography |
FPE | Format-Preserving Encryption |
GDPR | General Data Protection Regulation |
HE | Homomorphic Encryption |
IaaS | Infrastructure as a Service |
IT | Information Technology |
SBD | Secure by Design |
SRA | Software Reference Architecture |
SUS | System Usability Scale |
References
- Chen, J.; Wang, H. Guest Editorial: Big Data Infrastructure I. IEEE Trans. Big Data 2018, 4, 148–149. [Google Scholar] [CrossRef]
- Rawat, R.; Yadav, R. Big data: Big data analysis, issues and challenges and technologies. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1022, 012014. [Google Scholar] [CrossRef]
- Panwar, A.; Bhatnagar, V. Data lake architecture: A new repository for data engineer. Int. J. Organ. Collect. Intell. (IJOCI) 2020, 10, 63–75. [Google Scholar] [CrossRef]
- Moreno, J.; Fernandez, E.B.; Serrano, M.A.; Fernandez-Medina, E. Secure development of big data ecosystems. IEEE Access 2019, 7, 96604–96619. [Google Scholar] [CrossRef]
- Gupta, S.; Jain, S.; Agarwal, M. Ensuring data security in databases using format preserving encryption. In Proceedings of the 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 11–12 January 2018; pp. 1–5. [Google Scholar]
- Kumar, D.; Li, S. Separating storage and compute with the databricks lakehouse platform. In Proceedings of the 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA), Shenzhen, China, 13–16 October 2022; pp. 1–2. [Google Scholar]
- Mouratidis, H.; Kang, M. Secure by Design: Developing Secure Software Systems from the Ground Up. Int. J. Secur. Softw. Eng. 2011, 2, 23–41. [Google Scholar] [CrossRef]
- Shirtz, D.; Koberman, I.; Elyashar, A.; Puzis, R.; Elovici, Y. Enhancing Energy Sector Resilience: Integrating Security by Design Principles. arXiv 2024, arXiv:2402.11543. [Google Scholar]
- Awaysheh, F.M.; Aladwan, M.N.; Alazab, M.; Alawadi, S.; Cabaleiro, J.C.; Pena, T.F. Security by design for big data frameworks over cloud computing. IEEE Trans. Eng. Manag. 2021, 69, 3676–3693. [Google Scholar] [CrossRef]
- Bellare, M.; Ristenpart, T.; Rogaway, P.; Stegers, T. Format-preserving encryption. In Selected Areas in Cryptography, Proceedings of the 16th Annual International Workshop, SAC 2009, Calgary, AB, Canada, 13–14 August 2009; Revised Selected Papers 16; Springer: Berlin/Heidelberg, Germany, 2009; pp. 295–312. [Google Scholar]
- Weiss, M.; Rozenberg, B.; Barham, M. Practical solutions for format-preserving encryption. arXiv 2015, arXiv:1506.04113. [Google Scholar]
- Cui, B.; Zhang, B.; Wang, K. A data masking scheme for sensitive big data based on format-preserving encryption. In Proceedings of the 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), Guangzhou, China, 21–24 July 2017; Volume 1, pp. 518–524. [Google Scholar]
- Wu, M.; Huang, J. A Scheme of Relational Database Desensitization Based on Paillier and FPE. In Proceedings of the 2021 3rd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, 3–5 December 2021; pp. 374–378. [Google Scholar]
- Wieringa, R. Design science as nested problem solving. In Proceedings of the 4th International Conference on Design Science Research in Information Systems and Technology, Philadelphia, PA, USA, 7–8 May 2009; pp. 1–12. [Google Scholar]
- Wieringa, R.J. Design Science Methodology for Information Systems and Software Engineering; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
- Wohlfaxrth, M. Data Portability on the Internet: An Economic Analysis. In Proceedings of the International Conference on Interaction Sciences, Seoul, Republic of Korea, 10–13 December 2017. [Google Scholar]
- Wohlfarth, M. Data Portability on the Internet. Bus. Inf. Syst. Eng. 2019, 61, 551–574. [Google Scholar] [CrossRef]
- Bozman, J.; Chen, G. Cloud Computing: The Need for Portability and Interoperability; IDC Executive Insights; IDC Corporate: Needham, MA, USA, 2010; pp. 74–75. [Google Scholar]
- Huth, D.; Stojko, L.; Matthes, F. A Service Definition for Data Portability. In Proceedings of the International Conference on Enterprise Information Systems, Heraklion, Greece, 3–5 May 2019. [Google Scholar]
- Kadam, S.P.; Joshi, S.D. Secure by design approach to improve security of object oriented software. In Proceedings of the 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 11–13 March 2015; pp. 24–30. [Google Scholar]
- Kern, C. Secure by Design at Google; Technical Report, Google Security Engineering; Google Research: Mountain View, CA, USA, 2024. [Google Scholar]
- Howard, M.; Lipner, S. The Security Development Lifecycle; Microsoft Press: Redmond, WA, USA, 2006. [Google Scholar]
- Paul, A.; Manoj, R.; S, U. Amazon Web Services Cloud Compliance Automation with Open Policy Agent. In Proceedings of the 2024 International Conference on Expert Clouds and Applications (ICOECA), Bengaluru, India, 18–19 April 2024. [Google Scholar] [CrossRef]
- Arostegi, M.; Torre-Bastida, A.I.; Bilbao, M.N.; Ser, J.D. A heuristic approach to the multicriteria design of IaaS cloud infrastructures for Big Data applications. Expert Syst. 2018, 35, e12259. [Google Scholar] [CrossRef]
- Megahed, M.E.; Badry, R.M.; Gaber, S.A. Survey on Big Data and Cloud Computing: Storage Challenges and Open Issues. In Proceedings of the 2023 4th International Conference on Communications, Information, Electronic and Energy Systems (CIEES), Plovdiv, Bulgaria, 23–25 November 2023; pp. 1–6. [Google Scholar]
- Zagan, E.; Danubianu, M. Cloud DATA LAKE: The new trend of data storage. In Proceedings of the 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey, 11–13 June 2021; pp. 1–4. [Google Scholar]
- Dworkin, M. Recommendation for Block Cipher Modes of Operation. Methods and Techniques; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2001. [Google Scholar]
- Sayed, B.Y.; Mahmoud, A.M.; El-Rabaie, E.S.; Bauomy, N.A.S. CBSB: Robust Cancelable Biometric System for Banks Using Deep Learning. In Proceedings of the 2023 11th International Japan-Africa Conference on Electronics, Communications, and Computations (JAC-ECC), Alexandria, Egypt, 18–20 December 2023; pp. 30–33. [Google Scholar] [CrossRef]
- Sawadogo, P.; Darmont, J. On data lake architectures and metadata management. J. Intell. Inf. Syst. 2021, 56, 97–120. [Google Scholar] [CrossRef]
- Giebler, C.; Gröger, C.; Hoos, E.; Schwarz, H.; Mitschang, B. Leveraging the data lake: Current state and challenges. In Big Data Analytics and Knowledge Discovery, Proceedings of the 21st International Conference, DaWaK 2019, Linz, Austria, 26–29 August 2019; Proceedings 21; Springer: Berlin/Heidelberg, Germany, 2019; pp. 179–188. [Google Scholar]
- Madsen, M. How to Build an Enterprise Data Lake: Important Considerations Before Jumping; Third Nature Inc.: San Mateo, CA, USA, 2015; pp. 13–17. [Google Scholar]
- Gupta, S.; Giri, V. Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake; Apress: Berkeley, CA, USA, 2018. [Google Scholar]
- Anisetti, M.; Ardagna, C.A.; Braghin, C.; Damiani, E.; Polimeno, A.; Balestrucci, A. Dynamic and scalable enforcement of access control policies for big data. In Proceedings of the 13th International Conference on Management of Digital EcoSystems, Virtual Event, 1–3 November 2021; pp. 71–78. [Google Scholar]
- Quinto, B. Big data governance and management. In Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark; Apress: Berkeley, CA, USA, 2018; pp. 495–506. [Google Scholar]
- Muñoz, A.P.; Martí, L.; Sánchez-Pi, N. Data Governance, a Knowledge Model Through Ontologies. In Proceedings of the Congreso Internacional de Tecnologías e Innovación, Guayaquil, Ecuador, 22–25 November 2021. [Google Scholar]
- Saed, K.A.; Aziz, N.A.; Ramadhani, A.W.; Hassan, N.H. Data Governance Cloud Security Assessment at Data Center. In Proceedings of the 2018 4th International Conference on Computer and Information Sciences (ICCOINS), Kuala Lumpur, Malaysia, 13–14 August 2018; pp. 1–4. [Google Scholar]
- Liu, W. How Data Security Could Be Achieved in The Process of Cloud Data Governance? In Proceedings of the 2022 2nd International Conference on Management Science and Software Engineering (ICMSSE 2022), Dali, China, 14–16 July 2022; Atlantis Press: Dordrecht, The Netherlands, 2022; pp. 114–120. [Google Scholar] [CrossRef]
- Dingre, S.S. Exploration of Data Governance Frameworks, Roles, and Metrics for Success. J. Artif. Intell. Cloud Comput. 2023, 2, 1–3. [Google Scholar] [CrossRef]
- Khatri, V.; Brown, C.V. Designing data governance. Commun. ACM 2010, 53, 148–152. [Google Scholar] [CrossRef]
- Petersen, K.; Feldt, R.; Mujtaba, S.; Mattsson, M. Systematic mapping studies in software engineering. In Proceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering (EASE), Bari, Italy, 26–27 June 2008; BCS Learning & Development: Swindon, UK, 2008. [Google Scholar]
- Negri-Ribalta, C.; Lombard-Platet, M.; Salinesi, C. Understanding the GDPR from a requirements engineering perspective—A systematic mapping study on regulatory data protection requirements. Requir. Eng. 2024, 29, 523–549. [Google Scholar] [CrossRef]
- Sommerville, I. Software Engineering, 10th ed.; Series Software Engineering; Pearson: Boston, MA, USA, 2015; Volume 10. [Google Scholar]
- Steurer, J. The Delphi method: An efficient procedure to generate knowledge. Skelet. Radiol. 2011, 40, 959–961. [Google Scholar] [CrossRef]
- Nadal, S.; Herrero, V.; Romero, O.; Abelló, A.; Franch, X.; Vansummeren, S.; Valerio, D. A software reference architecture for semantic-aware Big Data systems. Inf. Softw. Technol. 2017, 90, 75–92. [Google Scholar] [CrossRef]
- Brooke, J. SUS: A ‘Quick and Dirty’ Usability Scale. In Usability Evaluation In Industry; CRC Press: Boca Raton, FL, USA, 1996; pp. 207–212. [Google Scholar] [CrossRef]
- Lagos, J.; Cravero, A. Process Formalization Proposal for Data Ingestion in a Data Lake. In Proceedings of the 2022 41st International Conference of the Chilean Computer Science Society (SCCC), Santiago, Chile, 21–25 November 2022; pp. 1–8. [Google Scholar]
- Bangor, A.; Kortum, P.; Miller, J. Determining what individual SUS scores mean: Adding an adjective rating scale. J. Usability Stud. 2009, 4, 114–123. [Google Scholar]
- Rieyan, S.A.; News, M.R.K.; Rahman, A.M.; Khan, S.A.; Zaarif, S.T.J.; Alam, M.G.R.; Hassan, M.M.; Ianni, M.; Fortino, G. An advanced data fabric architecture leveraging homomorphic encryption and federated learning. Inf. Fusion 2024, 102, 102004. [Google Scholar] [CrossRef]
- Yeng, P.K.; Diekuu, J.B.; Abomhara, M.; Elhadj, B.; Yakubu, M.A.; Oppong, I.N.; Odebade, A.; Fauzi, M.A.; Yang, B.; El-Gassar, R. HEALER2: A Framework for Secure Data Lake Towards Healthcare Digital Transformation Efforts in Low and Middle-Income Countries. In Proceedings of the 2023 International Conference on Emerging Trends in Networks and Computer Communications (ETNCC), Windhoek, Namibia, 16–18 August 2023; pp. 1–9. [Google Scholar]
- Shang, X.; Subenderan, P.; Islam, M.; Xu, J.; Zhang, J.; Gupta, N.; Panda, A. One stone, three birds: Finer-grained encryption with apache parquet@ large scale. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 5802–5811. [Google Scholar]
- Hamadou, H.B.; Pedersen, T.B.; Thomsen, C. The danish national energy data lake: Requirements, technical architecture, and tool selection. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 1523–1532. [Google Scholar]
- Revathy, P.; Mukesh, R. Analysis of big data security practices. In Proceedings of the 2017 3rd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), Tumkur, India, 21–23 December 2017; pp. 264–267. [Google Scholar]
- Rawat, D.B.; Doku, R.; Garuba, M. Cybersecurity in big data era: From securing big data to data-driven security. IEEE Trans. Serv. Comput. 2019, 14, 2055–2072. [Google Scholar] [CrossRef]
- Zhao, X.; Zhang, C.; Guan, S. A data lake-based security transmission and storage scheme for streaming big data. Clust. Comput. 2024, 27, 4741–4755. [Google Scholar] [CrossRef]
- Kai, L.; Liang, Z.; Yaojing, Y.; Dazhu, Y.; Min, Z. Research on Federated Learning Data Management Method Based on Data Lake Technology. In Proceedings of the 2023 International Conference on Computers, Information Processing and Advanced Education (CIPAE), Ottawa, ON, Canada, 26–28 August 2023; pp. 385–390. [Google Scholar]
- Gabriel, A.; Julio, F.; Juan, L. Chilean Rut Encryption Using FPE: A POC in Databricks. 2024. Available online: https://doi.org/10.5281/ZENODO.15320875 (accessed on 30 April 2025).
- Bellare, M.; Rogaway, P.; Spies, T. The FFX mode of operation for format-preserving encryption. NIST Submiss. 2010, 20, 1–18. [Google Scholar]
Original Text | Encrypted Text (FPE) |
---|---|
42437023 | 56548362 |
Hello World | Okvhu Lcsie |
4243-7023-1234-5678 | 5654-8362-9876-0421 |
Question ID | Question Text | Response Type |
---|---|---|
SQ1.1 | Regardless of whether you were a graduate or not, since approximately what date have you been working in IT? | Short answer (date) |
SQ1.2 | Regardless of whether you were a graduate or not, since approximately what date have you been working in Big Data? | Short answer (date) |
SQ1.3 | Regardless of whether you were a graduate or not, since approximately what date have you been working with Data Lake? | Short answer (date) |
SQ1.4 | Which role within the company most closely matches your functions? | Multiple choice:
|
Question ID | Statement |
---|---|
SQ2.1 | I think that I would like to use this protocol frequently. |
SQ2.2 | I found the protocol unnecessarily complex. |
SQ2.3 | I thought the protocol was easy to use. |
SQ2.4 | I think that I would need the support of a technical person to be able to use this protocol. |
SQ2.5 | I found the various functions of this protocol were well integrated. |
SQ2.6 | I thought there was too much inconsistency in this protocol. |
SQ2.7 | I would imagine that most people would learn to use this protocol very quickly. |
SQ2.8 | I found the protocol very cumbersome to use. |
SQ2.9 | I felt very confident using the protocol. |
SQ2.10 | I needed to learn a lot of things before I could get going with this protocol. |
Question ID | Quality Attribute | Statement |
---|---|---|
SQ3.1 | Usefulness | The presented protocol would be useful in my work. |
SQ3.2 | Satisfaction | Overall I fell satisfied with the presented protocol. |
SQ3.3 | Trust | I would trust the protocol to handle my work with sensitive data. |
SQ3.4 | Perceived Relative Benefit | Using the proposed protocol would be an improvement with respect to my current way of handling and analyzing sensitive data. |
SQ3.5 | Functional Completeness | In general, the proposed protocol covers the needs of my work. |
SQ3.6 | Functional Appropriateness | The proposed protocol facilitates the management of the work with sensitive data. |
SQ3.7 | Willingness to Adopt | I would like to adopt the protocol in my work. |
ID | Research Question |
---|---|
RQ1 | What types of contributions are found in the selected documents? |
RQ2 | What encryption techniques are used in big data tools for processing personal and sensitive data? |
RQ3 | What encryption techniques are applied to data in data lake repositories? |
RQ4 | What data format requirements are applied to data in data lake repositories? |
RQ5 | What other strategies for protecting personal and sensitive data are found in the selected documents? |
RQ6 | What are the industry domains presented where personal and sensitive data protection is applied? |
RQ7 | In documents that present future work, what kind of challenges and gaps are identified? |
Search Engine | Query Applied | Inclusion/Exclusion Criteria Applied |
---|---|---|
Scopus | 533 | 5 |
WoS | 10 | 1 |
IEEE | 9 | 2 |
ACM | 27 | 1 |
Company Position | Quantity | Exp. in IT (yrs) | Exp. in BD (yrs) | Exp. in DL (yrs) |
---|---|---|---|---|
Big Data Consultant | 13 | 5.47 | 1.87 | 1.77 |
Director | 1 | 16.00 | 16.00 | 3.50 |
In Training | 4 | 0.47 | 0.41 | 0.38 |
Web and App Developer | 6 | 1.17 | 0.64 | 0.58 |
Technical Lead | 2 | 12.68 | 7.18 | 5.39 |
BI Consultant | 2 | 9.86 | 3.26 | 3.26 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lagos-Obando, J.; Aillapán, G.; Fenner-López, J.; Bustamante-Mora, A.; Burgos-López, M. A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption. Appl. Sci. 2025, 15, 10356. https://doi.org/10.3390/app151910356
Lagos-Obando J, Aillapán G, Fenner-López J, Bustamante-Mora A, Burgos-López M. A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption. Applied Sciences. 2025; 15(19):10356. https://doi.org/10.3390/app151910356
Chicago/Turabian StyleLagos-Obando, Juan, Gabriel Aillapán, Julio Fenner-López, Ana Bustamante-Mora, and María Burgos-López. 2025. "A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption" Applied Sciences 15, no. 19: 10356. https://doi.org/10.3390/app151910356
APA StyleLagos-Obando, J., Aillapán, G., Fenner-López, J., Bustamante-Mora, A., & Burgos-López, M. (2025). A Secure-by-Design Approach to Big Data Analytics Using Databricks and Format-Preserving Encryption. Applied Sciences, 15(19), 10356. https://doi.org/10.3390/app151910356