1. Introduction
Data privacy and the rise of ethical challenges continue to undergo a detailed analysis as part of the digital revolution. Safeguarding data privacy constitutes a fundamental right, often overlooked amid the exchange of data transfer for commercial and scientific purposes [
1].
The ethical considerations in data management extend beyond mere privacy concerns. They encompass a broader spectrum of responsibilities, including the fair and transparent use of data, ensuring data integrity, and respecting the rights of data subjects. In this context, ethics in data management can be defined as the principles and practices that guide the responsible collection, processing, storage, and use of data, with a focus on protecting individual rights and maintaining public trust.
Requiring commodities to handle data securely and ethically allows this initiative to protect individual’s data control rights and builds user trust. Furthermore, while protecting individual rights, it ensures compliance with strict regulations to mitigate risks such as unauthorized access, misuse, or sensitive data breaches. As a result, understanding the complexities of data privacy is critical for advancing research, technology, and policy-making areas in the digital age [
2]. The combination of all these issues with the ongoing advancements in technology and concerns about the improper use of personal data by governments and corporations compelled the application of laws to clarify data privacy rights and ensure an appropriate worldwide level of protection for personal data [
3].
This paper primarily draws its requirements for ethical and privacy considerations from two key regulations: the General Data Protection Regulation (GDPR) [
4] and the Health Insurance Portability and Accountability Act (HIPAA) [
5]. These regulations were chosen due to their comprehensive nature and wide-ranging impact on data management practices globally. The GDPR, a regulation (not a directive) in EU law, provides a robust framework for data protection and privacy in the European Union and the European Economic Area. HIPAA, on the other hand, offers specific guidelines for protecting sensitive patient health information in the United States. Together, these regulations provide a solid foundation for discussing data privacy and ethical considerations in database management.
It is important to note that, throughout this paper, we will use the term “data subject” to refer to individuals whose personal data is being processed, aligning with the terminology used in the GDPR. This choice reflects our focus on regulatory compliance and helps in precisely scoping the requirements discussed in this paper.
There are regulations such as the aforementioned General Data Protection Regulation (GDPR) [
4] and Health Insurance Portability and Accountability Act (HIPAA) [
5] that assist organizations in safeguarding personal data under their jurisdiction, ensuring that organizations are informed and able to react to such an occurrence. In addition, HIPAA and GDPR impact database systems significantly, which affect their architecture, security protocols, and user access controls [
6]. Adherence to these regulations requires incorporating features that aid in following their requirements, like fields for consent statuses and attributes related to data classification in database schemas [
7]. Given these regulations, access to data is carefully controlled, following the privilege principle to grant access to only authorized users. Security measures have been tightened to protect sensitive data, needing robust encryption methods, intrusion detection systems, and strict access controls. These are some characteristics implemented by these regulations [
8].
Database administrators (DBAs) are vital in ensuring stored data security, integrity, and confidentiality. They are essential in ensuring ethically sound data practices within corporate environments. Their principal responsibility lies in implementing robust data security measures to safeguard databases against potential threats posed by unauthorized access and malicious actors [
9] while taking into consideration database design, security protocols, and user access permissions to comply with HIPAA and GDPR regulations [
10].
A critical ethical consideration in database management is the right of data subjects to request the removal of their data, often referred to as the “right to be forgotten” in the GDPR. Database designs and management practices must incorporate mechanisms to efficiently and completely remove an individual’s data upon request, while maintaining the integrity of the remaining data and complying with any legal retention requirements.
This paper presents a practical application of data privacy principles in managing a patient health dataset that mimics real-world healthcare data. The dataset was acquired from Kaggle [
11], a popular open-source platform that presents datasets, machine learning competitions, tools, and tutorials for data science enthusiasts and additional areas.
The results obtained from implementing data minimization, anonymization, pseudonymization, encryption, and access control practices in the “Healthcare Dataset” demonstrate significant improvements in the security and privacy of stored health information, although with a slight performance impact. Data minimization reduces the stored sensitive information, lowering the risk of exposure and privacy breaches. Anonymization and pseudonymization add a layer of protection while reducing data granularity, making detailed analysis challenging. Encryption using the Advanced Encryption Standard algorithm, facilitated by the pgcrypto extension in PostgreSQL, effectively safeguard sensitive data from unauthorized access, although at the cost of increased processing time. Lastly, access control measures enhance system security by carefully assigning permissions to database users.
In addition to encryption, this paper will also discuss hashing techniques, which are crucial for verifying data integrity. We will explore various hashing algorithms and their applications in database management, complementing our discussion on encryption methods.
This paper is structured as follows.
Section 2 provides a literature review on works related to database management, data privacy, and ethical considerations. In
Section 3, we introduce specific techniques and rules that database administrators should follow regarding ethical issues such as minimization, anonymization, pseudonymization and data encryption, access controls, and transparent communication with stakeholders.
Section 4 presents a real-world database scenario that shows how to implement ethical implementations and best practices.
Section 5 discusses the results and analyses obtained by applying ethical considerations and best practices. In
Section 6, we provide recommendations for more research in this area, critically evaluate the effects and implications of our findings, and summarize the study’s findings.
Throughout this paper, we aim to provide a comprehensive overview of data privacy and ethical considerations in database management, addressing not only technical aspects but also the broader ethical implications of data handling practices.
2. Literature Review
This section presents works related to data privacy and ethical concerns in database management.
In one paper [
12], the authors present a tool used in PostgreSQL to mask and anonymize the information of a database. They show techniques like masking parts of the text commonly applied to credit card numbers, where only the last four digits remain visible; date aging that substitutes the value with a range of ages; and nullifying columns where the actual values are swapped for a null value. This technique shows an example of how to help database managers protect critical user information to prevent data infringements.
Another paper [
13] presents an analysis of the performance of encryption tools over three databases, where one of the databases mentioned is PostgreSQL. In this database, the authors applied a tool called pgcrypto. This tool provides several functions, such as creating password hashes, Pretty Good Privacy (PGP) encryption functions, and random hash functions. The authors applied these security measures and evaluated the performance of databases given different workloads. The author states that PostgreSQL provides good security over data protection, although its performance may be slightly impacted.
The authors proposed a study on privacy methods and anonymization techniques to protect sensitive data on databases to prevent attacks and data recognition to evaluate its performance [
14]. Some methods applied were k-anonymity, l-diversity, suppression, generalization, anatomization, and slicing. The authors deduce that they must lose some information to protect data privacy and prevent data recognition, even though attacks can still be performed.
The authors of [
15] present a study that analyses an anonymization process of databases of banking applications while applying anonymization methodologies to improve its security. The authors found some challenges, given that the anonymization process can be very complex and time consuming, mainly when dealing with vast amounts of data. In addition, they explain the different contexts, whether organizational, functional/business, or technical, when applying anonymization techniques.
Article [
16] addresses the possible risks and challenges related to data minimization, which consists of limiting the amount of personal information collected, stored, and processed to only what is necessary to achieve a specific objective. Despite being commonly recommended as an excellent privacy measure, the article points out that data minimization can have unexpected and potentially harmful consequences.
In one paper, the researchers demonstrated that, by carefully minimizing the data collected about a student’s daily activities, they could still obtain valuable insights for personalized learning while effectively protecting the student’s privacy. This application shows how data minimization can be used creatively in specific contexts to combine the need for data-driven decision making with robust privacy protections [
17].
Other authors published a study outlining a methodology for pseudonymizing medical data [
18]. This involved hashing and encrypting patient identifiers and securely storing the encryption key, allowing for data analysis while preserving patient confidentiality.
A study carried out in [
19] highlights the importance of choosing algorithms when comparing data encryption performance. The research concluded that, while solid encryption is essential, the algorithm also needs to ensure adequate processing to avoid possible negative impacts on the performance of the selected database.
In the context of ethical considerations in data management, the Royal Society’s report on “Privacy Enhancing Technologies” [
20] provides valuable insights into the intersection of technology and privacy. The report discusses various technologies and methodologies that can be employed to enhance privacy in data-intensive applications, which is particularly relevant to our discussion on ethical database management.
Recent developments in data governance and protection regulations, such as the EU’s Data Act [
21], Data Governance Act [
22], Digital Services Act (DSA) [
23], and the proposed EU AI Act [
24], are also worth considering for future research. These regulations introduce new requirements and considerations for data management and use, particularly in the context of AI and digital services.
Addressing the ethical implications of data management, a study by [
25] explores the ethical challenges posed by big data. The authors discuss issues such as privacy, anonymity, transparency, and trust in the context of large-scale data collection and analysis. This work provides a broader perspective on the ethical considerations that should inform database management practices.
In response to the growing concern over the right to be forgotten, ref. [
26] discusses the technical and legal challenges of implementing this right in database systems. The paper explores the tensions between data retention, the right to privacy, and the practical implications for database design and management.
Regarding the specific requirements derived from GDPR and HIPAA, ref. [
27] provides a comprehensive analysis of the GDPR’s impact on data management practices. Similarly, ref. [
28] offers insights into the implications of HIPAA for healthcare data management. These works help in understanding the specific requirements that shape ethical and privacy considerations in database management.
The work performed on these papers may differ from our paper but will serve as a guide to conducting our study.
3. Best Practices
This section presents ethical best practices for database administrators, such as data retention policies, data minimization, data anonymization and pseudonymization, access controls, data encryption, data hashing, and transparent communication with stakeholders.
3.1. Data Retention Policies
Data retention policies have a vital role in upholding data security and confidentiality. Their implementation and upkeep carry considerable weight for businesses, particularly those with data-centric models like research service providers [
29]. Without indefinite data retention, there is a risk of losing collective memory, leading to a misleading perception of privacy and freedom [
30].
Therefore, some of the best methods to develop data retention policies are the following:
Data assessment: Identify and classify the types of data the organization stores, considering the level of risk to which they are subject. Personal data, such as Personally Identifiable Information (PII), should be prioritized and managed rigorously.
Clear conservation periods: Collaboration with data owners to establish and validate appropriate retention periods for diverse data categories, balancing legal and operational mandates while safeguarding confidentiality.
Deletion processes: Implement efficient strategies for purging obsolete or irrelevant data within specified retention timelines, mitigating data accumulation and minimizing the risk of privacy breaches or unauthorized disclosures.
Restricted access control: Ensure strict control over access to sensitive data, permitting only the authorized personnel necessary for organizational functions while maintaining comprehensive monitoring and recording mechanisms to detect and address any potential unauthorized breaches.
Continuous assessment and improvement: View data retention policies as dynamic, requiring ongoing assessment and enhancement, with data controllers staying up to date on evolving legal frameworks, security protocols, and technological advancements to ensure compliance and maximize effectiveness.
While following the above methods, some key principles are needed to follow to build data retention policies:
Have a precise, legal, and legitimate purpose;
Establish a precise and time-limited retention period;
Data must be relevant and necessary;
Data must be stored for the shortest possible time;
Data is safe and secure.
As with many legal policies, they present several implications regarding the implementation of these policies:
Data volume management: The surge in data volume needs efficient storage, processing, and deletion approaches to maintain critical organizational data while reducing the threat of excessive accumulation.
Difficulty in data classification: Ensuring accuracy and consistency in data classification is essential for successfully applying retention policies. However, organizations need help in accurately classifying data.
Ongoing regulatory changes: Data controllers must stay updated with changes to relevant laws and regulations, as well as security and privacy best practices and technologies.
Nonetheless, implementing data retention policies is challenging. Data managers face balancing operational needs with strict compliance requirements [
29].
3.2. Ethical Considerations in Data Retention
When implementing data retention policies, several ethical considerations must be taken into account:
Right to be forgotten: As per GDPR Article 17, data subjects have the right to request the erasure of their personal data. Database designs must incorporate mechanisms to efficiently and completely remove an individual’s data upon request [
31].
Purpose limitation: Data should only be collected and retained for specified, explicit, and legitimate purposes, as outlined in GDPR Article 5(1)(b) [
32].
Data minimization: Organizations should limit the collection and retention of personal data to what is necessary for the specified purposes, in line with GDPR Article 5(1)(c) [
33].
Storage limitation: Personal data should be kept in a form which permits the identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed, as per GDPR Article 5(1)(e) [
34].
These ethical considerations ensure that data retention policies not only comply with legal requirements but also respect the rights and privacy of data subjects.
3.3. Data Minimization
Regarding database security, a practical approach to safeguarding sensitive information involves reducing collected and processed data [
8]. As such, data collection is appropriate and minimizes the risks of unauthorized access, data breach, and disorderly use, which aligns with the general principle of collecting the minimum data necessary to achieve a specific purpose like GDPR.
For DBAs, implementing data minimization requires careful collaboration with other stakeholders to define essential data elements and formulate data guidelines [
35]. This also applies strict controls over data processing activities to prevent the intentional or accidental dissemination of sensitive information. By reducing the overall data footprint and limiting access to only authorized personnel, DBAs can significantly improve the security view of their databases [
36].
Table 1 and
Table 2 show an example of how data minimization works in the context of students’ data:
This example illustrates how data minimization works given the immense information shown in
Table 1. By carefully minimizing the data collected about a student’s personal information, one can select important information without disclosing other information to protect the student’s privacy.
3.4. Ethical Implications of Data Minimization
While data minimization is generally considered a good privacy practice, it is important to consider its potential ethical implications:
Loss of context: Excessive data minimization might lead to a loss of important contextual information, potentially affecting the quality of data analysis or decision-making processes [
16].
Bias introduction: If not carefully implemented, data minimization could inadvertently introduce bias by selectively removing certain types of data [
37].
Balancing utility and privacy: There is an ongoing ethical challenge in finding the right balance between minimizing data for privacy protection and retaining enough data for meaningful analysis and use [
17].
DBAs and organizations must carefully consider these ethical implications when implementing data minimization strategies to ensure they protect privacy without compromising the utility and fairness of their data practices.
3.5. Anonymization and Pseudonymization of Data
Anonymization and pseudonymization methods are critical in safeguarding confidential data [
38]. These techniques replace directly identifiable information with non-identifiable alternatives, significantly reducing the risk of individuals being re-identified within the data.
Anonymization involves removing or concealing all Personally Identifiable Information (PII) from a database. Techniques such as hashing, tokenization, and generalization replace specific data with non-identifying codes or modify it to remove specificity [
39]. For example, in
Table 3, the application of anonymization techniques overwrites the name of the subject with fictitious personal information, as illustrated in
Table 4.
Pseudonymization is another approach to protect privacy, particularly in healthcare settings [
40]. For instance, patient records in a healthcare database may be pseudonymized by hashing the patient’s name and date of birth and storing the hash and encrypted data. Pseudonymized data cannot be directly linked to a specific individual without additional information.
Table 5 shows the student name before pseudonymization and
Table 6 shows the student name after pseudonymization.
There are other techniques used in the anonymization process to protect data [
39] that include the following:
Remove Attributes (Suppression): This method eliminates attributes from the dataset that could help identify individuals. It should be used when an attribute is irrelevant or unnecessary for analysis or when anonymization is unfeasible by any other means.
Character Replacement (Masking): This technique involves using neutral characters, such as the “*”, to hide personal and vital information. This replacement can partially hide a text or attribute, which can be sufficient to anonymize the data.
Scrambling/Shuffling: This technique involves randomly mixing or rearranging the data while retaining the values of the original attributes in the dataset. It can analyze individual attributes independently without requiring their correlation with others.
Data Noise (Perturbation): This technique, also known as data perturbation, consists of minor modifications to the dataset’s attributes to make them less precise and remove possible identifications. It is crucial to understand the level of noise that should be applied to the data without compromising an individual’s analysis or privacy.
Generalization: This technique involves modifying attributes by changing their scale or order of magnitude to provide an overview of each attribute.
Aggregation: This method condenses data into summarized versions with fewer attributes through standardization and grouping similar data. Aggregation differs from generalization in that it actively alters the data, as opposed to the basic setting applied to each attribute in generalization.
The anonymization and pseudonymization techniques are essential for granting regulations such as GDPR and HIPAA. They enable organizations to lawfully process and share personal data for various purposes while mitigating the risk of unauthorized access and misuse.
3.6. Ethical Considerations in Anonymization and Pseudonymization
While anonymization and pseudonymization are crucial for protecting privacy, they also raise several ethical considerations:
Re-identification risk: Even with these techniques, there is always a risk of re-identification, especially with advances in data analytics. Ethical database management requires ongoing assessment of these risks [
18].
Data utility vs. privacy: There is often a trade-off between the level of anonymization and the usefulness of the data. Striking the right balance is an ethical challenge [
41].
Informed consent: It is crucial to consider whether data subjects have given informed consent for their data to be anonymized or pseudonymized, especially if the original purpose of data collection changes [
42].
Transparency: Organizations should be transparent about their anonymization and pseudonymization processes to maintain trust with data subjects [
43].
These ethical considerations highlight the need for a thoughtful and balanced approach to implementing anonymization and pseudonymization techniques in database management.
3.7. Data Encryption
In database security, data encryption is essential to protect sensitive information. DBAs can establish efficient protection against unauthorized and potential data access by selectively encrypting critical data and limiting access to encryption keys [
35].
The efficacy of data encryption hinges on adherence to various best practices. First, meticulously evaluating the data types that justify encryption is essential. Sensitive data, including Personally Identifiable Information (PII), Protected Health Information (PHI), and proprietary data must be prioritized. Additionally, DBAs must opt for appropriate encryption algorithms that balance security strength and practical performance considerations. Regarding database encryption, data protection can be implemented across multiple levels, including cell, column, tablespace, and file levels.
There are two types of Data Encryption Techniques:
Symmetric database encryption uses the same secret key for data encryption and decryption as illustrated in
Figure 1. It is a simple and effective solution at a low cost. The main problem with this method is keeping the key secret and sharing it securely among authorized users. Examples of symmetric encryption algorithms include 3DES, AES, DES, QUAD, and RC4.
Asymmetric database encryption, also known as public-key encryption, encrypts and decrypts data using public and private keys as shown in
Figure 2. A more secure method for protecting sensitive data is to use a public key, which can only be decrypted given an associated private key. By separating public and private keys, asymmetric database encryption unravels the problem of managing key distribution. Examples of asymmetric encryption algorithms include Diffie–Hellman, ECC, El Gamal, DSA and RSA.
Data encryption is essential for database security, ensuring confidentiality and appropriately maintaining the integrity and availability of information. The Data Encryption Standard (DES) has been fundamental to information security among the several symmetric encryption algorithms. By transforming plain text data into cypher text using a shared secret key, these algorithms ensure that the encrypted data will remain indecipherable without the key in case of unauthorized access to the database [
44]. DES is no longer considered safe for new systems or applications requiring high security. The Advanced Encryption Standard (AES) is a symmetric key encryption algorithm that replaces DES and is widely used in modern systems and applications that require a high level of security [
45].
3.8. Data Hashing
Hashing is a fundamental cybersecurity and data management technique that transforms data into a hash value, a fixed-length sequence of characters. This procedure ensures that the data is represented by a distinct and different identifier called a hash. Hashing is widely used because it allows for the verification of data integrity and the protection of confidential information without exposing the original data.
Many programs can transform the text into a hash string. Common hashing algorithms include the following:
MD5: A hash function compresses any length of data processes or messages into a 128 bit hash value [
46], where the same input will always produce the same output and, as it is unidirectional, the process cannot be reversed.
Figure 3, from [
47], demonstrates how the MD5 algorithm is used.
Even though it can be widely used for certificate authentication, it has vulnerabilities regarding hash collision [
48]. Due to this, it is no longer acceptable within secure hash functions where collision resistance is required, such as in the digital signatures stated by [
49]. This function can be used or is preferred due to its lower computational requirements than recent Secure Hash Strings.
SHA-256: The hash function successor of SHA-1 is one of the most robust hash functions available [
50] and is a highly secure one-way hash function that makes it impossible to recover the original message from the hash value [
51]. The SHA256 architecture describes the steps of message processing, including message input, padding, expansion, scheduling, and compression, to produce a 256-bit hash value.
Figure 4 demonstrates the different stages, including Message Padding for message length adjustment, Message Expansion to determine the original message length before padding, Message Scheduler to schedule message block processing, and Message Compression to generate a 256-bit message digest after 64 rounds [
51].
In addition to MD5 and SHA-256, other hashing algorithms such as SHA-3, Blake2, and Whirlpool are also employed in various applications, offering a range of security and performance features to suit different cybersecurity and data management needs. These hashing algorithms ensure that multiple options are available, each with specific requirements for data integrity verification and secure information protection, enhancing the overall security of various systems.
3.9. Access Controls
In database security, establishing access controls is paramount since it will likely reduce the likelihood of security incidents [
52]. This process aligns with gathering necessary data and restricting access to prevent unauthorized disclosures, data breaches, and other security breaches. For database administrators (DBAs), integrating access control principles into the core of software systems is crucial for maintaining information integrity, confidentiality, and availability [
53].
In addition to technical controls, developing clear guidelines and procedures for access control is essential. These guidelines should specify who can access specific data, under what circumstances, and for what purpose.
3.10. Ethical Considerations in Access Control
Implementing access controls raises several ethical considerations:
Privacy vs. Utility: Balancing the need for data protection with the need for data access to perform necessary functions is an ongoing ethical challenge [
54].
Transparency: Organizations should be transparent about their access control policies to maintain trust with data subjects and stakeholders [
55].
Non-discrimination: Access control measures should be implemented in a way that does not unfairly discriminate against certain groups or individuals [
56].
Accountability: There should be mechanisms in place to ensure that those who access data are accountable for their actions [
57].
These ethical considerations underscore the importance of the thoughtful and fair implementation of access controls in database management.
3.11. Transparent Communication with Stakeholders
Effective and transparent communication with target audiences and other stakeholders is crucial for addressing the challenges and opportunities of data management [
58]. Communication is a powerful tool that explains limitations and potential risks, followed by decision making. It is essential to find the right balance in communication to avoid excesses that cause information fatigue and lack of communication that leads to misunderstandings and mistrust.
Regarding ethical considerations in data management, it is essential to deal with issues related to data privacy [
59]. Good data management practices must follow evolving privacy regulations, ensuring that sensitive information is collected, stored, and used following legal guidelines [
60]. By providing clear information about their data practices and allowing people to control their personal information, organizations can show their commitment to ethics.
To communicate efficiently with stakeholders, organizations need to implement best practices for communicating with these stakeholders [
61]. This encompasses a four-step process: planning, implementation, monitoring, and evaluation [
61].
Ideal practices for transparent communication are as follows:
Establish clear objectives and roles: Allows all stakeholders to understand the goals and their affiliated roles. This transparency contributes to clarity and establishes the alignment of expectations from the beginning.
Two-way communication: Transparency involves maintaining open communication channels with stakeholders regarding projects or initiative progress. Regular updates for stakeholders to make inquiries and feedback encourage a sense of inclusion and mutual dedication.
Sharing positive and negative information: Demonstrating sincerity builds credibility and shows that the organization is not concealing information.
Use clear, concise language: Avoid technical terminology and overly complex terms that non-technical stakeholders may find challenging to comprehend. The goal is to ensure that information is accessible and easily understood by all relevant stakeholders.
Schedule regular communication: Set a periodic calendar of stakeholder meetings. This may involve weekly status updates, monthly progress reports, or ad hoc meetings as needed. Keeping all parties informed and aligned is facilitated by consistent communication.
3.12. Ethical Implications of Transparent Communication
Transparent communication in data management carries several ethical implications:
Trust building: Transparent communication helps build trust between organizations and their stakeholders, including data subjects [
62].
Informed consent: Clear communication is crucial for ensuring that data subjects can give truly informed consent for the use of their data [
63].
Accountability: Transparency in communication promotes accountability in data management practices [
64].
Empowerment: By providing clear information about data practices, organizations empower individuals to make informed decisions about their personal data [
65].
These ethical implications highlight the importance of prioritizing transparent communication in all aspects of data management.
5. Results and Analysis
In this section, we apply and analyze the results obtained from implementing best practices in database management in our case study.
5.1. Data Minimization
To minimize the data in the dataset, we chose to reduce the information in the “Gender” and “Test Results” columns.
In the column referring to the patient’s biological gender, we replaced the entries “Male” and “Female” with “M” and “F”, respectively.
In the test results column, we replaced the categories “Normal”, “Abnormal” and “Inconclusive” with “N”, “A”, and “I”.
By minimizing the data in the columns referring to the patient’s biological sex and test results, we obtained the results in
Table 7 and
Table 8:
The results obtained by implementing data minimization practices in the dataset allowed us to reduce the amount of sensitive information stored, limiting the risk of exposure and privacy violations. Additionally, we can avoid unnecessary overhead, storage, and processing time by reducing the amount of data collected.
Ethical Implications of Data Minimization
While data minimization effectively reduces privacy risks, it is important to consider its ethical implications:
Data Utility vs. Privacy: By minimizing data, we may limit the potential for certain types of analysis or research. This presents an ethical dilemma between protecting individual privacy and potentially limiting beneficial uses of data [
16].
Potential Bias: Minimizing certain data fields (like gender) could potentially introduce or exacerbate bias in analyses if not carefully considered [
70].
Informed Consent: It is crucial to ensure that data subjects are aware of and consent to this type of data transformation, as it changes the nature of the data they initially provided [
71].
5.2. Anonymization and Pseudonymization of Data
To anonymize and pseudonymize the data in the dataset, we chose to pseudonymize the information in the “Name” column and anonymize the information in the “Age”, “DateOfAdmission”, and “DischargeDate” columns.
In the column referring to the patient’s name, we used the character replacement (masking) method to conceal the patient’s last name, using unique values like “*” to avoid the direct identification of individuals. This method can be applied to either categorical or numerical attributes.
In the column referring to the age of the patients, we applied a generalization method that groups the ages into 10-year intervals/ranges. This anonymization practice reduces the granularity of the data, making it less likely to identify specific individuals. Following this, we adopted a similar generalization approach for patient admission and discharge dates, where only the month and year were kept while the day was removed.
By pseudonymizing the data in the column referring to the patient’s name, we obtained the results in
Table 9 and
Table 10:
This pseudonymization approach reduces the risk of re-identification while protecting patient privacy.
By anonymizing the data in the column referring to the patient’s age using the generalization technique, we obtained the results in
Table 11 and
Table 12:
In this way, patient privacy is preserved without significantly compromising the usefulness of the data for statistical and trend analysis.
Next, by anonymizing the data in the columns referring to the patient’s admission and discharge dates using the generalization technique, we obtained the outcomes in
Table 13 and
Table 14:
This simplification of data helps protect patient privacy while enabling temporally relevant analysis.
Finally, by anonymizing the data in the columns referring to the patient’s room number using the suppression technique, we obtained the results in
Table 15 and
Table 16:
The results obtained with the implementation of anonymization and pseudonymization of the data in the dataset made it possible to remove and modify information that could identify patients, guaranteeing the privacy and security of the data.
Data anonymization and pseudonymization provided additional protection, making it more difficult to identify individuals from the available data. By suppressing sensitive attributes and generalizing information such as age and dates, it was possible to preserve patient privacy without compromising data integrity for statistical and trend analysis. However, these anonymization and pseudonymization techniques have caused a reduction in data granularity, making detailed analysis or cross-referencing difficult.
Ethical Considerations in Anonymization and Pseudonymization
While these techniques enhance privacy, they also raise ethical concerns:
Re-identification Risk: Despite our efforts, there is always a risk of re-identification, especially with advances in data analytics. This presents an ongoing ethical challenge [
18].
Data Utility vs. Privacy: The reduction in data granularity may limit certain types of analysis, potentially impacting the utility of the data for research or healthcare improvements [
72].
Informed Consent: It is crucial to consider whether the original consent given by data subjects covers these transformations, as they significantly alter the nature of the data [
71].
Transparency: We must be transparent about these processes to maintain trust with data subjects and other stakeholders [
73].
5.3. Data Encryption
In pgAdmin4, we used the Advanced Encryption Standard (AES) algorithm with the help of the pgcrypto extension in PostgreSQL, ensuring that only authorized users with access to the private key could view or modify the information.
We executed the query “SELECT * FROM PatientRecord ORDER BY id” to analyze the database’s performance and the response time in milliseconds. Before data encryption, we observed a response time of less than 1 s, as shown in
Figure 5:
After encrypting the data in the columns referring to the patient’s medical condition and medication, we can observe a response time of approximately 11 s, as illustrated in
Figure 6:
Data encryption resulted in a considerable increase in database response time, indicating a compromise between security and performance that must be carefully evaluated to ensure the appropriate balance between the two aspects. This suggests that encryption may harm system performance, although it provides greater security in protecting sensitive data, as seen in
Figure 7.
Data encryption, although essential for protecting privacy and data security, resulted in a loss in the performance of our database queries. This was because the additional time required to encrypt and decrypt information increased the system’s response time, making it less efficient. While security is paramount, it is important to consider performance and find a balance between these opposing needs.
Ethical Implications of Data Encryption
The implementation of data encryption raises several ethical considerations:
Security vs. Accessibility: While encryption enhances data security, it may limit data accessibility for legitimate uses, potentially impacting patient care or research [
19].
Performance Trade-offs: The significant increase in query execution time could affect the efficiency of healthcare operations, potentially impacting patient care. This presents an ethical dilemma between data security and operational efficiency [
45].
Key Management: The ethical responsibility of managing encryption keys is crucial, as loss of keys could result in permanent data loss [
35].
Transparency: It is important to be transparent about encryption practices to maintain trust with data subjects and comply with regulations [
74].
5.4. Access Controls
To implement access control on data using pgAdmin4, we adopted an approach that involved assigning permissions to database users. This included granting specific access privileges, such as SELECT, INSERT, UPDATE, and DELETE, only to users who required these permissions to perform their functions.
Initially, the user (or role) “henrique” was created in PostgreSQL using the command “CREATE ROLE henrique LOGIN PASSWORD ‘string’ ”. Then “henrique” was granted a role with “SELECT” and “INSERT” permissions for all tables in the public schema through the “GRANT” command. The “UPDATE” and “DELETE” permissions were revoked, using the “REVOKE” command to ensure more restricted data management. Furthermore, additional restrictions were applied to user “henrique”, allowing him to authenticate and access the database using credentials. Some examples prevent him from having all administrative privileges (superuser), creating databases, creating new users (or roles), inheriting parental role privileges, and initiating streaming replication and backups.
User “afonso” received the “SELECT” permission for all tables in the public schema and the “INSERT”, “UPDATE”, and “DELETE” permissions were revoked. Likewise, additional restrictions were applied, very much similar to user “henrique”.
Lastly, user “eduardo” was granted “SELECT”, “INSERT”, “UPDATE”, and “DELETE” permissions for all tables in the public schema. Additional restrictions were applied to user “eduardo”, allowing him to authenticate and access the database using credentials, preventing him from being a superuser, allowing him to create databases, create new users or roles, inherit privileges from parental roles, and initiate streaming replication and backups.
These measures aim to guarantee precise and granular control of access to data, thus promoting the security and integrity of information stored in the health database against unauthorized access and improper manipulation.
Ethical Considerations in Access Control
Implementing access controls raises several ethical considerations:
Balancing Access and Privacy: While restricting access enhances privacy, it may limit legitimate data use for research or patient care. This presents an ethical challenge in balancing data protection with data utility [
52].
Transparency and Accountability: It is crucial to maintain transparency about access control policies and ensure accountability for data access [
75].
Fairness in Access Distribution: The distribution of access rights must be fair and based on legitimate needs, avoiding any form of discrimination [
53].
Continuous Review: Access rights should be regularly reviewed to ensure they remain appropriate, reflecting the ethical responsibility to maintain data security over time [
76].
In conclusion, implementing these data management practices has proven critical to ensuring compliance with privacy regulations, protecting the confidentiality of health information, and promoting user confidence in the ethical management and use of data. These measures contribute significantly to mitigating the risks of privacy violations and ensuring respect for patients’ rights, even though they may minimally influence the speed and efficiency of the database.
5.5. Data Hashing
To implement data hashing, we used the SHA-256 algorithm provided by PostgreSQL’s pgcrypto extension. We applied hashing to the patient’s name and medical condition to verify data integrity.
Here is an example of how we implemented hashing:
UPDATE PatientRecord
SET NameHash = encode(digest(Name, ‘sha256’), ‘hex’),
MedicalConditionHash = encode(digest(MedicalCondition, ‘sha256’), ‘hex’);
This query creates hash values for the Name and MedicalCondition fields, storing them in the new columns NameHash and MedicalConditionHash, respectively.
To verify data integrity, we can compare the stored hash with a newly generated hash:
SELECT Name, NameHash,
encode(digest(Name, ‘sha256’), ‘hex’) AS ComputedHash,
NameHash = encode(digest(Name, ‘sha256’), ‘hex’) AS HashMatch
FROM PatientRecord;
This query computes a new hash for each Name and compares it with the stored hash, indicating whether they match.
Ethical Considerations in Data Hashing
While hashing enhances data integrity, it also raises ethical considerations:
Privacy Implications: Although hashes are one-way functions, they could potentially be used for tracking or linking records across databases, raising privacy concerns [
46].
Transparency: It is important to be transparent about the use of hashing techniques to maintain trust with data subjects [
77].
Data Utility: Hashed data may limit certain types of data analysis or use, potentially impacting research or healthcare outcomes [
51].
Collision Risks: While rare with modern algorithms like SHA-256, hash collisions could potentially lead to data integrity issues, raising ethical concerns about data accuracy [
50].
The implementation of data hashing provides an additional layer of security and data integrity verification, complementing our other data protection measures. However, it is crucial to balance these benefits against the ethical considerations and potential impacts on data utility.
6. Conclusions and Future Work
This study’s implementation of data management practices proved essential to ensure that the health information in the “Healthcare Dataset” is adequately protected regarding privacy and security.
The implementation of data minimization, data anonymization and pseudonymization, data encryption, and access control has proven to be essential for effectively protecting the privacy and security of patients’ health data.
Data minimization is essential to limit the amount of personal information collected and processed to the minimum necessary to achieve legitimate purposes, thus reducing the risks of exposure and misuse of data.
Anonymization, in turn, irreversibly eliminates the link between the data and the patient’s identity, allowing the information to be used for statistical or research purposes without compromising privacy. Pseudonymization replaces direct identifiers with a code, maintaining the possibility of re-identification, which makes it suitable for situations where data needs to be traceable, such as monitoring the adverse effects of medications.
Encryption is essential in protecting patient data, providing additional protection against unauthorized access through reversible information encryption.
Access control establishes rules and restrictions on who can access, view, or modify data, ensuring that only authorized users can access and manipulate health data.
Data hashing provides an additional layer of security by ensuring data integrity and allowing for the verification of data authenticity. This is particularly important in healthcare settings where the accuracy of patient information is crucial.
While these security measures may introduce some performance overhead, particularly in the case of encryption, the privacy protection and regulatory compliance advantages are overwhelming. However, optimizing these implementations to minimize the performance impact and ensure an optimal balance between data security and system efficiency is essential.
The ethical implications of these practices are significant and multifaceted. They include considerations of data utility versus privacy, the potential for bias introduction, the need for informed consent, and the importance of transparency in data handling processes. These ethical considerations underscore the complexity of managing healthcare data and the need for a thoughtful, balanced approach that respects individual privacy while enabling the beneficial use of data for research and improved patient care.
Adopting these practices is crucial to building patient trust and promoting a safe, ethical, and trustworthy digital health ecosystem in digital information.
This study highlights the need for a comprehensive and ongoing approach to ensuring the privacy and security of health data in the digital era. It serves as a basis for future research and development in health information security.
Future Work
In future work, additional studies should determine the effectiveness of various data protection techniques by ensuring ongoing compliance with privacy regulations, investigating new anonymization and pseudonymization methods, and implementing monitoring and auditing systems.
Additionally, it would be interesting to consider ways to make the database structure more efficient, possibly looking to further minimize the negative impact of additional security measures. These can be accomplished using more advanced data processing methods or optimizing database queries and operations.
Future research could also focus on the following:
Exploring advanced encryption techniques that balance security and performance more effectively.
Investigating the use of blockchain technology for enhancing data integrity and traceability in healthcare databases.
Developing more sophisticated anonymization techniques that better preserve data utility while ensuring privacy.
Studying the long-term impacts of these data protection measures on healthcare outcomes and research capabilities.
Exploring the ethical implications of emerging technologies like artificial intelligence and machine learning in healthcare data management.
Investigating ways to implement the “right to be forgotten” in healthcare databases while maintaining data integrity and complying with retention requirements.
Finally, it is imperative to stay updated with data privacy and security regulation changes, and monitor new technological developments and best practices in data management. These will ensure that the system remains adapted to the ever-evolving needs and requirements of the digital healthcare environment.