A Novel Vertical Fragmentation Method for Privacy Protection Based on Entropy Minimization in a Relational Database

Many scholars have attempted to use an encryption method to resolve the problem of data leakage in data outsourcing storage. However, encryption methods reduce data availability and are inefficient. Vertical fragmentation perfectly solves this problem. It was first used to improve the access performance of the relational database, and nowadays some researchers employ it for privacy protection. However, there are some problems that remain to be solved with the vertical fragmentation method for privacy protection in the relational database. First, current vertical fragmentation methods for privacy protection require the user to manually define privacy constraints, which is difficult to achieve in practice. Second, there are many vertical fragmentation solutions that can meet privacy constraints; however, there are currently no quantitative evaluation criteria evaluating how effectively solutions can protect privacy more effectively. In this article, we introduce the concept of information entropy to quantify privacy in vertical fragmentation, so we can automatically discover privacy constraints. Based on this, we propose a privacy protection model with a minimum entropy fragmentation algorithm to achieve minimal privacy disclosure of vertical fragmentation. Experimental results show that our method is suitable for privacy protection with a lower overhead.


Introduction
With the increase of human activities, more and more personal data needs to be stored.Some enterprises and individuals outsource their data to storage service providers.However, outsourcing storage gives rise to new security risks.A major problem is that the data is stored on external storage providers, so owners lose control over their data, potentially exposing them to improper access and dissemination.For example, in 2007, the well-known cloud service provider Salesforce [1] leaked a large amount of data due to security issues.
A current issue is how to ensure that outsourced data does not suffer from private leakage, even if the intruders obtain the original data.Currently, the most widely used method to protect outsourced data is encryption.At the same time, in order to ensure the availability of data, in the field of encryption technology, there are homomorphic encryption techniques [2], searchable encryption techniques [3,4] and the like.However, from the user's point of view, the use of encryption technology is a huge burden.Encryption technology is very inefficient in operations such as database queries.Other commonly used data outsourcing methods include anonymization such as k-anonymity [5,6], l-diversity [7], and t-closeness [8].Still, these methods limit the availability of the outsourced data.
An interesting method used to ensure both the availability and privacy of outsourced data is vertical fragmentation.Vertical fragmentation is not specifically designed to protect data security.It is primarily used in relational databases to improve database access performance and optimize storage.It splits a large table into multiple small tables and stores the information on multiple different servers.When the user submits a query request, the system will translate it into multiple requests which are executed concurrently on different servers.In this way, the efficiency of database querying is greatly improved.The vertical fragmentation method assumes that privacy is generated by the associations among data.For example, if the name and phone number in a database are leaked separately, the attacker could not obtain the knowledge of which number belongs to whom.However, if the name and phone number are leaked together, the attacker will obtain accurate privacy information.Vertical fragmentation splits the database into different servers, greatly reducing the risk of leaking the entire database.Thus, we can use vertical fragmentation to protect the privacy of an outsourced database.In addition, vertical fragmentation has a big advantage of maintaining outsourced data availability compared to the use of encryption and anonymization techniques, as these latter methods do not store the original data.
However, there are some problems with vertical fragmentation concerning privacy protection.First, it is very difficult for users to specify privacy constraints.In practice, users need to understand the semantics of each attribute and then determine if there are privacy constraints.Some weakly related privacy constraints are difficult for users to find, especially when there are many attributes in the database.Second, there are many vertical fragmentation methods that satisfy privacy constraints.In order to evaluate which method is better, De Capitani di Vimercat's approach simulates the frequency of access to all attributes and ultimately chooses the method with the best access efficiency to perform the vertical fragmentation.In fact, the access data for attributes are impossible to know in advance.
Compared with De Capitani di Vimercat's method, our research focuses on optimizing vertical fragmentation to eliminate user-defined privacy constraints and evaluate the effectiveness of privacy protection.We propose a new approach to vertically fragment a relational database to ensure the privacy and availability of the outsourced data.The approach consists of an entropy-based method to quantify the privacy of outsourcing data and a greedy algorithm to achieve privacy protection.In summary, we make the following contributions: (1) We introduce the concept of information entropy and give an example of entropy calculation for a sample database, thus avoiding the need for the user to manually define privacy constraints in vertical fragmentation.We can calculate the entropy value of any number of attributes to determine which attributes together have a greater probability of having privacy constraints.
(2) We implement a vertical fragmentation method based on a greedy algorithm to minimize entropy.There are many solutions for vertically fragmenting databases.In order to minimize the risk of privacy leakage per fragmentation, we need to make the entropy of each fragment as small as possible.We propose a greedy algorithm to minimize the entropy for vertical fragmentation, which also works with low overhead.
The rest of this paper is structured as follows.Section 2 introduces related works.Section 3 describes the problem statement.We introduce our approach in Section 4. The experimental results are discussed in Section 5.In Section 6, we discuss the limitations and extensibility of our work.Conclusions are given in Section 7.

Related Work
In this section, we first introduce related work on vertical fragmentation with a focus on privacy protection (Section 2.1), and then we describe the research on information entropy (Section 2.2).

Vertical Fragmentation for Privacy Protection
With the development of encryption technology, encryption technology has been considered the greatest method for privacy protection.Encryption technology is inefficienct at performing queries.To perform the query on encrypted data, researchers proposed the indexing methods.Hacigumus et al. [9] proposed the bucket-based index methods.Hore et al. [10] present an efficient bucket-based index method.Wang et al. [11] proposed a hash-based index method.However, the indexing methods can't resist inference attacks [12], and Order Preserving Encryption (OPE) [13] methods solve this problem by performing range query.
These encryption methods to protect privacy do not solve the problem of how to efficiently execute queries.There is no doubt that querying on encrypted data is very inefficient.An interesting method to achieve the goal of privacy protection and efficiently query is vertical fragmentation.Vertical fragmentation is mainly used to improve database access performance, since vertical fragmentation stores attributes together that are frequently accessed by queries.De Capitani di Vimercat et al. were the first to use vertical fragmentation for privacy protection.As reported in Refs.[14][15][16][17][18][19][20], De Capitani di Vimercat et al. used vertical fragmentation to protect the privacy of a relational database.They employed the user defined privacy constraints to complete the vertical fragmentation.Considering that there are many solutions to satisfy privacy constraints, they employed the user's query load to compute the query efficient optimal solution.Xiang et al. [21] suggested that the user's query load on the outsourced data was dynamically changing, so they proposed an adaptive partitioning strategy.Biskup et al. [22] used vertical fragmentation to protect confidentiality.Their model consists of two domains: one for storing encrypted data and one for trusted local domains which store fragments containing highly sensitive data without encrypting them.However, this method is not efficient.In order to solve the overhead in the query processing of data encryption methods, Ganapathy et al. [23] proposed distributing data storage for secure database services, in which data is stored on multiple servers.They also used the concept of privacy constraints to achieve a heuristic greedy algorithm for the distribution problem.Partitioning the queries for the servers was achieved using a bottom-up state-based algorithm.For multi-relational databases, Bkakria et al. [24] defined an approach for the privacy protection of sensitive information in outsourced data, based on a combination of fragmentation and encryption.

Information Entropy
Shannon's theory of information entropy [25] solves the basic theory of quantifying and communicating information.Earlier studies of information entropy to measure privacy are those by Diaz [26] and Serjantov et al. [27], who proposed the use of information entropy to measure the anonymity of anonymous communication systems.Assuming that the purpose of the attacker is to determine the true identity of the sender (or receiver) of the message, each user in the system can be estimated to be the true sender or receiver of the message with a certain probability.The attacker guesses that a user is a real sender or receiver, which is expressed as a random variable X.Let p(x) be the probability function of X, and S(X) be the set of possible values of X.In this way, we use the entropy H(x) to quantify the system's privacy level: For example, when the set S(X) is {1, 2} and the probability of each is 50%, then H(X) = 1.If the only value of the random variable X is 1, then the probability of taking the value of 1 is 100%.Accordingly, H(X) = 0. Therefore, the greater the entropy of the random variable X, the greater the dispersion of X.The more discrete the value, the greater the probability that a record is identified, and the higher the risk of information being leaked.If the entropy of the attribute is 0, it means that the value of the attribute may equally correspond to all the records equally, and thus the security is also the highest; it is hard for an attacker to identify the message sender or receiver.

Problem Statement
To provide privacy protection with vertical fragmentation, we first examine the meaning of vertical fragmentation for privacy protection, in general.We use an example to explain it and show that the current vertical fragmentation method is inefficient and unrealistic.Then, we introduce automated vertical fragmentation technology, which is the approach needed to quantify the privacy.Finally, we give the evaluation criteria of our method.

Vertical Fragmentation with Privacy Constraints
De Capitani di Vimercat et al. first proposed a vertical fragmentation strategy to protect privacy in a relational database.In De Capitani di Vimercat's research, privacy is defined by the user, in terms of privacy constraints.A privacy constraint corresponds to a set of attributes of a relational database.The attributes in the set together may lead to privacy leakage.Definition 1. Privacy Constraints.For a simple relationship R, it consists of attributes a 1 , a 2 , a 3 , . . .a n .All privacy constraints are a subset of this attribute set, that is, the set of privacy constraints c ⊆ {a 1 , a 2 , . . .a n }.
For example, in a medical system, a patient's information sheet consists of their social security number (SSN), name, date of birth (DoB), zip code, disease, and physician, as shown in Table 1.De Capitani di Vimercat believed that some attributes together will cause the risk of a privacy leak.For example, the SSN and disease together form a privacy constraint, as they could disclose private information if leaked together.However, it is meaningless for only the SNN or only the disease information to be acquired by others; this does not cause a loss of privacy.De Capitani di Vimercat reported that the SSN together with other attributes can be considered as privacy constraints, as described by c 0 , . . .c 5 .Similarly, the name is also considered very sensitive.Privacy constraints c 6 , . . .c 9 show that the name paired with other attributes presents a privacy risk.Privacy constraints c 10 indicate that, with the birthday, zip code, and disease information together, one can infer the patient's name, acting as a quasi-identifier.Similarly, with the birthday, zip code, and physician together, one can also infer the name and SSN, as privacy constraint c 11 describes.Privacy constraints c 0 to c 11 are manually defined by the user.Definition 2. Fragmentation.Let R(a 1 , a 2 , . . .a n ) be a relation schema, and the fragmentation result is F = { f 1 , f 2 , . . .f p }, where these fragments satisfy: Condition (i) represents that only attributes of the relation R are considered by the fragmentation, condition (ii) ensures unlinkability between different fragments and condition (iii) guarantees that all attributes are fragmented.
Using these privacy constraints, De Capitani di Vimercat divided the patients' information sheet into four blocks { {SSN}, {name}, {DoB, zip code}, {disease, physician} }.Next, they were stored on four different servers that do not communicate with each other.Thus, even if one server is compromised due to an attack, the entire relational database would not be leaked; that is, "two can keep a secret" [28].This is the core concept of vertical fragmentation for privacy protection in De Capitani di Vimercat's research: that is, to break the original relationships for the purpose of privacy protection.
However, De Capitani di Vimercat's research still has some flaws.First, we need to know the semantics of each attribute as well as which attributes together may cause privacy leaks.Just like the example above, we have to make sure that the SSN and name attributes are sensitive, and recognize that they will cause privacy leaks along with other attributes.Second, some privacy constraints are not easily found in semantics, such as the privacy constraint c 10 = {DoB, zip code, and disease} in the above patients' information sheet.This is very difficult to apply to actual problem-solving.In reality, the number of attributes will be very large, and their relationships will be very complicated, so manually defined privacy constraints are not realistic.Third, there are many vertical fragmentation schemes that can satisfy the privacy constraints.For example, { {SSN}, {name}, {DoB, disease}, {zip code, physician} } is also a good scheme.However, De Capitani di Vimercat does not provide evaluation criteria to judge the quality of these schemes.

Evaluation Standard
We consider our model, as shown in In our proposed research, we assume that the attacker can obtain the data of a certain CSP through the network, but that the CSPs do not communicate with each other, so the risk of privacy leakage is concentrated on a single CSP.Our goal is to ensure the protection of private information on a single CSP.Based on the analysis of the example of the patients' information sheet, below is a list of the types of tasks that must be accomplished with our approach.
(1) Correct Fragmentation: Verify that the privacy constraints are broken, so that fragmentation on a single CSP will not cause privacy leaks.Our approach must first ensure the most basic privacy protection requirements to break the potential privacy constraints that exist in relational databases.We must ensure there are no privacy leaks in the fragmentation of data, one can completely determine the records in the database, and so are very likely to disclose privacy when paired together.Definition 3. Correct Fragmentation.Let R(a 1 , a 2 , . . .a n ) be a relation schema, the fragmentation result be F = { f 1 , f 2 , . . .f p }, and C(c 0 , c 1 , . . .c t ) be the privacy constraints.One correct fragmenting scheme must meet Definition 1 and the following requirements: As shown by example of the patients' information sheet, with the scheme F = {{SSN}, {name}, {DoB, disease}, {zipcode, physician}}, the privacy constraints c 0 , . . .c 11 are not a subset of the fragment f i .
(2) Minimal fragmentation: This is a simple metric of the fragmentation scheme to avoid excessive fragments.Intuitively, if there are n attributes of the data to be fragmented, we can simply divide it into n fragments.Thus, the risk of privacy leakage is also lower than other schemes.However, if we completely divide all attributes, it will lead to higher database query overhead.Definition 4. Minimal fragmentation.Let R(a 1 , a 2 , . . .a n ) be a relation schema.There are fragmentation schemes F satisfy definition 3, F i is the minimal fragmentation scheme, iff Here, # indicates the number of collection elements.#F i represents the number of fragments of scheme F i .De Capitani di Vimercat et al. translate the problem into the problem of computing maximum weighted clique of the fragmentation graph and implemented it with the Ordered Binary Decision Diagrams(OBDDs) [29].Their methods are based on heuristics algorithm and cannot get the optimal solution.
(3) Lower Risk of Information Disclosure: Each fragment contains a portion of private information that is less than the total amount of private information.There are many kinds of vertical fragmentation methods that can satisfy condition (1), but the various vertical fragmentation methods differ in that each fragment contains a different amount of information.In the patients' information sheet, both scheme (a) { {SSN}, {name}, {DoB, zip code}, {disease, physician} } and scheme (b) { {SSN}, {name}, {DoB, disease}, {zip code, physician} } are vertical fragmentation solutions.However, the amount of information they contain is different; thus, we need to reduce the amount of information contained in each fragment as much as possible.The greater the amount of information contained in the fragment, the greater the risk of information leakage.Suppose there are s fragments, and the amount of information contained in each fragment i is I(i).To evaluate the risk of information leakage of the scheme, we need to minimize the sum of the squares of the information ∑ I(i) 2 .Definition 5. Lower Risk of Information Leakage.Let R(a 1 , a 2 , . . .a n ) be a relation schema.There are fragmentation schemes (a) I a is an evaluation of the risk of information disclosure of the scheme (a), and I b is the evaluation of the risk of information disclosure of the scheme (b).This is to say scheme (a) has lower information than scheme (b).
The first statement indicates that our approach should achieve the goal of privacy protection, while the second statement indicates that the number of fragments needs to be minimized.The last statement indicates that we should minimize the potential risk of information disclosure.Our goal is to meet the above three requirements, which is equivalent to a multi-objective constrained optimization problem [30].

Approach
In this section, we describe the method to obtain the goal of privacy protection with minimal information disclosure.Our approach is implemented in three steps.First, we introduce the concept of entropy values to quantify private information.In this way, we can automatically discover privacy constraints to complete the fragmentation (Section 4.1).Second, we use the privacy constraints calculated in step 1 to get the number of minimal fragmentation (Section 4.2).Finally, we implement a minimum entropy fragmentation algorithm based on the greedy strategy, through which we can minimize the entropy of each fragment (Section 4.3).

Information Entropy to Quantify Privacy
To automatically discover privacy constraints, we need a method to quantify privacy.As an effective tool for measuring information, information entropy has proven to be an important contribution to the field of communications.As privacy is a kind of information, we can naturally consider using entropy to quantify it [31].
It can be proved that, if there is no repeat record, then the entropy value is log 2 m.We can define a probability distribution that a record X i in the database takes a value a via Here, the property X i is a random variable of the attribute, and # represents the number of elements in the collection.For example, in the patient's information sheet: Thus, the entropy of an attribute a j is: As Table 1 shows for the patients' information sheet, the attributes' entropies can calculated as follows: EN(SSN) = −(1/6log Here, ∞ is a string concatenation function.From the definition of entropy, we know that the more non-repeating records there are, the larger the entropy.There are records ra i and ra j of attribute A. If ra i = ra j , after adding attribute B, it will become ra i ∞rb i and ra j ∞rb j .However, ra i ∞rb i = ra j ∞rb j is true only when ra i = ra i and rb j = rb j .EN(A, B) will reduce the number of repeating records, so EN(A, B) ≥ max{EN(A), EN(B)}.

Theorem 2. Let R be a relation schema, A and B are attributes, EN(A) and EN(B) represent their entropy, EN(A, B) is the joint entropy of attributes A and B, and thus EN(A, B) ≤ EN(A) + EN(B).
Proof of Theorem 2. Mathematical induction.Assume that attribute A consists of the records ra 1 , ra 2 , . . .ra m , B consists of rb 1 , rb 2 , . . .rb m , and n represents the number of non-repeating records of attribute B, thus: When n = 1, n = 1 represents the records in attribute B exactly the same,

That is to say, EN(A, B) ≤ EN(A) + EN(B).
Suppose when n = t, EN(A, B) t ≤ EN(A) t + EN(B) t , there are x duplicate rb k records in attribute B, #BR = #{rb i |rb i = rb k } = x, x ≥ 1.We mark them with BR = {rnb 1 , rnb 2 , . . .rnb x }, and the corresponding elements in attribute A are recorded as AR = {rna 1 , rna 2 , . . .rna x }.Here, At the same time, we mark attributes A and B to remove these records as A − AR and B − BR: When n = t + 1, we know that the number of repeating records is minus one.There are x − 1 duplicate rb k records in attribute B; we need to discuss in two cases: (1) #{ra i ∞rb i |rb i = rb k } = x, this means #{rna 1 , rna 2 , . . .rna x } = x, thus ).
When we chang the duplicate record in attribute B, ra i and ra i ∞rb i will not change, thus From the above, we know that we only need to prove EN(B) t+1 − EN(B) t > 0: It is easy to prove that, when x > 1 and m > 1, the above formula is greater than zero, so (2) #{ra i ∞rb i |rb i = rb k } = y, y < x.We know that the number of duplicate records is y.There are at least two duplicate records in the set AR, thus y ≥ 2, we mark them as RN A. When the number of repeating records in the set BR is minus one, if the records in the set RB corresponding to the RN A do not change, it will become situation (1).We only need to consider that the situation of the set RB corresponding to the RN A are changing, thus: ).
This is equivalent to proof This is to say, we need to prove We know that, when m > 1 and x−1 is a monotonically increasing function.Obviously, EN(A, B) t+1 ≤ EN(A) t+1 + EN(B) t+1 .
In summary, we can get the following conclusion: With this knowledge, we can automatically discover the privacy constraints that exist in Table 1: EN(SSN) = EN(name) = 2.585 = EN(total table), which indicates that these attributes have privacy constraints, corresponding to c 0 to c 9 .Likewise, EN(DoB, zip code, disease) = EN(DoB, zip code, physician) = 2.585 = EN(total table) also indicates these attributes have privacy constraints, corresponding to c 10 and c 11 .We know from the definition of entropy that entropy is the amount of information an attribute contains.That is to say, both SSN and name contain the information amount of the entire table of the patients' information sheets, similar to the primary key.Thus, they are susceptible to privacy leaks along with other attributes.We proposed an algorithm that automatically generates privacy constraints, shown in Algorithm 1.  if Comb_entropy(i, j) > Entropy_tableA × threshold then 15: Constraints(i, j) = Combinations(j);

end for 18: end for
The automatically generated privacy constraints algorithm takes the table to be split as input, and the set of constraints result is saved in the array Constraints.Line 1 calculates the number of records and the number of attributes, marked as m and n, respectively.Line 2 calculates the entropy of the whole table.The function calEntropy takes the set of attributes and the number of records as input, using Equations ( 3) and ( 4) to calculate the entropy of the attribute.Lines 3 to 5 calculate the entropy of a single attribute in TableA.In lines 6 to 18, we calculate the constraints from 2 to n − 1 dimension.In Line 7, the function calCombinations calculates the combination of possible privacy constraints for the number of different attributes based on the entropy of all attributes.If i = 3, it will return all ternary combinations that may have privacy constraints.The basis for determining whether it can become a privacy constraint is based on Theorems 1 and 2. In lines 8 to 17, we decide whether these possible combinations will become privacy constraints.On line 13, we calculate the entropy of the possible combination of privacy constraints.On line 14, we set a threshold(0 < threshold ≤ 1) that the user can adjust.The larger the threshold, the stronger the privacy constraint requirement.When the entropy of possible combination exceeds the entropy of entire table multiplied by threshold, we think that there is a privacy constraint so that we can automatically discover privacy constraints.There is no need for users to manually define privacy constraints at all.

Calculate Minimal Fragmentation
Given a relational database, there are many correct fragmentation schemes.We can quickly calculate a simple minimal fragmentation scheme by Theorems 1 and 2.
Theorem 3. Let R(a 1 , a 2 , . . .a n ) be a relation schema, and EN(R) represent the entropy of it.EN(a i ) is the entropy value of each attribute.F is a correct fragmentation and close to the optional minimal fragmentation schemes, and is the ceiling function, threshold is a parameter that allows users to adjust the strength of privacy protection, thus According to Theorem 2, We know from Section 4.1 that, when a fragmentation violates privacy constraints, its entropy must be greater than EN(R) × threshold.When F is a correct fragmentation, thus From Equations ( 5) and ( 6), we can conclude that According to the above method, we can quickly calculate a smaller number of fragmentation, but not an optimal solution.

Minimum Entropy Fragmentation Algorithm
By using the entropy to quantify privacy, we can automatically discover the privacy constraints between attributes and the number of fragmentation.In this section, we achieve another goal, which is to minimize the amount of information disclosed in each fragment using the minimum entropy fragmentation algorithm.
From the definition of information entropy, we know that if the entropy of an attribute is x, then the amount of information it contains is 2 x .Thus, the information disclosed after fragmentation is: EN( f (i)) represents the entropy value of fragmentation f(i) , and n represents the number of fragments.In order for us to turn this problem into: given a relation R(a 1 , a 2 , . . .a n ), F = { f 1 , f 2 , . . .f p } is a correct minimal fragmentation, this requires the following constraints: and make the I(p) as small as possible This is the constrained minimum solution problem.To achieve the approximate optimal solution of disclosing the minimum amount of information, we employ a greedy algorithm, shown in Algorithm 2.
The minimum entropy fragmentation algorithm takes the table to be split as input, and the fragmentation result is saved in the array FS.Line 1 calculates the number of records and the number of attributes, marked as m and n, respectively.The function calEntropy takes the attribute and the number of records as input, using Equations ( 3) and (4) to calculate the entropy of the attribute.Lines 2 to 4 calculate the entropy of a single attribute in Table A. In Line 5, we calculate the entropy of entire table.In Line 6, we use Theorem 3 with the function cal MinimalFragmentation to calculate the minimal number of fragments.In Line 7, all of the attributes are arranged in descending order.Lines 8 to 10 fill in the maximum attributes to the result array FS.Line 14 uses Equation (7) to calculate the current amount of information disclosure, while Lines 16 to 22 calculate the amount of information disclosure after adding the specified attribute to each fragment.On line 17, we use the function calSetSumEntropy to calculate the entropy of the set FS and determine if there is a risk of privacy leak after adding the attribute A i .If there is a risk of privacy leak, we set the amount of information disclosure to the maximum value of the system; otherwise, we use the function cal In f oIncrement to calculate the increment of information disclosure.On Line 23, the fragmentation that minimizes the amount of information disclosed is selected.Finally, the attributes are assigned to the target fragmentation.Thus, there is no need for users to specify privacy constraints at all, and an evaluation of the privacy level after fragmentation can be made.

Experiments
Considering that there is no published data set for the problem defined in this paper, we tested the algorithm on the adult database of UCI datasets (http://archive.ics.uci.edu/ml/datasets/Adult).The UCI database is intended to be used for machine learning, presented by the University of California Irvine.Table 2 gives the detailed information of the test dataset.It has 14 attributes, which are age, workclass, fnlwgt (serial number), education, education_num (time of education), maritial_status, occupation, relationship, race, sex, capital_gain, capital_loss, hours_per_week, and native_country.In the following, we replace these attributes with the serial numbers 1-14.
Table 2. Detailed information of adult datasets.

Dataset Number of Samples Dimensions
Adult 32,561 14
To break constraints, we could not use semantics to determine which attributes together would lead to privacy leak risk in the adult database.Instead, we propose that, if some attributes together can identify a record easily, then they have a privacy constraint.This is similar to a primary key and can completely identify a record.First, we calculated the original unfragmented adult database's entropy to be 14.9893.From the previous definition, we let threshold = 0.8, which means that when the entropy of the attribute set exceeds 14.9893 × 0.8, it is considered to have privacy constraints.Next, we calculated the entropy of the set of attributes for groups of attributes.For these sets of attributes, we listed the sets whose entropy exceeds 14.9893 × 0.8, which are the sets that have privacy constraints, as shown in Table 3.For example, the entropy of the first row attribute set (1, 2, 4, 7), corresponding to age, workclass, education and occupation is 12.1658, which may represent a privacy constraint.The other attribute sets in the table represent other privacy constraints that exist in the adult database.We subjected the adult database to Algorithm 2. The minimum number of fragments is 4, and the last fragmentation result was F 1 = {3}, F 2 = {1, 11, 14}, F 3 = {2, 4, 5, 7, 12}, and F 4 = {6, 8, 9, 10, 13}.From the final fragmentation results, all possible privacy constraints in Table 3 have been broken, thus we have completely broken these potential privacy constraints.Concerning the lower information aspect, we know the original unfragmented adult database's entropy is 14.9893.We also calculated the last entropy of each fragment F 1 to F 4 to be 14.1583, 7.2534, 7.6337, and 7.4301, respectively.Therefore, the entropy is reduced after fragmentation, and the risk of a privacy leak is significantly reduced.Since our minimum entropy fragmentation algorithm is a greedy algorithm, we have not yet obtained the optimal solution.

Performance Evaluation
In order to evaluate the minimum entropy fragmentation algorithm performance, we tested the time cost according to the number of samples and attributes.The adult database has 14 attributes and 32,561 records.We use the exhaustive method to calculate the minimum entropy fragmentation.In the experiment of the adult database, when there are only four attributes, it takes 17 seconds, but when the attribute is increased to 10, the time is increased to 1709 s.Thus, the exhaustive method takes time to grow exponentially and cannot be applied in reality.In order to observe the variation of our algorithm with the number of records, we divide the number of records into {1000, 2000, 4000, 8000, 16,000, 32,561} with a fixed number of attributes.In addition, in the case of fixed records, the attributes were divided into {4, 6, 8, 10, 12, 14}. Figure 2a shows the time cost when the adult database varies in the number of records, but the number of attributes is fixed at 14.The correlation coefficient between the number of records and the time cost was 0.9667, and thus a strong linear relationship between the number of records and the overhead time of the greedy algorithm was observed.Figure 2b shows the time cost when the number of attributes is varied, but the number of records is fixed at 32,561.We also calculated the correlation coefficient between the number of attributes and the time cost to be 0.9977.
In order to maintain the generalizability of our approach, we conducted experiments in the census-income database of the UCI database (http://archive.ics.uci.edu/ml/datasets/Census+Income), as depicted in Figure 3a,b.The census-income database has 199,523 records and 42 attributes.The correlation coefficients were calculated to be 0.9940 and 0.9806, respectively.Thus, the greedy fragmentation algorithm proposed in this paper is a polynomial time algorithm, which can quickly fragment the database with privacy protection.

Discussion
Relational databases have always been the preferred solution for storing data.However, with the development of the internet, there must be more computing resources to handle fast-growing data.Extensions to computing resources include vertical scaling and horizontal scaling.The general method of vertical scaling is to replace the CPU, memory, etc., but this method is costly and has limited scalability.Another method is to use a small set of computers for large-scale computing tasks.Since relational databases are not suitable for building on clusters, NoSQL (Not Only SQL) databases have emerged.
NoSQL databases are divided into four categories according to storage methods [32]: (1) key-value storage, (2) columnar storage, (3) document storage, and (4) graphic storage.In the NoSQL data storage model, both columnar storage and document storage can be understood as extensions to key-value storage.These NoSQL databases are based on distributed storage, and the data is horizontally sharded and distributed to multiple nodes.The two major sharding strategies are order-preserving partitioning (OPP) and random partitioning (RP) [33].OPP can perform range queries, but it may cause load-balancing problems.The current sharding method mainly considers load balancing and range query, regardless of privacy issues.We believe that our proposed entropy method can be used to protect privacy in NoSQL key-value storage databases because the information stored by each node can be quantified as EN(node 1 ), EN(node 2 ), . . .EN(node n ).However, from the current point of view, it is inefficient to calculate the entropy value in a NoSQL database.When a data record is generated, we need to recalculate which node is placed to minimize the entropy, so we need to fetch the data of each node and parse it to calculate the entropy, which is a time-consuming task.There may be some techniques such as delayed record write, or caching each node feature information for entropy calculation to reduce the time overhead.However, these will always reduce the writing efficiency of the NoSQL database.For most NoSQL databases, it is necessary to support efficient writing operations, and our proposed entropy calculation method is only suitable for mass reading applications.To extend our method to NoSQL databases, we need to propose a new entropy calculation method.In the case of caching the key feature of each node, we can estimate the entropy value of the node after adding new data.This will be our next step in future work.

Conclusions
In this paper, we introduced information entropy as a powerful tool to quantify privacy; thus, we can automatically discover potential privacy constraints between attributes without the need for users to manually define such constraints.Based on this, we provided a greedy method of vertically fragmenting the data with the information entropy, which we called the minimum entropy fragmentation algorithm.With this approach, we can evaluate how much private information is contained in different vertical fragments.Our method is currently not applicable to NoSQL databases, but it is a feasible privacy protection scheme that enables us to optimize the efficiency of entropy calculation.

Figure 2 .Figure 3 .
Figure 2. Time cost for experiments on the adult database with respect to (a) changing the number of records, and (b) changing the number of attributes.

Table 1 .
An example of a patient's information sheet and the associated privacy constraints.

Table A ,
Constraints)Input: Input parameters TableA : the table to be fragmented