Analytics on Anonymity for Privacy Retention in Smart Health Data †

: Advancements in smart technology, wearable and mobile devices, and Internet of Things, have made smart health an integral part of modern living to better individual healthcare and well-being. By enhancing self-monitoring, data collection and sharing among users and service providers, smart health can increase healthy lifestyles, timely treatments, and save lives. However, as health data become larger and more accessible to multiple parties, they become vulnerable to privacy attacks. One way to safeguard privacy is to increase users’ anonymity as anonymity increases indistinguishability making it harder for re-identiﬁcation. Still the challenge is not only to preserve data privacy but also to ensure that the shared data are sufﬁciently informative to be useful. Our research studies health data analytics focusing on anonymity for privacy protection. This paper presents a multi-faceted analytical approach to (1) identifying attributes susceptible to information leakages by using entropy-based measure to analyze information loss, (2) anonymizing the data by generalization using attribute hierarchies, and (3) balancing between anonymity and informativeness by our anonymization technique that produces anonymized data satisfying a given anonymity requirement while optimizing data retention. Our anonymization technique is an automated Artiﬁcial Intelligent search based on two simple heuristics. The paper describes and illustrates the detailed approach and analytics including pre and post anonymization analytics. Experiments on published data are performed on the anonymization technique. Results, compared with other similar techniques, show that our anonymization technique gives the most effective data sharing solution, with respect to computational cost and balancing between anonymity and data retention. This paper presents an approach to health data analytics focusing on anonymity for privacy protection. The approach is applicable to both data producers (e.g., use of ﬁtness trackers, or glucose and heart rate monitors) as well as data consumers (e.g., weight loss application services, healthcare professionals) to safeguard a given health data set from information leakages and re-identiﬁcation. A common concept relies on making data anonymous. An analytical approach is proposed to (1) identifying attributes susceptible to information leakages by using entropy-based measure to analyze information loss, (2) transforming the data into a more anonymous form by generalization using attribute hierarchies, and (3) anonymization that balances anonymity requirements and optimal informativeness by an automated Artiﬁcial Intelligence search using two simple heuristics. Unlike existing techniques, our anonymization approach preserves maximum information by avoiding extensive generalizations yet still complies with the anonymity requirements. The proposed anonymization follows k -anonymity; therefore, it inherits the limitations of k -anonymization as discussed in [21]. We describe and illustrate the detailed approach and analytics including pre and post anonymization analytics. We have conducted experiments to evaluate effectiveness of our anonymization approach. The results obtained show that our approach balances the trade-off between preserving privacy and retaining maximum information with efﬁcient computational cost. Future work includes a framework designed to integrate all different measures to improve anonymization techniques as well as to better increase anonymity and protect privacy. The added metrics will help further the analysis of the anonymized data in terms of privacy. That way, we aim to get a better understanding of what needs to be improved for anonymization or how successful the anonymization is. Author Contributions: Curation, Writing–original


Introduction
Smart health improves the well-being and quality of lives by providing customized cares and treatments using health data collected from smart health devices (e.g., trackers of movements and heart rates [1], or mobile EKG (electrocardiogram) monitors for heart rhythms [2]). Telemedicine increasingly relies on health devices to treat chronic diseases, e.g., by monitoring glucose [3], blood sugar levels [4], or blood pressure [5] for patients with heart diseases and diabetes. Advancement in wearable technology and Internet of Things enable smart health in self-monitoring and delivery of users' health data to doctors, hospitals, and fitness service providers [6]. Smart health can increase healthy lifestyles, timely treatments, and save lives. Furthermore, collection and sharing of health data can help researchers navigate scientific discoveries. For example, genetic-testing companies collect users' DNA (Deoxyribonucleic Acid) and survey data to gain insights on genetic diseases like Parkinson, Late-onset Alzheimer 's or celiac disease [7].
While smart health brings great benefits, it also poses potential threats to privacy as health data often contains sensitive and disclosed information. Collecting, storing, and sharing these data can put users' privacy at risks of being re-identified (even if personally multi-faceted analytical approach in Section 3, which can be viewed as pre-anonymization and anonymization steps. Section 4 describes experiments to evaluate performance and effectiveness of our anonymization technique when compared with similar techniques. Section 5 provides post anonymization analytics and Section 6 concludes the paper.

Related Work
Much research in data privacy addresses issues on anonymity [8][9][10]21,22,[24][25][26][27][28][29]. Many aim to measure anonymity [11][12][13][14], whereas some are concerned with the utility of the result [24,25,29]. Majority of the metrics that are concerned with anonymization quality [11,13,14] use Shannon's entropy to quantify average information [13,14]. Work in [13] uses entropy to estimate the average number of correct re-identifications (of individuals) based on binary queries. More correct responses help increase the attacker's information about the individuals in the database and reduce the anonymity of the individuals. Longpre et al. proposed a measure [14] to estimate an average information loss when an attacker acquires additional information through querying. Again, the more information the attacker gains (or average information loss), the less anonymity users have. This makes it easier for the attacker to breach disclosure and identify the users. Our paper suggests a method to analyze the data using this latter measure to pinpoint areas in the data that are susceptible to privacy attacks. Some anonymization techniques consider the utility of the data after the anonymization for evaluation [24,25,29]. Among those, work in [25] considers information loss and calculates utility accordingly whereas some considers classification accuracy to measure utility [24,29].
A large body of research in anonymity concerns with anonymization techniques to transform a given data set into a more anonymous form for privacy preservation [8][9][10][14][15][16][17][18][19][20][21][22]. Most of these techniques find anonymized data (via generalization) that complies with kanonymity requirement [8][9][10]19,20] to guarantee that each group of unique critical attribute values has at least k records to prevent individuals from being reidentified easily. Some anonymization uses exhaustive search to find the minimal k-anonymization with minimal distortion [9,22]. Although the approach is not practically feasible, it provides a concrete formal model for minimal k-anonymization. Work in [9] searches for k-anonymizations using a binary search. Since binary search is a blind search, computational cost can still pose a problem when searching for all possible k-generalizations as discussed in [15]. Other approaches focus on efficiency rather than minimal k-anonymizations [19,20]. Unlike the approaches that search for minimal generalizations blindly or focusing on efficiency rather than minimal anonymization, our proposed approach [15] aims to efficiently search for anonymized data that strike a balance between satisfying k-anonymity requirements and maximizing retention of the original data.

Proposed Multi-Faceted Anonymity Analytics Approach
This section describes the proposed approach that analyzes anonymity in multiple aspects to protect data owner's privacy. Figure 1 shows a general overview of the approach. As shown in the figure, the approach identifies the attributes susceptible to information leakages in pre-analytics process. The user can choose to increase anonymity of the vulnerable attributes before the anonymization procedure or directly anonymizes the data using the findings of the vulnerable attributes as guidance. The user can also choose to anonymize the smart health dataset, without applying pre-analytics. Then, the data are anonymized, by our IAB (Intelligent Anonymity Balance) anonymization technique that produces data satisfying a given anonymity requirement while optimizing data retention. The anonymized data are then analyzed in the post-analytics process to see if the critical attributes in the resulting anonymized data are vulnerable to information leakage (e.g., via inference of attackers). If they are further actions can be taken (e.g., alerting data publishers or injecting additional "fake" data to increase indistinguishability of the vulnerable individuals). The first two steps are described in this section whereas the post-analytics step will be described in Section 5. For easy referencing, since we will describe and illustrate Future Internet 2021, 13,274 4 of 20 each section with the same data, we briefly introduce them along with common terms and notations below.
Future Internet 2021, 13, x FOR PEER REVIEW 4 publishers or injecting additional "fake" data to increase indistinguishability of the nerable individuals). The first two steps are described in this section whereas the analytics step will be described in Section 5. For easy referencing, since we will des and illustrate each section with the same data, we briefly introduce them along with mon terms and notations below. Given a (data) table T (or relational database) with A, a set of attributes A1, A2, . a data record represents an instance of a tuple (a1, a2, ..., an), where data entry ai  dom a set of all possible values of Ai. Consider Table 1, each row represents a unique tu attribute values where the last column represents the number of records for each Here Row 2 represents a unique tuple (F, Low, 35, 52000, 143, Black, No) with thr stances of records. As shown in Table 1, Rows 1, 4, 10, 13 are obviously vulnera privacy threats since each has one record instance giving low anonymity and easy f identification. Next, we will describe the analytical approach.  Given a (data) table T (or relational database) with A, a set of attributes A 1 , A 2 , ..., A n ., a data record represents an instance of a tuple (a 1 , a 2 , ..., a n ), where data entry a i ∈ dom(A i ), a set of all possible values of A i . Consider Table 1, each row represents a unique tuple of attribute values where the last column represents the number of records for each row. Here Row 2 represents a unique tuple (F, Low, 35, 52000, 143, Black, No) with three instances of records. As shown in Table 1, Rows 1, 4, 10, 13 are obviously vulnerable to privacy threats since each has one record instance giving low anonymity and easy for re-identification. Next, we will describe the analytical approach.

Assessing Vulnerability to Information Leakages
Before we transform a given health data into a more anonymous form, one may investigate if (and what areas of) the data are susceptible to information loss if an attacker uses some of his information to make inferences. To do this, we propose an analysis on various structures of the data using the Longpre et al.'s entropy-based measure [14] to estimate average information loss in respective areas. The motivation of this preanonymization analytics is not simply to apply existing measure in a typical manner but maximizing the measure for systematic use to gain useful information for privacy protection. For example, the finding that certain attribute is vulnerable to information leakage may be linked to low anonymity that can be alleviated by modifying the original data. Next, we briefly describe the measure and its derivations from two sources.

Proposition 1. Shannon's information quantification.
Let X be a discrete random variable with outcomes x 1 , x 2 ,...., x n , p(x i ) be the probability of x i being the outcome, and I(x i ) be the amount (or value) of information received when learning that x i is the outcome (sent). Then I(x i ) is log 2 (1/p(x i )).

Proof.
Since the more probable the information is, the less informative the information becomes. Thus, I(x i ) is inversely proportional to p(x i ). Furthermore, for information of value y, the amount of information is measured by the number of bits to store y, i.e., log 2 (y) bits. Thus, Shannon's quantifying information I(x i ) = log 2 (1/p(x i )).

Proposition 2. Longpre et al.'s entropy-based measure.
Given a data table of n individuals, where p(r i ) is the probability of individual r i being reidentified. An attacker makes queries, each of which has m possible answers represented in a sequence <a 1 , a 2 ,..., a m >. All n individuals are partitioned into m partitions, where each partition E j contains individuals whose attribute value matches the jth answer of the query a j . Subsequently, the average p(r i )log 2 p(r i ) and S j = ∑ r i ∈E j p r i |E j log 2 p r i |E j representing an initial average amount of information (before queries) and the average of amount of information after the query answer j, respectively.
Proof. If the attacker knows p(r i ) then the amount or value of the information can be quantified as log 2 (1/p(r i )) by Proposition 1. Thus, an average of these information values over all individuals gives an entropy S 0 = − n ∑ i=1 p(r i ) log 2 p(r i ). (Note, if an attacker does   not have any information about individuals, then everyone in the table is equally likely to be identified with p(r i ) = 1/n.). Now suppose an attacker makes queries as stated. Each individual r i can belong to one partition. Thus, ∑ m j=1 E j = n. Suppose an individual r i is found to be in E j then p(r i ) becomes p(r i |E j ), which is p( Since p(r i ) is reduced, the information value/amount increases (as less certain is more informative). Thus, an attacker gains more information about the individual and more vulnerable to privacy breach. Thus, the average amount of information when answer j is matched S j = ∑ r i ∈E j p r i |E j log 2 p r i |E j , where p(r i ) is changed to p(r i |E j ). This gives an average loss to be estimated as ∆S E j = ∑ m j=1 p(E j ) S 0 − S j . Note that ∆S({E j }) is maximum when S j is zero and ∆S({E j }) = S 0 (i.e., no information is lost to the attacker or that he has no information). Hence, the normalized average information loss is ∆S({E j })/S 0 where its value is in [0, 1].

Analytics on Information Leakages
Instead of applying the Longpre et al.'s entropy-based measure to the entire table, we will analyze which attribute will be most vulnerable to information leaks (i.e., leaks most amount) on the average when an attacker obtains information on the attribute values.
We will use Table 1 to illustrate and explain the concept. Suppose an attacker queries information on attribute Sex. Table 1 has a total of 60 individuals with 27 females (F) and 33 males (M). When an attacker has no information, every individual is equally likely to be identified with p(r i ) = 1/60. Therefore, the initial average amount of information S 0 is − (1/60)log 2 (1/60) = 5.9. For attribute Sex, there are two possible answers: <F, M>.
Thus, we partition 60 individuals into E 1 and E 2 for those who are F and M, respectively.
Based on Proposition 2 and Table 1 Table 2. Similarly, we can apply the measure to estimate the average information loss given the attacker querying on other attributes except the disclosed one (e.g., genetic risk). Table 2 shows the overall results obtained. The normalized results show us on average how much information is leaked given attacker knows the attribute value of the person they are looking for. The attribute that discloses more information has a higher value out of the maximum possible value of one. As shown in Table 2, for data Table 1, the Age attribute is the most vulnerable as it leaks the most information. Next is Zip, followed by Race. These are not surprising as they are typical key attributes that lead to identity identification. Although we have not done this, the Longpre et al.'s entropy-based measure can be applied to a combination of attributes at any level to give different insights. Here we apply the measure to each non-disclosed attribute for systematic preliminary findings.
In general, this pre-anonymization analytics can help us decide which attributes we should pay attention to when we try to protect privacy. For example, we may pick a set of most vulnerable attributes to increase anonymity by generalization. In anonymization techniques, a set of such attributes is known as quasi-identifiers or shields that are specified by users. Next section shows more details of basic anonymization techniques.

Increasing Anonymity by Generalization
The analytics in Section 3.1 show that, once an attacker obtains the query answers, information on some attributes (or set of attributes) can lead to more average information loss than the others. To protect such loss, a common practice to increase anonymity is by generalization and compression [8][9][10]. This section describes these basic concepts in more details along with the concept of k-anonymity that is used in many anonymization techniques including ours (to described in Section 3.3).
Generalization replaces an attribute value by a more abstract form or a more general but semantically consistent value. For example, we can replace the zip "12345" by "123 * * ", or replace a "city" by its "country". The former can be viewed as a suppression of the last two digits of the zip where" * " represents any non-negative digit. The consistency on semantics of attribute values is governed by its conceptual hierarchy. By doing this, the number of records of each unique tuple will increase and that increases the tuple's degree of anonymity. Consequently, individuals are more indistinguishable, and their identities are better protected. Generalization provides many advantages to preserve data privacy including consistent interpretation, traceability, and minimal content distortion [10].
We will now explain the concepts in more details via illustrations on Table 1. Continuing our analytics from Section 3.1, where we identify that Age, Zip and Race are vulnerable. One can focus on generalizing these attributes to increase their anonymity or exploring other attributes based on domain experts. Here we consider the three attributes: Alcohol Consumption (AC), Age and Zip and their corresponding conceptual hierarchies as shown in Figure 2. For AC, there are four attribute values in the domain although only three appear in Table 1. The Age attribute values are discretized into four ranges and the Zip attribute values are string of numbers where a more general value uses " * " for any non-negative digit. The Zip hierarchy is general in that it is applicable to any string of digits other than 9's.

Increasing Anonymity by Generalization
The analytics in Section 3.1 show that, once an attacker obtains the query answers, information on some attributes (or set of attributes) can lead to more average information loss than the others. To protect such loss, a common practice to increase anonymity is by generalization and compression [8][9][10]. This section describes these basic concepts in more details along with the concept of k-anonymity that is used in many anonymization techniques including ours (to described in Section 3.3).
Generalization replaces an attribute value by a more abstract form or a more general but semantically consistent value. For example, we can replace the zip "12345" by "123 * * ", or replace a "city" by its "country". The former can be viewed as a suppression of the last two digits of the zip where" * " represents any non-negative digit. The consistency on semantics of attribute values is governed by its conceptual hierarchy. By doing this, the number of records of each unique tuple will increase and that increases the tuple's degree of anonymity. Consequently, individuals are more indistinguishable, and their identities are better protected. Generalization provides many advantages to preserve data privacy including consistent interpretation, traceability, and minimal content distortion [10].
We will now explain the concepts in more details via illustrations on Table 1. Continuing our analytics from Section 3.1, where we identify that Age, Zip and Race are vulnerable. One can focus on generalizing these attributes to increase their anonymity or exploring other attributes based on domain experts. Here we consider the three attributes: Alcohol Consumption (AC), Age and Zip and their corresponding conceptual hierarchies as shown in Figure 2. For AC, there are four attribute values in the domain although only three appear in Table 1. The Age attribute values are discretized into four ranges and the Zip attribute values are string of numbers where a more general value uses " * " for any non-negative digit. The Zip hierarchy is general in that it is applicable to any string of digits other than 9's.  Table 1.
For simplicity and without loss of generality, we will illustrate generalization on parts of Table 1, namely Rows 1, 2, 5, 6, 9 and 10 with four attributes: Sex, Alcohol Consumption (AC), Age and Zip, as shown in Table 3a to be an initial data table.
In Table 3a, Row 1 and Row 5 each has one record. This makes an individual in these two rows vulnerable for re-identification. If an attacker knows that the person he is looking for is a Female (F) having Medium (Med) AC and lives in Zip 52000, he will be able to identify the person and infer his age of 35 (see Row 1). Similarly, Row 5 is the only one record of a Female, Age 75, so this person can be identified and her sensitive information of having High AC can be leaked.  Table 1. For simplicity and without loss of generality, we will illustrate generalization on parts of Table 1, namely Rows 1, 2, 5, 6, 9 and 10 with four attributes: Sex, Alcohol Consumption (AC), Age and Zip, as shown in Table 3a to be an initial data table.
In Table 3a, Row 1 and Row 5 each has one record. This makes an individual in these two rows vulnerable for re-identification. If an attacker knows that the person he is looking for is a Female (F) having Medium (Med) AC and lives in Zip 52000, he will be able to identify the person and infer his age of 35 (see Row 1). Similarly, Row 5 is the only one record of a Female, Age 75, so this person can be identified and her sensitive information of having High AC can be leaked.
To increase anonymity of individuals in Rows 1 and 5, we generalize on AC cells of all rows of females (i.e., Rows 1, 2, 5, 6) in Table 3a to obtain results as shown in Table 3b where the change and important areas are colored. In this table, individual in Row 1 increases his/her anonymity since Row 1 can be merged with individuals in Row 2 creating a tuple (F, Yes, 35, 5200) with four records. However, this generalization is not enough to increase anonymity of individual in Row 5.
To increase anonymity of individual in Row 5 with the goal to merge with Row 6, we need to further generalize both rows on Age and Zip according to the taxonomies in Figure 2. By generalizing the Age attribute two steps to  and the Zip to 5200 * , we obtain the results as shown in Table 3c. As shown in this table, Rows 5 and 6 can now be merged. By merging Row 1 with Row 2, and Row 5 with Row 6, we obtain the final table as shown in Table 3d. Here none of the unique tuples of attribute values has a one record. In fact, the record number indicates the degree of anonymity. Table 3d shows that there are at least three people in each group of the same attribute values and hence their identities and information are better protected.  To increase anonymity of individuals in Rows 1 and 5, we generalize on AC cells of all rows of females (i.e., Rows 1, 2, 5, 6) in Table 3a to obtain results as shown in Table 3b where the change and important areas are colored. In this table, individual in Row 1 increases his/her anonymity since Row 1 can be merged with individuals in Row 2 creating a tuple (F, Yes, 35, 5200) with four records. However, this generalization is not enough to increase anonymity of individual in Row 5.
To increase anonymity of individual in Row 5 with the goal to merge with Row 6, we need to further generalize both rows on Age and Zip according to the taxonomies in Figure 2. By generalizing the Age attribute two steps to  and the Zip to 5200 * , we obtain the results as shown in Table 3c. As shown in this table, Rows 5 and 6 can now be merged. By merging Row 1 with Row 2, and Row 5 with Row 6, we obtain the final table as shown in Table 3d. Here none of the unique tuples of attribute values has a one record. In fact, the record number indicates the degree of anonymity. Table 3d shows that there are at least three people in each group of the same attribute values and hence their identities and information are better protected.
There are many ways to generalize. The above shows generalization at a cell level (i.e., a data entry of a specific row and column of a table). Another type of generalization is applied to all attribute values of the same level in the hierarchy. Thus, when a table is generalized on attribute A, the generalization is applied only to the table rows whose A's attribute values are either the child or its siblings of the same parent in the hierarchy. For example, generalizing a Table 3a on Age will replace the Age values of Rows 1, 2, and 6 to  and those of the rest of rows will be replaced by . To improve efficiency, many anonymization techniques including ours (Section 3.1) adopt this interpretation when applying generalization. Next, we formally define important concepts for anonymization, namely, k-anonymity requirement and other relevant terminologies.

k-Anonymity Requirement for Anonymization
Anonymity requirement specifies an anonymity degree required on a subset of privacy critical attributes, called shield (or quasi-identifiers [24,25]). Given the degree k and the shield S, the k-anonymity requirement on shield S, denoted by <S, k>, is defined to be a set of S-projected tuples, whose each unique tuple is guaranteed to have a minimum of k records. Let [t, nt] denote an ordered pair of a unique tuple t and its corresponding number of records nt. We say that <S, The k-anonymity required on shield attributes helps user to protect There are many ways to generalize. The above shows generalization at a cell level (i.e., a data entry of a specific row and column of a table). Another type of generalization is applied to all attribute values of the same level in the hierarchy. Thus, when a table is generalized on attribute A, the generalization is applied only to the table rows whose A's attribute values are either the child or its siblings of the same parent in the hierarchy. For example, generalizing a Table 3a on Age will replace the Age values of Rows 1, 2, and 6 to  and those of the rest of rows will be replaced by . To improve efficiency, many anonymization techniques including ours (Section 3.1) adopt this interpretation when applying generalization. Next, we formally define important concepts for anonymization, namely, k-anonymity requirement and other relevant terminologies.

k-Anonymity Requirement for Anonymization
Anonymity requirement specifies an anonymity degree required on a subset of privacy critical attributes, called shield (or quasi-identifiers [24,25]). Given the degree k and the shield S, the k-anonymity requirement on shield S, denoted by <S, k>, is defined to be a set of S-projected tuples, whose each unique tuple is guaranteed to have a minimum of k records. Let [t, n t ] denote an ordered pair of a unique tuple t and its corresponding number of records n t . We say that <S, k> is violated if there is [b, r b ] such that r b < k, for some S-projected tuple b. The k-anonymity required on shield attributes helps user to protect privacy without over generalizing the tuple. As for example, consider Table 3a with a given anonymity requirement <{AC, Age, Zip}, 3>. Note that each row represents a unique tuple projected on the shield. Rows 1, 5 and 6 violates the given anonymity requirements with the number of records lower than three. However, Table 3d contains four distinct tuples, each of which has three or more records. Thus, Table 3d satisfies the given anonymity requirement.
In general, for a given table, one can define more than one anonymity requirement, each of which can have a different anonymity degree and a shield. In practice, the anonymity requirement is user-specified. If the anonymity degree is too low, the shield may or may not be able to protect the individual identity (e.g., when the projected tuple becomes personally identifiable). On the other hand, if we set the anonymity degree too high, data may not be informative since almost all tuples would be the same after anonymization [15]. The data privacy is over protected. This k-anonymity requirements are used in many anonymization techniques [8][9][10]19,20]. Next, we describe our anonymization technique.

Balancing Generalization with Data Retention in Anonymization
Given a data table and a k-anonymity requirements, this section discusses an analytical approach to transforming the data into anonymized data that satisfy the k-anonymity requirements and at the same time retains the data from the original as much as possible. In AI (Artificial Intelligence), we can view this problem as a search in a space of all possible generalized tables on all possible attributes. The simplest approach is to search exhaustively for a solution. To improve efficiency, heuristic search can be employed. Our approach relies on two simple heuristics: the number of rows violating the anonymity requirements and the total number of table rows. The interplay between the two heuristics gives a balance between anonymity compliance and optimizing data retention.

Intelligent Anonymity Balance (IAB) Algorithm
We now briefly describe our anonymization algorithm, IAB (Intelligent Anonymity Balance) as also discussed in [15]. Given a data table T with a set of attributes A and a taxonomy tree for each shield attribute. Without loss of generality, we assume one anonymity requirements R with shield S ⊆ A. The basic overview of the IAB algorithm is shown in Algorithm 1.

Algorithm 1 The IAB Anonymization Algorithm
Procedure IAB Anonymization Inputs: T, a table with a set of attributes A, a set of anonymity requirement R with a set of anonymity shield attributes S ⊆ A and corresponding taxonomy trees of each attribute in S. Output: a generalized The algorithm iteratively generalizes a table on an appropriate attribute using its corresponding taxonomy tree to increase anonymity degree. In Lines 1-4, a generalized table of T on each attribute in S is generated and maintained in set W. Each generalized table keeps track of two key heuristics: the number of rows that violate R and the total number of rows on the table. The former tells how close we are to finding the table that satisfies the anonymity requirements R while the latter measures how much data is preserved. Among generalized tables in W, the algorithm selects a table that has the highest number of rows with the lowest violation number of rows to be further generalized (Lines 5-10). The selected table is removed from W (as shown in Line 11).
The generalization process repeats until there are no more tables left in W or no tables in W has the number of rows > the number of rows of a table that satisfies R. In other words, we stop expanding the search when we find a table that satisfies R or a table that is smaller than the biggest table that satisfies R found so far (even though it violates R). By monotonicity of generalization, further generalization can never grow the table. Therefore, the algorithm only further generalizes the table that is larger than those found to satisfy R so far. However, if a table violates R but is already smaller than the biggest table found so far to satisfy R, further generalizing it would not result in a larger table that satisfies R. Thus, the algorithm selects the largest table among the tables in W with no anonymity requirements violation.
Note that it is possible to have more than one of such table of the same size. In such a case, the algorithm selects the first one found as it represents the table that has the least number of generalized steps. In other words, it retains most specific data that are closest to the given data table. Since generalization procedure monotonically decreases the number of rows, our approach uses this property to prune the fruitless path of an exhaustive search. Thus, it finds an optimal solution. The optimal solution is that maximizes the information preserved (i.e., the table size) from the original table while hiding desired privacy by satisfying anonymity requirements (i.e., zero violation rows). Therefore, the optimal solution has maximum number of rows (maximum information preservation) that satisfies the anonymity requirement (desired anonymity).

Illustration
We apply the algorithm described in Section 3.3.1 to Table 1 with a given anonymity requirement R = <{Zip, Age, AC}, 3>. Based on the number of records of each row, Table 1 contains 6 rows with number of records less than 3. Thus, these rows, namely Rows 1, 4, 9, 10, 13 and 14, violate R. Generalizing these violating rows of Table 1 on attribute Zip (and also generalizing Zip values for the rest of the rows since their Zip values are siblings of those in the violating rows), we obtained a table as shown in Table 4. As shown in Table 4, Row 1 and Row 9 can be merged to the first row of the resulting table. Rows 4 and 14 can be combined to satisfy R as a unique tuple from Shield attributes, i.e., (Low, 35, 5200 * ) has three records. However, the two rows cannot be merged. Therefore, the resulting Table 4 has reduced number of violating rows to two (i.e., Rows 10, 13) with a total number of rows to be 21.
Let T(n, m) denote a generalized table T, where n is the number of rows violating R and m is the number of rows in T. Tables 1 and 4 are represented by T(6, 22) and T 1 (2, 21), respectively. The generalization process repeats. The whole process can be viewed as a search starting from T(6, 22) as a root and as shown in Figure 3. The search starts from the root T (6,22), i.e., Table 1 (or T) with 6 violating rows and a total of 22 rows as shown in Figure 3. We first apply to T, generalization on Zip, Age and AC to obtain tables T1 (2,21), T2(4, 21) and T3(0, 18), respectively. Recall that T1 (2,21) is actually Table 4. As seen in Table 4, after merging Rows 1 and 9, we have [(Med, 35, 5200 * ), 4]. Hence the violation in these two cases is eliminated. Rows 4 and 14 also have the same shield attribute values after the generalization that is [(Low, 35, 52,000), 3]. Therefore, T1 has 2 violating rows remained, namely Rows 10 and 13. Moreover, because Rows 1 and 9 The search starts from the root T(6, 22), i.e., Table 1 (or T) with6 violating rows and a total of 22 rows as shown in Figure 3. We first apply to T, generalization on Zip, Age and AC to obtain tables T 1 (2, 21), T 2 (4, 21) and T 3 (0, 18), respectively. Recall that T 1 (2, 21) is actually Table 4.
As seen in Table 4, after merging Rows 1 and 9, we have [(Med, 35, 5200 * ), 4]. Hence the violation in these two cases is eliminated. Rows 4 and 14 also have the same shield attribute values after the generalization that is [(Low, 35, 52,000), 3]. Therefore, T 1 has 2 violating rows remained, namely Rows 10 and 13. Moreover, because Rows 1 and 9 merged, the number of rows in T 1 becomes 21. Thus, T 1 (2, 21) is obtained. The rest of resulting tables can be obtained similarly.
T 3 (0, 18) has zero violations, however we continue to search because there might be a table with more rows and zero violations.
The frontier nodes at this point are T 1 (2, 21), T 2 (4, 21). They have the same row number, therefore T 1 (2, 21) having fewer violating rows is selected to be expanded further. By generalizing T 1 (2, 21) on the three attributes we get the tables T 4 (2, 21), T 5 (0, 21) and T 6 (0,17). At this point we stop because, we obtain T 5 (0, 21). We do not continue to search even though there are still table with violations such as T 2 (4, 21) , because none of them have number of rows larger than the current result that is 21. That means we already found the table with the greatest number of rows with zero violations as further generalizing on other tables would only result in a smaller table. Thus, the optimal result of T 5 (0, 21) has been found and the algorithm stops.

Evaluation and Experiments
This section compares our anonymization approach described in Section 3.3 and evaluate their performances by comparing with two other similar anonymization techniques. Two criteria for evaluating the resulting anonymized table: (1) the table must satisfy a given anonymity requirement with maximum data retention, and (2) the table must be found in timely manner without too much space.
Section 4.1 relates to (1) to evaluate correctness on Table 1, and Section 4.2 relates to (2) by discussion on experiments and results on public datasets. Since the anonymization is viewed as a search problem in this paper, we evaluate our approach by comparing the resulting table(s) with two other search methods: exhaustive search (Method 1) and greedy search (Method 2) [30]. The former is blind search, but the latter is a heuristic search, where the number of violating rows is the heuristic. We will compare results obtained by our approach with Methods 1 and 2.

On Correctness
Consider Table 1 and the anonymity requirement R is < {Zip, Age, AC}, 3 >. Partial search tree obtained by Method 1 is shown in Figure 4. Section 4.1 relates to (1) to evaluate correctness on Table 1, and Section 4.2 relates to (2) by discussion on experiments and results on public datasets. Since the anonymization is viewed as a search problem in this paper, we evaluate our approach by comparing the resulting table(s) with two other search methods: exhaustive search (Method 1) and greedy search (Method 2) [30]. The former is blind search, but the latter is a heuristic search, where the number of violating rows is the heuristic. We will compare results obtained by our approach with Methods 1 and 2.

On Correctness
Consider Table 1 and the anonymity requirement R is < {Zip, Age, AC}, 3 >. Partial search tree obtained by Method 1 is shown in Figure 4. As shown in Figure 4, T corresponds to Table 1 and T2 is the table obtained from generalizing T on attribute Zip (or Table 4). Method 1 generates all possible generalized tables and choose a table with maximum number of rows among those with zero violation as a solution. The resulting table is shown in Table 5 = T5 (0, 21) (see Figure 4), which is the same result obtained by our approach. All violations are eliminated (since Rows 4 and 14, Rows 6 and 10, and Rows 11 and 13 has 3, 5, and 4 records, satisfying R, respectively). Furthermore, T5 has 21 distinct rows as Rows 1 and 9 are merged. As shown in Figure 4, T corresponds to Table 1 and T 2 is the table obtained from generalizing T on attribute Zip (or Table 4). Method 1 generates all possible generalized tables and choose a table with maximum number of rows among those with zero violation as a solution. The resulting table is shown in Table 5 = T 5 (0, 21) (see Figure 4), which is the same result obtained by our approach. All violations are eliminated (since Rows 4 and 14, Rows 6 and 10, and Rows 11 and 13 has 3, 5, and 4 records, satisfying R, respectively). Furthermore, T 5 has 21 distinct rows as Rows 1 and 9 are merged. Method 1 (Exhaustive search) and our approach produce the same anonymized table. However, as we will see later that both computational costs are significantly different. A compromising approach between the two is to use a greedy search.
By using the number of rows that violate the anonymity requirement as a heuristic and each time expand on the table with minimum violations (as it has the highest chance The greedy solution produces Table 6 = T3, as a result. As shown in Table 6, there is no violation. However, the number of rows is 19, which is less than our solution which has 21 rows. However, the resulting table from Method 2 (Greedy search) is correct in that it satisfies the R. Therefore, even though Method 2 satisfies (2) it fails (1). Our proposed approach on the other hand satisfies both (1) and (2). Now we will show the performance results (i.e., Section 4.2) as reported in [15].

On Performances
To demonstrate the effectiveness of our method, we experiment with the public heart disease datasets [31] collected from three different health organizations: Cleveland Clinic Foundation (dataset 1), Hungarian Institute of Cardiology Budapest (dataset2) and V.A. Medical Center, Long Beach, CA (dataset 3). In each data set, we select six most pertinent attributes for our purpose to illustrate privacy protection of our anonymization approach. For the same reason, we also add the Zip attribute for our experiments giving a total of As seen from Figure 5, the result is achieved after generating 3 tables, and when the violations become 0, the search stopped. In Figure 5, each node again is a table annotated by a corresponding heuristic value. Starting from the root node T(6) represents Table 1, with 6 violating rows. We see that after applying a generalization on Zip, Age and AC attribute, the resulting table has 2, 4 and 0 violating rows, respectively.
The greedy solution produces Table 6 = T 3 , as a result. As shown in Table 6, there is no violation. However, the number of rows is 19, which is less than our solution which has 21 rows. However, the resulting table from Method 2 (Greedy search) is correct in that it satisfies the R. Therefore, even though Method 2 satisfies (2) it fails (1). Our proposed approach on the other hand satisfies both (1) and (2). Now we will show the performance results (i.e., Section 4.2) as reported in [15].

On Performances
To demonstrate the effectiveness of our method, we experiment with the public heart disease datasets [31] collected from three different health organizations: Cleveland Clinic Foundation (dataset 1), Hungarian Institute of Cardiology Budapest (dataset2) and V.A. Medical Center, Long Beach, CA (dataset 3). In each data set, we select six most pertinent attributes for our purpose to illustrate privacy protection of our anonymization approach. For the same reason, we also add the Zip attribute for our experiments giving a total of seven attributes. Figure 6 summarizes the attributes of the three data sets with their corresponding attribute values along with each data set size.
Future Internet 2021, 13, x FOR PEER REVIEW 15 of 21 seven attributes. Figure 6 summarizes the attributes of the three data sets with their corresponding attribute values along with each data set size. Figure 6. Summary of the Three Data Sets.
The mechanism to anonymize the data relies on data generalization based on the taxonomy of data of each attribute. Here the taxonomy trees for relevant attributes are shown in Figure 7.
Note that a combination of the attributes, selected in Figure 6, can be used to re-identify an individual heart patient. Recall that our method aims to quickly find a solution of an anonymized table that satisfies an anonymity requirement and that it maximally preserves the original data. To better understand how our method performs with respect to the trade-off among each criterion (i.e., data preservation, privacy protection, and efficient solution), we compare our method with two other methods that solve a problem focusing on a single criterion. Method 1, that is the exhaustive search, aims to find a solution, satisfying anonymity requirements, with maximum information preservation (i.e., retaining the greatest number of data rows), whereas Method 2, that is greedy search, aims to find a solution satisfying anonymity requirements most efficiently. In terms of search, Method 1 exhaustively searches for a solution that has a maximum number of rows in the table, while Method 2 is a greedy search for a table with no rows violating the anonymity requirements. See more details on search algorithms in [30].

Comparisons on Single Shield
Shield attributes in this experiment are Age, Sex, Smoker and Zip and a given anonymity requirement is <{Zip, Smoker, Sex, Age}, 5>. We evaluate in terms of three metrics: number of generalizations, number of table rows, and time. For the number of generalizations, we measure the total number of generalizations applied during the search for a solution. It indicates the degree of privacy protection. The more generalizations we use, the table becomes more anonymous (but less data preservation). Each generalization The mechanism to anonymize the data relies on data generalization based on the taxonomy of data of each attribute. Here the taxonomy trees for relevant attributes are shown in Figure 7.
seven attributes. Figure 6 summarizes the attributes of the three data sets with their corresponding attribute values along with each data set size. The mechanism to anonymize the data relies on data generalization based on the taxonomy of data of each attribute. Here the taxonomy trees for relevant attributes are shown in Figure 7.
Note that a combination of the attributes, selected in Figure 6, can be used to re-identify an individual heart patient. Recall that our method aims to quickly find a solution of an anonymized table that satisfies an anonymity requirement and that it maximally preserves the original data. To better understand how our method performs with respect to the trade-off among each criterion (i.e., data preservation, privacy protection, and efficient solution), we compare our method with two other methods that solve a problem focusing on a single criterion. Method 1, that is the exhaustive search, aims to find a solution, satisfying anonymity requirements, with maximum information preservation (i.e., retaining the greatest number of data rows), whereas Method 2, that is greedy search, aims to find a solution satisfying anonymity requirements most efficiently. In terms of search, Method 1 exhaustively searches for a solution that has a maximum number of rows in the table, while Method 2 is a greedy search for a table with no rows violating the anonymity requirements. See more details on search algorithms in [30].

Comparisons on Single Shield
Shield attributes in this experiment are Age, Sex, Smoker and Zip and a given anonymity requirement is <{Zip, Smoker, Sex, Age}, 5>. We evaluate in terms of three metrics: number of generalizations, number of table rows, and time. For the number of generalizations, we measure the total number of generalizations applied during the search for a solution. It indicates the degree of privacy protection. The more generalizations we use, the table becomes more anonymous (but less data preservation). Each generalization Note that a combination of the attributes, selected in Figure 6, can be used to reidentify an individual heart patient. Recall that our method aims to quickly find a solution of an anonymized table that satisfies an anonymity requirement and that it maximally preserves the original data. To better understand how our method performs with respect to the trade-off among each criterion (i.e., data preservation, privacy protection, and efficient solution), we compare our method with two other methods that solve a problem focusing on a single criterion.
Method 1, that is the exhaustive search, aims to find a solution, satisfying anonymity requirements, with maximum information preservation (i.e., retaining the greatest number of data rows), whereas Method 2, that is greedy search, aims to find a solution satisfying anonymity requirements most efficiently. In terms of search, Method 1 exhaustively searches for a solution that has a maximum number of rows in the table, while Method 2 is a greedy search for a table with no rows violating the anonymity requirements. See more details on search algorithms in [30] The more generalizations we  use, the table becomes more anonymous (but less data preservation). Each generalization  transforms a table into a new table. However, when generalizing a table on multiple  attributes, the order of the attribute applied for generalization does not affect the resulting  table. For example, generalizing a table on attribute age then generalizing the resulting  table on attribute Sex gives the same table as first generalizing a given table on Sex then  generalizing the resulting table on attribute Age. Hence, we label the tables that have the same generalizations as duplicate and only keep one of the tables. The experiment results on total number of generalizations are shown in Figure 8.  a table into a new table. However, when generalizing a table on multiple attributes, the order of the attribute applied for generalization does not affect the resulting  table. For example, generalizing a table on attribute age then generalizing the resulting  table on attribute Sex gives the same table as first generalizing a given table on Sex then  generalizing the resulting table on attribute Age. Hence, we label the tables that have the same generalizations as duplicate and only keep one of the tables. The experiment results on total number of generalizations are shown in Figure 8. As seen in Figure 8, although each method finds a solution that satisfies anonymity requirements, Method 2 uses optimal number of generalizations in all of the three data sets. This is as expected because the number of generalization steps effect how quickly we can find the solution. On the other hand, Method 1 has the highest number of generalizations in all the three data sets as expected. This can be explained by the fact that Method 1 aims to maximize the data information and thus, it searches over all possible generalized tables for the best solution giving the highest number of rows. On the other hand, the results for our method are in between because it is a trade-off solution that compromises among the three criteria.
The second metric is the number of (distinct) rows that the solution table has. As shown in Figure 6, initially data sets 1-3 has 303, 294 and 200 rows, respectively. Number of rows measures the quality of the result in terms of information preservation. The more distinct row the table has the more original information is preserved. The comparison results are shown in Figure 9. As shown in Figure 9, our method and Method 1 produce the solutions with the same number of rows in all the three data sets. In fact, both obtained an anonymity-complied solution with optimal number of rows. However, as observed in Figure 8, our method uses less effort in terms of the number of generalizations applied. This favors our method in that it takes less work (i.e., number of generalizations) and yet it retains optimal information (i.e., number of rows). The third metric is time that each method takes to find its anonymity-complied solution. Figure 10 shows the comparison results. As seen in Figure 8, although each method finds a solution that satisfies anonymity requirements, Method 2 uses optimal number of generalizations in all of the three data sets. This is as expected because the number of generalization steps effect how quickly we can find the solution. On the other hand, Method 1 has the highest number of generalizations in all the three data sets as expected. This can be explained by the fact that Method 1 aims to maximize the data information and thus, it searches over all possible generalized tables for the best solution giving the highest number of rows. On the other hand, the results for our method are in between because it is a trade-off solution that compromises among the three criteria.
The second metric is the number of (distinct) rows that the solution table has. As shown in Figure 6, initially data sets 1-3 has 303, 294 and 200 rows, respectively. Number of rows measures the quality of the result in terms of information preservation. The more distinct row the table has the more original information is preserved. The comparison results are shown in Figure 9. As shown in Figure 9, our method and Method 1 produce the solutions with the same number of rows in all the three data sets. In fact, both obtained an anonymity-complied solution with optimal number of rows. However, as observed in Figure 8, our method uses less effort in terms of the number of generalizations applied. This favors our method in that it takes less work (i.e., number of generalizations) and yet it retains optimal information (i.e., number of rows). The third metric is time that each method takes to find its anonymity-complied solution. Figure 10 shows the comparison results.
As expected, Method 2 has the minimum time as its design (since Method 2 greedily searches for the solution and returns once it finds a solution, see Section 4.1) and Method 1 has the maximum time in finding the solution in all the three data sets. This is because the time is associated with the effort in generalization and thus, the number of generalizations. On the other hand, our method gives a compromised solution in that it is relative fast to find a solution and also retains the high number of data rows.  As expected, Method 2 has the minimum time as its design (since Method 2 greedily searches for the solution and returns once it finds a solution, see Section 4.1) and Method 1 has the maximum time in finding the solution in all the three data sets. This is because the time is associated with the effort in generalization and thus, the number of generalizations. On the other hand, our method gives a compromised solution in that it is relative fast to find a solution and also retains the high number of data rows.

Comparisons on Varying Anonymity Requirements
Given a fixed k with varying shield attributes on the anonymity requirement, the biggest factor to both number of generalizations and total time is the selected shield attributes. The taxonomy trees of the attributes and number of selected attributes both effect the results.
Intuitively, the more attributes the shield has the more alternatives for generalization there are. Similarly, if the shield attributes have higher depth of taxonomy trees, there will be more generalizations. As also discussed in [15], to demonstrate that our method still performs well on various shields, we experimented with different shield set sizes and attribute on the same data set, namely dataset 1(Cleveland). The results are shown in Table  7. As shown in the top partition row of Table 7, the anonymity requirement with the greatest number of shield attributes produces the highest number of generalizations in all methods. In the next two partition rows, between the two Anonymity Requirements with three attributes, the Anonymity Requirement (<{Zip, Smoker, Age}, 5>) in the third partition row produces a greater number of generalizations than those produced by the Anonymity Requirement (<{Zip, Smoker, Sex}, 5>) for all methods. This is as expected because the taxonomy tree of Age is larger than that of Sex. In fact, the size of the taxonomy tree of the shield attribute can influence the number of generalizations more than the number  As expected, Method 2 has the minimum time as its design (since Method 2 greedily searches for the solution and returns once it finds a solution, see Section 4.1) and Method 1 has the maximum time in finding the solution in all the three data sets. This is because the time is associated with the effort in generalization and thus, the number of generalizations. On the other hand, our method gives a compromised solution in that it is relative fast to find a solution and also retains the high number of data rows.

Comparisons on Varying Anonymity Requirements
Given a fixed k with varying shield attributes on the anonymity requirement, the biggest factor to both number of generalizations and total time is the selected shield attributes. The taxonomy trees of the attributes and number of selected attributes both effect the results.
Intuitively, the more attributes the shield has the more alternatives for generalization there are. Similarly, if the shield attributes have higher depth of taxonomy trees, there will be more generalizations. As also discussed in [15], to demonstrate that our method still performs well on various shields, we experimented with different shield set sizes and attribute on the same data set, namely dataset 1(Cleveland). The results are shown in Table  7. As shown in the top partition row of Table 7, the anonymity requirement with the greatest number of shield attributes produces the highest number of generalizations in all methods. In the next two partition rows, between the two Anonymity Requirements with three attributes, the Anonymity Requirement (<{Zip, Smoker, Age}, 5>) in the third partition row produces a greater number of generalizations than those produced by the Anonymity Requirement (<{Zip, Smoker, Sex}, 5>) for all methods. This is as expected because the taxonomy tree of Age is larger than that of Sex. In fact, the size of the taxonomy tree of the shield attribute can influence the number of generalizations more than the number

Comparisons on Varying Anonymity Requirements
Given a fixed k with varying shield attributes on the anonymity requirement, the biggest factor to both number of generalizations and total time is the selected shield attributes. The taxonomy trees of the attributes and number of selected attributes both effect the results.
Intuitively, the more attributes the shield has the more alternatives for generalization there are. Similarly, if the shield attributes have higher depth of taxonomy trees, there will be more generalizations. As also discussed in [15], to demonstrate that our method still performs well on various shields, we experimented with different shield set sizes and attribute on the same data set, namely dataset 1(Cleveland). The results are shown in Table 7. As shown in the top partition row of Table 7, the anonymity requirement with the greatest number of shield attributes produces the highest number of generalizations in all methods. In the next two partition rows, between the two Anonymity Requirements with three attributes, the Anonymity Requirement (<{Zip, Smoker, Age}, 5>) in the third partition row produces a greater number of generalizations than those produced by the Anonymity Requirement (<{Zip, Smoker, Sex}, 5>) for all methods. This is as expected because the taxonomy tree of Age is larger than that of Sex. In fact, the size of the taxonomy tree of the shield attribute can influence the number of generalizations more than the number of attributes in the shield. As shown in Table 7, the Anonymity Requirement (<{Zip, Smoker, Sex}, 5>) (second partition row) has higher number of attributes than the Anonymity Requirement (<{Zip, Age}, 5>) (last partition row) and yet it produces smaller number of generalizations. This is because the size of taxonomy tree of Age is deeper than those of Sex and Smoker. In all cases of varying shields on the anonymity requirements, Method 1 (Method 2) generates the highest (lowest) number of generalizations, while ours is in between as it is designed to balance the trade-off between privacy protection (i.e., generalizations) and data preservation (i.e., rows). Our method aims to obtain an anonymized table with maximum information preservation by generating only required amount of generalization. As shown in Table 7, comparing the number of rows of the resulting tables generated by all methods using varying shields on the anonymity requirements, ours and Method 1 generates a maximum number of rows, while results of Method 2 are slightly lower in all but one case.
In general, Method 1 finds the anonymity-complied table that has a maximum number of distinct rows by searching through all possible generalizations. Thus, the search is exhaustive and optimal solution (i.e., an anonymity-compliant generalized table with maximum number of rows) is guaranteed. If there are multiple tables with the same number of rows, the first solution found is selected, as it would have the least generalizations (less time). Even though Method 1 generates a solution that retains maximum information preservation, its exhaustive search that requires many generalizations may not be desirable in practice.
On the other hand, Method 2 finds an anonymized table by greedily searching for a generalized table that has a minimum number of anonymity violations (i.e., zero). Using a heuristic on the number of violating rows, Method 2 finds a solution without going through all possible generalized tables. Thus, its search is more efficient than Method 1. However, finding the optimal solution (i.e., a generalized table with zero violation) is not guaranteed.
Our method combines Methods 1 and 2 by quickly finding a generalized table that has zero anonymity violation as well as being the most informative table (i.e., having maximum number of distinct rows like Method 1). The method is heuristic using the above two evaluation metrics and thus, saves time compared to an exhaustive Method 1. Furthermore, when a generalized table with no violation is found, further generalization is not necessary, as by the monotonicity property of generalization, generalization will not produce a table with a higher number of rows. The reason is that generalization creates rows with common values and therefore it always maintains or shrinks the table size. Our method uses the monotonicity property to reduce search time and guarantees optimal solution (i.e., an anonymity-compliant generalized table with a maximum number of rows). The experimental results obtained are consistent with the design of each of the above methods.

Post Anonymization Analytics
After an original data table has been anonymized, the table is ready to be released for public or sharing among appropriate parties. However, in case when the data that are privacy critical, further analyzing the anonymized table can be pursued. In this paper, we examine the resulting anonymized table obtained by our technique as described in Section 3.3. To illustrate, consider the anonymized Cleveland dataset 1 (as obtained in Section 4). By applying the approach described in Section 3.1 using the Longpre et al.'s entropy-based measure, on the anonymized Cleveland dataset 1, we can further assess the effectiveness of the anonymization. Table 8 shows the overall results of this post anonymization analytics where each row indicates vulnerability to information leakages (i.e., normalized average information loss) given an attacker obtains information on the corresponding attribute in each column. As shown in Table 8, the first row gives the vulnerability "Before Anonymization". We see that the Zip attribute is most vulnerable as it leaks most information of 0.99. Next is Cholesterol and Age that leaks 0.85 and 0.61, respectively. These results can help partially select potential shield attributes although in practice, they are user-specified.
Second row shows the vulnerability on anonymized table that is in compliance with the requirement of 5-anonymity on the shield attribute set {Age, Zip}, as denoted by <{Age, Zip}, 5>. This shield attributes agree with the vulnerability assessment for the most part and omit Cholesterol as it may not be acquired easily through binary query. As shown on second row of Algorithm 1, the Age attribute is now not leaking any information. Each record now has the same Age value as a result of anonymization (i.e., generalization).
Note that only attributes that are on the shield (i.e., Age and Zip) have reduced average leakages (e.g., Age's loss from 0.61 to 0, and Zip's loss from 0.99 to 0.4). This is as expected since the generalization only can be applied to those attributes and causes a value change. The rest of other attribute values stays the same after anonymization. The information disclosure based on that attribute stays the same.
Similarly, on the third row of Algorithm 1, the anonymization satisfies <{Age, Zip, Smoker, Sex}, 5>. Compared with Row 2, two more attributes (i.e., Smoker and Sex) are added to the anonymity requirement, leakages on Age and Zip remain the same (i.e., they are generalized to the same level as previous case). However, leakages on Smoker reduce to 0 but leakages on Sex remain the same. This means that anonymization process generalizes on Smoker attribute.
The overall analytics on leakages after post anonymization indicate that the anonymization is effective since all shield attributes either maintain the same or reduced average information loss after the anonymization. Note that the average information loss is reversed from anonymity. When the information loss is high (i.e., an attacker obtains more information), the anonymity is low because the attacker can use the information to better distinguish individuals for re-identification. Therefore, we can use this measure to link to anonymity.

Conclusions
Smart health has significant impacts on healthcare and wellness. However, it also poses privacy threats to users. As health data get larger and become more accessible to multiple parties, users lose more control of their data that increasingly become vulnerable to attacks. Furthermore, the challenge is not only to protect the data but also to ensure that the shared data are sufficiently informative. Increasing users' anonymity is a basic remedy as anonymity increases indistinguishability. The more indistinguishable people are the more anonymous they become and thus, their information and identity are better concealed.
This paper presents an approach to health data analytics focusing on anonymity for privacy protection. The approach is applicable to both data producers (e.g., use of fitness trackers, or glucose and heart rate monitors) as well as data consumers (e.g., weight loss application services, healthcare professionals) to safeguard a given health data set from information leakages and re-identification. A common concept relies on making data anonymous.
An analytical approach is proposed to (1) identifying attributes susceptible to information leakages by using entropy-based measure to analyze information loss, (2) transforming the data into a more anonymous form by generalization using attribute hierarchies, and (3) anonymization that balances anonymity requirements and optimal informativeness by an automated Artificial Intelligence search using two simple heuristics. Unlike existing techniques, our anonymization approach preserves maximum information by avoiding extensive generalizations yet still complies with the anonymity requirements. The proposed anonymization follows k-anonymity; therefore, it inherits the limitations of k-anonymization as discussed in [21]. We describe and illustrate the detailed approach and analytics including pre and post anonymization analytics. We have conducted experiments to evaluate effectiveness of our anonymization approach. The results obtained show that our approach balances the trade-off between preserving privacy and retaining maximum information with efficient computational cost. Future work includes a framework designed to integrate all different measures to improve anonymization techniques as well as to better increase anonymity and protect privacy. The added metrics will help further the analysis of the anonymized data in terms of privacy. That way, we aim to get a better understanding of what needs to be improved for anonymization or how successful the anonymization is.