-Sensitive k -Anonymity: An Anonymization Model for IoT based Electronic Health Records

: The Internet of Things (IoT) is an exponentially growing emerging technology, which is implemented in the digitization of Electronic Health Records (EHR). The application of IoT is used to collect the patient’s data and the data holders and then to publish these data. However, the data collected through the IoT-based devices are vulnerable to information leakage and are a potential privacy threat. Therefore, there is a need to implement privacy protection methods to prevent individual record identification in EHR. Significant research contributions exist e.g., p + -sensitive k -anonymity and balanced p + -sensitive k -anonymity for implementing privacy protection in EHR. However, these models have certain privacy vulnerabilities, which are identified in this paper with two new types of attack: the sensitive variance attack and categorical similarity attack . A mitigation solution, the 𝜃 -sensitive k -anonymity privacy model, is proposed to prevent the mentioned attacks. The proposed model works effectively for all k -anonymous size groups and can prevent sensitive variance, categorical similarity, and homogeneity attacks by creating more diverse k anonymous groups. Furthermore, we formally modeled and analyzed the base and the proposed privacy models to show the invalidation of the base and applicability of the proposed work. Experiments show that our proposed model outperforms the others in terms of privacy security (14.64%).


Introduction
The current highly-connected technological society generates a huge amount of digital data-termed Big Data, collected through internet-enabled devices, termed the Internet of Things (IoT) [1]. Billions of these IoT devices sense and collect the data e.g., the patient's Electronic Health Records (EHR) [1][2][3][4]. The collected data are then shared with corporate or government bodies for research and policymaking. However, the privacy of the individual records is an important goal when sharing data that is collected through the IoT enabled devices [1][2][3][4][5][6]. This is because these data contain names or some unique identification (explicit identifiers-A ), such as age, gender, zip code (quasi-identifiers-A ), and some health-related private information (sensitive attributes-A ) [7][8][9][10][11][12]. To preserve privacy, eliminating the A before sharing or publishing the data is not enough [11]. For an attacker or an adversary, the quasi-identifiers (QIs) are the partial identifiers that can be used to link to some externally available data e.g., voting or census data, to identify an individual A , known as a linking attack [10][11][12].
To implement data privacy, a lot of cryptographic techniques [13,14] have been proposed. However, these techniques have high computational overheads. Another simple approach is data anonymization. Data anonymization is about concealing an individual's identity in a small crowd of records before data publishing. The publishing of such anonymized records are known as Privacy Preserving Data Publishing (PPDP) [11]. A plethora of PPDP methods have been proposed [7][8][9][10][11][12][15][16][17][18][19]. These techniques are broadly classified into:  Identity disclosure prevention: Generalizing [7][8][9] the QI values of a group of records from more specific values to less specific values e.g., k-anonymity [7,8], where every record should be indistinguishable from at least k-1 other records. An individual having probability higher than 1/k cannot be reidentified by an intruder/attacker.  Attribute disclosure prevention: Preventing to reveal private information (A information) about an individual. Examples are l-diversity [15] and t-closeness [16] privacy models.
In this paper, a variance-based privacy model is proposed to prevent attribute disclosure risk. For sensitive attribute privacy, the p+-sensitive k-anonymity, (p, α)-sensitive k-anonymity [17] privacy model is a state-ofthe-art privacy model where the sensitive values are categorized into four categories. For creating a kanonymous group of records called an equivalence class (EC), a l-diversity [15] is applied. However, two new possible attacks are applied: sensitive variance attack and categorical similarity attack. These attacks breach the privacy of the p+-sensitive k-anonymity and (p, α)-sensitive k-anonymity [17] algorithm, due to the A values from the single sensitive category or a low diversity at A category level. The proposed mitigation solution: the -Sensitive k-anonymity privacy model, is a numerical measure of privacy strength for thwarting the attribute disclosure risk. The proposed approach also appends small amounts of noise tuple(s) to increase the variability in an EC, if needed. To minimize the utility loss, the proposed algorithm uses a bottom-up generalization (i.e., the local recoding mechanism [18]) because it minimally distorts the data compared to global recoding techniques [12]. The following section presents the motivation of our work.

Motivation
Broadly examining the PPDP models for preventing attribute disclosure risk [11,[15][16][17][18], it was concluded that the worthiness of each model exists in the diversity of an EC where the sensitive values belongs to different categories. Such variability of A values creates a diverse EC. Different privacy models employ different techniques to achieve variability in k-anonymous ECs. The repeated frequencies of same sensitive values are the only obstruct in achieving the required diversity in an EC. The privacy models in [17] and [18] provide a meaningful approach in dealing with the attribute disclosure problem, however the following limitations have been observed.
 p + -sensitive k-anonymity and (p, α)-sensitive k-anonymity [17]: This model is a modified version of the p-sensitive k-anonymity [19], for preventing a similarity attack. However, the p + -sensitive kanonymity and (p, α)-sensitive k-anonymity models have zero diversity at the A category level, which may lead to a categorical similarity attack. A more powerful possible attack by an adversary is the sensitive variance attack, due to the low variability at A category level. With an upsurge in the adversary's knowledge (background knowledge-BK) the privacy level can be breached, which may cause attribute disclosures. The proposed -sensitive k-anonymity privacy model provides a privacy solution to prevent all such attacks.  Balanced p + -sensitive k-anonymity and (p, α)-sensitive k-anonymity [18]: This model is an enhanced version of p + -sensitive k-anonymity model. It balances the categorical level sensitive attributes in each EC. However, it still has low diversity at the A category level and works only for more than three kanonymous size ECs.
To solve the problems of homogeneity, categorical similarity, and sensitive variance attacks in the p +sensitive k-anonymity and (p, α)-sensitive k-anonymity model [17], we propose the -Sensitive k-anonymity privacy model in this paper. The categorical level similarity and small EC size problems in the balanced p +sensitive k-anonymity and (p, α)-sensitive k-anonymity model [18] are also addressed by achieving a more balanced and diverse EC even at the category level and its execution on small k size EC, i.e., k = 2.

Contributions
The proposed -sensitive k-anonymity privacy model multiplies variance ( ) of a fully diverse EC with an observed value (observation 1) which produces a threshold value . The value ensures prevention against attribute disclosure in an EC which collectively results in the privacy of the given dataset.
The contributions of this paper are as follows:  A new -sensitive k-anonymity privacy model is proposed where privacy in an EC is achieved through a threshold value, i.e., . The value for an EC is obtained by multiplying variance and an observation value. The variance-based diversity in an EC prevents the sensitive variance attack, which automatically prevents the categorical similarity attack. In the proposed model, the A values checking is not only performed with next ECs, but a cross check is also performed during the last EC.
If the required privacy is not achievable with the existing A values, then a noise is added for the required diversity.  We formally modeled and analyzed the base model in [17] and the proposed -sensitive k-anonymity privacy model using High Level Petri-Nets (HLPN).  Based on the above points, simulation results show that our proposed -sensitive k-anonymity model has only 0.002679% higher privacy leakage than its counterpart p + -sensitive k-anonymity model which has 14.65% higher privacy leakage with the base line privacy.
Paper Organization. The remainder of the paper is organized as follows. Section 2 explains related work. Preliminaries are discussed in Section 3. The considered attacks and problem statement in p + -sensitive kanonymity along with its formal analysis are presented in Section 4. Section 5 discusses the proposed -sensitive k-anonymity model and its formal analysis. In Section 6, the experiments and evaluations are provided. Section 7 concludes the paper.

Related Work
In this section, the literature related to the proposed privacy model is studied from various aspects. The data collected through the various IoT enabled devices [1][2][3][4][5][6]20,21] must be anonymized before publishing because of the private information contained in it. Anonymized data are published for the sake of its maximum utility without disclosing the private information of an individual. For anonymization, the privacy models can be broadly classified into semantic [22,23] or syntactic [7][8][9][10][11][12][13][14][15][16][17][18][19] approaches. The semantic privacy models add a random amount of noise for preserving privacy, e.g., differential privacy models [22,23]. In differential privacy, the deletion or addition of an individual's record or noise does not affect the data analysis results while preserving the privacy. Syntactic privacy models create a k-indistinguishable [7] ECs. In syntactic privacy, two main privacy disclosure risks are: identity disclosure [7][8][9][10]12] and attribute disclosure [11,[15][16][17][18]. The k-anonymity [7,8] is an example of preventing identity disclosure that generalizes a set of records with respect to QIs. These k-anonymous records are indistinguishable from other k-1 records in a dataset. However, k-anonymity lacks the ability to provide attribute level protection. Attribute disclosure releases the value of confidential attributes corresponding to an identified individual record. Although in ldiversity [15], l distinct groups for the A in an EC are required. However, the skewness and similarity attacks can breach the privacy because l-well sensitive attribute groups are not always possible over the existing A s. Similarly for t-closeness [16], the threshold for A and its distance distribution in an EC has low data utility, and the earth mover distance (EMD) is not an efficient prevention for attribute linkage [24,25].
In [26] by Torra, identity and attribute disclosure were both addressed. Jose et al. [27] proposed an adaptive two-step iterative anonymization approach. A privacy leakage for an attribute linkage attack was possible because of having numerous versions. An extended k-anonymity model was proposed by Rahimi et al. [28] to protect identity and attribute information. However, a BK attack is possible because the publisher is unaware of the adversary's knowledge. The k-join-anonymity model proposed by Sowmiyaa et al. [29] was the same as k-anonymity, which focuses only on identity disclosure risk. The (α, k)-anonymity model proposed by Wong et al. [30], used a global recoding technique, which has a high utility loss and, due to table linkage attack, it was susceptible to the disclosure of attributes.
The (k, e)-anonymization model proposed by Zhang et al. [31] publishes separate tables, consisting of A and QI to reduce the relationship between them, and where instead of generalization, a permutation-based approach has been adopted. Although in aggregated search, not using QI-generalization is recommended for accuracy improvement. However, a probabilistic attack is possible over the A due to the one-time publication of the microdata. The (ɛ, m)-anonymity model [32] deals with the numeric A , however it is limited to work for categorical A . Xiao et al. [33] worked on personalized anonymity that uses a greedy personalized generalization approach. This model de-associated A and QI instead of modifying the association between them.
In Reference [19], the p-sensitive k-anonymity found the closest neighbor. This model was then improved by Sun at al. [17] with a top-down specialization. The generated anonymized datasets should be from at least p distinct A values categories for each EC. However, the developed algorithm in [17] is vulnerable to privacy leakage from sensitive variance, categorical similarity, and homogeneity attacks. In this paper, these privacy limitations were mitigated using the proposed -sensitive k-anonymity algorithm. The proposed privacy model is a syntactic privacy model for preventing attribute disclosure risk, which adds a fixed amount of noise to create k-anonymous ECs.

Preliminaries
Let an original Microdata Table (MT) = {EI, QI, S} (i.e., Table 1a) be the private static data (i.e., one-time release) for a publisher to publish. The t ∈ MT is a tuple that belongs to an individual i, such that EI = { A , A , A … A }, QI = { A , A , A … A }, and S = { A } (this work considers only single A ). The kanonymized data essentially consists of A and A , while A s are removed. This is because an adversary can link the A with some external information (e.g., voter or census data) to perform a record linkage attack (i.e., identity disclosure) [34]. However, the k-anonymous A values prevent the record against the record linkage attack in an EC. For example, consider some common diseases in a 2-anonymous (Table 1b) obtained from the original microdata Table 1a. Table 2 summarizes the notations used in this paper.

iff ∀ A × t A ∈ T ≥
where k is the anonymity level (as shown in Table 1b). The k-anonymity model blends the k records into at least a k-1 crowd but it does not impose any restrictions on the algorithm to sufficiently protect the individuals. Consequently, the probability of linking a victim to a specific record through A s is at most 1/k.

Max frequency of A in an ECn
Max frequency of A in an ECn-1

Max frequency of A in an ECc
Max frequency of A in an ECb P Places used in formal modeling G QI-group at index i Data Types in formal modeling

Definition 2. l-Diversity [15]: A QI block in a masked microdata table T' having m QI-blocks
(1 ≤ ≤ ) is l-diverse, if it contains more than or equal to l well significant A values. In an l-diverse modified microdata table T', every QI block is l-diverse.

iff ∀ A × A ∈ T ≥
Definition 3. t-closeness [16]: An EC is considered as t-closed if the distance between the distribution of the sensitive data in a class and the distribution of sensitive data in the whole table is equal to or less than threshold t. If every EC is t-closed, the whole table is t-closed. To calculate the distance while studying the transportation problem, researchers have explored some methods [33,35]. However, most of them focused on the Earth Mover Distance (EMD) method [15,36]. The EMD(P, Q) measures the minimum cost for transforming one distribution P to another distribution Q. It depends on the amount and distance of mass moved.
Definition 4. p-sensitive k-anonymity [19]: The masked microdata table T' is p-sensitive k-anonymous if it is kanonymous and each EC in T' has at least p distinct A values.
where G represents an EC that already satisfies k-anonymity and is a set of A and A . The value of A must be equal to or greater than p, where A represents distinct A values in an EC.
Definition 5. Categorical similarity attack: If an adversary knows that the l-diverse modified microdata T' (satisfying k-anonymity and l-diversity) has sensitive values belong to the single sensitive category in an EC from a p distinct A categories. Definition 6. Sensitive variance attack: The privacy leakage in an EC due to the low variability of sensitive values from p distinct A categories.
Definition 7. High-Level Petri Nets (HLPN) [37]: The behavior of the system with its mathematical properties are modeled specifically via HLPN. An HLPN is a combination of 7-tuples = ( , , , , , , ), where represented by circles are the set of places. is the set of transitions in the system represented by rectangular boxes, such that ∩ = ∅. represents the flow relations such that ⊆ ( × ) ∪ ( ∪ ). maps places to the data types.
represents the rules or properties for transitions that verify the correctness of the underlying system. represents labels on , and is the initial marking. The following section reviews the p + -sensitive k-anonymity model, to highlight its shortcomings concerning sensitive variance or an S-Variance attack.
Definition 8. p + -sensitive k-anonymity [17]: A masked microdata T', fulfills k-anonymity and for each A value belongs to distinct categories must be equal to or greater than p for each EC in T'.
where C depicts A values categorizations that already fulfill a p-sensitive k-anonymous approach. C represents distinct categories in Table 3 [17] and must be equal to or greater than p. Table 4a obtained from  Table 1a, shows p + -sensitive k-anonymity model in which p = 2, k = 4 and c = 2. The ECs column in Table 4a is not part of a published table. Indigestion, Flu Definition 9. ( , )-sensitive k-anonymity [17]: A modified microdata table T' that fulfills the k-anonymity property and there must be p distinct sensitive attribute values in each QI-group having a minimum weight of at least .
where G represents all groups in masked micro table T' that already fulfill the p-sensitive k-anonymity property. Weight should be assigned to each category and each sensitive value p must have weight in each category i.e., w that must be at least . Table 4b obtained from Table 1a, shows ( , )-sensitive k-anonymity.  The sensitive variance and categorical similarity attacks have minor difference concerning the variability of A in an EC. The sensitive variance attack is more powerful than categorical similarity attack, i.e., categorical similarity attack ∈ sensitive variance attack. Therefore, the attribute disclosure through the sensitive variance attack automatically covers the disclosures through the categorical similarity attack. The EC2 and EC3 in Table 4a obtained through the p + -sensitive k-anonymous approach have categorical similarity and sensitive variance attacks and are explained in Table 5. Table 5 shows the variance calculation for these ECs, where a high variance for more diverse EC2 and small variance for less diverse EC3 can be seen.
To calculate the variance of the ECs, an ordered weight is given to the A values in such a way that the higher the frequency (f), the lower the weight (x) will be. For example, consider EC3 in Table 4a, i.e., Flu = 2, Cancer = 1, HIV = 1. The numeric value against each sensitive value represents its frequency occurrence in EC3.
If an EC, e.g., EC2 is fully diverse i.e., size 4 and 4-diverse, then the order weight will be Hepatitis = 1, Phthisis = 1, Asthma = 1, Obesity = 1. In EC2, because of having a single occurrence for each A value, has a higher variance than EC3.
An adversary, using the category table (Table 3), can analyse the ECs in Table 4a and Table 4b published in [17]. The variability in some of the ECs is low concerning the category table. Therefore, the adversary can isolate the sensitive values that belong to a specific category and hence to individual records, and thus breaches the identity of an individual.

Critical Review of p + -Sensitive k-Anonymity Model
We formally modeled the p + -sensitive k-anonymity algorithm to check its invalidation concerning a sensitive variance attack. The detail formal verification of the working of p + -sensitive k-anonymity privacy model along with its properties is given in [18] from Rule 1 to Rule 7, which gets original data input from the end-user and processes it. The sensitive variance attack over the p + -sensitive k-anonymity model is shown in Figure 1, where the arrow heads show the data flow. Table 6 shows variable types and their descriptions. The places and its description are shown in Table 7. The attacker model in Figure 1 consists of three entities: the end-user, the adversary, and the trusted data publisher. Table 6. Types used in high-level Petri nets (HLPN) for p+-sensitive k-anonymity.

Data Types Description
User input for k-anonymity p-sensitivity numeric value Distinct categories set Boolean value 1 or 0 Total distinct A values Total distinct categories Sensitive Attribute for i th end user Identifier attribute for i th end user Table 7. Data-types, places, and their mapping.
In Figure 1, Transitions , which are input to the HLPN model, consist of patients' records (original data). A trusted data publisher further processes the data to minimize an attribute disclosure risk. Generalization and removing identifying attributes transform the data into masked data. After generalization, the masked microdata table is ready to be published. An adversary then exploits the published data for its benefits.
In this paper, the first seven rules in [18] are outlined briefly. For input k, the data publisher processes the original data to perform data generalization via the Generalize() function and each EC is stored at place the micro mask table (MMT); The publisher confirms the k-anonymity condition. If successful, variable is set to . For each EC, the Dist() function calculates the distinct A values and stores its count at place ds. To further process the array of t A , the Count() function counts the S and stores it at place Count Ds. Before the calculation of C , p-sensitive k-anonymity is verified in masked data. Transition ℎ checks at least p distinct A values in each EC in the whole table.
stores the input transition p value for comparison. Apart from the checking condition for k-anonymity, another checking for p value is done. If it returns it means the data already fulfills k-anonymity. This concludes a successful transition, ensures the p-sensitive k-anonymous property. Next, computing A values categories using function Get_Cat(). Both A values and categories are stored at place Giʹ for further processing. Actual improvement to the prior model and source for p+-sensitive k-anonymity is the transition ℎ . Distinct categories are calculated in a column, using the sensitive values. Comp C stores this 'number' of distinct categories. The C involved in each EC is checked with transition ℎ to confirm that there must be at least p distinct categories. The minimum value for p is 2. The p+-sensitive k-anonymity properties are fulfilled if the variable returns . Figure 1. HLPN for p + -sensitive k-anonymity attack model.
The p + -sensitive k-anonymity model is highly vulnerable against a sensitive variance attack. The main reason is the existence of non-diverse (low variance) A values similar to 'Flu' in Table 4a and Table 4b, and 'HIV' in Table 4b. In Rule (1) through function S-Variance_ Attck(), an adversary performs an attack on the released data using some external source of information, i.e., BK. In Rule (1): The adversary takes the union of the published data with the external information and BK to plot an EC. In this way, specific individuals correspond to some specific ECs that belong to homogenous categories and hence sensitive values from a specific category disclose an individual. Therefore, a sensitive variance attack occurs due to low variance in corresponding ECs.

Threshold -Sensitivity
The goal of the proposed -Sensitive k-anonymity privacy model is to prevent the attribute disclosure of the individual records in MT, collected through the IoT [2-6] enabled devices. Each EC in MT must satisfy the threshold value. The -Sensitivity, is the product of variance ( ) and Observation 1 (µ) as shown in Equation (1).
= Variance of a fully diverse EC ( ) * Observation 1( ) The variance value represents the diversity in an EC. High variance means high diversity in an EC and vice versa, since achieving 100% diversity is almost impossible in all cases. However, the variance-based optimal frequency distribution of A values with some fixed amount of noise addition achieves an enhanced data privacy in an EC. The proposed method in this paper is simple and effective. During examining each EC, if the variance of an EC is greater than i.e., fully diverse, the next EC is examined. Otherwise, the variance for the same EC is increased by swapping the A values from the successor ECs or by adding some noise records, to make it above . Because of the required noise addition, our proposed model impliesdifferential privacy [22,23] but the proposed approach is a syntactic anonymization [9] approach. The variance calculation in Table 5 for ECs depicts the variability in a numerical form. To standardize the value for different size ECs, to prevent the sensitive variance attack, initially, we consider a fully diverse EC, e.g., if EC size = 2 variance = 0.25, if EC size = 3 variance = 0.67, if EC size = 4 variance = 1.25, if EC size = 5 variance = 2, and so on, then multiplying the variance with an observed value from Observation 1 ( ).

Observation 1 (μ)
A decimal multiplied part: Observation 1 (µ), for getting , the threshold value has full control over the EC diversity. During the simulation in Python, different values for µ were checked to get a suitable value. After executing the dataset for different k size ECs, the values of µ in the range of 0.5 to 0.9 were concluded. A smaller observed µ value results in the frequent repetition of sensitive values in an EC, and higher observed value produces a more diverse EC. However, "what observed value should be chosen for different size ECs?", is explained below.
Consider again, the 2 + -Sensitive 4-anonymous Table 4a, EC2 variance = 1.25, and EC3 variance = 0.69. The difference is because of the duplicated sensitive value i.e., Flu, in EC3. We propose an efficient way of removing the frequency repetition of sensitive values to achieve a more diverse EC. For this, we calculated the value. For example, consider a fully diverse EC of size 4 with variance = 1.25 and multiply it with an observed value, ranges between 0.5 and 0.9. Since, 1.25 * 0.5 = 0.625 is less than 0.69 and 1.25 * 0.6 = 0.75, which is greater than 0.69. The difference between the two values i.e., 1.25 and 0.69, is because of only one duplicated value "Flu". Thus, it depends on privacy requirements and the level of diversity we are interested to achieve. In this paper, we perform a very strict calculation to get fully diverse ECs. Therefore, for example in the implementation part of the proposed Algorithm 1, we multiply a variance of 4 size EC with an observed value µ= 0.6 to have a fully diverse EC. The same technique is applied to all other ECs as well. The obtained in this way in line 8 of the proposed Algorithm 1 in Section 5.2, is then checked in the conditional part at line 10 inside a loop to check all ECs concerning requirements.
Definition 10: -Sensitive k-anonymity: The modified microdata table T' fulfills -sensitive k-anonymity, if it fulfills k-anonymity and for each EC in T', the variance for each EC must be at least .

iff ∀ A × t A ∈ T ≥ ∧ (∀G: A × A ∈ T • A ← Count(Dist(A )) ≥ )
where G represents a QI-group or EC that already satisfies k-anonymity and is a set of A and A . The value of A must be equal to or greater than p, where A is the number of distinct sensitive values in a QI-group. The proposed -sensitive k-anonymity model produces the anonymized Table 8a (with noise) from the original microdata Table 1a and Table 8b (without noise) from Table 4b. The A values in Table 8a and Table 8b are generalized through local recoding (bottom-up generalization) which improves the utility of the anonymized data. The 4-diverse ECs in Table 8a and Table 8b have sensitive values from a minimum of three different sensitive categories in Table 3. Therefore, these tables have more attribute privacy and are more protected from a sensitive variance attack.

The proposed -Sensitive k-Anonymity Algorithm
The proposed -sensitive k-anonymity algorithm starts execution by checking the k size to create an EC (minimum cardinality k = 2), at line 3. The algorithm can be executed on different size of k. However, if the minimum cardinality fails, the condition becomes and jumps to line 50. If it is , the loop works from line 5 to 7, to calculate the variance for each m size G or EC that belongs to k-anonymous ECs and assigns them to an array i.e.,V .
Line 8 multiplies an average observed value and variance σ for an EC to get a threshold (i.e., Equation (1)). This value ensures the maximum level of diversity in an EC. mainly depends on . If is smaller for an EC, low diverse EC will be obtained and vice versa. What level of diversity we want to have in an EC is completely controlled by . Deeply observing l-diversity [15] and t-closeness [16] and performing experiments while executing the algorithm in Python, the value is kept to achieve maximum diversity. The algorithm starts working from lines 9-49, which checks the obtained variances against user input k for each m size EC to . At line 10, if V is greater than , line 46 is executed and the algorithm moves on to next EC. If it is less than , the current EC is named as EC , and the next index EC is named as EC . Lines 12-45, each part inside if statement has two major functionalities; swapping and require noise addition.
The else part of an if statement executes the ECs from first till EC , and its first part processes the last EC (lines [13][14][15][16][17][18][19][20][21][22][23][24][25]. At line 12, if EC is the last class, i.e., EC , then from A the value of MS is calculated. Similarly, the value of MS is calculated from EC . At line 15, a crossCheck() function checks the existence of most frequent A that does not exist in each other ECs. The swap() function may be executed. The purpose of the cross-check is not to further increase or decrease V because it has already been processed by the else part of the current if statement. This function is for the last EC to increase its diversity. If any of the A value from EC exists in EC or vice versa, the swapping at line 17 will not be performed. If swapping is performed, V is calculated to check with (line 20). If V is still less than , then the algorithm jumps to line 43 to add a distinct A value as noise to increase its variance and to achieve a high diversity.
To process the first EC until the EC , the else part of if statement executes (line 12). The algorithm finds an EC with greater than EC (lines [27][28][29][30][31]. The if statement checks EC f condition, when it is satisfied, then a function mfsv() is executed on both EC and EC , which calculates the most frequent sensitive values in both ECs. Before swapping the values for MS and MS , a function backCheck() checks the existence of MS in EC , which is an EC ahead of EC . If the value of MS exists in EC , then that MS value is removed from a temporary array in Algorithm 1. MS and next MS in same EC is checked with MS . This process continues until it finds a A value in MS that do not exist in MS . Line 37 then swaps these two MS values along with their corresponding records. Two important purposes are achieved through this swap function. First, reducing the frequency of repeated A and second, increasing diversity in EC which results in increasing V . The V is again calculated and is checked with , if it is greater than , counter for EC moves to the next EC.
Here, the absence of the else statement adds noise instantly, in a situation when the variance is less than , because more than one swapping for a specific EC is possible. We add noise only once after completely checking the frequency of each A in an EC. For example, if to produce a 4-anonymous EC table from Table  1a, after one swapping e.g., 'HIV' swaps with 'Obesity', the resulting EC1 in Table 8a will become 3-diverse and its variance will not meet , the else part might add noise to increase variance even though there is a duplicated A 'Cancer' value that still exists in EC . To reduce the frequency of the next duplicated A value i.e., 'Cancer', by swapping it with another A in EC if one exists, noise is not added at this moment. This is achieved by going control back to line 10, and since this increased variance is still less than , the procedure repeats and from an EC a new A is swapped with the next duplicated A value. In this way, two swapping procedures are performed and 2-diverse EC will become 4-diverse without adding any unnecessary noise, which results in increasing data utility and a more diverse EC.
EC is found because of a variance greater than , there are chances that no EC exists in a given dataset having a higher variance than , in this case, the loop will not break (line 29). In that case, the algorithm will jump to line 43. It will add a dummy record with distinct A value(s) via function addNoise(). Such an addition is considered as noise to the real data just like the addition of noise in differential privacy [22,23]. This algorithm performs very intelligent swapping and adds noise intelligently. The purpose of these two functions (i.e., swap() and addNoise()), is to increase the diversity keeping the utility as high as possible, which is easily achieved in our algorithm as shown in the experimental evaluation, Section 6.
The sanitized Table 4a from p + -sensitive k-anonymity is prone to homogeneity, categorical similarity, and sensitive variance attacks, and Table 8a from -sensitive k-anonymity secures the data from such attacks because of more diversity, even at the category level, i.e., the maximum value for category c is 4 throughsensitive k-anonymity, where, for Table 4a, the maximum value for c is 2. Table 8a provides more protection against the categorical similarity attack. Further swapping of values is not possible in the last EC; thus, a single tuple is added as noise to increase the diversity and to prevent categorical similarity attack and sensitive variance attack. Such a small amount of noise does not highly affect the utility of the data. Table 4b  is a base table to obtain Table 8b using the -sensitive k-anonymity approach. Table 8b is also highly diverse at the categorical level and there are no repeated sensitive values. Thus, there is no need to add noise and to have a high value of variance. The anonymized data, both in Table 8a and Table 8b, obtained through the proposed -sensitive k-anonymity algorithm, have no attribute disclosure risk and are defensive against homogeneity [11], categorical similarity, and sensitive variance attacks, and even secure from skewness attacks [12].

Analysis of -Sensitive k-Anonymity Model Using Formal Modeling and Analysis
The proposed -sensitive k-anonymity model mitigates the vulnerability discussed in Section 4. Modeling the -sensitive k-anonymity via HLPN has the same end-user, data publisher, and unknown adversary, as shown in Figure 2. Tables 9 and 10, respectively, show variable types and places, and their corresponding descriptions.
The -sensitive k-anonymity algorithm was modeled through the HLPN rules for the microdata input. The data publisher initially verifies the k-value input. The original data is k-anonymized (bottom-up generalization) after finalizing the individual records in an EC obtained through variance calculations. In Rule (2), the k-anonymity masks the data. In Rule (2)  Adjust variance for Equivalence class c Table 10. Mapping of data types in -sensitive k-anonymity model.

Places
Descriptions If an input k is less than the minimum size of an EC (i.e., <2) the condition fails. For cardinality having a minimum value of 2 or above, the algorithm executes. The k-anonymity for true or false are depicted in Rule (3).
The threshold is calculated in Rule (4). Variance for a fully diverse ECs for a specific k is calculated using the var() function. The important contributed functions are swap() and addNoise() functions, through which the algorithm processes all ECs. Transition performs all these swapping and noise additions in corresponding ECs. In Rule (5), transitions for the initial ECs. For the rest of the ECs, the same transition can be used in the same manner. In Rule (4): In Rule (5): The -sensitive k-anonymity model's main functionalities are described in Rule (6) and Rule (7). Variance in each k-anonymous EC with respect to is checked in Rule (6). If variance of EC is greater than (i.e., (i14[1] > i13)), move to next EC and update the value in place MMT. If the variance of EC is less than (i.e., (i14[1] < i13)), then transaction stops. We try to find EC , and swap required available A values from EC . After performing all needed swapping, if the variance of AdjEC is still less than (i.e., (i32 < i13)), the noise is added to increase its diversity. In Rule (6): The proposed -sensitive k-anonymity algorithm starts by processing each k size EC.  Figure 2. HLPN for -sensitive k-anonymity.
MS is swapped with MS after the checking succeeds and is saved in place AdjEC . EC minimizes the frequency of the A value and increases diversity. While processing the last EC, i.e., EC , swapping is not possible in the forward direction. Thus swapping with previous EC is performed with a condition that the variance of already processed EC should not be decreased with . The crossCheck() function confirms two-way checking, that the values for both MS and MS are distinct and it should not change the variance of EC at place StrictEC to an undesired value again. In that case, we call it strict EC . In other words, in addition to increasing the diversity in EC , it is also not increasing the frequency of A value at place EC . Values are then swapped and are saved at place AdjEC . Rule (7) shows the whole process. In Rule (7): If the variance of AdjEC is still less than (i.e., (i34[1] < i35)), a dummy record called noise is added whenever needed throughout the variance adjustment process. In Rule (8), we have given the final noise addition case for last AdjEC . Its purpose is to increase the variance at a level greater than . It will produce a highly diverse EC even if there are not enough diverse records in MMT. In Rule (8): In Rule (9), an adversary attacks against the individual's A values. Adversary combines the already available BK (i.e., i40 [2]) with the published data (i.e., i38 [2]) and performs attack to disclose the patient's identity (i.e., i2 [2]) and the sensitive values (i.e., i2 [3]). -sensitive k-anonymity model can provide better privacy protection to prevent from attribute disclosure attacks because it considers the high value of variance due to swapping and noise addition in corresponding ECs. The diversity of sensitive attribute values in ECs prevents the adversarial BK and is more effective as compared to the p + -sensitive k-anonymity model. Therefore, the adversary did not get private information for the target individual and the attack results in a null value. In Rule (9):

Experimental Evaluation
In this section, the experiments that were performed to show the effectiveness of the proposedsensitive k-anonymity privacy model in comparison to the p + -sensitive k-anonymity model are described. The proposed algorithm wisely diversified the A S values in a balanced way inside each EC without using the categorical approach. The utility and quality of the anonymized released data were checked with numerous quality measures.

Experimental Setup
All experiments were performed on a machine with an Intel Core i5 2.39 GHz processor with 4 GB RAM, using the Windows 10 operating system. The algorithm was written in Python 3.7. We used the Adults database, which contained age, zip code, salary, and occupation attributes, which is openly accessible at the UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets. We considered the age, zip code, salary as A s and occupation as A .
Experimental results show the usefulness of the proposed -sensitive k-anonymity privacy model and protection against the categorical similarity attack and sensitive variance attack as compared to the p +sensitive k-anonymity model. The quality of the sanitized publicly released data was evaluated with four utility metrics: discernibility penalty (DCP) [18,38,39], normalized average QI-group (CAVG) [17,18,38], noise calculation, and query accuracy [18,33]. The execution time of both algorithms was analyzed at the end of the experiments.

Discernibility Penalty (DCP)
The DCP proposed in [38] and used in [18,39] is an assignment of penalty (cost) to each tuple in the generalized data set. Through this penalty, the sanitized tuple cannot be distinguished among other tuples in the result set. Minimizing the discernibility cost is an optimal objective. The penalty for a tuple t that belongs to an EC of size |EC|, i.e., t ϵ EC, will be |EC| and the penalty for each EC is |EC| . The complete DCP penalty for the overall sanitized released dataset * can be seen in Equation (2).
where {EC} are the total number of ECs in R * . A baseline can be obtained from the most optimal DCP score calculations as shown in [10]. For example, if k = 2 and the number of anonymized tuples are 10, the DCP optimal score will be 2 + 2 + 2 + 2 + 2 = 20. This optimal score is called the baseline. The approach to generate groups followed in this paper was based on k size, inclusive of the noise tuple(s). Higher k means bigger group size, so the baseline moves up because of a high DCP score. The p + -sensitive k-anonymity model generated groups based on p. It means the number of tuples can be greater than p in a k-anonymous class. Figure 3 shows the DCS score for -sensitive k-anonymity, including a comparison with p + -sensitive and baseline. In comparison to p + -sensitivity, the DCP score, through the proposed -sensitive k-anonymity algorithm, is almost equal to the baseline, which implies that the proposed model assigned an optimal penalty to each EC and produced an optimal DCP score. The magnified subplots in Figure 3 with k = 12 and k = 16 for -sensitive k-anonymity shows the very minor difference with baseline. This minor difference can also be seen in Table 11, with an average DCP score of 47.2 or 0.002679% with a baseline obtained from the simulation while calculating the DCP for the anonymized dataset R * .  --This means that our proposed approach Ɵ-sensitive, k-anonymity is 14.64% better than p + -sensitive kanonymity and .002679% closer to the baseline.

Normalized Average (CAVG)
CAVG is another mathematically sound measurement that measures the quality of the sanitized data by the EC average size. It was proposed in [38] and applied in [17,18]. Below in Equation (3), CAVG can be calculated as where |R * | is the overall sanitized released dataset and |{EC}| are the total number of ECs in R * . Data utility and CAVG are inversely proportional. Low CAVG value indicates high information utility. The optimal goal is to have a minimum size of ECs in R * . Figure 4 shows CAVG for p + -sensitive k-anonymity and -sensitive kanonymity over k-anonymity. p + -sensitive has lower data utility over small k, where there is a high data utility for large k. The proposed technique has a very balanced and sustainable utility for each input value of k. Thus, the proposed -sensitive k-anonymity model performs efficiently for all sizes of k, compared to the p +sensitive k-anonymity model.

Noise Addition
Among different masking methods, one popular approach is the perturbation of data, i.e., noise addition. These are dummy tuples, added to the original data that helps in achieving the required diversity similar to the differential privacy [22,23]. The reason is if there are not enough A values to swap with, especially in the second last and last ECs, the gap is filled with the noise tuples to prevent with disclosure risk. So, one of the reasons for such a good performance of the proposed model is the cost of noise addition. Figure 5 shows the number of tuples added as a noise for different values of k. These tuples are added to achieve the required value of the threshold . For different values of k, the algorithm responds differently but the maximum number of noise tuples added for a specific value of k is only six tuples. In the processed "Adult" dataset, the total number of tuples was 160,150 and only 34 noise tuples, i.e., 0.021% of the total size, were added in total. Such an amount of utility loss is negligible. This small amount of noise addition is sometimes due to get a round number when dividing the dataset size by the k size input, for example, 160150/4 = 40037.5 and 160152/4 = 40038.

Query Accuracy
Query accuracy measures precision for aggregate queries to check the utility of the anonymized data. It has been used by various research works [18,33]. To answer the aggregate queries, the built-in COUNT operator is used, where A s are the query predicates. Consider R * to be a sanitized release from original microdata R having maximum m as A s; A (1 ≤ ≤ ), where D(A ) is the domain of i th QI. The SQLQuery in Equation (4) for the COUNT query will work as SQLQuery = select COUNT( * ) from R * where A ∈ D A AND . . . AND A ∈ D(A ) Against each query, at least one or a few number of tuples should be selected from each EC based on query predicates. Two important parameters for query predicates are (1) query dimensionality q, and (2) the query selectivity ϑ. Query dimensionality comprises of the number of QIs in query predicate while query selectivity is the number of values for each attribute A , (1 ≤ ≤ ). The query selectivity is calculated as, ϑ = | | | | , where |T | are the output number of tuples after using query Q on relation R, and |R| are the total number of tuples in the whole dataset. Query error i.e., Error(Q), is calculated in Equation (5).
Error(Q) = |count(R * ) − count(R)| count(R) where count(R * ) depicts result set from the COUNT query on an anonymized dataset while count(R) is the result set from the COUNT query on the original microdata. More selective queries have a high error rate.  Figure 6a shows the query error for the input value of k. We compare the p + -sensitive k-anonymity and -sensitive k-anonymity using the query error rate for 1000 randomly generated aggregate queries. The error rate increases for the high value of k because of the high range in A s. This selects a greater number of tuples than the original microdata and hence high error rate. In Figure 6b, it is depicted that the more we select tuples based on predicates, the higher the error rate will be in the anonymized data. Figure 7 shows the execution time for both p + -sensitive k-anonymity model and for the proposedsensitive k-anonymity model. The execution time for both of the algorithms increased with an increase in value of k because of the increase in A s generalization range. Since we did not consider the sensitive values categorization, our approach took a small amount of time to execute as compared to its counterpart. In thesensitive k-anonymity model, a higher execution time for k = 10, k = 16 and k = 20 was because of the time taken to add more noise tuples to achieve the required diversity.

Conclusion
In this paper, the huge amount of data (i.e., Big Data) collected through the IoT-based devices were anonymized using the proposed -sensitive k-anonymity privacy model in comparison to p + -sensitive kanonymity model. The purpose was to prevent an attribute disclosure risk in anonymized data. The p +sensitive k-anonymity model was considered to be vulnerable to a privacy breach from sensitive variance, categorical similarity, and homogeneity attacks. These attacks were mitigated by implementing the proposed -sensitive k-anonymity privacy model using Equation (1). In the proposed solution, the threshold value decides the diversity level for each EC of the dataset. The vulnerabilities in the p + -sensitive k-anonymity model and the effectiveness of the proposed -sensitive k-anonymity model were formally modeled through HLPN, which further ensures the validation of the proposed technique. The experimental work proved the privacy implementation and an improved utility of the released data using different mathematical measures. For future work consideration, the proposed algorithm can be extended to 1:M (single record having many attribute values) [40], to multiple sensitive attributes (MSA) [41][42][43], or can be modeled by considering the dynamic data set [44] approach.