Next Article in Journal
A 16 Gbps, Full-Duplex Transceiver over Lossy On-Chip Interconnects in 28 nm CMOS Technology
Next Article in Special Issue
Lightweight Modeling Attack-Resistant Multiplexer-Based Multi-PUF (MMPUF) Design on FPGA
Previous Article in Journal
Design of a Schottky Metal-Brim Structure to Optimize 180–220 GHz Broadband Frequency Doubler MMIC
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

θ-Sensitive k-Anonymity: An Anonymization Model for IoT based Electronic Health Records

National Engineering Laboratory for Mobile Network Technologies, Beijing University of Posts and Telecommunications, Beijing 100876, China
Department of Computer Science, COMSATS University Islamabad, Islamabad 45550, Pakistan
Cybernetica AS Estonia, Tallinn 13412, Estonia
Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, UK
Department of Computer Science, University of Peshawar, Peshawar 25120, Pakistan
Warwick Manufacturing Group, The University of Warwick, Coventry CV4 7AL, UK
Author to whom correspondence should be addressed.
Electronics 2020, 9(5), 716;
Received: 28 March 2020 / Revised: 19 April 2020 / Accepted: 20 April 2020 / Published: 26 April 2020
(This article belongs to the Special Issue Cyber Security for Internet of Things)


The Internet of Things (IoT) is an exponentially growing emerging technology, which is implemented in the digitization of Electronic Health Records (EHR). The application of IoT is used to collect the patient’s data and the data holders and then to publish these data. However, the data collected through the IoT-based devices are vulnerable to information leakage and are a potential privacy threat. Therefore, there is a need to implement privacy protection methods to prevent individual record identification in EHR. Significant research contributions exist e.g., p+-sensitive k-anonymity and balanced p+-sensitive k-anonymity for implementing privacy protection in EHR. However, these models have certain privacy vulnerabilities, which are identified in this paper with two new types of attack: the sensitive variance attack and categorical similarity attack. A mitigation solution, the θ -sensitive k-anonymity privacy model, is proposed to prevent the mentioned attacks. The proposed model works effectively for all k-anonymous size groups and can prevent sensitive variance, categorical similarity, and homogeneity attacks by creating more diverse k-anonymous groups. Furthermore, we formally modeled and analyzed the base and the proposed privacy models to show the invalidation of the base and applicability of the proposed work. Experiments show that our proposed model outperforms the others in terms of privacy security (14.64%).

1. Introduction

The current highly-connected technological society generates a huge amount of digital data—termed Big Data, collected through internet-enabled devices, termed the Internet of Things (IoT) [1]. Billions of these IoT devices sense and collect the data e.g., the patient’s Electronic Health Records (EHR) [1,2,3,4]. The collected data are then shared with corporate or government bodies for research and policymaking. However, the privacy of the individual records is an important goal when sharing data that is collected through the IoT enabled devices [1,2,3,4,5,6]. This is because these data contain names or some unique identification (explicit identifiers— A ei ), such as age, gender, zip code (quasi-identifiers— A qi ), and some health-related private information (sensitive attributes— A s ) [7,8,9,10,11,12]. To preserve privacy, eliminating the A ei before sharing or publishing the data is not enough [11]. For an attacker or an adversary, the quasi-identifiers (QIs) are the partial identifiers that can be used to link to some externally available data e.g., voting or census data, to identify an individual A s , known as a linking attack [10,11,12].
To implement data privacy, a lot of cryptographic techniques [13,14] have been proposed. However, these techniques have high computational overheads. Another simple approach is data anonymization. Data anonymization is about concealing an individual’s identity in a small crowd of records before data publishing. The publishing of such anonymized records are known as Privacy Preserving Data Publishing (PPDP) [11]. A plethora of PPDP methods have been proposed [7,8,9,10,11,12,15,16,17,18,19]. These techniques are broadly classified into:
  • Identity disclosure prevention: Generalizing [7,8,9] the QI values of a group of records from more specific values to less specific values e.g., k-anonymity [7,8], where every record should be indistinguishable from at least k-1 other records. An individual having probability higher than 1/k cannot be re-identified by an intruder/attacker.
  • Attribute disclosure prevention: Preventing to reveal private information ( A s information) about an individual. Examples are l-diversity [15] and t-closeness [16] privacy models.
In this paper, a variance-based privacy model is proposed to prevent attribute disclosure risk. For sensitive attribute privacy, the p+-sensitive k-anonymity, (p, α)-sensitive k-anonymity [17] privacy model is a state-of-the-art privacy model where the sensitive values are categorized into four categories. For creating a k-anonymous group of records called an equivalence class (EC), a l-diversity [15] is applied. However, two new possible attacks are applied: sensitive variance attack and categorical similarity attack. These attacks breach the privacy of the p+-sensitive k-anonymity and (p, α)-sensitive k-anonymity [17] algorithm, due to the A s values from the single sensitive category or a low diversity at A s category level. The proposed mitigation solution: the θ -Sensitive k-anonymity privacy model, is a numerical measure of privacy strength for thwarting the attribute disclosure risk. The proposed approach also appends small amounts of noise tuple(s) to increase the variability in an EC, if needed. To minimize the utility loss, the proposed algorithm uses a bottom-up generalization (i.e., the local recoding mechanism [18]) because it minimally distorts the data compared to global recoding techniques [12]. The following section presents the motivation of our work.

1.1. Motivation

Broadly examining the PPDP models for preventing attribute disclosure risk [11,15,16,17,18], it was concluded that the worthiness of each model exists in the diversity of an EC where the sensitive values belongs to different categories. Such variability of A s values creates a diverse EC. Different privacy models employ different techniques to achieve variability in k-anonymous ECs. The repeated frequencies of same sensitive values are the only obstruct in achieving the required diversity in an EC. The privacy models in [17] and [18] provide a meaningful approach in dealing with the attribute disclosure problem, however the following limitations have been observed.
  • p+-sensitive k-anonymity and (p, α)-sensitive k-anonymity [17]: This model is a modified version of the p-sensitive k-anonymity [19], for preventing a similarity attack. However, the p+-sensitive k-anonymity and (p, α)-sensitive k-anonymity models have zero diversity at the A s category level, which may lead to a categorical similarity attack. A more powerful possible attack by an adversary is the sensitive variance attack, due to the low variability at A s category level. With an upsurge in the adversary’s knowledge (background knowledge—BK) the privacy level can be breached, which may cause attribute disclosures. The proposed θ -sensitive k-anonymity privacy model provides a privacy solution to prevent all such attacks.
  • Balanced p+-sensitive k-anonymity and (p, α)-sensitive k-anonymity [18]: This model is an enhanced version of p+-sensitive k-anonymity model. It balances the categorical level sensitive attributes in each EC. However, it still has low diversity at the   A s category level and works only for more than three k-anonymous size ECs.
To solve the problems of homogeneity, categorical similarity, and sensitive variance attacks in the p+-sensitive k-anonymity and (p, α)-sensitive k-anonymity model [17], we propose the θ -Sensitive k-anonymity privacy model in this paper. The categorical level similarity and small EC size problems in the balanced p+-sensitive k-anonymity and (p, α)-sensitive k-anonymity model [18] are also addressed by achieving a more balanced and diverse EC even at the category level and its execution on small k size EC, i.e., k = 2.

1.2. Contributions

The proposed θ -sensitive k-anonymity privacy model multiplies variance ( σ 2 ) of a fully diverse EC with an observed value (observation 1) which produces a threshold value θ . The θ value ensures prevention against attribute disclosure in an EC which collectively results in the privacy of the given dataset.
The contributions of this paper are as follows:
  • A new θ -sensitive k-anonymity privacy model is proposed where privacy in an EC is achieved through a threshold value, i.e., θ . The θ value for an EC is obtained by multiplying variance and an observation value. The variance-based diversity in an EC prevents the sensitive variance attack, which automatically prevents the categorical similarity attack. In the proposed model, the A s values checking is not only performed with next ECs, but a cross check is also performed during the last EC. If the required privacy is not achievable with the existing A s values, then a noise is added for the required diversity.
  • We formally modeled and analyzed the base model in [17] and the proposed θ -sensitive k-anonymity privacy model using High Level Petri-Nets (HLPN).
  • Based on the above points, simulation results show that our proposed θ -sensitive k-anonymity model has only 0.002679% higher privacy leakage than its counterpart p+-sensitive k-anonymity model which has 14.65% higher privacy leakage with the base line privacy.
Paper Organization. The remainder of the paper is organized as follows. Section 2 explains related work. Preliminaries are discussed in Section 3. The considered attacks and problem statement in p+-sensitive k-anonymity along with its formal analysis are presented in Section 4. Section 5 discusses the proposed θ -sensitive k-anonymity model and its formal analysis. In Section 6, the experiments and evaluations are provided. Section 7 concludes the paper.

2. Related Work

In this section, the literature related to the proposed privacy model is studied from various aspects. The data collected through the various IoT enabled devices [1,2,3,4,5,6,20,21] must be anonymized before publishing because of the private information contained in it. Anonymized data are published for the sake of its maximum utility without disclosing the private information of an individual. For anonymization, the privacy models can be broadly classified into semantic [22,23] or syntactic [7,8,9,10,11,12,13,14,15,16,17,18,19] approaches. The semantic privacy models add a random amount of noise for preserving privacy, e.g., differential privacy models [22,23]. In differential privacy, the deletion or addition of an individual’s record or noise does not affect the data analysis results while preserving the privacy. Syntactic privacy models create a k-indistinguishable [7] ECs. In syntactic privacy, two main privacy disclosure risks are: identity disclosure [7,8,9,10,12] and attribute disclosure [11,15,16,17,18]. The k-anonymity [7,8] is an example of preventing identity disclosure that generalizes a set of records with respect to QIs. These k-anonymous records are indistinguishable from other k-1 records in a dataset. However, k-anonymity lacks the ability to provide attribute level protection. Attribute disclosure releases the value of confidential attributes corresponding to an identified individual record. Although in l-diversity [15], l distinct groups for the A s in an EC are required. However, the skewness and similarity attacks can breach the privacy because l-well sensitive attribute groups are not always possible over the existing A s s. Similarly for t-closeness [16], the threshold for A s and its distance distribution in an EC has low data utility, and the earth mover distance (EMD) is not an efficient prevention for attribute linkage [24,25].
In [26] by Torra, identity and attribute disclosure were both addressed. Jose et al. [27] proposed an adaptive two-step iterative anonymization approach. A privacy leakage for an attribute linkage attack was possible because of having numerous versions. An extended k-anonymity model was proposed by Rahimi et al. [28] to protect identity and attribute information. However, a BK attack is possible because the publisher is unaware of the adversary’s knowledge. The k-join-anonymity model proposed by Sowmiyaa et al. [29] was the same as k-anonymity, which focuses only on identity disclosure risk. The (α, k)-anonymity model proposed by Wong et al. [30], used a global recoding technique, which has a high utility loss and, due to table linkage attack, it was susceptible to the disclosure of attributes.
The (k, e)-anonymization model proposed by Zhang et al. [31] publishes separate tables, consisting of A s and QI to reduce the relationship between them, and where instead of generalization, a permutation-based approach has been adopted. Although in aggregated search, not using QI-generalization is recommended for accuracy improvement. However, a probabilistic attack is possible over the A s due to the one-time publication of the microdata. The (ɛ, m)-anonymity model [32] deals with the numeric A s , however it is limited to work for categorical A s . Xiao et al. [33] worked on personalized anonymity that uses a greedy personalized generalization approach. This model de-associated A s and QI instead of modifying the association between them.
In Reference [19], the p-sensitive k-anonymity found the closest neighbor. This model was then improved by Sun at al. [17] with a top-down specialization. The generated anonymized datasets should be from at least p distinct A s values categories for each EC. However, the developed algorithm in [17] is vulnerable to privacy leakage from sensitive variance, categorical similarity, and homogeneity attacks. In this paper, these privacy limitations were mitigated using the proposed θ –sensitive k-anonymity algorithm. The proposed privacy model is a syntactic privacy model for preventing attribute disclosure risk, which adds a fixed amount of noise to create k-anonymous ECs.

3. Preliminaries

Let an original Microdata Table (MT )   = { EI , QI , S } (i.e., Table 1a) be the private static data (i.e., one-time release) for a publisher to publish. The t MT is a tuple that belongs to an individual i, such that EI = { A 1 ei , A 2 ei ,   A 3 ei A h ei }, QI = { A 1 qi , A 2 qi ,   A 3 qi   A m qi }, and S = { A s } (this work considers only single A s ). The k-anonymized data essentially consists of A qi and A s , while A ei s are removed. This is because an adversary can link the A qi with some external information (e.g., voter or census data) to perform a record linkage attack (i.e., identity disclosure) [34]. However, the k-anonymous A qi values prevent the record against the record linkage attack in an EC. For example, consider some common diseases in a 2-anonymous (Table 1b) obtained from the original microdata Table 1a. Table 2 summarizes the notations used in this paper.
Definition 1.
k-anonymity [7,8]: Relation R having A qi over the schema R(A1,A2, …, An) in a masked microdata table T’ is said to be k-anonymous if and only if, for any combination A i qi × t( A in qi ) values from start to end, is greater than or equal to k in R.
iff   | { A i qi × t ( A in qi ) } | T k
where k is the anonymity level (as shown in Table 1b). The k-anonymity model blends the k records into at least a k-1 crowd but it does not impose any restrictions on the algorithm to sufficiently protect the individuals. Consequently, the probability of linking a victim to a specific record through A qi s is at most 1/k.
Definition 2.
l-Diversity [15]: A QI block in a masked microdata table T’ having m QI-blocks Q I j   ( 1 j m ) is l-diverse, if it contains more than or equal to l well significant A s values. In an l-diverse modified microdata table T’, every QI block is l-diverse.
iff   | { A i qi × A i s } |   T l
Definition 3.
t-closeness [16]: An EC is considered as t-closed if the distance between the distribution of the sensitive data in a class and the distribution of sensitive data in the whole table is equal to or less than threshold t. If every EC is t-closed, the whole table is t-closed. To calculate the distance while studying the transportation problem, researchers have explored some methods [33,35]. However, most of them focused on the Earth Mover Distance (EMD) method [15,36]. The EMD(P, Q) measures the minimum cost for transforming one distribution P to another distribution Q. It depends on the amount and distance of mass moved.
Definition 4.
p-sensitive k-anonymity [19]: The masked microdata table T’ is p-sensitive k-anonymous if it is k-anonymous and each EC in T’ has at least p distinct A s values.
iff   | { A i qi × t ( A in qi ) } |   T k   ( G : { A i qi × A i s }   T A n s Count ( Dist ( A i s ) ) p )
where G represents an EC that already satisfies k-anonymity and is a set of A i s and A i qi . The value of A n s must be equal to or greater than p, where A n s represents distinct A s values in an EC.
Definition 5.
Categorical similarity attack: If an adversary knows that the l-diverse modified microdata T’ (satisfying k-anonymity and l-diversity) has sensitive values belong to the single sensitive category in an EC from a p distinct A s categories.
Definition 6.
Sensitive variance attack: The privacy leakage in an EC due to the low variability of sensitive values from p distinct A s categories.
Definition 7.
High-Level Petri Nets (HLPN) [37]: The behavior of the system with its mathematical properties are modeled specifically via HLPN. An HLPN is a combination of 7-tuples N = ( P , T , F , φ , R n , L , M 0 ) , where P represented by circles are the set of places. T is the set of transitions in the system represented by rectangular boxes, such that P T = . F represents the flow relations such that F ( P × T ) ( T P ) . φ . maps places P to the data types. R n represents the rules or properties for transitions that verify the correctness of the underlying system. L represents labels on F , and M 0 is the initial marking.
The following section reviews the p+-sensitive k-anonymity model, to highlight its shortcomings concerning sensitive variance or an S-Variance attack.

4. Problem Statement

Definitions 8 and 9 describe the p+-sensitive k-anonymity and ( p ,   α ) -sensitive k-anonymity models [17], respectively.
Definition 8.
p+-sensitive k-anonymity [17]: A masked microdata T’, fulfills k-anonymity and for each A s value belongs to distinct categories must be equal to or greater than p for each EC in T’.
( G : { A i qi × A i s } G C   C G C n Count ( Dist ( C ) ) p )
where C depicts A s values categorizations that already fulfill a p-sensitive k-anonymous approach. C n represents distinct categories in Table 3 [17] and must be equal to or greater than p. Table 4a obtained from Table 1a, shows p+-sensitive k-anonymity model in which p = 2, k = 4 and c = 2. The ECs column in Table 4a is not part of a published table.
Definition 9.
( p , α ) -sensitive k-anonymity [17]: A modified microdata table T’ that fulfills the k-anonymity property and there must be p distinct sensitive attribute values in each QI-group having a minimum weight of at least α .
( G : { A i qi × A i s }   T   A n s   p   w c   α )
where G represents all groups in masked micro table T’ that already fulfill the p-sensitive k-anonymity property. Weight should be assigned to each category and each sensitive value p must have weight in each category i.e., w c that must be at least α . Table 4b obtained from Table 1a, shows ( p , α ) -sensitive k-anonymity.
The sensitive variance and categorical similarity attacks have minor difference concerning the variability of A s in an EC. The sensitive variance attack is more powerful than categorical similarity attack, i.e., categorical similarity attack sensitive variance attack. Therefore, the attribute disclosure through the sensitive variance attack automatically covers the disclosures through the categorical similarity attack. The EC2 and EC3 in Table 4a obtained through the p+-sensitive k-anonymous approach have categorical similarity and sensitive variance attacks and are explained in Table 5. Table 5 shows the variance calculation for these ECs, where a high variance for more diverse EC2 and small variance for less diverse EC3 can be seen.
To calculate the variance of the ECs, an ordered weight is given to the A s values in such a way that the higher the frequency (f), the lower the weight (x) will be. For example, consider EC3 in Table 4a, i.e., Flu = 2, Cancer = 1, HIV = 1. The numeric value against each sensitive value represents its frequency occurrence in EC3. If an EC, e.g., EC2 is fully diverse i.e., size 4 and 4-diverse, then the order weight will be Hepatitis = 1, Phthisis = 1, Asthma = 1, Obesity = 1. In EC2, because of having a single occurrence for each A s value, has a higher variance than EC3.
An adversary, using the category table (Table 3), can analyse the ECs in Table 4a and Table 4b published in [17]. The variability in some of the ECs is low concerning the category table. Therefore, the adversary can isolate the sensitive values that belong to a specific category and hence to individual records, and thus breaches the identity of an individual.

Critical Review of p+-Sensitive k-Anonymity Model

We formally modeled the p+-sensitive k-anonymity algorithm to check its invalidation concerning a sensitive variance attack. The detail formal verification of the working of p+-sensitive k-anonymity privacy model along with its properties is given in [18] from Rule 1 to Rule 7, which gets original data input from the end-user and processes it. The sensitive variance attack over the p+-sensitive k-anonymity model is shown in Figure 1, where the arrow heads show the data flow. Table 6 shows variable types and their descriptions. The places P and its description are shown in Table 7. The attacker model in Figure 1 consists of three entities: the end-user, the adversary, and the trusted data publisher.
In Figure 1, Transitions T , which are input to the HLPN model, consist of patients’ records (original data). A trusted data publisher further processes the data to minimize an attribute disclosure risk. Generalization and removing identifying attributes transform the data into masked data. After generalization, the masked microdata table is ready to be published. An adversary then exploits the published data for its benefits.
In this paper, the first seven rules in [18] are outlined briefly. For input k, the data publisher processes the original data to perform data generalization via the Generalize ( ) function and each EC is stored at place the micro mask table (MMT); The publisher confirms the k-anonymity condition. If successful, C o n d i t i o n variable is set to t r u e . For each EC, the Dist ( ) function calculates the distinct A s values and stores its count at place ds . To further process the array of t A s , the Count ( ) function counts the S n and stores it at place Count   Ds . Before the calculation of C n , p-sensitive k-anonymity is verified in masked data. Transition C h e c k P K checks at least p distinct A s values in each EC in the whole table. P L e v e l stores the input transition p value for comparison. Apart from the checking condition for k-anonymity, another checking for p value is done. If it returns t r u e it means the data already fulfills k-anonymity. This concludes a successful transition, ensures the p-sensitive k-anonymous property. Next, computing A s values categories using function Get _ Cat ( ) . Both A s values and categories are stored at place Gi for further processing. Actual improvement to the prior model and source for p+-sensitive k-anonymity is the transition C h e c k P P K . Distinct categories are calculated in a column, using the sensitive values. Comp   C stores this ‘number’ of distinct categories. The C n involved in each EC is checked with transition C h e c k P P K to confirm that there must be at least p distinct categories. The minimum value for p is 2. The p+-sensitive k-anonymity properties are fulfilled if the c o n d i t i o n variable returns t r u e .
The p+-sensitive k-anonymity model is highly vulnerable against a sensitive variance attack. The main reason is the existence of non-diverse (low variance) A s values similar to ‘Flu’ in Table 4a and Table 4b, and ‘HIV’ in Table 4b. In Rule (1) through function S - Variance _   Attck ( ) , an adversary performs an attack on the released data using some external source of information, i.e., BK. In Rule (1):
R ( Attack ) =   i 40   x 40 ,   i 42   x 42 ,   i 43 x 43   ,     i 2 x 2 | S Variance _ Attck ( i 40 [ 2 ] , i 42 [ 2 ] ) i 43 [ 2 ] = i 2 [ 1 ]     i 43 [ 2 ] = i 2 [ 3 ]
The adversary takes the union of the published data with the external information and BK to plot an EC. In this way, specific individuals correspond to some specific ECs that belong to homogenous categories and hence sensitive values from a specific category disclose an individual. Therefore, a sensitive variance attack occurs due to low variance in corresponding ECs.

5. The Proposed θ -Sensitive k-Anonymity Privacy Model

5.1. Threshold θ -Sensitivity

The goal of the proposed θ -Sensitive k-anonymity privacy model is to prevent the attribute disclosure of the individual records in MT, collected through the IoT [2,3,4,5,6] enabled devices. Each EC in MT must satisfy the threshold θ value. The θ -Sensitivity, is the product of variance ( σ 2 ) and Observation 1 (µ) as shown in Equation (1).
θ = Variance   of   a   fully   diverse   EC   ( σ 2 ) Observation   1 ( μ )
The variance value represents the diversity in an EC. High variance means high diversity in an EC and vice versa, since achieving 100% diversity is almost impossible in all cases. However, the variance-based optimal frequency distribution of A s values with some fixed amount of noise addition achieves an enhanced data privacy in an EC. The proposed method in this paper is simple and effective. During examining each EC, if the variance of an EC is greater than θ i.e., fully diverse, the next EC is examined. Otherwise, the variance for the same EC is increased by swapping the A s values from the successor ECs or by adding some noise records, to make it above θ . Because of the required noise addition, our proposed model implies ε -differential privacy [22,23] but the proposed approach is a syntactic anonymization [9] approach.

5.1.1. Variance ( σ 2 )

The variance calculation in Table 5 for ECs depicts the variability in a numerical form. To standardize the θ value for different size ECs, to prevent the sensitive variance attack, initially, we consider a fully diverse EC, e.g., if EC size = 2 variance = 0.25, if EC size = 3 variance = 0.67, if EC size = 4 variance = 1.25, if EC size = 5 variance = 2, and so on, then multiplying the variance with an observed value from Observation 1 ( μ ).

5.1.2. Observation 1 ( μ )

A decimal multiplied part: Observation 1 (µ), for getting θ , the threshold value has full control over the EC diversity. During the simulation in Python, different values for µ were checked to get a suitable θ value. After executing the dataset for different k size ECs, the values of µ in the range of 0.5 to 0.9 were concluded. A smaller observed µ value results in the frequent repetition of sensitive values in an EC, and higher observed value produces a more diverse EC. However, “what observed value should be chosen for different size ECs?”, is explained below.
Consider again, the 2+-Sensitive 4-anonymous Table 4a, EC2 variance = 1.25, and EC3 variance = 0.69. The difference is because of the duplicated sensitive value i.e., Flu, in EC3. We propose an efficient way of removing the frequency repetition of sensitive values to achieve a more diverse EC. For this, we calculated the θ value. For example, consider a fully diverse EC of size 4 with variance = 1.25 and multiply it with an observed value, ranges between 0.5 and 0.9. Since, 1.25 * 0.5 = 0.625 is less than 0.69 and 1.25 * 0.6 = 0.75, which is greater than 0.69. The difference between the two values i.e., 1.25 and 0.69, is because of only one duplicated value “Flu”. Thus, it depends on privacy requirements and the level of diversity we are interested to achieve. In this paper, we perform a very strict θ calculation to get fully diverse ECs. Therefore, for example in the implementation part of the proposed Algorithm 1, we multiply a variance of 4 size EC with an observed value µ= 0.6 to have a fully diverse EC. The same technique is applied to all other ECs as well. The θ obtained in this way in line 8 of the proposed Algorithm 1 in Section 5.2, is then checked in the conditional part at line 10 inside a loop to check all ECs concerning θ requirements.
Definition 10.
θ -Sensitive k-anonymity: The modified microdata table T’ fulfills θ -sensitive k-anonymity, if it fulfills k-anonymity and for each EC in T’, the variance for each EC must be at least θ .
iff   | { A i qi × t ( A in qi ) } |   T k     ( G : { A i qi × A i s }   T A n s Count ( Dist ( A i s ) ) θ )
where G represents a QI-group or EC that already satisfies k-anonymity and is a set of A i s and A i qi . The value of A n s must be equal to or greater than p, where A n s is the number of distinct sensitive values in a QI-group. The proposed θ -sensitive k-anonymity model produces the anonymized Table 8a (with noise) from the original microdata Table 1a and Table 8b (without noise) from Table 4b. The A qi values in Table 8a and Table 8b are generalized through local recoding (bottom-up generalization) which improves the utility of the anonymized data. The 4-diverse ECs in Table 8a and Table 8b have sensitive values from a minimum of three different sensitive categories in Table 3. Therefore, these tables have more attribute privacy and are more protected from a sensitive variance attack.

5.2. The Proposed θ -Sensitive k-Anonymity Algorithm

The proposed θ -sensitive k-anonymity algorithm starts execution by checking the k size to create an EC (minimum cardinality k = 2), at line 3. The algorithm can be executed on different size of k. However, if the minimum cardinality fails, the condition becomes f a l s e and jumps to line 50. If it is t r u e , the f o r loop works from line 5 to 7, to calculate the variance for each m size G i qi or EC i that belongs to k-anonymous ECs and assigns them to an array i.e., V EC i .
Line 8 multiplies an average observed value μ and variance σ 2 for an EC to get a threshold θ (i.e., Equation (1)). This θ value ensures the maximum level of diversity in an EC. θ mainly depends on μ . If μ is smaller for an EC, low diverse EC will be obtained and vice versa. What level of diversity we want to have in an EC is completely controlled by μ . Deeply observing l-diversity [15] and t-closeness [16] and performing experiments while executing the algorithm in Python, the μ value is kept to achieve maximum diversity. The algorithm starts working from lines 9–49, which checks the obtained variances against user input k for each m size EC to θ . At line 10, if V EC is greater than θ , line 46 is executed and the algorithm moves on to next EC. If it is less than θ , the current EC is named as EC c , and the next index EC is named as EC b . Lines 12–45, each part inside if statement has two major functionalities; swapping and require noise addition.
The else part of an if statement executes the ECs from first till EC n 1 , and its first part processes the last EC (lines 13–25). At line 12, if EC c is the last class, i.e., EC n , then from A EC n s the value of MS n is calculated. Similarly, the value of MS n 1 is calculated from EC n 1 . At line 15, a crossCheck ( ) function checks the existence of most frequent A s that does not exist in each other ECs. The swap ( ) function may be executed. The purpose of the cross-check is not to further increase or decrease V EC n 1 because it has already been processed by the else part of the current if statement. This function is for the last EC to increase its diversity. If any of the A s value from EC n exists in EC n 1 or vice versa, the swapping at line 17 will not be performed. If swapping is performed, V EC n is calculated to check with θ (line 20). If V EC n is still less than θ , then the algorithm jumps to line 43 to add a distinct A s value as noise to increase its variance and to achieve a high diversity.
To process the first EC until the EC n 1 , the else part of if statement executes (line 12). The algorithm finds an EC b with θ greater than EC c (lines 27–31). The if statement checks EC b f o u n d condition, when it is satisfied, then a function mfsv ( ) is executed on both EC c and EC b , which calculates the most frequent sensitive values in both ECs. Before swapping the values for MS c and MS b , a function backCheck ( ) checks the existence of MS b in EC c , which is an EC ahead of EC b . If the value of MS b exists in EC c , then that MS value is removed from a temporary array in Algorithm 1.
Algorithm 1: θ -sensitive k-anonymity
Input: Microdata Table (MT)
Output: :   θ sensitive   k anonymous table (MMT)
 1 Procedure :   θ sensitive   k anonymity   ( MMT ,   θ , k )
 2 Let   k MMT
 3 i f | k |   2 t h e n
 4   Condition = true ;
 5     f o r   e a c h   m   size   EC   in   G i qi : { A i qi ×   A i s }   k do G i qi set, consists of A i qi & A i s
 6       V EC i Compute   vari ( A EC i s ) vari ( A EC i s ), calculate variance for each m size EC.
 7     e n d   f o r
 8     θ μ   σ 2 θ , required threshold
 9     f o r   e a c h   m   size   EC i   in   G i qi : { A i qi ×   A i s }   k   d o G i qi set, consists of A i qi and A i s
10       i f   V EC c < θ   t h e n
11         EC b EC c + 1
12         i f   EC n = EC c
13           MS n Compute   mfsv ( A EC n s ) ►mfsv(), max frequent A EC n s
14           MS n 1 Compute   mfsv ( A EC n 1 s ) ►mfsv(),max frequent A EC n 1 s
15           notExist crossCheck ( MS EC n , MS EC n 1 ) ►crossCheck(), check both side existence
16           i f   notExist
17             swap ( MS n , MS n 1 ) ►swap(), last and 2nd last ECs MS values
18           e n d   i f
19           V EC n Compute   vari ( A EC n s )
20           i f   V EC n < θ
21             B r e a k
22             jump   to   else   part   of   condition   line   43
23           e l s e
24             B r e a k
25           e n d   i f
26         e l s e
27           f o r   EC b till   EC n in   G i qi : { A i qi ×   A i s }   K
28             i f   V EC b > θ
29               B r e a k   l o o p
30             e n d   i f
31             B r e a k   l o o p
32             i f   EC b = found
33               MS c Compute   mfsv ( A EC c s ) ►mfsv(), max frequency A EC c s
34               MS b Compute   mfsv ( A EC b s ) ►mfsv(), max frequency A EC b s
35               MS b backCheck ( MS EC c , MS EC b ) ►backCheck() find MS value in MS EC b , not exists in   MS EC c
37               swap ( MS c , MS b ) ►swap(), exchange MS values
38               V EC c Compute   vari ( A EC c s ) ►vari(), again compute variance
39               i f   V EC c > θ
40                 EC c + = 1
41               e n d   i f
42             e l s e
43               NS Compute   addNoise ( A EC c s ) ►addNoise(), until variance> θ
44             e n d   i f
45           e n d   i f
46         e l s e
47          EC c + = 1
48         e n d   i f
49       e n d   f o r
50 e l s e
51   Condition = false ;
52 e n d   i f
MS EC b and next MS in same EC b is checked with MS c . This process continues until it finds a A s value in MS EC b that do not exist in MS c . Line 37 then swaps these two MS values along with their corresponding records. Two important purposes are achieved through this swap function. First, reducing the frequency of repeated A s and second, increasing diversity in EC c which results in increasing V EC c . The V EC c is again calculated and is checked with θ , if it is greater than θ , counter for EC c moves to the next EC.
Here, the absence of the else statement adds noise instantly, in a situation when the variance is less than θ , because more than one swapping for a specific EC c is possible. We add noise only once after completely checking the frequency of each A s in an EC. For example, if to produce a 4-anonymous EC table from Table 1a, after one swapping e.g., ‘HIV’ swaps with ‘Obesity’, the resulting EC1 in Table 8a will become 3-diverse and its variance will not meet θ , the else part might add noise to increase variance even though there is a duplicated A s ‘Cancer’ value that still exists in EC c . To reduce the frequency of the next duplicated A s value i.e., ‘Cancer’, by swapping it with another A s in EC b if one exists, noise is not added at this moment. This is achieved by going control back to line 10, and since this increased variance is still less than θ , the procedure repeats and from an EC b a new A s is swapped with the next duplicated A s value. In this way, two swapping procedures are performed and 2-diverse EC c will become 4-diverse without adding any unnecessary noise, which results in increasing data utility and a more diverse EC.
EC b is found because of a variance greater than θ , there are chances that no EC exists in a given dataset having a higher variance than θ , in this case, the loop will not break (line 29). In that case, the algorithm will jump to line 43. It will add a dummy record with distinct A s value(s) via function addNoise ( ) . Such an addition is considered as noise to the real data just like the addition of noise in differential privacy [22,23]. This algorithm performs very intelligent swapping and adds noise intelligently. The purpose of these two functions (i.e., swap ( ) and addNoise ( ) ) , is to increase the diversity keeping the utility as high as possible, which is easily achieved in our algorithm as shown in the experimental evaluation, Section 6.
The sanitized Table 4a from p+-sensitive k-anonymity is prone to homogeneity, categorical similarity, and sensitive variance attacks, and Table 8a from θ -sensitive k-anonymity secures the data from such attacks because of more diversity, even at the category level, i.e., the maximum value for category c is 4 through θ -sensitive k-anonymity, where, for Table 4a, the maximum value for c is 2. Table 8a provides more protection against the categorical similarity attack. Further swapping of values is not possible in the last EC; thus, a single tuple is added as noise to increase the diversity and to prevent categorical similarity attack and sensitive variance attack. Such a small amount of noise does not highly affect the utility of the data. Table 4b is a base table to obtain Table 8b using the θ -sensitive k-anonymity approach. Table 8b is also highly diverse at the categorical level and there are no repeated sensitive values. Thus, there is no need to add noise and to have a high value of variance. The anonymized data, both in Table 8a and Table 8b, obtained through the proposed θ -sensitive k-anonymity algorithm, have no attribute disclosure risk and are defensive against homogeneity [11], categorical similarity, and sensitive variance attacks, and even secure from skewness attacks [12].

5.3. Analysis of θ -Sensitive k-Anonymity Model Using Formal Modeling and Analysis

The proposed θ -sensitive k-anonymity model mitigates the vulnerability discussed in Section 4. Modeling the θ -sensitive k-anonymity via HLPN has the same end-user, data publisher, and unknown adversary, as shown in Figure 2. Table 9 and Table 10, respectively, show variable types and places, and their corresponding descriptions.
The θ -sensitive k-anonymity algorithm was modeled through the HLPN rules for the microdata input. The data publisher initially verifies the k-value input. The original data is k-anonymized (bottom-up generalization) after finalizing the individual records in an EC obtained through variance calculations. In Rule (2), the k-anonymity masks the data. In Rule (2):
R ( MaskData )   i 2 x 2 ,   i 3 x 3 , i 4 x 4   |
i 4 [ 1 ] Mask   { i 2 [ 2 ] } i 4 [ 2 ] Mask { i 2 [ 3 ] }     x 4 x 4   { i 4 [ 1 ] , i 4 [ 2 ] , i 3 }
If an input k is less than the minimum size of an EC (i.e., <2) the condition fails. For cardinality having a minimum value of 2 or above, the algorithm executes. The k-anonymity for true or false are depicted in Rule (3).
R ( Check   k )   i 5 x 5 ,   i 6     x 6   | Count ( i 5 [ 1 ] ) i 5 [ 3 ] i 6 [ 1 ] TRUE     Count ( i 5 [ 1 ] ) i 5 [ 3 ] i 6 [ 1 ] FALSE   x 6 x 6 { i 6 [ 1 ] }
The threshold θ is calculated in Rule (4). Variance for a fully diverse ECs for a specific k is calculated using the var ( ) function. The important contributed functions are swap ( ) and addNoise ( ) functions, through which the algorithm processes all ECs. Transition A d j u s t   V a r performs all these swapping and noise additions in corresponding ECs. In Rule (5), C o m p u t e   V a r transitions for the initial ECs. For the rest of the ECs, the same transition can be used in the same manner. In Rule (4):
R ( Calc   Theta )   i 10 x 10 ,   i 11 x 11 , i 12   x 12 | i 12 { i 11 ( i 10 ) 2 }     x 12 x 12 { i 12 }
In Rule (5):
R ( Compute   Var )   i 8   x 8 ,   i 9 x 9 | i 9 [ 1 ] Compute   Var ( i 8 [ 1 ] ) i 9 [ 2 ] i 8 [ 2 ] )   x 9 x 9   { i 9 [ 1 ] , i 9 [ 2 ] }
The θ -sensitive k-anonymity model’s main functionalities are described in Rule (6) and Rule (7). Variance in each k-anonymous EC with respect to θ   is checked in Rule (6). If variance of EC c is greater than θ (i.e., ( i 14 [ 1 ] > i 13 )), move to next EC c and update the value in place MMT. If the variance of EC c is less than θ (i.e., ( i 14 [ 1 ] < i 13 ) ), then transaction stops. We try to find EC b , and swap required available   A s values from EC b . After performing all needed swapping, if the variance of AdjEC c is still less than θ (i.e., ( i 32 < i 13 ) ), the noise is added to increase its diversity. In Rule (6):
R ( Check   Variance )   i 13 x 13 ,   i 14 x 14 ,   i 15 x 15 , i 19 x 19 ,   i 23 x 23 , i 24 x 24 ,   i 32 x 32 ,   i 33 x 33 | { ( (   i 14 [ 1 ] > i 13 ) i 16 [ 1 ] i 15 [ 1 ] + 1 x 16 x 16 [ 2 ] { i 16 } ) ( ( i 14 [ 1 ] < i 13 ) i 16 [ 2 ] i 15 [ 1 ] + 1 x 16 x 16 [ 2 ] { i 16 } ) ( ( i 14 [ 2 ] > i 13 ) ) i 19 x 19 = x 19 { i 19 } ) ( ( i 24 < i 13 ) i 25 x 25 x 25 { i 25 } ) ( ( i 32 < i 13 ) i 33 x 33 x 33 { i 33 } ) }  
The proposed θ -sensitive k-anonymity algorithm starts by processing each k size EC. The function Comp   mfsv (   ) computes the max frequency of A EC c s and   A EC b s , named as MS EC c and MS EC b , respectively. A one-way checking function: backCheck ( ) , checks for the existence of MS EC b at FoundEC b that do not exist in earlier EC c .
MS b is swapped with MS c after the checking succeeds and is saved in place Adj EC c . EC c minimizes the frequency of the A s value and increases diversity. While processing the last EC, i.e., EC n , swapping is not possible in the forward direction. Thus swapping with previous EC is performed with a condition that the variance of already processed EC n 1 should not be decreased with θ . The crossCheck ( ) function confirms two-way checking, that the values for both MS n and MS n 1 are distinct and it should not change the variance of EC n 1 at place StrictEC n 1 to an undesired value again. In that case, we call it strict EC n 1 . In other words, in addition to increasing the diversity in EC n , it is also not increasing the frequency of A s value at place EC n 1 . Values are then swapped and are saved at place AdjEC n . Rule (7) shows the whole process. In Rule (7):
R ( Adjst   Var )   i 17 x 17 , i 20 x 20 , i 21 x 21 , i 28 x 28 , i 29 x 29 | ( i 17 [ 1 ] i 17 [ 3 ] ) Comp   mfsv ( i 20 [ 1 ] , i 17 [ 1 ] )   True backCheck ( i 20 [ 1 ] , i 17 [ 1 ] ) i 21 swap ( i 20 [ 1 ] , i 17 [ 1 ] )   x 21 x 21 { i 21 } ( i 17 [ 1 ] = i 17 [ 3 ] ) Comp   mfsv ( i 17 [ 3 ] , i 28 [ 1 ] )   True crossCheck ( i 17 [ 3 ] , i 28 [ 1 ] )   i 29 swap ( i 17 [ 3 ] , i 28 [ 1 ] )   x 29 x 29 { i 29 [ 1 ] }
If the variance of AdjEC c is still less than θ (i.e., ( i 34 [ 1 ] < i 35 ) ), a dummy record called noise is added whenever needed throughout the variance adjustment process. In Rule (8), we have given the final noise addition case for last   AdjEC n . Its purpose is to increase the variance at a level greater than θ . It will produce a highly diverse EC even if there are not enough diverse records in MMT. In Rule (8):
R ( Add   Noise )   i 34 x 34 , i 35 x 35 , i 36 x 36 | (   i 34 [ 1 ] < i 35 ) i 36 addNoise ( i 34 [ 2 ] , i 34 [ 3 ] , i 34 [ 4 ] ) x 36 x 36 { i 36 [ 1 ] , i 36 [ 2 ] , i 36 [ 3 ] }
In Rule (9), an adversary attacks against the individual’s A s values. Adversary combines the already available BK (i.e., i 40 [ 2 ] ) with the published data (i.e., i 38 [ 2 ] ) and performs attack to disclose the patient’s identity (i.e., i 2 [ 2 ] ) and the sensitive values (i.e., i 2 [ 3 ] ). θ -sensitive k-anonymity model can provide better privacy protection to prevent from attribute disclosure attacks because it considers the high value of variance due to swapping and noise addition in corresponding ECs. The diversity of sensitive attribute values in ECs prevents the adversarial BK and is more effective as compared to the p+-sensitive k-anonymity model. Therefore, the adversary did not get private information for the target individual and the attack results in a null value. In Rule (9):
R ( S Variance   Attack ) : =   i 38 x 38 ,   i 40 x 40 ,   i 41 x 41 | Att _ Dis ( i 38 [ 2 ] , i 40 [ 2 ] ) ( i 2 [ 1 ] i 2 [ 2 ] i 2 [ 3 ] ) ( i 41 [ 2 ]   i 41 [ 3 ] ) =

6. Experimental Evaluation

In this section, the experiments that were performed to show the effectiveness of the proposed θ -sensitive k-anonymity privacy model in comparison to the p+-sensitive k-anonymity model are described. The proposed algorithm wisely diversified the AS values in a balanced way inside each EC without using the categorical approach. The utility and quality of the anonymized released data were checked with numerous quality measures.

6.1. Experimental Setup

All experiments were performed on a machine with an Intel Core i5 2.39 GHz processor with 4 GB RAM, using the Windows 10 operating system. The algorithm was written in Python 3.7. We used the Adults database, which contained age, zip code, salary, and occupation attributes, which is openly accessible at the UC Irvine Machine Learning Repository, We considered the age, zip code, salary as A qi s and occupation as A s .
Experimental results show the usefulness of the proposed θ -sensitive k-anonymity privacy model and protection against the categorical similarity attack and sensitive variance attack as compared to the p+-sensitive k-anonymity model. The quality of the sanitized publicly released data was evaluated with four utility metrics: discernibility penalty (DCP) [18,38,39], normalized average QI-group (CAVG) [17,18,38], noise calculation, and query accuracy [18,33]. The execution time of both algorithms was analyzed at the end of the experiments.

6.2. Discernibility Penalty (DCP)

The DCP proposed in [38] and used in [18,39] is an assignment of penalty (cost) to each tuple in the generalized data set. Through this penalty, the sanitized tuple cannot be distinguished among other tuples in the result set. Minimizing the discernibility cost is an optimal objective. The penalty for a tuple t that belongs to an EC of size |EC|, i.e., t   ϵ   EC , will be |EC| and the penalty for each EC is | EC | 2 . The complete DCP penalty for the overall sanitized released dataset R can be seen in Equation (2).
DCP   ( R ) = i = 1 | { EC } | | EC i | 2
where { EC } are the total number of ECs in R . A baseline can be obtained from the most optimal DCP score calculations as shown in [10]. For example, if k = 2 and the number of anonymized tuples are 10, the DCP optimal score will be 2 2 + 2 2 + 2 2 + 2 2 + 2 2 = 20 . This optimal score is called the baseline. The approach to generate groups followed in this paper was based on k size, inclusive of the noise tuple(s). Higher k means bigger group size, so the baseline moves up because of a high DCP score. The p+-sensitive k-anonymity model generated groups based on p. It means the number of tuples can be greater than p in a k-anonymous class. Figure 3 shows the DCS score for θ -sensitive k-anonymity, including a comparison with p+-sensitive and baseline. In comparison to p+-sensitivity, the DCP score, through the proposed θ -sensitive k-anonymity algorithm, is almost equal to the baseline, which implies that the proposed model assigned an optimal penalty to each EC and produced an optimal DCP score. The magnified subplots in Figure 3 with k = 12 and k = 16 for θ -sensitive k-anonymity shows the very minor difference with baseline. This minor difference can also be seen in Table 11, with an average DCP score of 47.2 or 0.002679% with a baseline obtained from the simulation while calculating the DCP for the anonymized dataset R .

6.3. Normalized Average (CAVG)

CAVG is another mathematically sound measurement that measures the quality of the sanitized data by the EC average size. It was proposed in [38] and applied in [17,18]. Below in Equation (3), CAVG can be calculated as
C AVG = ( | R | | { EC } | ) ÷ k
where | R | is the overall sanitized released dataset and | { EC } | are the total number of ECs in R . Data utility and CAVG are inversely proportional. Low CAVG value indicates high information utility. The optimal goal is to have a minimum size of ECs in R . Figure 4 shows CAVG for p+-sensitive k-anonymity and θ -sensitive k-anonymity over k-anonymity. p+-sensitive has lower data utility over small k, where there is a high data utility for large k. The proposed technique has a very balanced and sustainable utility for each input value of k. Thus, the proposed θ -sensitive k-anonymity model performs efficiently for all sizes of k, compared to the p+-sensitive k-anonymity model.

6.4. Noise Addition

Among different masking methods, one popular approach is the perturbation of data, i.e., noise addition. These are dummy tuples, added to the original data that helps in achieving the required diversity similar to the differential privacy [22,23]. The reason is if there are not enough A s values to swap with, especially in the second last and last ECs, the gap is filled with the noise tuples to prevent with disclosure risk. So, one of the reasons for such a good performance of the proposed model is the cost of noise addition. Figure 5 shows the number of tuples added as a noise for different values of k. These tuples are added to achieve the required value of the threshold θ . For different values of k, the algorithm responds differently but the maximum number of noise tuples added for a specific value of k is only six tuples. In the processed “Adult” dataset, the total number of tuples was 160,150 and only 34 noise tuples, i.e., 0.021% of the total size, were added in total. Such an amount of utility loss is negligible. This small amount of noise addition is sometimes due to get a round number when dividing the dataset size by the k size input, for example, 160150/4 = 40037.5 and 160152/4 = 40038.

6.5. Query Accuracy

Query accuracy measures precision for aggregate queries to check the utility of the anonymized data. It has been used by various research works [18,33]. To answer the aggregate queries, the built-in COUNT operator is used, where A qi s are the query predicates. Consider R to be a sanitized release from original microdata R having maximum m as A qi s ;   A i qi ( 1 i m ) , where D ( A i qi ) is the domain of ith QI. The SQLQuery in Equation (4) for the COUNT query will work as
SQLQuery   =   select   COUNT ( )   from   R   where   A 1 qi D ( A 1 qi )   AND   . . .   AND   A m qi   D ( A m qi )
Against each query, at least one or a few number of tuples should be selected from each EC based on query predicates. Two important parameters for query predicates are (1) query dimensionality q, and (2) the query selectivity ϑ . Query dimensionality comprises of the number of QIs in query predicate while query selectivity is the number of values for each attribute A i ,   ( 1 i n ) . The query selectivity is calculated as, ϑ = | T Q | | R | , where | T Q | are the output number of tuples after using query Q on relation R, and | R | are the total number of tuples in the whole dataset. Query error i.e., Error(Q), is calculated in Equation (5).
Error ( Q ) = | count ( R ) count ( R ) | count ( R )
where count ( R ) depicts result set from the COUNT query on an anonymized dataset while count ( R ) is the result set from the COUNT query on the original microdata. More selective queries have a high error rate.
Figure 6a shows the query error for the input value of k. We compare the p+-sensitive k-anonymity and θ -sensitive k-anonymity using the query error rate for 1000 randomly generated aggregate queries. The error rate increases for the high value of k because of the high range in A qi s. This selects a greater number of tuples than the original microdata and hence high error rate. In Figure 6b, it is depicted that the more we select tuples based on predicates, the higher the error rate will be in the anonymized data.

6.6. Execution Time

Figure 7 shows the execution time for both p+-sensitive k-anonymity model and for the proposed θ -sensitive k-anonymity model. The execution time for both of the algorithms increased with an increase in value of k because of the increase in A qi s generalization range. Since we did not consider the sensitive values categorization, our approach took a small amount of time to execute as compared to its counterpart. In the θ -sensitive k-anonymity model, a higher execution time for k = 10, k = 16 and k = 20 was because of the time taken to add more noise tuples to achieve the required diversity.

7. Conclusions

In this paper, the huge amount of data (i.e., Big Data) collected through the IoT-based devices were anonymized using the proposed θ -sensitive k-anonymity privacy model in comparison to p+-sensitive k-anonymity model. The purpose was to prevent an attribute disclosure risk in anonymized data. The p+-sensitive k-anonymity model was considered to be vulnerable to a privacy breach from sensitive variance, categorical similarity, and homogeneity attacks. These attacks were mitigated by implementing the proposed θ -sensitive k-anonymity privacy model using Equation (1). In the proposed solution, the threshold θ value decides the diversity level for each EC of the dataset. The vulnerabilities in the p+-sensitive k-anonymity model and the effectiveness of the proposed θ -sensitive k-anonymity model were formally modeled through HLPN, which further ensures the validation of the proposed technique. The experimental work proved the privacy implementation and an improved utility of the released data using different mathematical measures. For future work consideration, the proposed algorithm can be extended to 1:M (single record having many attribute values) [40], to multiple sensitive attributes (MSA) [41,42,43], or can be modeled by considering the dynamic data set [44] approach.

Author Contributions

Conceptualization, R.K. and X.T.; methodology, R.K., X.T., and A.A.; software, R.K.; validation, X.T. and A.A.; formal analysis, R.K., T.K., and S.u.R.M.; writing—original draft preparation, T.K.; writing—review and editing, T.K., X.T., A.A., A.K., W.u.R., and C.M.; supervision, X.T.; funding acquisition, X.T. All authors have read and agreed to the published version of the manuscript.


This work was supported by the National Natural Science Foundation of China (61932005), and 111 Project of China B16006.

Conflicts of Interest

The authors declare that there is no conflict of interests regarding the publication of this paper.


  1. Dang, L.M.; Piran, J.; Han, D.; Min, K.; Moon, H. A Survey on Internet of Things and Cloud Computing for Healthcare. Electronics 2019, 8, 768. [Google Scholar] [CrossRef][Green Version]
  2. Sun, W.; Cai, Z.; Li, Y.; Liu, F.; Fang, S.; Wang, G. Security and Privacy in the Medical Internet of Things: A Review. Secur. Commun. Netw. 2018, 2018, 1–9. [Google Scholar] [CrossRef]
  3. Baek, S.; Seo, S.-H.; Kim, S.J. Preserving Patient’s Anonymity for Mobile Healthcare System in IoT Environment. Int. J. Distrib. Sens. Netw. 2016, 12, 2171642. [Google Scholar] [CrossRef][Green Version]
  4. Liu, F.; Li, T. A Clustering K-Anonymity Privacy-Preserving Method for Wearable IoT Devices. Secur. Commun. Netw. 2018, 2018, 1–8. [Google Scholar] [CrossRef][Green Version]
  5. Wan, J.; Al-Awlaqi, M.A.A.H.; Li, M.; O’Grady, M.; Gu, X.; Wang, J.; Cao, N. Wearable IoT enabled real-time health monitoring system. EURASIP J. Wirel. Commun. Netw. 2018, 2018, 298. [Google Scholar] [CrossRef]
  6. Al-Khafajiy, M.; Baker, T.; Chalmers, C.; Asim, M.; Kolivand, H.; Fahim, M.; Waraich, A. Remote health monitoring of elderly through wearable sensors. Multimed. Tools Appl. 2019, 78, 24681–24706. [Google Scholar] [CrossRef][Green Version]
  7. Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef][Green Version]
  8. Sweeney, L. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 2002, 10, 571–588. [Google Scholar] [CrossRef]
  9. Song, F.; Ma, T.; Tian, Y.; Al-Rodhaan, M. A New Method of Privacy Protection: Random k-Anonymous. IEEE Access 2019, 7, 75434–75445. [Google Scholar] [CrossRef]
  10. Wang, J.; Du, K.; Luo, X.; Li, X. Two privacy-preserving approaches for data publishing with identity reservation. Knowl. Inf. Syst. 2018, 60, 1039–1080. [Google Scholar] [CrossRef][Green Version]
  11. Amiri, F.; Yazdani, N.; Shakery, A.; Chinaei, A.H. Hierarchical anonymization algorithms against background knowledge attack in data releasing. Knowl. Based Syst. 2016, 101, 71–89. [Google Scholar] [CrossRef]
  12. Yaseen, S.; Abbas, S.M.A.; Anjum, A.; Saba, T.; Khan, A.; Malik, S.U.R.; Ahmad, N.; Shahzad, B.; Bashir, A.K. Improved Generalization for Secure Data Publishing. IEEE Access 2018, 6, 27156–27165. [Google Scholar] [CrossRef]
  13. Liu, X.; Deng, R.H.; Choo, K.K.R.; Weng, J. An efficient privacy preserving outsourced calculation tool kit with multiple keys. IEEE Trans. Inf. Forensics Secur. 2016, 11, 2401–2414. [Google Scholar] [CrossRef]
  14. Michalas, A. The lord of the shares. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, Limassol, Cyprus, 8–12 April 2019; pp. 146–155. [Google Scholar] [CrossRef][Green Version]
  15. Machanavajjhala, A.; Gehrke, J.; Kifer, D.; Venkitasubramaniam, M. L-diversity: Privacy beyond k-anonymity. Int. Conf. Data Eng. 2006, 1, 24. [Google Scholar] [CrossRef][Green Version]
  16. Li, N.; Li, T.; Venkatasubramanian, S. t-Closeness: Privacy beyond k-Anonymity and l-Diversity. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering, Istanbul, Turkey, 15–20 April 2007; pp. 106–115. [Google Scholar]
  17. Sun, X.; Sun, L.; Wang, H. Extended k-anonymity models against sensitive attribute disclosure. Comput. Commun. 2011, 34, 526–535. [Google Scholar] [CrossRef]
  18. Anjum, A.; Malik, S.U.R.; Choo, K.-K.R.; Khan, A.; Haroon, A.; Khan, S.; Khan, S.U.; Ahmad, N.; Raza, B. An efficient privacy mechanism for electronic health records. Comput. Secur. 2018, 72, 196–211. [Google Scholar] [CrossRef]
  19. Campan, A.; Truta, T.M.; Cooper, N. p-sensitive k-anonymity with generalization constraints. Trans. Data Privacy 2010, 3, 65–89. [Google Scholar]
  20. Al-Khafajiy, M.; Webster, L.; Baker, T.; Waraich, A. Towards fog driven IoT healthcare. In Proceedings of the 2nd International Conference on Future Networks and Distributed Systems, Amman, Jordan, 26–27 June 2018; Volume 9, p. 9. [Google Scholar]
  21. Shahzad, A.; Lee, Y.S.; Lee, M.; Kim, Y.-G.; Xiong, N.N. Real-Time Cloud-Based Health Tracking and Monitoring System in Designed Boundary for Cardiology Patients. J. Sens. 2018, 2018, 1–15. [Google Scholar] [CrossRef]
  22. Domingo-Ferrer, J.; Soria-Comas, J. From t-closeness to differential privacy and vice versa in data anonymization. Knowl. Based Syst. 2015, 74, 151–158. [Google Scholar] [CrossRef][Green Version]
  23. Dwork, C. Differential privacy. In International Colloquium on Automata, Languages, and Programming; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
  24. Fung, B.C.; Wang, K.; Chen, R.; Yu, P.S. Privacy-preserving data publishing. ACM Comput. Surv. 2010, 42, 1–53. [Google Scholar] [CrossRef]
  25. Xu, Y.; Ma, T.; Tang, M.; Tian, W. A Survey of Privacy Preserving Data Publishing using Generalization and Suppression. Appl. Math. Inf. Sci. 2014, 8, 1103–1116. [Google Scholar] [CrossRef][Green Version]
  26. Torra, V. Transparency in Microaggregation; UNECE: Skovde, Sweden, 2015; pp. 1–8. Available online: (accessed on 25 August 2019).
  27. Panackal, J.J.; S.Pillai, A. Adaptive Utility-based Anonymization Model: Performance Evaluation on Big Data Sets. Procedia Comput. Sci. 2015, 50, 347–352. [Google Scholar] [CrossRef][Green Version]
  28. Rahimi, M.; Bateni, M.; Mohammadinejad, H. Extended K-Anonymity Model for Privacy Preserving on Micro Data. Int. J. Comput. Netw. Inf. Secur. 2015, 7, 42–51. [Google Scholar] [CrossRef][Green Version]
  29. Sowmiyaa, P.; Tamilarasu, P.; Kavitha, S.; Rekha, A.; Krishna, G.R. Privacy Preservation for Microdata by using k-Anonymity Algorthim. Int. J. Adv. Res. Comput. Commun. Eng. 2015, 4, 373–375. [Google Scholar]
  30. Wong, C.; Li, J.; Fu, W.; Wang, K. (α,k)-Anonymity: An enhanced k-anonymity model for privacy preserving data publishing. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining ACM, Philadelphia, PA, USA, 20–23 August 2006; pp. 754–759. [Google Scholar]
  31. Zhang, Q.; Koudas, N.; Srivastava, D.; Yu, T. Aggregate Query Answering on Anonymized Tables. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering, Institute of Electrical and Electronics Engineers (IEEE), Istanbul, Turkey, 17–20 April 2007; pp. 116–125. [Google Scholar]
  32. Li, J.; Tao, Y.; Xiao, X. Preservation of proximity privacy in publishing numerical sensitive data. In Proceedings of the 2008 ACM SIGMOD International Conference, Association for Computing Machinery (ACM), Vancouver, BC, Canada, 9–12 June 2008; pp. 473–486. [Google Scholar] [CrossRef]
  33. Xiao, X.; Tao, Y. Personalized privacy preservation. In Proceedings of the 2006 ACM SIGMOD International Conference, Chicago, IL, USA, 27–29 June 2006; p. 229. [Google Scholar] [CrossRef][Green Version]
  34. Christen, P.; Vatsalan, D.; Fu, Z. Advanced Record Linkage Methods and Privacy Aspects for Population Reconstruction—A Survey and Case Studies. In Population Reconstruction; Springer: Berlin, Germany, 2015; pp. 87–110. [Google Scholar] [CrossRef]
  35. Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  36. Rubner, Y.; Tomasi, C.; Guibas, L.J. The Earth Mover’s Distance as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
  37. Ali, M.; Malik, S.U.R.; Khan, S.U. DaSCE: Data Security for Cloud Environment with Semi-Trusted Third Party. IEEE Trans. Cloud Comput. 2015, 5, 642–655. [Google Scholar] [CrossRef]
  38. Bayardo, R.J.; Agrawal, R. Data Privacy through Optimal k-Anonymization. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), Tokyo, Japan, 5–8 April 2005; pp. 217–228. [Google Scholar]
  39. Lefevre, K.; DeWitt, D.; Ramakrishnan, R. Mondrian Multidimensional K-Anonymity. In Proceedings of the 22nd International Conference on Data Engineering, Atlanta, GA, USA, 3–8 April 2006; p. 25. [Google Scholar]
  40. Gong, Q.; Luo, J.; Yang, M.; Ni, W.; Li, X.-B. Anonymizing 1:M microdata with high utility. Knowl. Based Syst. 2016, 115, 15–26. [Google Scholar] [CrossRef][Green Version]
  41. Wang, R.; Zhu, Y.; Chen, T.-S.; Chang, C.-C. Privacy-Preserving Algorithms for Multiple Sensitive Attributes Satisfying t-Closeness. J. Comput. Sci. Technol. 2018, 33, 1231–1242. [Google Scholar] [CrossRef]
  42. Anjum, A.; Ahmad, N.; Malik, S.U.R.; Zubair, S.; Shahzad, B. An efficient approach for publishing microdata for multiple sensitive attributes. J. Supercomput. 2018, 74, 5127–5155. [Google Scholar] [CrossRef]
  43. Khan, R.; Tao, X.; Anjum, A.; Sajjad, H.; Malik, S.U.R.; Khan, A.; Amiri, F. Privacy Preserving for Multiple Sensitive Attributes against Fingerprint Correlation Attack Satisfying c-Diversity. Wirel. Commun. Mob. Comput. 2020, 2020, 1–18. [Google Scholar] [CrossRef][Green Version]
  44. Zhu, H.; Liang, H.B.; Zhao, L.; Peng, D.Y.; Xiong, L. τ-Safe (l,k)-Diversity Privacy Model for sequential publication with high utility. IEEE Access 2019, 7, 687–701. [Google Scholar] [CrossRef]
Figure 1. HLPN for p+-sensitive k-anonymity attack model.
Figure 1. HLPN for p+-sensitive k-anonymity attack model.
Electronics 09 00716 g001
Figure 2. HLPN for θ -sensitive k-anonymity.
Figure 2. HLPN for θ -sensitive k-anonymity.
Electronics 09 00716 g002
Figure 3. Discernibility penalty (DSP) score.
Figure 3. Discernibility penalty (DSP) score.
Electronics 09 00716 g003
Figure 4. The ratio of CAVG.
Figure 4. The ratio of CAVG.
Electronics 09 00716 g004
Figure 5. The number of noise tuples added against each k.
Figure 5. The number of noise tuples added against each k.
Electronics 09 00716 g005
Figure 6. (a) Query error for k. (b) Query error for selectivity.
Figure 6. (a) Query error for k. (b) Query error for selectivity.
Electronics 09 00716 g006
Figure 7. Algorithm execution time.
Figure 7. Algorithm execution time.
Electronics 09 00716 g007
Table 1. a. Original microdata. b. 2-Anonymous microdata.
IDNameAgeZip CodeCountryDisease
9YIN LI4014243ChinaFlu
IDAgeZip CodeCountryDisease
5>= 4014054-14063AmericaHepatitis
6>= 4014054-14063AmericaObesity
7>= 4013073-14066AsiaAsthma
8>= 4013073-14066AsiaPhthisis
Table 2. Summary of notations used.
Table 2. Summary of notations used.
M T Microdata Table A i qi Quasi identifier for ith end user
MMT Micro Mask Table A s Sensitive Attributes
AAttributes in MT A id Identifier Attribute
PDPublished Data A ECc s Sensitive value in an EC c
ECs Set of Equivalence classes A ECn s Sensitive value in an EC n
EC i k-anonymous group of tuples with the combination of A i qi and A s A ECn 1 s Sensitive value in an EC n 1
EC c Equivalence Class current A ECb s Sensitive value in an EC b
EC b Equivalence Class broken N Noise
V EC i Variance for EC i M Total number of record in an EC
MS n Max frequency of A i s in an ECn MS c Max frequency of A i s in an ECc
MS n 1 Max frequency of A i s in an ECn-1 MS b Max frequency of A i s in an ECb
P Places used in formal modeling G i qi QI-group at index i
φ Data Types in formal modeling
Table 3. Category table.
Table 3. Category table.
Category IDSensitive Values
1HIV, Cancer
2Hepatitis, Phthisis
3Asthma, Obesity
4Indigestion, Flu
Table 4. a. 2+-Sensitive 4-Anonymous. b. (3,1)-Sensitive 4-Anonymous.
ECsIDAgeZip CodeCountryDisease
EC11=< 4014204-14247AmericaHIV
2=< 4014204-14247AmericaCancer
3=< 40 14204-14247AmericaFlu
4=< 4014204-14247AmericaIndigestion
EC25>= 4013073-14066****Hepatitis
6>= 4013073-14066****Phthisis
7>= 4013073-14066****Asthma
8>= 4013073-14066****Obesity
EC39=< 4014203-14247****HIV
10=< 4014203-14247****Cancer
11=< 4014203-14247****Flu
12=< 4014203-14247****Flu
IDAgeZip CodeCountryDisease
1=< 4014205-14247****HIV
2=< 4014205-14247****HIV
3=< 40 14205-14247****Cancer
4=< 4014205-14247****Flu
5>= 4013073-14066****Hepatitis
6>= 4013073-14066****Phthisis
7>= 4013073-14066****Asthma
8>= 4013073-14066****Obesity
9=< 4014203-14247AmericaCancer
10=< 4014203-14247AmericaFlu
11=< 4014203-14247AmericaFlu
12=< 4014203-14247AmericaIndigestion
Table 5. Variance calculation for different equivalence classes (ECs) in Table 4a.
Table 5. Variance calculation for different equivalence classes (ECs) in Table 4a.
Sensitive Values x f x 2 f x f x 2 Sensitive Values x f x 2 f x f x 2
Obesity4116416 N = f = 4 f x = 7 f x 2 = 15
N = f = 4 f x = 10 f x 2 = 30
Variance ( σ 2 ) ( fX 2 N ( fX N ) 2 ) = ( 30 4 ( 10 4 ) 2 ) = 1.25 Variance ( σ 2 ) ( fX 2 N ( fX N ) 2 ) = ( 15 4 ( 7 4 ) 2 ) = 0.69
Table 6. Types used in high-level Petri nets (HLPN) for p+-sensitive k-anonymity.
Table 6. Types used in high-level Petri nets (HLPN) for p+-sensitive k-anonymity.
Data TypesDescription
kUser input for k-anonymity
pp-sensitivity numeric value
CDistinct categories set
ConditionBoolean value 1 or 0
SnTotal distinct A s values
CnTotal distinct categories
A i si Sensitive Attribute for ith end user
A i id Identifier attribute for ith end user
Table 7. Data-types, places, and their mapping.
Table 7. Data-types, places, and their mapping.
φ (MT)ℙ ( A qi   ×   A s   ×   A id )
φ (MMT)ℙ ( A qi × A s × k)
φ (KLevel)ℙ (k)
φ (CondTF)ℙ (Condition)
φ (Gi)ℙ ( A qi × A s × k)
φ (ds)ℙ ( A s )
φ (CountDs)ℙ (Sn)
φ ( Gi )ℙ ( A qi   × A s × k× C)
φ (PLevel)ℙ (p)
φ (CompC)ℙ ( C n )
φ (Publish Data)ℙ ( A qi × A s )
φ (BK)ℙ ( A id × A qi )
φ (SA Disc)ℙ ( A i qi   ×   A i si ×   A i id )
Table 8. a. θ-sensitive 4-anonymous (with noise). b. θ-sensitive 4-anonymous (without noise).
IDAgeZip CodeCountryDisease
1=< 4014054-14247AmericaHIV
2=< 4014054-14247AmericaCancer
3=< 4014054-14247AmericaHepatitis
4=< 4014054-14247AmericaObesity
5>= 4013073-14243AsiaHIV
6>= 4013073-14243AsiaPhthisis
7>= 4013073-14243AsiaAsthma
8>= 4013073-14243AsiaFlu
9=< 4014063-14247AmericaCancer
10=< 4014063-14247AmericaFlu
IDAgeZip CodeCountryDisease
1=< 4014054-14247AmericaHepatitis
2=< 4014054-14247AmericaHIV
3=< 4014054-14247AmericaCancer
4=< 4014054-14247AmericaFlu
5>= 4013073-14243AsiaHIV
6>= 4013073-14243AsiaPhthisis
7>= 4013073-14243AsiaAsthma
8>= 4013073-14243AsiaFlu
9=< 4014063-14247AmericaCancer
10=< 4014063-14247AmericaObesity
11=< 4014063-14247AmericaFlu
12=< 4014063-14247AmericaIndigestion
Table 9. Types used in HLPN for θ -sensitive k-anonymity.
Table 9. Types used in HLPN for θ -sensitive k-anonymity.
Data TypesDescriptions
MSize of an EC
ConditionBoolean value 1 or 0
σ A float type value to define Sigma
µA float type value to define Mu
θ A float type value to define Theta
Found EC b Equivalence class b when it is found
AdjEC c Adjust Equivalence class c
AdjEC n Adjust Equivalence class n
VarEC s Variance of different Equivalence classes
VarAdjEC n Adjust variance for Equivalence class n
VarAdjEC c Adjust variance for Equivalence class c
Table 10. Mapping of data types in θ -sensitive k-anonymity model.
Table 10. Mapping of data types in θ -sensitive k-anonymity model.
φ ( MT )ℙ ( A id × A qi × A s )
φ ( MMT )ℙ ( EC c × EC b × EC n × k )
φ ( KValue )ℙ (k)
φ ( CondTF )ℙ (Condition)
φ ( Sigma )ℙ ( σ )
φ ( Mu )ℙ ( μ )
φ ( Theta )ℙ ( θ )
φ ( Found   EC b )ℙ ( EC b )
φ ( VarEC s )ℙ ( V EC c × V EC b × V EC n )
φ ( AdjEC c )ℙ ( EC c )
φ ( AdjEC n )ℙ ( EC n )
φ ( StrictEC n 1 )ℙ ( EC n 1 )
φ ( VarAdjEC n )ℙ ( V EC n )
φ ( VarAdjEC c )ℙ ( V EC c )
φ ( Need   Noise )ℙ ( V EC c   × A id × A qi × A s )
φ ( PublshdData )ℙ ( A qi × A s )
φ ( BK )ℙ ( A id × A qi )
φ ( SA   Disc )ℙ ( A i qi   ×   A i si × A i id )
Table 11. DCP experiment values for each k.
Table 11. DCP experiment values for each k.
Average Val.17616501761697.22059994.7
Diff. of θ and p+ avg. values with base avg. value--47.2298344.7
Percent Closer to baseline--0.00267923514.65
% diff. between θ and p+--14.64--
This means that our proposed approach θ-sensitive, k-anonymity is 14.64% better than p+-sensitive k-anonymity and 0.002679% closer to the baseline.

Share and Cite

MDPI and ACS Style

Khan, R.; Tao, X.; Anjum, A.; Kanwal, T.; Malik, S.u.R.; Khan, A.; Rehman, W.u.; Maple, C. θ-Sensitive k-Anonymity: An Anonymization Model for IoT based Electronic Health Records. Electronics 2020, 9, 716.

AMA Style

Khan R, Tao X, Anjum A, Kanwal T, Malik SuR, Khan A, Rehman Wu, Maple C. θ-Sensitive k-Anonymity: An Anonymization Model for IoT based Electronic Health Records. Electronics. 2020; 9(5):716.

Chicago/Turabian Style

Khan, Razaullah, Xiaofeng Tao, Adeel Anjum, Tehsin Kanwal, Saif ur Rehman Malik, Abid Khan, Waheed ur Rehman, and Carsten Maple. 2020. "θ-Sensitive k-Anonymity: An Anonymization Model for IoT based Electronic Health Records" Electronics 9, no. 5: 716.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop