θ-Sensitive k-Anonymity: An Anonymization Model for IoT based Electronic Health Records

Razaullah Khan; Xiaofeng Tao; Adeel Anjum; Tehsin Kanwal; Saif ur Rehman Malik; Abid Khan; Waheed ur Rehman; Carsten Maple

doi:10.3390/electronics9050716

,

and

¹

National Engineering Laboratory for Mobile Network Technologies, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Department of Computer Science, COMSATS University Islamabad, Islamabad 45550, Pakistan

³

Cybernetica AS Estonia, Tallinn 13412, Estonia

⁴

Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, UK

Electronics2020, 9(5), 716;https://doi.org/10.3390/electronics9050716

This article belongs to the Special Issue Cyber Security for Internet of Things

Version Notes

Order Reprints

Abstract

The Internet of Things (IoT) is an exponentially growing emerging technology, which is implemented in the digitization of Electronic Health Records (EHR). The application of IoT is used to collect the patient’s data and the data holders and then to publish these data. However, the data collected through the IoT-based devices are vulnerable to information leakage and are a potential privacy threat. Therefore, there is a need to implement privacy protection methods to prevent individual record identification in EHR. Significant research contributions exist e.g., p⁺-sensitive k-anonymity and balanced p⁺-sensitive k-anonymity for implementing privacy protection in EHR. However, these models have certain privacy vulnerabilities, which are identified in this paper with two new types of attack: the sensitive variance attack and categorical similarity attack. A mitigation solution, the

θ

-sensitive k-anonymity privacy model, is proposed to prevent the mentioned attacks. The proposed model works effectively for all k-anonymous size groups and can prevent sensitive variance, categorical similarity, and homogeneity attacks by creating more diverse k-anonymous groups. Furthermore, we formally modeled and analyzed the base and the proposed privacy models to show the invalidation of the base and applicability of the proposed work. Experiments show that our proposed model outperforms the others in terms of privacy security (14.64%).

Keywords:

Internet of Things; big data; electronic health records; k-anonymity; privacy; security

1. Introduction

The current highly-connected technological society generates a huge amount of digital data—termed Big Data, collected through internet-enabled devices, termed the Internet of Things (IoT) []. Billions of these IoT devices sense and collect the data e.g., the patient’s Electronic Health Records (EHR) [,,,]. The collected data are then shared with corporate or government bodies for research and policymaking. However, the privacy of the individual records is an important goal when sharing data that is collected through the IoT enabled devices [,,,,,]. This is because these data contain names or some unique identification (explicit identifiers—

A^{ei}

), such as age, gender, zip code (quasi-identifiers—

A^{qi}

), and some health-related private information (sensitive attributes—

A^{s}

) [,,,,,]. To preserve privacy, eliminating the

A^{ei}

before sharing or publishing the data is not enough []. For an attacker or an adversary, the quasi-identifiers (QIs) are the partial identifiers that can be used to link to some externally available data e.g., voting or census data, to identify an individual

A^{s}

, known as a linking attack [,,].

To implement data privacy, a lot of cryptographic techniques [,] have been proposed. However, these techniques have high computational overheads. Another simple approach is data anonymization. Data anonymization is about concealing an individual’s identity in a small crowd of records before data publishing. The publishing of such anonymized records are known as Privacy Preserving Data Publishing (PPDP) []. A plethora of PPDP methods have been proposed [,,,,,,,,,,]. These techniques are broadly classified into:

Identity disclosure prevention: Generalizing [,,] the QI values of a group of records from more specific values to less specific values e.g., k-anonymity [,], where every record should be indistinguishable from at least k-1 other records. An individual having probability higher than 1/k cannot be re-identified by an intruder/attacker.
Attribute disclosure prevention: Preventing to reveal private information ( $A^{s}$ information) about an individual. Examples are l-diversity [] and t-closeness [] privacy models.

In this paper, a variance-based privacy model is proposed to prevent attribute disclosure risk. For sensitive attribute privacy, the p+-sensitive k-anonymity, (p, α)-sensitive k-anonymity [] privacy model is a state-of-the-art privacy model where the sensitive values are categorized into four categories. For creating a k-anonymous group of records called an equivalence class (EC), a l-diversity [] is applied. However, two new possible attacks are applied: sensitive variance attack and categorical similarity attack. These attacks breach the privacy of the p+-sensitive k-anonymity and (p, α)-sensitive k-anonymity [] algorithm, due to the

A^{s}

values from the single sensitive category or a low diversity at

A^{s}

category level. The proposed mitigation solution: the

θ

-Sensitive k-anonymity privacy model, is a numerical measure of privacy strength for thwarting the attribute disclosure risk. The proposed approach also appends small amounts of noise tuple(s) to increase the variability in an EC, if needed. To minimize the utility loss, the proposed algorithm uses a bottom-up generalization (i.e., the local recoding mechanism []) because it minimally distorts the data compared to global recoding techniques []. The following section presents the motivation of our work.

1.1. Motivation

Broadly examining the PPDP models for preventing attribute disclosure risk [,,,,], it was concluded that the worthiness of each model exists in the diversity of an EC where the sensitive values belongs to different categories. Such variability of

A^{s}

values creates a diverse EC. Different privacy models employ different techniques to achieve variability in k-anonymous ECs. The repeated frequencies of same sensitive values are the only obstruct in achieving the required diversity in an EC. The privacy models in [] and [] provide a meaningful approach in dealing with the attribute disclosure problem, however the following limitations have been observed.

p⁺-sensitive k-anonymity and (p, α)-sensitive k-anonymity []: This model is a modified version of the p-sensitive k-anonymity [], for preventing a similarity attack. However, the p⁺-sensitive k-anonymity and (p, α)-sensitive k-anonymity models have zero diversity at the $A^{s}$ category level, which may lead to a categorical similarity attack. A more powerful possible attack by an adversary is the sensitive variance attack, due to the low variability at $A^{s}$ category level. With an upsurge in the adversary’s knowledge (background knowledge—BK) the privacy level can be breached, which may cause attribute disclosures. The proposed $θ$ -sensitive k-anonymity privacy model provides a privacy solution to prevent all such attacks.
Balanced p⁺-sensitive k-anonymity and (p, α)-sensitive k-anonymity []: This model is an enhanced version of p⁺-sensitive k-anonymity model. It balances the categorical level sensitive attributes in each EC. However, it still has low diversity at ${the A}^{s}$ category level and works only for more than three k-anonymous size ECs.

To solve the problems of homogeneity, categorical similarity, and sensitive variance attacks in the p⁺-sensitive k-anonymity and (p, α)-sensitive k-anonymity model [], we propose the

θ

-Sensitive k-anonymity privacy model in this paper. The categorical level similarity and small EC size problems in the balanced p⁺-sensitive k-anonymity and (p, α)-sensitive k-anonymity model [] are also addressed by achieving a more balanced and diverse EC even at the category level and its execution on small k size EC, i.e., k = 2.

1.2. Contributions

The proposed

θ

-sensitive k-anonymity privacy model multiplies variance (

σ^{2}

) of a fully diverse EC with an observed value (observation 1) which produces a threshold value

θ

. The

θ

value ensures prevention against attribute disclosure in an EC which collectively results in the privacy of the given dataset.

The contributions of this paper are as follows:

A new $θ$ -sensitive k-anonymity privacy model is proposed where privacy in an EC is achieved through a threshold value, i.e., $θ$ . The $θ$ value for an EC is obtained by multiplying variance and an observation value. The variance-based diversity in an EC prevents the sensitive variance attack, which automatically prevents the categorical similarity attack. In the proposed model, the $A^{s}$ values checking is not only performed with next ECs, but a cross check is also performed during the last EC. If the required privacy is not achievable with the existing $A^{s}$ values, then a noise is added for the required diversity.
We formally modeled and analyzed the base model in [] and the proposed $θ$ -sensitive k-anonymity privacy model using High Level Petri-Nets (HLPN).
Based on the above points, simulation results show that our proposed $θ$ -sensitive k-anonymity model has only 0.002679% higher privacy leakage than its counterpart p⁺-sensitive k-anonymity model which has 14.65% higher privacy leakage with the base line privacy.

Paper Organization. The remainder of the paper is organized as follows. Section 2 explains related work. Preliminaries are discussed in Section 3. The considered attacks and problem statement in p⁺-sensitive k-anonymity along with its formal analysis are presented in Section 4. Section 5 discusses the proposed

θ

-sensitive k-anonymity model and its formal analysis. In Section 6, the experiments and evaluations are provided. Section 7 concludes the paper.

2. Related Work

In this section, the literature related to the proposed privacy model is studied from various aspects. The data collected through the various IoT enabled devices [,,,,,,,] must be anonymized before publishing because of the private information contained in it. Anonymized data are published for the sake of its maximum utility without disclosing the private information of an individual. For anonymization, the privacy models can be broadly classified into semantic [,] or syntactic [,,,,,,,,,,,,] approaches. The semantic privacy models add a random amount of noise for preserving privacy, e.g., differential privacy models [,]. In differential privacy, the deletion or addition of an individual’s record or noise does not affect the data analysis results while preserving the privacy. Syntactic privacy models create a k-indistinguishable [] ECs. In syntactic privacy, two main privacy disclosure risks are: identity disclosure [,,,,] and attribute disclosure [,,,,]. The k-anonymity [,] is an example of preventing identity disclosure that generalizes a set of records with respect to QIs. These k-anonymous records are indistinguishable from other k-1 records in a dataset. However, k-anonymity lacks the ability to provide attribute level protection. Attribute disclosure releases the value of confidential attributes corresponding to an identified individual record. Although in l-diversity [], l distinct groups for the

A^{s}

in an EC are required. However, the skewness and similarity attacks can breach the privacy because l-well sensitive attribute groups are not always possible over the existing

A^{s}

s. Similarly for t-closeness [], the threshold for

A^{s}

and its distance distribution in an EC has low data utility, and the earth mover distance (EMD) is not an efficient prevention for attribute linkage [,].

In [] by Torra, identity and attribute disclosure were both addressed. Jose et al. [] proposed an adaptive two-step iterative anonymization approach. A privacy leakage for an attribute linkage attack was possible because of having numerous versions. An extended k-anonymity model was proposed by Rahimi et al. [] to protect identity and attribute information. However, a BK attack is possible because the publisher is unaware of the adversary’s knowledge. The k-join-anonymity model proposed by Sowmiyaa et al. [] was the same as k-anonymity, which focuses only on identity disclosure risk. The (α, k)-anonymity model proposed by Wong et al. [], used a global recoding technique, which has a high utility loss and, due to table linkage attack, it was susceptible to the disclosure of attributes.

The (k, e)-anonymization model proposed by Zhang et al. [] publishes separate tables, consisting of

A^{s}

and QI to reduce the relationship between them, and where instead of generalization, a permutation-based approach has been adopted. Although in aggregated search, not using QI-generalization is recommended for accuracy improvement. However, a probabilistic attack is possible over the

A^{s}

due to the one-time publication of the microdata. The (ɛ, m)-anonymity model [] deals with the numeric

A^{s}

, however it is limited to work for categorical

A^{s}

. Xiao et al. [] worked on personalized anonymity that uses a greedy personalized generalization approach. This model de-associated

A^{s}

and QI instead of modifying the association between them.

In Reference [], the p-sensitive k-anonymity found the closest neighbor. This model was then improved by Sun at al. [] with a top-down specialization. The generated anonymized datasets should be from at least p distinct

A^{s}

values categories for each EC. However, the developed algorithm in [] is vulnerable to privacy leakage from sensitive variance, categorical similarity, and homogeneity attacks. In this paper, these privacy limitations were mitigated using the proposed

θ

–sensitive k-anonymity algorithm. The proposed privacy model is a syntactic privacy model for preventing attribute disclosure risk, which adds a fixed amount of noise to create k-anonymous ECs.

3. Preliminaries

Let an original Microdata Table (MT

) = {EI, QI, S}

(i.e., Table 1a) be the private static data (i.e., one-time release) for a publisher to publish. The

t \in MT

is a tuple that belongs to an individual i, such that EI = {

A_{1}^{ei}

,

A_{2}^{ei}, A_{3}^{ei} \dots A_{h}^{ei}

}, QI = {

A_{1}^{qi}

,

A_{2}^{qi}, A_{3}^{qi} \dots A_{m}^{qi}

}, and S = {

A^{s}

} (this work considers only single

A^{s}

). The k-anonymized data essentially consists of

A^{qi}

and

A^{s}

, while

A^{ei} s

are removed. This is because an adversary can link the

A^{qi}

with some external information (e.g., voter or census data) to perform a record linkage attack (i.e., identity disclosure) []. However, the k-anonymous

A^{qi}

values prevent the record against the record linkage attack in an EC. For example, consider some common diseases in a 2-anonymous (Table 1b) obtained from the original microdata Table 1a. Table 2 summarizes the notations used in this paper.

Definition 1.

k-anonymity [,]: Relation R having

A^{qi}

over the schema R(A₁,A₂, …, A_n) in a masked microdata table T’ is said to be k-anonymous if and only if, for any combination

A_{i}^{qi} \times

t(

A_{in}^{qi}

) values from start to end, is greater than or equal to k in R.

iff | \forall {A_{i}^{qi} \times t (A_{in}^{qi})} | \in T' \geq k

where k is the anonymity level (as shown in Table 1b). The k-anonymity model blends the k records into at least a k-1 crowd but it does not impose any restrictions on the algorithm to sufficiently protect the individuals. Consequently, the probability of linking a victim to a specific record through

A^{qi}

s is at most 1/k.

Definition 2.

l-Diversity []: A QI block in a masked microdata table T’ having m QI-blocks

Q I_{j} (1 \leq j \leq m)

is l-diverse, if it contains more than or equal to l well significant

A^{s}

values. In an l-diverse modified microdata table T’, every QI block is l-diverse.

iff | \forall {A_{i}^{qi} \times A_{i}^{s}} | \in T' \geq l

Definition 3.

t-closeness []: An EC is considered as t-closed if the distance between the distribution of the sensitive data in a class and the distribution of sensitive data in the whole table is equal to or less than threshold t. If every EC is t-closed, the whole table is t-closed. To calculate the distance while studying the transportation problem, researchers have explored some methods [,]. However, most of them focused on the Earth Mover Distance (EMD) method [,]. The EMD(P, Q) measures the minimum cost for transforming one distribution P to another distribution Q. It depends on the amount and distance of mass moved.

Definition 4.

p-sensitive k-anonymity []: The masked microdata table T’ is p-sensitive k-anonymous if it is k-anonymous and each EC in T’ has at least p distinct

A^{s}

values.

iff | \forall {A_{i}^{qi} \times t (A_{in}^{qi})} | \in T' \geq k \land (\forall G : {A_{i}^{qi} \times A_{i}^{s}} \in T' • A_{n}^{s} \leftarrow Count (Dist (A_{i}^{s})) \geq p)

where G represents an EC that already satisfies k-anonymity and is a set of

A_{i}^{s}

and

A_{i}^{qi}

. The value of

A_{n}^{s}

must be equal to or greater than p, where

A_{n}^{s}

represents distinct

A^{s}

values in an EC.

Definition 5.

Categorical similarity attack: If an adversary knows that the l-diverse modified microdata T’ (satisfying k-anonymity and l-diversity) has sensitive values belong to the single sensitive category in an EC from a p distinct

A^{s}

categories.

Definition 6.

Sensitive variance attack: The privacy leakage in an EC due to the low variability of sensitive values from p distinct

A^{s}

categories.

Definition 7.

High-Level Petri Nets (HLPN) []: The behavior of the system with its mathematical properties are modeled specifically via HLPN. An HLPN is a combination of 7-tuples

N = (P, T, F, φ, R_{n}, L, M_{0}),

where

P

represented by circles are the set of places.

T

is the set of transitions in the system represented by rectangular boxes, such that

P \cap T = \emptyset

.

F

represents the flow relations such that

F \subseteq (P \times T) \cup (T \cup P)

.

φ

. maps places

P

to the data types.

R_{n}

represents the rules or properties for transitions that verify the correctness of the underlying system.

L

represents labels on

F

, and

M_{0}

is the initial marking.

Table 2. Summary of notations used.

The following section reviews the p⁺-sensitive k-anonymity model, to highlight its shortcomings concerning sensitive variance or an S-Variance attack.

4. Problem Statement

Definitions 8 and 9 describe the p⁺-sensitive k-anonymity and

(p, α)

-sensitive k-anonymity models [], respectively.

Definition 8.

p⁺-sensitive k-anonymity []: A masked microdata T’, fulfills k-anonymity and for each

A^{s}

value belongs to distinct categories must be equal to or greater than p for each EC in T’.

(\forall G : {A_{i}^{qi} \times A_{i}^{s}} \in G \cup C \land \forall C \in G • C_{n} \leftarrow Count (Dist (C)) \geq p)

where C depicts

A^{s}

values categorizations that already fulfill a p-sensitive k-anonymous approach.

C_{n}

represents distinct categories in Table 3 [] and must be equal to or greater than p. Table 4a obtained from Table 1a, shows p⁺-sensitive k-anonymity model in which p = 2, k = 4 and c = 2. The ECs column in Table 4a is not part of a published table.

Table 3. Category table.

Definition 9.

(p, α)

-sensitive k-anonymity []: A modified microdata table T’ that fulfills the k-anonymity property and there must be p distinct sensitive attribute values in each QI-group having a minimum weight of at least

α

.

(\forall G : {A_{i}^{qi} \times A_{i}^{s}} \in T' \land A_{n}^{s} \geq p \land w_{c} \geq α)

where

G

represents all groups in masked micro table T’ that already fulfill the p-sensitive k-anonymity property. Weight should be assigned to each category and each sensitive value p must have weight in each category i.e.,

w_{c}

that must be at least

α

. Table 4b obtained from Table 1a, shows

(p, α)

-sensitive k-anonymity.

The sensitive variance and categorical similarity attacks have minor difference concerning the variability of

A^{s}

in an EC. The sensitive variance attack is more powerful than categorical similarity attack, i.e., categorical similarity attack

\in

sensitive variance attack. Therefore, the attribute disclosure through the sensitive variance attack automatically covers the disclosures through the categorical similarity attack. The EC2 and EC3 in Table 4a obtained through the p⁺-sensitive k-anonymous approach have categorical similarity and sensitive variance attacks and are explained in Table 5. Table 5 shows the variance calculation for these ECs, where a high variance for more diverse EC2 and small variance for less diverse EC3 can be seen.

Table 5. Variance calculation for different equivalence classes (ECs) in Table 4a.

To calculate the variance of the ECs, an ordered weight is given to the

A^{s}

values in such a way that the higher the frequency (f), the lower the weight (x) will be. For example, consider EC3 in Table 4a, i.e., Flu = 2, Cancer = 1, HIV = 1. The numeric value against each sensitive value represents its frequency occurrence in EC3. If an EC, e.g., EC2 is fully diverse i.e., size 4 and 4-diverse, then the order weight will be Hepatitis = 1, Phthisis = 1, Asthma = 1, Obesity = 1. In EC2, because of having a single occurrence for each

A^{s}

value, has a higher variance than EC3.

An adversary, using the category table (Table 3), can analyse the ECs in Table 4a and Table 4b published in []. The variability in some of the ECs is low concerning the category table. Therefore, the adversary can isolate the sensitive values that belong to a specific category and hence to individual records, and thus breaches the identity of an individual.

Critical Review of p⁺-Sensitive k-Anonymity Model

We formally modeled the p⁺-sensitive k-anonymity algorithm to check its invalidation concerning a sensitive variance attack. The detail formal verification of the working of p⁺-sensitive k-anonymity privacy model along with its properties is given in [] from Rule 1 to Rule 7, which gets original data input from the end-user and processes it. The sensitive variance attack over the p⁺-sensitive k-anonymity model is shown in Figure 1, where the arrow heads show the data flow. Table 6 shows variable types and their descriptions. The places

P

and its description are shown in Table 7. The attacker model in Figure 1 consists of three entities: the end-user, the adversary, and the trusted data publisher.

Figure 1. HLPN for p⁺-sensitive k-anonymity attack model.

Table 6. Types used in high-level Petri nets (HLPN) for p+-sensitive k-anonymity.

Table 7. Data-types, places, and their mapping.

In Figure 1, Transitions

T

, which are input to the HLPN model, consist of patients’ records (original data). A trusted data publisher further processes the data to minimize an attribute disclosure risk. Generalization and removing identifying attributes transform the data into masked data. After generalization, the masked microdata table is ready to be published. An adversary then exploits the published data for its benefits.

In this paper, the first seven rules in [] are outlined briefly. For input k, the data publisher processes the original data to perform data generalization via the

Generalize ()

function and each EC is stored at place the micro mask table (MMT); The publisher confirms the k-anonymity condition. If successful,

C o n d i t i o n

variable is set to

t r u e

. For each EC, the

Dist ()

function calculates the distinct

A^{s}

values and stores its count at place

ds

. To further process the array of t

A^{s}

, the

Count ()

function counts the

S_{n}

and stores it at place

Count Ds

. Before the calculation of

C_{n}

, p-sensitive k-anonymity is verified in masked data. Transition

C h e c k P K

checks at least p distinct

A^{s}

values in each EC in the whole table.

P L e v e l

stores the input transition p value for comparison. Apart from the checking condition for k-anonymity, another checking for p value is done. If it returns

t r u e

it means the data already fulfills k-anonymity. This concludes a successful transition, ensures the p-sensitive k-anonymous property. Next, computing

A^{s}

values categories using function

Get_Cat ()

. Both

A^{s}

values and categories are stored at place

Gi'

for further processing. Actual improvement to the prior model and source for p+-sensitive k-anonymity is the transition

C h e c k P P K

. Distinct categories are calculated in a column, using the sensitive values.

Comp C

stores this ‘number’ of distinct categories. The

C_{n}

involved in each EC is checked with transition

C h e c k P P K

to confirm that there must be at least p distinct categories. The minimum value for p is 2. The p+-sensitive k-anonymity properties are fulfilled if the

c o n d i t i o n

variable returns

t r u e

.

The p⁺-sensitive k-anonymity model is highly vulnerable against a sensitive variance attack. The main reason is the existence of non-diverse (low variance)

A^{s}

values similar to ‘Flu’ in Table 4a and Table 4b, and ‘HIV’ in Table 4b. In Rule (1) through function

S

-

Variance_Attck ()

, an adversary performs an attack on the released data using some external source of information, i.e., BK. In Rule (1):

\begin{array}{l} R (Attack) = \forall i 40 \in x 40, i 42 \in x 42, i 43 \in x 43, \forall i 2 \in x 2 | \\ S - Variance_Attck (i 40 [2], i 42 [2]) \to i 43 [2] = i 2 [1] \land i 43 [2] = i 2 [3] \end{array}

The adversary takes the union of the published data with the external information and BK to plot an EC. In this way, specific individuals correspond to some specific ECs that belong to homogenous categories and hence sensitive values from a specific category disclose an individual. Therefore, a sensitive variance attack occurs due to low variance in corresponding ECs.

5. The Proposed $θ$ -Sensitive k-Anonymity Privacy Model

5.1. Threshold $θ$ -Sensitivity

The goal of the proposed

θ

-Sensitive k-anonymity privacy model is to prevent the attribute disclosure of the individual records in MT, collected through the IoT [,,,,] enabled devices. Each EC in MT must satisfy the threshold

θ

value. The

θ

-Sensitivity, is the product of variance (

σ^{2}

) and Observation 1 (µ) as shown in Equation (1).

θ = Variance of a fully diverse EC (σ^{2}) * Observation 1 (μ)

(1)

The variance value represents the diversity in an EC. High variance means high diversity in an EC and vice versa, since achieving 100% diversity is almost impossible in all cases. However, the variance-based optimal frequency distribution of

A^{s}

values with some fixed amount of noise addition achieves an enhanced data privacy in an EC. The proposed method in this paper is simple and effective. During examining each EC, if the variance of an EC is greater than

θ

i.e., fully diverse, the next EC is examined. Otherwise, the variance for the same EC is increased by swapping the

A^{s}

values from the successor ECs or by adding some noise records, to make it above

θ

. Because of the required noise addition, our proposed model implies

ε

-differential privacy [,] but the proposed approach is a syntactic anonymization [] approach.

5.1.1. Variance ( $σ^{2}$ )

The variance calculation in Table 5 for ECs depicts the variability in a numerical form. To standardize the

θ

value for different size ECs, to prevent the sensitive variance attack, initially, we consider a fully diverse EC, e.g., if EC size = 2 variance = 0.25, if EC size = 3 variance = 0.67, if EC size = 4 variance = 1.25, if EC size = 5 variance = 2, and so on, then multiplying the variance with an observed value from Observation 1 (

μ

).

5.1.2. Observation 1 ( $μ$ )

A decimal multiplied part: Observation 1 (µ), for getting

θ

, the threshold value has full control over the EC diversity. During the simulation in Python, different values for µ were checked to get a suitable

θ

value. After executing the dataset for different k size ECs, the values of µ in the range of 0.5 to 0.9 were concluded. A smaller observed µ value results in the frequent repetition of sensitive values in an EC, and higher observed value produces a more diverse EC. However, “what observed value should be chosen for different size ECs?”, is explained below.

Consider again, the 2⁺-Sensitive 4-anonymous Table 4a, EC2 variance = 1.25, and EC3 variance = 0.69. The difference is because of the duplicated sensitive value i.e., Flu, in EC3. We propose an efficient way of removing the frequency repetition of sensitive values to achieve a more diverse EC. For this, we calculated the

θ

value. For example, consider a fully diverse EC of size 4 with variance = 1.25 and multiply it with an observed value, ranges between 0.5 and 0.9. Since, 1.25 * 0.5 = 0.625 is less than 0.69 and 1.25 * 0.6 = 0.75, which is greater than 0.69. The difference between the two values i.e., 1.25 and 0.69, is because of only one duplicated value “Flu”. Thus, it depends on privacy requirements and the level of diversity we are interested to achieve. In this paper, we perform a very strict

θ

calculation to get fully diverse ECs. Therefore, for example in the implementation part of the proposed Algorithm 1, we multiply a variance of 4 size EC with an observed value µ= 0.6 to have a fully diverse EC. The same technique is applied to all other ECs as well. The

θ

obtained in this way in line 8 of the proposed Algorithm 1 in Section 5.2, is then checked in the conditional part at line 10 inside a loop to check all ECs concerning

θ

requirements.

Definition 10.

θ

-Sensitive k-anonymity: The modified microdata table T’ fulfills

θ

-sensitive k-anonymity, if it fulfills k-anonymity and for each EC in T’, the variance for each EC must be at least

θ

.

iff | \forall {A_{i}^{qi} \times t (A_{in}^{qi})} | \in T' \geq k \land (\forall G : {A_{i}^{qi} \times A_{i}^{s}} \in T' • A_{n}^{s} \leftarrow Count (Dist (A_{i}^{s})) \geq θ)

where G represents a QI-group or EC that already satisfies k-anonymity and is a set of

A_{i}^{s}

and

A_{i}^{qi}

. The value of

A_{n}^{s}

must be equal to or greater than p, where

A_{n}^{s}

is the number of distinct sensitive values in a QI-group. The proposed

θ

-sensitive k-anonymity model produces the anonymized Table 8a (with noise) from the original microdata Table 1a and Table 8b (without noise) from Table 4b. The

A^{qi}

values in Table 8a and Table 8b are generalized through local recoding (bottom-up generalization) which improves the utility of the anonymized data. The 4-diverse ECs in Table 8a and Table 8b have sensitive values from a minimum of three different sensitive categories in Table 3. Therefore, these tables have more attribute privacy and are more protected from a sensitive variance attack.

5.2. The Proposed $θ$ -Sensitive k-Anonymity Algorithm

The proposed

θ

-sensitive k-anonymity algorithm starts execution by checking the k size to create an EC (minimum cardinality k = 2), at line 3. The algorithm can be executed on different size of k. However, if the minimum cardinality fails, the condition becomes

f a l s e

and jumps to line 50. If it is

t r u e

, the

f o r

loop works from line 5 to 7, to calculate the variance for each m size

G_{i}^{qi}

or

{EC}_{i}

that belongs to k-anonymous ECs and assigns them to an array i.e.,

V_{{EC}_{i}}

.

Line 8 multiplies an average observed value

μ

and variance

σ^{2}

for an EC to get a threshold

θ

(i.e., Equation (1)). This

θ

value ensures the maximum level of diversity in an EC.

θ

mainly depends on

μ

. If

μ

is smaller for an EC, low diverse EC will be obtained and vice versa. What level of diversity we want to have in an EC is completely controlled by

μ

. Deeply observing l-diversity [] and t-closeness [] and performing experiments while executing the algorithm in Python, the

μ

value is kept to achieve maximum diversity. The algorithm starts working from lines 9–49, which checks the obtained variances against user input k for each m size EC to

θ

. At line 10, if

V_{EC}

is greater than

θ

, line 46 is executed and the algorithm moves on to next EC. If it is less than

θ

, the current EC is named as

{EC}_{c}

, and the next index EC is named as

{EC}_{b}

. Lines 12–45, each part inside if statement has two major functionalities; swapping and require noise addition.

The else part of an if statement executes the ECs from first till

{EC}_{n - 1}

, and its first part processes the last EC (lines 13–25). At line 12, if

{EC}_{c}

is the last class, i.e.,

{EC}_{n}

, then from

A_{{EC}_{n}}^{s}

the value of

{MS}_{n}

is calculated. Similarly, the value of

{MS}_{n - 1}

is calculated from

{EC}_{n - 1}

. At line 15, a

crossCheck ()

function checks the existence of most frequent

A^{s}

that does not exist in each other ECs. The

swap ()

function may be executed. The purpose of the cross-check is not to further increase or decrease

V_{{EC}_{n - 1}}

because it has already been processed by the else part of the current if statement. This function is for the last EC to increase its diversity. If any of the

A^{s}

value from

{EC}_{n}

exists in

{EC}_{n - 1}

or vice versa, the swapping at line 17 will not be performed. If swapping is performed,

V_{{EC}_{n}}

is calculated to check with

θ

(line 20). If

V_{{EC}_{n}}

is still less than

θ

, then the algorithm jumps to line 43 to add a distinct

A^{s}

value as noise to increase its variance and to achieve a high diversity.

To process the first EC until the

{EC}_{n - 1}

, the else part of if statement executes (line 12). The algorithm finds an

{EC}_{b}

with

θ

greater than

{EC}_{c}

(lines 27–31). The if statement checks

{EC}_{b}

f

o u n d

condition, when it is satisfied, then a function

mfsv ()

is executed on both

{EC}_{c}

and

{EC}_{b}

, which calculates the most frequent sensitive values in both ECs. Before swapping the values for

{MS}_{c}

and

{MS}_{b}

, a function

backCheck ()

checks the existence of

{MS}_{b}

in

{EC}_{c}

, which is an EC ahead of

{EC}_{b}

. If the value of

{MS}_{b}

exists in

{EC}_{c}

, then that MS value is removed from a temporary array in Algorithm 1.

	Algorithm 1: $θ$ -sensitive k-anonymity
	Input: Microdata Table (MT)
	Output: $: θ - sensitive k - anonymous$ table (MMT)
1	$Procedure : θ - sensitive k - anonymity (MMT, θ, k)$
2	$Let k \subseteq MMT$
3	$i f \| k \| \geq 2$ $t h e n$
4	$Condition = true$ ;
5	$f o r e a c h m size EC in G_{i}^{qi} : {A_{i}^{qi} \times A_{i}^{s}} \in k$ do	► $G_{i}^{qi}$ set, consists of $A_{i}^{qi} & A_{i}^{s}$
6	$V_{{EC}_{i}} \leftarrow Compute vari (A_{{EC}_{i}}^{s})$	► $vari (A_{{EC}_{i}}^{s}$ ), calculate variance for each m size EC.
7	$e n d f o r$
8	$θ \leftarrow {μ * σ}^{2}$	► $θ$ , required threshold
9	$f o r e a c h {m size EC}_{i} {in G}_{i}^{qi} : {A_{i}^{qi} \times A_{i}^{s}} \in k d o$	► $G_{i}^{qi}$ set, consists of $A_{i}^{qi}$ and $A_{i}^{s}$
10	$i f V_{{EC}_{c}} < θ t h e n$
11	${EC}_{b} \leftarrow {EC}_{c} + 1$
12	$i f {EC}_{n} = {EC}_{c}$
13	${MS}_{n} \leftarrow Compute mfsv (A_{{EC}_{n}}^{s})$	►mfsv(), max frequent $A_{{EC}_{n}}^{s}$
14	${MS}_{n - 1} \leftarrow Compute mfsv (A_{{EC}_{n - 1}}^{s})$	►mfsv(),max frequent $A_{{EC}_{n - 1}}^{s}$
15	$notExist \leftarrow crossCheck ({MS}_{{EC}_{n}}, {MS}_{{EC}_{n - 1}})$	►crossCheck(), check both side existence
16	$i f notExist$
17	$swap ({MS}_{n}, {MS}_{n - 1})$	►swap(), last and 2^nd last ECs MS values
18	$e n d i f$
19	$V_{{EC}_{n}} \leftarrow Compute vari (A_{{EC}_{n}}^{s})$
20	$i f V_{{EC}_{n}} < θ$
21	$B r e a k$
22	$jump to else part of condition line 43$
23	$e l s e$
24	$B r e a k$
25	$e n d i f$
26	$e l s e$
27	$f o r {EC}_{b} {till EC}_{n} {in G}_{i}^{qi} : {A_{i}^{qi} \times A_{i}^{s}} \in K$
28	$i f V_{{EC}_{b}} > θ$
29	$B r e a k l o o p$
30	$e n d i f$
31	$B r e a k l o o p$
32	$i f {EC}_{b} = found$
33	${MS}_{c} \leftarrow Compute mfsv (A_{{EC}_{c}}^{s})$	►mfsv(), max frequency $A_{{EC}_{c}}^{s}$
34	${MS}_{b} \leftarrow Compute mfsv (A_{{EC}_{b}}^{s})$	►mfsv(), max frequency $A_{{EC}_{b}}^{s}$
35	${MS}_{b} \leftarrow backCheck ({MS}_{{EC}_{c}}, {MS}_{{EC}_{b}})$	►backCheck() find MS value in ${MS}_{{EC}_{b}}$ , not exists ${in MS}_{{EC}_{c}}$
36
37	$swap ({MS}_{c}, {MS}_{b})$	►swap(), exchange MS values
38	$V_{{EC}_{c}} \leftarrow Compute vari (A_{{EC}_{c}}^{s})$	►vari(), again compute variance
39	$i f V_{{EC}_{c}} > θ$
40	${EC}_{c} + = 1$
41	$e n d i f$
42	$e l s e$
43	$NS \leftarrow Compute addNoise (A_{{EC}_{c}}^{s})$	►addNoise(), until variance> $θ$
44	$e n d i f$
45	$e n d i f$
46	$e l s e$
47	${EC}_{c} + = 1$
48	$e n d i f$
49	$e n d f o r$
50	$e l s e$
51	$Condition = false;$
52	$e n d i f$

{MS}_{{EC}_{b}}

and next MS in same

{EC}_{b}

is checked with

{MS}_{c}

. This process continues until it finds a

A^{s}

value in

{MS}_{{EC}_{b}}

that do not exist in

{MS}_{c}

. Line 37 then swaps these two MS values along with their corresponding records. Two important purposes are achieved through this swap function. First, reducing the frequency of repeated

A^{s}

and second, increasing diversity in

{EC}_{c}

which results in increasing

V_{{EC}_{c}}

. The

V_{{EC}_{c}}

is again calculated and is checked with

θ

, if it is greater than

θ

, counter for

{EC}_{c}

moves to the next EC.

Here, the absence of the else statement adds noise instantly, in a situation when the variance is less than

θ

, because more than one swapping for a specific

{EC}_{c}

is possible. We add noise only once after completely checking the frequency of each

A^{s}

in an EC. For example, if to produce a 4-anonymous EC table from Table 1a, after one swapping e.g., ‘HIV’ swaps with ‘Obesity’, the resulting EC1 in Table 8a will become 3-diverse and its variance will not meet

θ

, the else part might add noise to increase variance even though there is a duplicated

A^{s}

‘Cancer’ value that still exists in

{EC}_{c}

. To reduce the frequency of the next duplicated

A^{s}

value i.e., ‘Cancer’, by swapping it with another

A^{s}

in

{EC}_{b}

if one exists, noise is not added at this moment. This is achieved by going control back to line 10, and since this increased variance is still less than

θ

, the procedure repeats and from an

{EC}_{b}

a new

A^{s}

is swapped with the next duplicated

A^{s}

value. In this way, two swapping procedures are performed and 2-diverse

{EC}_{c}

will become 4-diverse without adding any unnecessary noise, which results in increasing data utility and a more diverse EC.

{EC}_{b}

is found because of a variance greater than

θ,

there are chances that no EC exists in a given dataset having a higher variance than

θ,

in this case, the loop will not break (line 29). In that case, the algorithm will jump to line 43. It will add a dummy record with distinct

A^{s}

value(s) via function

addNoise ()

. Such an addition is considered as noise to the real data just like the addition of noise in differential privacy [,]. This algorithm performs very intelligent swapping and adds noise intelligently. The purpose of these two functions (i.e.,

swap ()

and

addNoise ())

, is to increase the diversity keeping the utility as high as possible, which is easily achieved in our algorithm as shown in the experimental evaluation, Section 6.

The sanitized Table 4a from p⁺-sensitive k-anonymity is prone to homogeneity, categorical similarity, and sensitive variance attacks, and Table 8a from

θ

-sensitive k-anonymity secures the data from such attacks because of more diversity, even at the category level, i.e., the maximum value for category c is 4 through

θ

-sensitive k-anonymity, where, for Table 4a, the maximum value for c is 2. Table 8a provides more protection against the categorical similarity attack. Further swapping of values is not possible in the last EC; thus, a single tuple is added as noise to increase the diversity and to prevent categorical similarity attack and sensitive variance attack. Such a small amount of noise does not highly affect the utility of the data. Table 4b is a base table to obtain Table 8b using the

θ

-sensitive k-anonymity approach. Table 8b is also highly diverse at the categorical level and there are no repeated sensitive values. Thus, there is no need to add noise and to have a high value of variance. The anonymized data, both in Table 8a and Table 8b, obtained through the proposed

θ

-sensitive k-anonymity algorithm, have no attribute disclosure risk and are defensive against homogeneity [], categorical similarity, and sensitive variance attacks, and even secure from skewness attacks [].

5.3. Analysis of $θ$ -Sensitive k-Anonymity Model Using Formal Modeling and Analysis

The proposed

θ

-sensitive k-anonymity model mitigates the vulnerability discussed in Section 4. Modeling the

θ

-sensitive k-anonymity via HLPN has the same end-user, data publisher, and unknown adversary, as shown in Figure 2. Table 9 and Table 10, respectively, show variable types and places, and their corresponding descriptions.

Figure 2. HLPN for

θ

-sensitive k-anonymity.

Table 9. Types used in HLPN for

θ

-sensitive k-anonymity.

Table 10. Mapping of data types in

θ

-sensitive k-anonymity model.

The

θ

-sensitive k-anonymity algorithm was modeled through the HLPN rules for the microdata input. The data publisher initially verifies the k-value input. The original data is k-anonymized (bottom-up generalization) after finalizing the individual records in an EC obtained through variance calculations. In Rule (2), the k-anonymity masks the data. In Rule (2):

R (MaskData) ≔ \forall i 2 \in x 2, i 3 \in x 3, i 4 \in x 4 |

i 4 [1] ≔ Mask {i 2 [2]} \land i 4 [2] ≔ Mask {i 2 [3]} \land x 4' ≔ x 4 \cup {i 4 [1], i 4 [2], i 3}

If an input k is less than the minimum size of an EC (i.e., <2) the condition fails. For cardinality having a minimum value of 2 or above, the algorithm executes. The k-anonymity for true or false are depicted in Rule (3).

\begin{array}{l} R (Check k) ≔ \forall i 5 \in x 5, i 6 \in x 6 | \\ Count (i 5 [1]) \geq i 5 [3] \to i 6 [1] ≔ TRUE \lor Count (i 5 [1]) ≱ i 5 [3] \to i 6 [1] ≔ FALSE \land x 6' ≔ x 6 \cup {i 6 [1]} \end{array}

The threshold

θ

is calculated in Rule (4). Variance for a fully diverse ECs for a specific k is calculated using the

var ()

function. The important contributed functions are

swap ()

and

addNoise ()

functions, through which the algorithm processes all ECs. Transition

A d j u s t V a r

performs all these swapping and noise additions in corresponding ECs. In Rule (5),

C o m p u t e V a r

transitions for the initial ECs. For the rest of the ECs, the same transition can be used in the same manner. In Rule (4):

\begin{array}{l} R (Calc Theta) ≔ \forall i 10 \in x 10, i 11 \in x 11, i 12 \in x 12 | \\ i 12 ≔ {i 11 * {(i 10)}^{2}} \land x 12' ≔ x 12 \cup {i 12} \end{array}

In Rule (5):

\begin{array}{l} R (Compute Var) ≔ \forall i 8 \in x 8, i 9 \in x 9 | \\ i 9 [1] ≔ Compute Var (i 8 [1]) \land i 9 [2] ≔ i 8 [2]) \land x 9' ≔ x 9 \cup {i 9 [1], i 9 [2]} \end{array}

The

θ

-sensitive k-anonymity model’s main functionalities are described in Rule (6) and Rule (7). Variance in each k-anonymous EC with respect to

θ

is checked in Rule (6). If variance of

{EC}_{c}

is greater than

θ

(i.e., (

i 14 [1] > i 13

)), move to next

{EC}_{c}

and update the value in place MMT. If the variance of

{EC}_{c}

is less than

θ

(i.e.,

(i 14 [1] < i 13)

), then transaction stops. We try to find

{EC}_{b}

, and swap required available

A^{s}

values from

{EC}_{b}

. After performing all needed swapping, if the variance of

{AdjEC}_{c}

is still less than

θ

(i.e., (

i 32 < i 13)

), the noise is added to increase its diversity. In Rule (6):

\begin{array}{l} R (Check Variance) ≔ \forall i 13 \in x 13, i 14 \in x 14, i 15 \in x 15, i 19 \in x 19, i 23 \in x 23, i 24 \in x 24, i 32 \in x 32, i 33 \in x 33 | \\ \land {((i 14 [1] > i 13) \to i 16 [1] ≔ i 15 [1] + 1 \\ \land x 16' ≔ x 16 [2] \cup {i 16}) ≔ ((i 14 [1] < i 13) \to i 16 [2] ≔ i 15 [1] + 1 \\ \land x 16' ≔ x 16 [2] \cup {i 16}) ≔ ((i 14 [2] > i 13)) \to i 19 \land x 19^{'} ∶ = x 19 \cup {i 19}) ≔ ((i 24 < i 13) \to i 25 \\ \land x 25 ≔ x 25 \cup {i 25}) ≔ ((i 32 < i 13) \to i 33 \land x 33 ≔ x 33 \cup {i 33})} \end{array}

The proposed

θ

-sensitive k-anonymity algorithm starts by processing each k size EC. The function

Comp mfsv ()

computes the max frequency of

A_{{EC}_{c}}^{s} {and A}_{{EC}_{b}}^{s}

, named as

{MS}_{{EC}_{c}}

and

{MS}_{{EC}_{b}}

, respectively. A one-way checking function:

backCheck ()

, checks for the existence of

{MS}_{{EC}_{b}}

at

{FoundEC}_{b}

that do not exist in earlier

{EC}_{c}

.

{MS}_{b}

is swapped with

{MS}_{c}

after the checking succeeds and is saved in place Adj

{EC}_{c}

.

{EC}_{c}

minimizes the frequency of the

A^{s}

value and increases diversity. While processing the last EC, i.e.,

{EC}_{n}

, swapping is not possible in the forward direction. Thus swapping with previous EC is performed with a condition that the variance of already processed

{EC}_{n - 1}

should not be decreased with

θ

. The

crossCheck ()

function confirms two-way checking, that the values for both

{MS}_{n}

and

{MS}_{n - 1}

are distinct and it should not change the variance of

{EC}_{n - 1}

at place

{StrictEC}_{n - 1}

to an undesired value again. In that case, we call it strict

{EC}_{n - 1}

. In other words, in addition to increasing the diversity in

{EC}_{n}

, it is also not increasing the frequency of

A^{s}

value at place

{EC}_{n - 1}

. Values are then swapped and are saved at place

{AdjEC}_{n}

. Rule (7) shows the whole process. In Rule (7):

R (Adjst Var) ≔ \forall i 17 \in x 17, i 20 \in x 20, i 21 \in x 21, i 28 \in x 28, i 29 \in x 29 | (i 17 [1] \neq i 17 [3]) \to Comp mfsv (i 20 [1], i 17 [1]) \land True ≔ backCheck (i 20 [1], i 17 [1]) \land i 21 ≔ swap (i 20 [1], i 17 [1]) \land x 21' ≔ x 21 \cup {i 21} (i 17 [1] = i 17 [3]) \to Comp mfsv {(i 17 [3], i 28 [1])}_{} \land True ≔ crossCheck {(i 17 [3], i 28 [1])}_{} \land i 29 ≔ swap {(i 17 [3], i 28 [1])}_{} \land x 29' ≔ x 29 \cup {i 29 [1]}

If the variance of

{AdjEC}_{c}

is still less than

θ

(i.e.,

(i 34 [1] < i 35)

), a dummy record called noise is added whenever needed throughout the variance adjustment process. In Rule (8), we have given the final noise addition case for last

{AdjEC}_{n}

. Its purpose is to increase the variance at a level greater than

θ

. It will produce a highly diverse EC even if there are not enough diverse records in MMT. In Rule (8):

R (Add Noise) ≔ \forall i 34 \in x 34, i 35 \in x 35, i 36 \in x 36 | (i 34 [1] < i 35) \to i 36 ≔ addNoise (i 34 [2], i 34 [3], i 34 [4]) \land x 36' ≔ x 36 \cup {i 36 [1], i 36 [2], i 36 [3]}

In Rule (9), an adversary attacks against the individual’s

A^{s}

values. Adversary combines the already available BK (i.e.,

i 40 [2]

) with the published data (i.e.,

i 38 [2]

) and performs attack to disclose the patient’s identity (i.e.,

i 2 [2]

) and the sensitive values (i.e.,

i 2 [3]

).

θ

-sensitive k-anonymity model can provide better privacy protection to prevent from attribute disclosure attacks because it considers the high value of variance due to swapping and noise addition in corresponding ECs. The diversity of sensitive attribute values in ECs prevents the adversarial BK and is more effective as compared to the p⁺-sensitive k-anonymity model. Therefore, the adversary did not get private information for the target individual and the attack results in a null value. In Rule (9):

R (S - Variance Attack) : = \forall i 38 \in x 38, i 40 \in x 40, i 41 \in x 41 | Att_Dis (i 38 [2], i 40 [2]) \neq (i 2 [1] \cup i 2 [2] \cup i 2 [3]) (i 41 [2] \cup i 41 [3]) = \emptyset

6. Experimental Evaluation

In this section, the experiments that were performed to show the effectiveness of the proposed

θ

-sensitive k-anonymity privacy model in comparison to the p⁺-sensitive k-anonymity model are described. The proposed algorithm wisely diversified the A^S values in a balanced way inside each EC without using the categorical approach. The utility and quality of the anonymized released data were checked with numerous quality measures.

6.1. Experimental Setup

All experiments were performed on a machine with an Intel Core i5 2.39 GHz processor with 4 GB RAM, using the Windows 10 operating system. The algorithm was written in Python 3.7. We used the Adults database, which contained age, zip code, salary, and occupation attributes, which is openly accessible at the UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets. We considered the age, zip code, salary as

A^{qi} s

and occupation as

A^{s}

.

Experimental results show the usefulness of the proposed

θ

-sensitive k-anonymity privacy model and protection against the categorical similarity attack and sensitive variance attack as compared to the p⁺-sensitive k-anonymity model. The quality of the sanitized publicly released data was evaluated with four utility metrics: discernibility penalty (DCP) [,,], normalized average QI-group (C_AVG) [,,], noise calculation, and query accuracy [,]. The execution time of both algorithms was analyzed at the end of the experiments.

6.2. Discernibility Penalty (DCP)

The DCP proposed in [] and used in [,] is an assignment of penalty (cost) to each tuple in the generalized data set. Through this penalty, the sanitized tuple cannot be distinguished among other tuples in the result set. Minimizing the discernibility cost is an optimal objective. The penalty for a tuple t that belongs to an EC of size |EC|, i.e.,

t ϵ EC

, will be |EC| and the penalty for each EC is

{| EC |}^{2}

. The complete DCP penalty for the overall sanitized released dataset

R^{*}

can be seen in Equation (2).

DCP (R^{*}) = \sum_{i = 1}^{| {EC} |} {| {EC}_{i} |}^{2}

(2)

where

{EC}

are the total number of ECs in

R^{*}

. A baseline can be obtained from the most optimal DCP score calculations as shown in []. For example, if k = 2 and the number of anonymized tuples are 10, the DCP optimal score will be

2^{2} + 2^{2} + 2^{2} + 2^{2} + 2^{2} = 20

. This optimal score is called the baseline. The approach to generate groups followed in this paper was based on k size, inclusive of the noise tuple(s). Higher k means bigger group size, so the baseline moves up because of a high DCP score. The p⁺-sensitive k-anonymity model generated groups based on p. It means the number of tuples can be greater than p in a k-anonymous class. Figure 3 shows the DCS score for

θ

-sensitive k-anonymity, including a comparison with p⁺-sensitive and baseline. In comparison to p⁺-sensitivity, the DCP score, through the proposed

θ

-sensitive k-anonymity algorithm, is almost equal to the baseline, which implies that the proposed model assigned an optimal penalty to each EC and produced an optimal DCP score. The magnified subplots in Figure 3 with k = 12 and k = 16 for

θ

-sensitive k-anonymity shows the very minor difference with baseline. This minor difference can also be seen in Table 11, with an average DCP score of 47.2 or 0.002679% with a baseline obtained from the simulation while calculating the DCP for the anonymized dataset

R^{*}

.

Figure 3. Discernibility penalty (DSP) score.

Table 11. DCP experiment values for each k.

6.3. Normalized Average (C_AVG)

C_AVG is another mathematically sound measurement that measures the quality of the sanitized data by the EC average size. It was proposed in [] and applied in [,]. Below in Equation (3), C_AVG can be calculated as

C_{AVG} = (\frac{| R^{*} |}{| {EC} |}) \div k

(3)

where

| R^{*} |

is the overall sanitized released dataset and

| {EC} |

are the total number of ECs in

R^{*}

. Data utility and C_AVG are inversely proportional. Low C_AVG value indicates high information utility. The optimal goal is to have a minimum size of ECs in

R^{*}

. Figure 4 shows C_AVG for p⁺-sensitive k-anonymity and

θ

-sensitive k-anonymity over k-anonymity. p⁺-sensitive has lower data utility over small k, where there is a high data utility for large k. The proposed technique has a very balanced and sustainable utility for each input value of k. Thus, the proposed

θ

-sensitive k-anonymity model performs efficiently for all sizes of k, compared to the p⁺-sensitive k-anonymity model.

Figure 4. The ratio of C_AVG.

6.4. Noise Addition

Among different masking methods, one popular approach is the perturbation of data, i.e., noise addition. These are dummy tuples, added to the original data that helps in achieving the required diversity similar to the differential privacy [,]. The reason is if there are not enough

A^{s}

values to swap with, especially in the second last and last ECs, the gap is filled with the noise tuples to prevent with disclosure risk. So, one of the reasons for such a good performance of the proposed model is the cost of noise addition. Figure 5 shows the number of tuples added as a noise for different values of k. These tuples are added to achieve the required value of the threshold

θ

. For different values of k, the algorithm responds differently but the maximum number of noise tuples added for a specific value of k is only six tuples. In the processed “Adult” dataset, the total number of tuples was 160,150 and only 34 noise tuples, i.e., 0.021% of the total size, were added in total. Such an amount of utility loss is negligible. This small amount of noise addition is sometimes due to get a round number when dividing the dataset size by the k size input, for example, 160150/4 = 40037.5 and 160152/4 = 40038.

Figure 5. The number of noise tuples added against each k.

6.5. Query Accuracy

Query accuracy measures precision for aggregate queries to check the utility of the anonymized data. It has been used by various research works [,]. To answer the aggregate queries, the built-in COUNT operator is used, where

A^{qi}

s are the query predicates. Consider

R^{*}

to be a sanitized release from original microdata

R

having maximum m as

A^{qi}

s

; A_{i}^{qi}

(

1 \leq i \leq m)

, where

D (A_{i}^{qi})

is the domain of i^th QI. The SQLQuery in Equation (4) for the COUNT query will work as

SQLQuery = select COUNT (*) {from R}^{*} {where A}_{1}^{qi} \in D (A_{1}^{qi}) AND . . . {AND A}_{m}^{qi} \in D (A_{m}^{qi})

(4)

Against each query, at least one or a few number of tuples should be selected from each EC based on query predicates. Two important parameters for query predicates are (1) query dimensionality q, and (2) the query selectivity

ϑ

. Query dimensionality comprises of the number of QIs in query predicate while query selectivity is the number of values for each attribute

A_{i}, (1 \leq i \leq n)

. The query selectivity is calculated as,

ϑ = \frac{| T_{Q} |}{| R |}

, where

| T_{Q} |

are the output number of tuples after using query Q on relation R, and

| R |

are the total number of tuples in the whole dataset. Query error i.e., Error(Q), is calculated in Equation (5).

Error (Q) = \frac{| count (R^{*}) - count (R) |}{count (R)}

(5)

where

count (R^{*})

depicts result set from the COUNT query on an anonymized dataset while

count (R)

is the result set from the COUNT query on the original microdata. More selective queries have a high error rate.

Figure 6a shows the query error for the input value of k. We compare the p⁺-sensitive k-anonymity and

θ

-sensitive k-anonymity using the query error rate for 1000 randomly generated aggregate queries. The error rate increases for the high value of k because of the high range in

A^{qi}

s. This selects a greater number of tuples than the original microdata and hence high error rate. In Figure 6b, it is depicted that the more we select tuples based on predicates, the higher the error rate will be in the anonymized data.

Figure 6. (a) Query error for k. (b) Query error for selectivity.

6.6. Execution Time

Figure 7 shows the execution time for both p⁺-sensitive k-anonymity model and for the proposed

θ

-sensitive k-anonymity model. The execution time for both of the algorithms increased with an increase in value of k because of the increase in

A^{qi} s

generalization range. Since we did not consider the sensitive values categorization, our approach took a small amount of time to execute as compared to its counterpart. In the

θ

-sensitive k-anonymity model, a higher execution time for k = 10, k = 16 and k = 20 was because of the time taken to add more noise tuples to achieve the required diversity.

Figure 7. Algorithm execution time.

7. Conclusions

In this paper, the huge amount of data (i.e., Big Data) collected through the IoT-based devices were anonymized using the proposed

θ

-sensitive k-anonymity privacy model in comparison to p⁺-sensitive k-anonymity model. The purpose was to prevent an attribute disclosure risk in anonymized data. The p⁺-sensitive k-anonymity model was considered to be vulnerable to a privacy breach from sensitive variance, categorical similarity, and homogeneity attacks. These attacks were mitigated by implementing the proposed

θ

-sensitive k-anonymity privacy model using Equation (1). In the proposed solution, the threshold

θ

value decides the diversity level for each EC of the dataset. The vulnerabilities in the p⁺-sensitive k-anonymity model and the effectiveness of the proposed

θ

-sensitive k-anonymity model were formally modeled through HLPN, which further ensures the validation of the proposed technique. The experimental work proved the privacy implementation and an improved utility of the released data using different mathematical measures. For future work consideration, the proposed algorithm can be extended to 1:M (single record having many attribute values) [], to multiple sensitive attributes (MSA) [,,], or can be modeled by considering the dynamic data set [] approach.

Author Contributions

Conceptualization, R.K. and X.T.; methodology, R.K., X.T., and A.A.; software, R.K.; validation, X.T. and A.A.; formal analysis, R.K., T.K., and S.u.R.M.; writing—original draft preparation, T.K.; writing—review and editing, T.K., X.T., A.A., A.K., W.u.R., and C.M.; supervision, X.T.; funding acquisition, X.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (61932005), and 111 Project of China B16006.

Conflicts of Interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

Dang, L.M.; Piran, J.; Han, D.; Min, K.; Moon, H. A Survey on Internet of Things and Cloud Computing for Healthcare. Electronics 2019, 8, 768. [Google Scholar] [CrossRef]
Sun, W.; Cai, Z.; Li, Y.; Liu, F.; Fang, S.; Wang, G. Security and Privacy in the Medical Internet of Things: A Review. Secur. Commun. Netw. 2018, 2018, 1–9. [Google Scholar] [CrossRef]
Baek, S.; Seo, S.-H.; Kim, S.J. Preserving Patient’s Anonymity for Mobile Healthcare System in IoT Environment. Int. J. Distrib. Sens. Netw. 2016, 12, 2171642. [Google Scholar] [CrossRef][Green Version]
Liu, F.; Li, T. A Clustering K-Anonymity Privacy-Preserving Method for Wearable IoT Devices. Secur. Commun. Netw. 2018, 2018, 1–8. [Google Scholar] [CrossRef]
Wan, J.; Al-Awlaqi, M.A.A.H.; Li, M.; O’Grady, M.; Gu, X.; Wang, J.; Cao, N. Wearable IoT enabled real-time health monitoring system. EURASIP J. Wirel. Commun. Netw. 2018, 2018, 298. [Google Scholar] [CrossRef]
Al-Khafajiy, M.; Baker, T.; Chalmers, C.; Asim, M.; Kolivand, H.; Fahim, M.; Waraich, A. Remote health monitoring of elderly through wearable sensors. Multimed. Tools Appl. 2019, 78, 24681–24706. [Google Scholar] [CrossRef]
Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
Sweeney, L. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 2002, 10, 571–588. [Google Scholar] [CrossRef]
Song, F.; Ma, T.; Tian, Y.; Al-Rodhaan, M. A New Method of Privacy Protection: Random k-Anonymous. IEEE Access 2019, 7, 75434–75445. [Google Scholar] [CrossRef]
Wang, J.; Du, K.; Luo, X.; Li, X. Two privacy-preserving approaches for data publishing with identity reservation. Knowl. Inf. Syst. 2018, 60, 1039–1080. [Google Scholar] [CrossRef]
Amiri, F.; Yazdani, N.; Shakery, A.; Chinaei, A.H. Hierarchical anonymization algorithms against background knowledge attack in data releasing. Knowl. Based Syst. 2016, 101, 71–89. [Google Scholar] [CrossRef]
Yaseen, S.; Abbas, S.M.A.; Anjum, A.; Saba, T.; Khan, A.; Malik, S.U.R.; Ahmad, N.; Shahzad, B.; Bashir, A.K. Improved Generalization for Secure Data Publishing. IEEE Access 2018, 6, 27156–27165. [Google Scholar] [CrossRef]
Liu, X.; Deng, R.H.; Choo, K.K.R.; Weng, J. An efficient privacy preserving outsourced calculation tool kit with multiple keys. IEEE Trans. Inf. Forensics Secur. 2016, 11, 2401–2414. [Google Scholar] [CrossRef]
Michalas, A. The lord of the shares. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, Limassol, Cyprus, 8–12 April 2019; pp. 146–155. [Google Scholar] [CrossRef]
Machanavajjhala, A.; Gehrke, J.; Kifer, D.; Venkitasubramaniam, M. L-diversity: Privacy beyond k-anonymity. Int. Conf. Data Eng. 2006, 1, 24. [Google Scholar] [CrossRef]
Li, N.; Li, T.; Venkatasubramanian, S. t-Closeness: Privacy beyond k-Anonymity and l-Diversity. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering, Istanbul, Turkey, 15–20 April 2007; pp. 106–115. [Google Scholar]
Sun, X.; Sun, L.; Wang, H. Extended k-anonymity models against sensitive attribute disclosure. Comput. Commun. 2011, 34, 526–535. [Google Scholar] [CrossRef]
Anjum, A.; Malik, S.U.R.; Choo, K.-K.R.; Khan, A.; Haroon, A.; Khan, S.; Khan, S.U.; Ahmad, N.; Raza, B. An efficient privacy mechanism for electronic health records. Comput. Secur. 2018, 72, 196–211. [Google Scholar] [CrossRef]
Campan, A.; Truta, T.M.; Cooper, N. p-sensitive k-anonymity with generalization constraints. Trans. Data Privacy 2010, 3, 65–89. [Google Scholar]
Al-Khafajiy, M.; Webster, L.; Baker, T.; Waraich, A. Towards fog driven IoT healthcare. In Proceedings of the 2nd International Conference on Future Networks and Distributed Systems, Amman, Jordan, 26–27 June 2018; Volume 9, p. 9. [Google Scholar]
Shahzad, A.; Lee, Y.S.; Lee, M.; Kim, Y.-G.; Xiong, N.N. Real-Time Cloud-Based Health Tracking and Monitoring System in Designed Boundary for Cardiology Patients. J. Sens. 2018, 2018, 1–15. [Google Scholar] [CrossRef]
Domingo-Ferrer, J.; Soria-Comas, J. From t-closeness to differential privacy and vice versa in data anonymization. Knowl. Based Syst. 2015, 74, 151–158. [Google Scholar] [CrossRef]
Dwork, C. Differential privacy. In International Colloquium on Automata, Languages, and Programming; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
Fung, B.C.; Wang, K.; Chen, R.; Yu, P.S. Privacy-preserving data publishing. ACM Comput. Surv. 2010, 42, 1–53. [Google Scholar] [CrossRef]
Xu, Y.; Ma, T.; Tang, M.; Tian, W. A Survey of Privacy Preserving Data Publishing using Generalization and Suppression. Appl. Math. Inf. Sci. 2014, 8, 1103–1116. [Google Scholar] [CrossRef]
Torra, V. Transparency in Microaggregation; UNECE: Skovde, Sweden, 2015; pp. 1–8. Available online: http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A861563&dswid=-2982 (accessed on 25 August 2019).
Panackal, J.J.; S.Pillai, A. Adaptive Utility-based Anonymization Model: Performance Evaluation on Big Data Sets. Procedia Comput. Sci. 2015, 50, 347–352. [Google Scholar] [CrossRef]
Rahimi, M.; Bateni, M.; Mohammadinejad, H. Extended K-Anonymity Model for Privacy Preserving on Micro Data. Int. J. Comput. Netw. Inf. Secur. 2015, 7, 42–51. [Google Scholar] [CrossRef][Green Version]
Sowmiyaa, P.; Tamilarasu, P.; Kavitha, S.; Rekha, A.; Krishna, G.R. Privacy Preservation for Microdata by using k-Anonymity Algorthim. Int. J. Adv. Res. Comput. Commun. Eng. 2015, 4, 373–375. [Google Scholar]
Wong, C.; Li, J.; Fu, W.; Wang, K. (α,k)-Anonymity: An enhanced k-anonymity model for privacy preserving data publishing. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining ACM, Philadelphia, PA, USA, 20–23 August 2006; pp. 754–759. [Google Scholar]
Zhang, Q.; Koudas, N.; Srivastava, D.; Yu, T. Aggregate Query Answering on Anonymized Tables. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering, Institute of Electrical and Electronics Engineers (IEEE), Istanbul, Turkey, 17–20 April 2007; pp. 116–125. [Google Scholar]
Li, J.; Tao, Y.; Xiao, X. Preservation of proximity privacy in publishing numerical sensitive data. In Proceedings of the 2008 ACM SIGMOD International Conference, Association for Computing Machinery (ACM), Vancouver, BC, Canada, 9–12 June 2008; pp. 473–486. [Google Scholar] [CrossRef]
Xiao, X.; Tao, Y. Personalized privacy preservation. In Proceedings of the 2006 ACM SIGMOD International Conference, Chicago, IL, USA, 27–29 June 2006; p. 229. [Google Scholar] [CrossRef]
Christen, P.; Vatsalan, D.; Fu, Z. Advanced Record Linkage Methods and Privacy Aspects for Population Reconstruction—A Survey and Case Studies. In Population Reconstruction; Springer: Berlin, Germany, 2015; pp. 87–110. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Rubner, Y.; Tomasi, C.; Guibas, L.J. The Earth Mover’s Distance as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
Ali, M.; Malik, S.U.R.; Khan, S.U. DaSCE: Data Security for Cloud Environment with Semi-Trusted Third Party. IEEE Trans. Cloud Comput. 2015, 5, 642–655. [Google Scholar] [CrossRef]
Bayardo, R.J.; Agrawal, R. Data Privacy through Optimal k-Anonymization. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), Tokyo, Japan, 5–8 April 2005; pp. 217–228. [Google Scholar]
Lefevre, K.; DeWitt, D.; Ramakrishnan, R. Mondrian Multidimensional K-Anonymity. In Proceedings of the 22nd International Conference on Data Engineering, Atlanta, GA, USA, 3–8 April 2006; p. 25. [Google Scholar]
Gong, Q.; Luo, J.; Yang, M.; Ni, W.; Li, X.-B. Anonymizing 1:M microdata with high utility. Knowl. Based Syst. 2016, 115, 15–26. [Google Scholar] [CrossRef]
Wang, R.; Zhu, Y.; Chen, T.-S.; Chang, C.-C. Privacy-Preserving Algorithms for Multiple Sensitive Attributes Satisfying t-Closeness. J. Comput. Sci. Technol. 2018, 33, 1231–1242. [Google Scholar] [CrossRef]
Anjum, A.; Ahmad, N.; Malik, S.U.R.; Zubair, S.; Shahzad, B. An efficient approach for publishing microdata for multiple sensitive attributes. J. Supercomput. 2018, 74, 5127–5155. [Google Scholar] [CrossRef]
Khan, R.; Tao, X.; Anjum, A.; Sajjad, H.; Malik, S.U.R.; Khan, A.; Amiri, F. Privacy Preserving for Multiple Sensitive Attributes against Fingerprint Correlation Attack Satisfying c-Diversity. Wirel. Commun. Mob. Comput. 2020, 2020, 1–18. [Google Scholar] [CrossRef]
Zhu, H.; Liang, H.B.; Zhao, L.; Peng, D.Y.; Xiong, L. τ-Safe (l,k)-Diversity Privacy Model for sequential publication with high utility. IEEE Access 2019, 7, 687–701. [Google Scholar] [CrossRef]

Figure 1. HLPN for p⁺-sensitive k-anonymity attack model.

Figure 2. HLPN for

θ

-sensitive k-anonymity.

Figure 3. Discernibility penalty (DSP) score.

Figure 4. The ratio of C_AVG.

Figure 5. The number of noise tuples added against each k.

Figure 6. (a) Query error for k. (b) Query error for selectivity.

Figure 7. Algorithm execution time.

Table 1. a. Original microdata. b. 2-Anonymous microdata.

(a)

ID	Name	Age	Zip Code	Country	Disease
1	JULIAN	34	14247	USA	HIV
2	KALEEM	40	14208	Pakistan	HIV
3	JOHANNA	26	14205	USA	Cancer
4	MICHAEL	25	14242	Canada	Cancer
5	JUDITH	40	14054	USA	Hepatitis
6	EVA	48	13073	Japan	Phthisis
7	HARIS	45	14066	Pakistan	Asthma
8	PAUL	40	14063	USA	Obesity
9	YIN LI	40	14243	China	Flu
10	BEVERLY	37	14203	Canada	Flu
11	DENISE	36	14204	Canada	Flu
12	JANETTE	35	14247	USA	Indigestion

(b)

ID	Age	Zip Code	Country	Disease
1	34–40	14208-14247	**	HIV
2	34–40	14208-14247	**	HIV
3	25–26	14205-14242	America	Cancer
4	25–26	14205-14242	America	Cancer
5	>= 40	14054-14063	America	Hepatitis
6	>= 40	14054-14063	America	Obesity
7	>= 40	13073-14066	Asia	Asthma
8	>= 40	13073-14066	Asia	Phthisis
9	35–40	14243-14247	**	Flu
10	35–40	14243-14247	**	Indigestion
11	36–37	14203-14204	America	Flu
12	36–37	14203-14204	America	Flu

Table 2. Summary of notations used.

Symbol	Description	Symbol	Description
$M T$	Microdata Table	$A_{i}^{qi}$	Quasi identifier for i^th end user
$MMT$	Micro Mask Table	$A^{s}$	Sensitive Attributes
A	Attributes in MT	$A^{id}$	Identifier Attribute
PD	Published Data	$A_{ECc}^{s}$	Sensitive value in an ${EC}_{c}$
$ECs$	Set of Equivalence classes	$A_{ECn}^{s}$	Sensitive value in an ${EC}_{n}$
${EC}_{i}$	k-anonymous group of tuples with the combination of $A_{i}^{qi}$ and $A^{s}$	$A_{ECn - 1}^{s}$	Sensitive value in an ${EC}_{n - 1}$
${EC}_{c}$	Equivalence Class current	$A_{ECb}^{s}$	Sensitive value in an ${EC}_{b}$
${EC}_{b}$	Equivalence Class broken	$N$	Noise
$V_{{EC}_{i}}$	Variance for ${EC}_{i}$	$M$	Total number of record in an EC
${MS}_{n}$	Max frequency of $A_{i}^{s}$ in an EC_n	${MS}_{c}$	Max frequency of $A_{i}^{s}$ in an EC_c
${MS}_{n - 1}$	Max frequency of $A_{i}^{s}$ in an EC_n-1	${MS}_{b}$	Max frequency of $A_{i}^{s}$ in an EC_b
$P$	Places used in formal modeling	$G_{i}^{qi}$	QI-group at index i
$φ$	Data Types in formal modeling

Table 3. Category table.

Category ID	Sensitive Values
1	HIV, Cancer
2	Hepatitis, Phthisis
3	Asthma, Obesity
4	Indigestion, Flu

Table 4. a. 2⁺-Sensitive 4-Anonymous. b. (3,1)-Sensitive 4-Anonymous.

(a)

ECs	ID	Age	Zip Code	Country	Disease
EC1	1	=< 40	14204-14247	America	HIV
	2	=< 40	14204-14247	America	Cancer
	3	=< 40	14204-14247	America	Flu
	4	=< 40	14204-14247	America	Indigestion
EC2	5	>= 40	13073-14066	****	Hepatitis
	6	>= 40	13073-14066	****	Phthisis
	7	>= 40	13073-14066	****	Asthma
	8	>= 40	13073-14066	****	Obesity
EC3	9	=< 40	14203-14247	****	HIV
	10	=< 40	14203-14247	****	Cancer
	11	=< 40	14203-14247	****	Flu
	12	=< 40	14203-14247	****	Flu

(b)

ID	Age	Zip Code	Country	Disease
1	=< 40	14205-14247	****	HIV
2	=< 40	14205-14247	****	HIV
3	=< 40	14205-14247	****	Cancer
4	=< 40	14205-14247	****	Flu
5	>= 40	13073-14066	****	Hepatitis
6	>= 40	13073-14066	****	Phthisis
7	>= 40	13073-14066	****	Asthma
8	>= 40	13073-14066	****	Obesity
9	=< 40	14203-14247	America	Cancer
10	=< 40	14203-14247	America	Flu
11	=< 40	14203-14247	America	Flu
12	=< 40	14203-14247	America	Indigestion

Table 5. Variance calculation for different equivalence classes (ECs) in Table 4a.

EC2						EC3
Sensitive Values	$x$	$f$	$x^{2}$	$f * x$	$f * x^{2}$	Sensitive Values	$x$	$f$	$x^{2}$	$f * x$	$f * x^{2}$
Hepatitis	1	1	1	1	1	Flu	1	2	1	2	2
Phthisis	2	1	4	2	4	Cancer	2	1	4	2	4
Asthma	3	1	9	3	9	HIV	3	1	9	3	9
Obesity	4	1	16	4	16			$N = \sum f = 4$		$\sum f x = 7$	$\sum f x^{2} = 15$
		$N = \sum f = 4$		$\sum f x = 10$	$\sum f x^{2} = 30$			$N = \sum f = 4$		$\sum f x = 7$	$\sum f x^{2} = 15$
Variance ( $σ^{2})$	$(\frac{\sum {fX}^{2}}{N} - {(\frac{\sum fX}{N})}^{2}) = (\frac{30}{4} - {(\frac{10}{4})}^{2}) = 1.25$					Variance $(σ^{2})$	$(\frac{\sum {fX}^{2}}{N} - {(\frac{\sum fX}{N})}^{2}) = (\frac{15}{4} - {(\frac{7}{4})}^{2}) = 0.69$

Table 6. Types used in high-level Petri nets (HLPN) for p+-sensitive k-anonymity.

Data Types	Description
k	User input for k-anonymity
p	p-sensitivity numeric value
C	Distinct categories set
Condition	Boolean value 1 or 0
S_n	Total distinct $A^{s}$ values
C_n	Total distinct categories
$A_{i}^{si}$	Sensitive Attribute for i^th end user
$A_{i}^{id}$	Identifier attribute for i^th end user

Table 7. Data-types, places, and their mapping.

Places	Description
$φ$ (MT)	ℙ ( $A^{qi}$ × $A^{s}$ × $A^{id})$
$φ$ (MMT)	ℙ ( $A^{qi}$ × $A^{s}$ × k)
$φ$ (KLevel)	ℙ (k)
$φ$ (CondTF)	ℙ (Condition)
$φ$ (Gi)	ℙ ( $A^{qi}$ × $A^{s}$ × k)
$φ$ (ds)	ℙ ( $A^{s})$
$φ$ (CountDs)	ℙ (S_n)
$φ$ ( ${Gi}^{'}$ )	ℙ ( $A^{qi}$ × $A^{s}$ × k× C)
$φ$ (PLevel)	ℙ (p)
$φ$ (CompC)	ℙ ( $C_{n}$ )
$φ$ (Publish Data)	ℙ ( $A^{qi}$ × $A^{s}$ )
$φ$ (BK)	ℙ ( $A^{id}$ × $A^{qi})$
$φ$ (SA Disc)	ℙ ( $A_{i}^{qi}$ × $A_{i}^{si}$ × $A_{i}^{id})$

Table 8. a. θ-sensitive 4-anonymous (with noise). b. θ-sensitive 4-anonymous (without noise).

(a)

ID	Age	Zip Code	Country	Disease
1	=< 40	14054-14247	America	HIV
2	=< 40	14054-14247	America	Cancer
3	=< 40	14054-14247	America	Hepatitis
4	=< 40	14054-14247	America	Obesity
5	>= 40	13073-14243	Asia	HIV
6	>= 40	13073-14243	Asia	Phthisis
7	>= 40	13073-14243	Asia	Asthma
8	>= 40	13073-14243	Asia	Flu
9	=< 40	14063-14247	America	Cancer
10	=< 40	14063-14247	America	Flu
11	=<40	14063-14247	America	Flu
12	=<40	14063-14247	America	Indigestion
13	=<40	14063-14247	America	Obesity

(b)

ID	Age	Zip Code	Country	Disease
1	=< 40	14054-14247	America	Hepatitis
2	=< 40	14054-14247	America	HIV
3	=< 40	14054-14247	America	Cancer
4	=< 40	14054-14247	America	Flu
5	>= 40	13073-14243	Asia	HIV
6	>= 40	13073-14243	Asia	Phthisis
7	>= 40	13073-14243	Asia	Asthma
8	>= 40	13073-14243	Asia	Flu
9	=< 40	14063-14247	America	Cancer
10	=< 40	14063-14247	America	Obesity
11	=< 40	14063-14247	America	Flu
12	=< 40	14063-14247	America	Indigestion

Table 9. Types used in HLPN for

θ

-sensitive k-anonymity.

Table 9. Types used in HLPN for

θ

-sensitive k-anonymity.

Data Types	Descriptions
M	Size of an EC
Condition	Boolean value 1 or 0
$σ$	A float type value to define Sigma
µ	A float type value to define Mu
$θ$	A float type value to define Theta
Found ${EC}_{b}$	Equivalence class b when it is found
${AdjEC}_{c}$	Adjust Equivalence class c
${AdjEC}_{n}$	Adjust Equivalence class n
${VarEC}_{s}$	Variance of different Equivalence classes
${VarAdjEC}_{n}$	Adjust variance for Equivalence class n
${VarAdjEC}_{c}$	Adjust variance for Equivalence class c

Table 10. Mapping of data types in

θ

-sensitive k-anonymity model.

Table 10. Mapping of data types in

θ

-sensitive k-anonymity model.

Places	Descriptions
$φ$ ( $MT$ )	ℙ ( $A^{id} \times A^{qi} \times A^{s}$ )
$φ$ ( $MMT$ )	ℙ ( ${EC}_{c} \times {EC}_{b} \times {EC}_{n} \times k)$
$φ$ ( $KValue$ )	ℙ (k)
$φ$ ( $CondTF$ )	ℙ (Condition)
$φ$ ( $Sigma$ )	ℙ ( $σ$ )
$φ$ ( $Mu$ )	ℙ ( $μ$ )
$φ$ ( $Theta$ )	ℙ ( $θ$ )
$φ$ ( ${Found EC}_{b}$ )	ℙ ( ${EC}_{b}$ )
$φ$ ( ${VarEC}_{s}$ )	ℙ ( $V_{{EC}_{c}} \times V_{{EC}_{b}} \times V_{{EC}_{n}}$ )
$φ$ ( ${AdjEC}_{c}$ )	ℙ ( ${EC}_{c}$ )
$φ$ ( ${AdjEC}_{n}$ )	ℙ ( ${EC}_{n}$ )
$φ$ ( ${StrictEC}_{n - 1}$ )	ℙ ( ${EC}_{n - 1}$ )
$φ$ ( ${VarAdjEC}_{n}$ )	ℙ ( $V_{{EC}_{n}}$ )
$φ$ ( ${VarAdjEC}_{c}$ )	ℙ ( $V_{{EC}_{c}}$ )
$φ$ ( $Need Noise$ )	ℙ ( $V_{{EC}_{c}} \times A^{id} \times A^{qi} \times A^{s}$ )
$φ$ ( $PublshdData$ )	ℙ ( $A^{qi} \times A^{s}$ )
$φ$ ( $BK$ )	ℙ ( $A^{id} \times A^{qi}$ )
$φ$ ( $SA Disc$ )	ℙ ( $A_{i}^{qi}$ × $A_{i}^{si}$ × $A_{i}^{id}$ )

Table 11. DCP experiment values for each k.

k	Baseline	θ-Sensitive	p⁺-Sensitive
2	320300	320303	571227
4	640600	640605	778626
6	960900	960912	1214207
8	1281200	1281215	1467959
10	1601500	1601520	1876310
12	1921800	1921824	2096543
14	2242100	2242145	2632775
16	2562400	2562470	3017773
18	2882700	2882812	3315591
20	3203000	3203166	3628936
Average Val.	1761650	1761697.2	2059994.7
Diff. of θ and p⁺ avg. values with base avg. value	--	47.2	298344.7
Percent Closer to baseline	--	0.002679235	14.65
% diff. between θ and p⁺	--	14.64	--
This means that our proposed approach θ-sensitive, k-anonymity is 14.64% better than p⁺-sensitive k-anonymity and 0.002679% closer to the baseline.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

θ-Sensitive k-Anonymity: An Anonymization Model for IoT based Electronic Health Records

Abstract

1. Introduction

1.1. Motivation

1.2. Contributions

2. Related Work

3. Preliminaries

4. Problem Statement

Critical Review of p⁺-Sensitive k-Anonymity Model

5. The Proposed $θ$ -Sensitive k-Anonymity Privacy Model

5.1. Threshold $θ$ -Sensitivity

5.1.1. Variance ( $σ^{2}$ )

5.1.2. Observation 1 ( $μ$ )

5.2. The Proposed $θ$ -Sensitive k-Anonymity Algorithm

5.3. Analysis of $θ$ -Sensitive k-Anonymity Model Using Formal Modeling and Analysis

6. Experimental Evaluation

6.1. Experimental Setup

6.2. Discernibility Penalty (DCP)

6.3. Normalized Average (C_AVG)

6.4. Noise Addition

6.5. Query Accuracy

6.6. Execution Time

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

θ-Sensitive k-Anonymity: An Anonymization Model for IoT based Electronic Health Records

Abstract

1. Introduction

1.1. Motivation

1.2. Contributions

2. Related Work

3. Preliminaries

4. Problem Statement

Critical Review of p+-Sensitive k-Anonymity Model

5. The Proposed θ -Sensitive k-Anonymity Privacy Model

5.1. Threshold θ -Sensitivity

5.1.1. Variance ( σ 2 )

5.1.2. Observation 1 ( μ )

5.2. The Proposed θ -Sensitive k-Anonymity Algorithm

5.3. Analysis of θ -Sensitive k-Anonymity Model Using Formal Modeling and Analysis

6. Experimental Evaluation

6.1. Experimental Setup

6.2. Discernibility Penalty (DCP)

6.3. Normalized Average (CAVG)

6.4. Noise Addition

6.5. Query Accuracy

6.6. Execution Time

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Critical Review of p⁺-Sensitive k-Anonymity Model

5. The Proposed $θ$ -Sensitive k-Anonymity Privacy Model

5.1. Threshold $θ$ -Sensitivity

5.1.1. Variance ( $σ^{2}$ )

5.1.2. Observation 1 ( $μ$ )

5.2. The Proposed $θ$ -Sensitive k-Anonymity Algorithm

5.3. Analysis of $θ$ -Sensitive k-Anonymity Model Using Formal Modeling and Analysis

6.3. Normalized Average (C_AVG)