3.3. EARH-GA
Let us consider
, where
, where
indicates the set of transactions in
that contains the victim item.
refers the number of positive integers. The solution space is set as
, where each element
, where
.
δ is the number of transactions to be altered with the aim of hiding the sensitive rule. If we have a rule
X ⇒
Y, then
δ is calculated by Equation (1):
A Venn diagram that represents the relationship between
and
is presented in
Figure 1.
Two approaches are mostly used to alter the transaction, with the purpose of minimizing the confidence of sensitive rule X ⇒ Y below the MC. (1) Item deletion: this approach decreases the support count by considering the right-hand side (RHS) of rule Y by eliminating certain items, while the removed items from the sensitive transaction support both X and Y. (2) Item addition: this approach, on the other hand, increases the support count of the itemset on the left-hand side (LHS) of rule X. The support count is increased by adding items to the transactions containing X. In the first approach, a few nonsensitive rules may be hidden while all others are authentic. On the contrary, the second strategy may generate illegitimate rules in the sanitized database, and it may also fail to conceal the sensitive rules. Therefore, the item deletion approach has more benefit than item addition. The data sanitization process is aimed at protecting the sensitive knowledge against mining techniques, so in all the side effects, the hiding failure is more essential. Thus, we adopt this approach.
Genetic optimization is proposed to convert D to . In order to achieve genetic optimization, two things have to be defined: the solution encoding and the objective function. The solution is encoded as a vector of indices that indicates the transactions that will be altered, while the objective function represents the ratio of the frequency of lost NSAR over the number of NSARs. The genetic algorithm will be called for the number of iterations of as many sensitive rules as we have. In each iteration, the algorithm deals with one sensitive association rule. First, in this step, our approach is to elect the victim item from the sensitive association rule. An item with minimum support of the RHS of the association rule is determined as the victim item, as such items produce fewer frequent itemsets and have minimum support. Therefore, the numbers of association rules containing the victim items have less frequency than other items, so deleting these items has a minor effect on nonsensitive association rules. Second, the algorithm will calculate , the number of transactions that need to be altered using Equation (3). Third, it will select the transactions that contain the victim item as candidate transactions. Fourth, the GA will be called, and it should return the selected transaction. These are the transactions whose alteration will have the minimum impact on the nonsensitive rules, and their number is . Finally, the algorithm will remove the victim item from the selected transaction according to the best solution, the database will be updated, and the algorithm will move on to the next SAR. Optimization association rule hiding using a genetic algorithm is presented in Algorithm 1. To elaborate on how the genetic algorithm works, we present the pseudocode in Algorithm 2. The genetic algorithm receives the DB′, which is the dataset that has to be altered. Also, it receives AR, SAR, CT, , VI, MS, and MC. The result of the genetic algorithm is the selected transactions, which represent the algorithm’s global best solution. The operation of the genetic algorithm is described as a sequence of steps after initializing the population, which indicates the set of random vectors that belong to the set , and the fitness values of the solutions will be evaluated using the fitness function, which is the number of lost NSARs over the number of NSARs when the solution is applied. Applying the solution means deleting the victim items from the selected transactions by the solution. A set of iterative steps is performed, and in each step, the elites are selected and used to produce offspring through two operations: crossover and mutation. In each iteration, the objective function will be called to evaluate the whole solution in the population.
Algorithm 1. Optimizing association rule hiding using a genetic algorithm. |
Input: Database (DB), association rules (ARs), sensitive association rules (SARs), SARcounter = 0, minimum support (MS), minimum confidence (MC) Output: DB′ StartDB′ = DB While SARcounter < |SAR| SARcounter = SARcounter + 1 for i = 1 to |items in the right-hand side of SAR| itemSupport(i) = Support(DB′,item(i)) end VI = the item with minimum itemSupport//VI indicates the victim item = Support(DB′,SAR items) − floor (MC × Support (DB′, items in the left hand side of SAR)) candidateTransactions = transactions where victimItem belongs selectedTransactions = GeneticAlgorithm(DB′, AR, SAR, NSAR, candidateTransactions, , victimItem, MS, MC) DB′ = Remove(selectedTransactions, victimItem, DB′) EndWhile End |
Algorithm 2. The adopted genetic algorithm pseudocode. |
Input: Database (DB′), association rules (ARs), sensitive association rule (SAR), nonsensitive association rules (NSARs), candidate transactions (CTs), victimItem (VI), MS, MC Output: Global best individual StartinitialPopulation = initialize population randomly FitnessValues = objectiveFunction (initialPopulation, DB′, CT, VI, AR, NSAR, MS, MC, ) while (!stopCondition) bestFitIndividuals = selectElite (FitnessValues) newIndividuals = crossover and mutation (bestFitIndividuals) FitnessValues = objectiveFunction(newIndividuals, DB′, CT, VI, AR, NSAR, MS, MC, ) endwhile globalBestIndividual = individual with min(FitnessValues) FitnessValues = objectiveFunction (Population, DB′, CT, VI, AR, NSAR, MS, MC, ) For each solution from Population NewDB′ = DeleteVictim(DB′, VI, solution) LostRules = FindLostRules(NewDB′, DB′) LNSAR = FindLostNonSensitive(LostRules, NSAR) FitnessValues.Add(LNSAR/NSAR) end End |
3.4. Tracing Example
The goal of this example is to elaborate on the operation of the genetic algorithm. We present an example of a small dataset in
Table 2 and all the following steps in order to obtain a new dataset for hiding a given sensitive rule.
Table 2 provides an example of a transaction database. Each dataset transaction comprises a set of items I = {a, b, c, d, e, f, g, h}, provided that the minimum support (MS)
= 37.5% and the minimum confidence (MC)
= 75%. Twenty-four association rules are shown in
Table 3. These rules are the result of an a priori data mining algorithm [
10]. Other popular ARM algorithms are FP-Growth [
32] and Eclat [
33].
If the rule {l} ⇒ {m, n} was chosen as a sensitive rule, then the victim item would be (n) because it has lower support compared to (m), where
α(n) = 5 and
α(m) = 6. Therefore, (n) should be removed from a certain number of transactions; this number is
δ, which can be computed using Equation (2). So, altering one transaction is enough to hide the rule {l} ⇒ {m, n} and this is due to the small size of the dataset in our example. Let us suppose a dataset with 120 transactions whose support count of the rule is 98 (
α (
X ⇒
Y) = 98), the support count of itemset LHS of the rule is 102 (
α (
X) = 102), and the minimum confidence threshold is 0.75 (MC = 0.75), which can be formulated as Equation (3). Twenty-two transactions are sanitized; therefore, that it is possible the support count of itemset
Y is reduced to 76, and thus the revised confidence of the rule would be 76/98 = 0.745.
Obtaining the best transactions is the result of applying a genetic algorithm, and the next section will demonstrate how this works.
Let us get back to the previous example, but with βmin = 50%, and with the sensitive rules (SARs) of {a} ⇒ {b, c} and {d, f} ⇒ {e}. First, we want to hide {a} ⇒ {b, c}; the victim item is (c), and
δ is now 2 as computed using Equations (4) and (5). The sensitive rules are shown in
Table 4.
Candidate transactions support the sensitive rule, so they are the transactions with the IDs {1, 3, 5, 7} in the transaction dataset in
Table 2, and they also represent the solution space for the genetic algorithm. The solution length (chromosome length) is 2 because
δ = 2. If the population size is 4, then the initial generation would be something like: {1, 3} {5, 7} {1, 5} {3, 7}.
Next, the objective function in the equation is applied to each of these solutions. For example, for solution {1, 3}, if we delete (c) from transactions with IDs 1 and 3, the new dataset would be
, as shown in
Table 5, and the new rules would be those shown in
Table 6.
The fitness value of this solution is #lost NSAR/#NSAR = 1/22. Let us say that the next solution has a fitness of (2/22, 5/22, 4/22). The next step is to select the elite individuals; the elite are the most important individuals to produce new individuals that are certain to survive to the next generation. The elite count is ceil (0.05 × population size). In our example, the elite count is ceil (0.05 × 4) = 1, and the elite individual is {1, 3} with a fitness of 1/22.
In order to elaborate the crossover and mutation, we assume two solutions— and —that contain the transactions that have to be altered. The crossover is done by generating one random vector with the same length as the parents; for example, . Then the child will inherit its genes from A in the case of 1 and from B in the case of 0: . Following the crossover, mutation is performed, which provides a minor tweak to a chromosome to obtain diverse solutions. In this operation, first a ratio is chosen from the vector entries of an individual. All the entries of an individual have a probability rate of being mutated with a value of 0.01. In the next step, the selected entry of the individual is revised by a random number depending on the upper and lower limits for a particular entry. So, for a child with six entries (3, 6, 9, 1, 12, 5), the randomly selected fraction is 50%. Then entries (1, 2, 3) are checked with the mutation rate; if entry (3) is a hit, it will go to the second step and number (9) should be replaced by another random number selected uniformly by considering the upper and lower limits for this entry.