The developed MOPSO-based framework for data sanitization described in this section consists of two steps. For the first data processing step, the frequent itemsets and the pre-large itemsets are discovered and placed into the FIs and PFs sets, respectively, for a later process. Details of this are provided in the next subsection. Moreover, transactions with any confidential information are then projected as a new database for a later process. For the second evolution step, the two updating strategies of gbest and pbest for particles with dominance relationships and the GMPSO algorithm are designed to iteratively update the particles in the evolutionary process. A pre-large concept is also utilized to speed up the evolutionary computation of the execution time. Details of these are as follows.
4.1. Data Processing
Before the sanitization process, the user has to set the confidential information to hidden, and such itemsets are placed in the
CI set. The frequent itemsets and the pre-large itemsets are then discovered against the minimum and pre-large support counts, and placed into the
FIs and
PFs sets, respectively. Because the evolution process requires a certain number of computations, the pre-large itemsets are considered as a
buffer to reduce the computational cost in terms of the artificial cost of the side effects in the evolution process. The lower support (
) of the pre-large concept is derived as follows:
where
is the minimum support threshold,
is the number of transactions in the database, and
m is the size of a particle in the designed MOPSO-based framework, which is defined through Equation (
5).
Thanks to the discovered large and pre-large itemsets, multiple database scans can be greatly reduced during the evolutionary process. To obtain better transactions for deletion in PPDM, the database is first processed to project the transactions with any of the confidential information appearing in the set of CI. The projected database is then set as . Each transaction in consists of at least one itemset within the set of CI, which is also considered a candidate of the particles for later deletion during the evolutionary process. A detailed algorithm is given in Algorithm 1.
Algorithm 1: Data Processing |
![Applsci 09 00774 i001]() |
In the data processing, the inputs of
D and
CI are the original database and the set of confidential itemsets, respectively, and the outputs
,
FI, and
PF are the projected database, the set of large itemsets, and the set of pre-large itemsets. First, the original database is scanned to obtain the
FI and
PF using
minsup and
lowsup (line 1), respectively. The algorithm used for mining the
FI and
PF can be the variants of the pre-large algorithm [
45,
46]. The transactions containing any of the confidential itemsets are then projected (lines 2–4). The size of each particle is then calculated (line 5) for the later evolutionary process. Thanks to the advantages of the pre-large itemsets, the side effects of the artificial cost can be easily maintained and updated because the potential itemsets that may become large itemsets are kept in
PF. Finally, the projected database (
), the frequent itemsets (
FI), and pre-large itemsets (
PF) are then returned as the outputs (line 6) for the next evolution process.
4.2. Evolution Process
In the second evolution process, the particles are then evaluated to obtain better solutions for the next iteration. In an MOPSO-based framework, each particle can be represented as a possible solution with
vectors, and each vector has a transaction ID, which shows the transaction for deletion. Note that a vector in a particle can have a
null value. In the updating stage of the evolutionary process, the formulas are thus defined as follows:
In the updating process, the designed GMPSO handles data sanitization in PPDM as a discrete problem. Thus, several parameters used in a traditional PSO are unnecessary in the designed GMPSO. For the updating process, the TIDs within the elder particle or
null value are then randomly chosen to fill in the size of the particle, which is shown in Equation (
12). The results are then summed with the updated velocity of the particle, which is shown in Equation (
13). This method increases the randomization of the exploration ability during the evolution process.
For instance, supposing that the particle size is set as 4, a particle is initially set as = [1, 2, 5, 8]; gbest is initially set as [0, 4, 0, 5], and pbest is initially set as pbest = [2, 1, 0, 6]. Note that 0 represents a null value, which indicates that a non-transaction is selected for deletion. Thus, the particle can be updated as follows: = {[1, 2, 5, 8] - [2, 1, 0, 6]} ∪ {[0, 2, 0, 5] - [1, 2, 5, 8]} = [5, 8] ∪ [0] = [5, 8, 0]. Because the size of a particle is set as 4, we then randomly select one transaction from its elder content as (1, 2, 5, 8) (= 1). Thus, the particle is updated as [5, 8, 0, 1].
Because the MOPSO-based framework is used to solve a multi-objective problem, the pbest of each particle cannot be simply determined based on the fitness value. Thus, the non-domination relation is adapted to obtain pbest during the updating process. The local updating strategy is defined as follows:
Strategy 1 (Local Updating Strategy, LUS)
. Thus, if the current particle dominates its last pbest, the pbest can be replaced by the current particle; otherwise, a random selection is then conducted to select a particle as pbest for the next iteration.
To obtain gbest, a global updating strategy is then designed to obtain a better updating solution during the evolution process. Using the designed GMPSO algorithm, a grid-based method is then assigned to a set of candidate particles, and each particle is then assigned with a probability based on the number of grids and the number of the particles in each grid. A random function is then performed to choose one of the candidate particles as gbest based on their assigned probability. The global updating strategy can be stated as follows:
Strategy 2 (Global Updating Strategy, GUS)
. Without loss of generality, the multi-objective optimization problem with
n-objectives (
) to be minimized can be defined as:
which is subject to {
} and {
}, and
. Moreover,
is the vector of optimization variables;
is the optimized domain, which can be defined by the Cartesian product of the domain of each optimization variable.
is the objective functions of the problem;
and
represents, respectively, the inequality and equality constraints of the problem. The set of feasible solutions is represented by
.
Essentially, the goal of multi-objective evolutionary algorithms is to obtain a diverse set of estimates of the Pareto optimal set, which contains the non-dominated solutions of the multi-objective problem.
Pareto dominance [
47] has been the most commonly adopted criterion used to discriminate among solutions in the multi-objective context, and therefore it has been the basis to develop most of the MOEAs proposed so far. A theorem of Pareto dominance relation is given below.
Theorem 1. A feasible Pareto-dominates another point if it holds: and and the relation operation ≤ and ≠ are defined as: , and .
The above theorem of a and b respectively represents two different decision vectors. Normally, this dominance is usually presented as: . Based on the theorem of Pareto dominance, the four side effects can be considered as four objects, and the solutions of the optimization in PPDM is to find a set of un-dominated solutions for transaction deletion. Thus, the Pareto dominance can be utilized in the designed algorithm and holds the correctness of the dominance relationships of the derived solutions. The details of the designed GMPSO algorithm are described in Algorithm 2.
Algorithm 2: Proposed GMPSO Algorithm |
![Applsci 09 00774 i002]() |
Here, , CI, FI, PF, and N are the projected database, the set of confidential information, the set of large itemsets, the set of pre-large itemsets, and the number of populations, respectively. In addition, , , , and are four fitness functions presenting the number of hiding failures, the missing cost, the artificial cost, and the number of database dissimilarities, (Dis), respectively First, N particles are randomly initialized as m vectors based on the size of each particle (line 1). The generated particles are then put into the set of as the initial populations. The candidate set of the Pareto front is then set as null (line 4). Next, an iterative process is then applied until the termination criteria are achieved, for example, the number of iterations (, lines 5–18). Each particle in the candidate set of the Pareto front is then evaluated using four fitness functions (line 7) to find the non-dominated solutions (lines 8–14). The satisfied solutions are then applied using the GridProb function to assign the probability of each particle in the grid. The pseudo-code of the GridProb function is illustrated in Algorithm 3.
Algorithm 3: GridProb(Pool) |
![Applsci 09 00774 i003]() |
In Algorithm 3, the size of the grid is initially set based on user preference. For each grid, the number of grids can be found using
(line 3), and the probability of each particle within a grid is then calculated (line 5). Each particle then receives a probability for later selection of
gbest, which increases the diversity of the solutions. To clearly explain the abbreviations used in this paper, the explantions of all acronyms are shown in the
Appendix A section.