Efﬁciently Supporting Online Privacy-Preserving Data Publishing in a Distributed Computing Environment

: There has recently been an increasing need for the collection and sharing of microdata containing information regarding an individual entity. Because microdata typically contain sensitive information on an individual, releasing it directly for public use may violate existing privacy requirements. Thus, extensive studies have been conducted on privacy-preserving data publishing (PPDP), which ensures that any microdata released satisfy the privacy policy requirements. Most existing privacy-preserving data publishing algorithms consider a scenario in which a data publisher, receiving a request for the release of data containing personal information, anonymizes the data prior to publishing—a process that is usually conducted ofﬂine. However, with the increasing demand for the sharing of data among various parties, it is more desirable to integrate the data anonymization functionality into existing systems that are capable of supporting online query processing. Thus, we developed a novel scheme that is able to efﬁciently anonymize the query results on the ﬂy, and thus support efﬁcient online privacy-preserving data publishing. In particular, given a user’s query, the proposed approach effectively estimates the generalization level of each quasi-identiﬁer attribute, thereby achieving the k -anonymity property in the query result datasets based on the statistical information without applying k -anonymity on all actual datasets, which is a costly procedure. The experiment results show that, through the proposed method, signiﬁcant gains in processing time can be achieved.


Introduction
Recently, there has been an increasing need for the collection and sharing of microdata, which contain information on an individual entity. Microdata are a valuable source of information in diverse areas. Many different organizations, including healthcare providers, apply data analysis techniques to large volumes of microdata to extract hidden knowledge with the goal of improving their decision-making capabilities. However, microdata typically contain sensitive information about an individual, and thus directly releasing such data for public use may violate existing privacy requirements. To avoid the privacy problems that occur through the release of microdata for public use, extensive studies have been conducted in the area of privacy-preserving data publishing (PPDP) [1][2][3][4]. These methods ensure that the microdata released satisfy the privacy requirements, such as kanonymity [1,2]. Although such methods differ in the way in which the original mircodata are transformed into another format that is releasable for public use, they are all based on the same principle, that is, individuals cannot be uniquely identified in the data released.
Most existing privacy-preserving data publishing algorithms consider a scenario in which data owners receiving a request for the release of data containing personal information anonymize the data before being published, which is conducted offline. However, with the increasing demand for data sharing among various parties, such an offline data publishing scenario is insufficient to support the voluminous request of a data release. Instead, it is more desirable to integrate the data anonymization functionality into existing systems that are capable of supporting online query processing, such as a database management system or data warehouse. For example, as our motivating application, let us consider the example scenario shown in Figure 1, in which databases are managed by either a database management system or a data warehouse. Here, a data publisher can submit a normal SQL query with anonymization parameters to the system. Then, the system returns the resulting anonymized set to the data publisher who, in turn, releases it for data analytics purposes to data users. However, thus far, the existing privacy-preserving data publishing techniques have overlooked this online privacy-preserving data publishing scenario. The major challenge in supporting online privacy-preserving data publishing arises from the efficiency of query processing. Applying data anonymization for query results during the query processing phase clearly adds significant overhead, thereby resulting in a degraded query performance. Thus, to address this efficiency challenge, in this study, we developed a novel scheme that is able to efficiently anonymize the query results on the fly, and therefore eventually support efficient online privacy-preserving data publishing.
In an online query audition, an aggregate query, such as sum, max or min, posed over sensitive data is denied if the query result can reveal sensitive information [5][6][7]. That is, given a sequence of queries that have already been answered to users, the online query audition denies a new query whenever an answer to the query, along with previous query results, can reveal private data. Furthermore, it is known that anonymization methods are vulnerable to attacks of re-identification caused by the release of multiple anonymous datasets when failing to consider previously released anonymous datasets. Thus, in the literature, extensive studies have been conducted to support continuous data publishing using anonymization methods [8][9][10][11][12]. For example, Wang and Fung [9] studied the problem of sequentially releasing k-anonymous tables with different sets of attributes of the original table. In addition, Fung et al. [10] addressed the problem of continuously publishing k-anonymous tables of an original table into which a new set of records is continuously inserted. Moreover, Xiao and Tao [11] proposed an anonymization algorithm that supports the continuous publication of microdata in the presence of an inserted, deleted, and updated set of records. We note that the method proposed in this paper is a general framework that can be extended along with these existing methods to support a continuous anonymous table release.
The rest of this paper is structured as follows: In the next section, we present the related work. In Section 3, we introduce the background, and then formally define the problem addressed in this paper. In Section 4, we present our algorithm for efficiently supporting online privacy-preserving data publishing. In Section 5, we experimentally evaluate our approach using real datasets. Finally, we provide some concluding remarks in Section 6.

Related Work
Extensive studies have been conducted in the area of privacy-preserving data publishing (PPDP). The most popular anonymization algorithm, k-anonymity, was first formulated in [1]. Various algorithms have been proposed to achieve the k-anonymity requirement. LeFevre et al. finds full-domain optimal k-anonymous generalizations with a bottom-up pruning approach [2]. Wang et al. proposed a bottom-up generalization algorithm to find a minimal k-anonymization for classification [13]. Fung et al. presented the top-down specialization scheme in which the specialization process terminates if further specialization on quasi-identifier attribute values violates the k-anonymity requirement [14]. Mondrian [15] is a multidimensional generalization model that anonymizes data by recursively partitioning the space across the dimension. Clustering-based methods have been proposed to effectively find the k-anonymous table. For example, [16,17] group k similar records into a cluster and generalize each cluster to achieve k-anonymity. Besides k-anonymity, many privacy metrics have been proposed in the literature. Machanavajjhala et al. [3] introduced l-diversity that requires that each equivalence has at least l well represented values of a sensitive attribute. Li et al. proposed t-closeness that requires that the distribution of a sensitive attribute in each equivalence class is similar to the distribution of the entire table [4]. p + -sensitive k-anonymity was proposed to prevent similarity attacks, and thus to reduce the potential threat for attribute disclosure [18][19][20]. Kim et al. [21] developed a delay-free anonymization method to publish electronic health data streams. In [22], a utility-preserving anonymization method for PPDP, which preserves the utility of health data by inserting counterfeit records and creating a catalog of the counterfeit records in the process of data anonymization, was proposed. Khan et al. [23] introduced the θ-sensitive k-anonymity privacy model, in which the threshold θ determines the diversity level of an equivalence class, to prevent the sensitive variance attack when publishing electronic health records. A comprehensive survey of privacy-preserving data publishing can be found in [24][25][26][27][28][29].
Differential privacy (DP) [30], which is the strongest scheme for protecting individuals' privacy in released data, has been extensively studied in diverse areas, including data mining and medical analysis. DP guarantees that an attacker with arbitrary background knowledge cannot infer with high confidence whether a particular individual is participating in the query result (or the published data). DP can be used in two different settings. The first one is the offline setting where a statistical summary, such as histograms or a set of synthetic data that mimic the original data, is released for public use [31]. The second one is the online setting, where the user issues a statistical query to the original database, and then a perturbed version of the query result is returned to the user [32]. With its strong privacy guarantees, DP has been used in various application areas and many variants of DP have been proposed in the literature, such as local differential privacy [33][34][35][36] and geo-indistinguishability [37,38]. DP can be used for publishing location data in a privacypreserving manner by using a spatial histogram [39,40]. DP complaint spatial histograms are constructed by first partitioning a spatial domain into several cells and then adding carefully calibrated noise to the true count of objects located within the boundaries of each cell. Unlike anonymization methods, DP is mostly used for the release of aggregated results, such as histograms or cross tabulations. However, several recent attempts have been made to apply DP along with an anonymization algorithm to the publishing of microdata. For example, Lee and Chung [41] proposed a method for releasing the -DP version of an original dataset. The method proposed in [41] uses anonymization methods based on generalization, suppression, and insertion, along with DP to generate an -DP version of an original dataset. Guo et al. [42] proposed a method based on the combination of k-anonymity and DP for publishing physiological signals collected by wearable devices.

Background
The most popular anonymization algorithm, k-anonymity, was formulated in [1]. The k-anonymity algorithm guarantees that, for each record, there are at least k − 1 other records included in the released data that have the same values for a set of quasi-identifier attributes (which are defined as special attributes that can be linked with external data to uniquely identify individual records in the released data), thereby ensuring that every record in the released data is indistinguishable from at least (k − 1) others, despite a linkage attack [1,2]. Each record in a dataset is generalized into an indistinguishable group, called the equivalent class, by replacing the specific values of the quasi-identifier attributes with more general values. For instance, let us consider the example table in Figure 2a, in which the attributes, Age and Zip, are quasi-identifier attributes, and the attribute, Disease, is a sensitive attribute. Let us further assume that the domain generalization hierarchies of Age and Zip are defined as in Figure 3. Then, the k-anonymous table in Figure 2b is obtained by replacing the values of the quasi-identifier attributes, Age and Zip, of each record with more general values defined in the domain generalization hierarchies. For example, the first record, 14, 3068, Pneumonia , in Figure 2a is generalized as 10-20, 3060-3070, Pneumonia in Figure 2b, and thus is indistinguishable from the next three records (i.e., RID = 2, 3, 4). Many k-anonymity algorithms employ the concept of a generalization lattice to compute an anonymous table. A generalization lattice over attribute domain generalization hierarchies is constructed using a set of all possible combinations of the generalization levels of each attribute ( Figure 3). Then, an optimal k-anonymized table is computed by traversing the generalization lattice in a bottom-up manner until the k-anonymity property is satisfied. See [2] for a more detailed description of the k-anonymity algorithm.

Problem Statement
Let us assume the relation R(A 1 , A 2 , · · · , A m ). Let A = {A 1 , A 2 , · · · , A m } be a set of attributes in R. Let us further assume that Q ∈ A is a set of quasi-identifier attributes and S ∈ A is a set of sensitive attributes. In this paper, we then focus on a selection query on a single relation, which can be written as follows: SELECT proj_list FROM R WHERE pred 1 AND pred 2 AND · · · AND pred l .
Here, we further assume the following: • proj_list consists of attributes that are in either Q or S, • the condition in the WHERE clause consists of l conjunctive selection conditions, pred 1 , pred 2 , · · · , pred l , and • each predicate pred i can be either an equality condition or a range condition on an attribute not in proj_list.
In this paper, we consider a scenario in which the results of a query are anonymized using k-anonymity with generalization, which is the most popular type of scheme.
We assume a distributed system running on a shared nothing architecture in which the data are horizontally split and stored in multiple nodes. In a shared nothing architecture, each node has its own private resources, such as memory or disk, and thus does not share resources with other nodes. In a distributed environment, given a selection query on a single relation, every slave node executes the query against its own data and sends the query result to the master node. The master node then aggregates all query results from all nodes and sends them to the user. In this distributed query processing scenario, none of the slave nodes can locally anonymize their own query results because the generalization level of each quasi-identifier attribute used to satisfy the k-anonymity requirement cannot be determined without combining all of the results from each of the slave nodes. Hence, the straightforward solution is to first aggregate the results from every slave node and then globally anonymize the aggregated results at the master node ( Figure 4a). However, considering the large volume of data stored in the system, this global approach is highly inefficient because the data anonymization is performed solely by the master node. Furthermore, with a global approach, the resources of slave nodes are not utilized during the anonymization phase. A more promising solution is for each node to locally anonymize its own results and send the anonymized results to the master node, which combines the anonymized results from every slave node (Figure 4b). Thus, in this study, we develop a method that enables the query result at each slave node to be locally anonymized as much as possible, thereby fully utilizing the resources of the slave nodes during the anonymization phase.

Efficient Support of Online Data Publishing
In this section, we describe the proposed algorithm for efficiently anonymizing the query results on the fly in a distributed environment to support online privacy-preserving data publishing. The proposed approach in this paper is summarized as follows ( Figure 5):

1.
First, given a query, the master node estimates the generalization level of each quasiidentifier attribute to satisfy the k-anonymity property over the query result datasets, and then send it to each slave node along with the user query (Section 4.1); 2.
Each slave node then executes the user query, anonymizes its own query results based on the generalization information received from the master node, and sends the anonymized query results to the master node (Section 4.2); 3.
Finally, the master node aggregates the anonymized query results from every slave node and returns the aggregated results to the user (Section 4.3).
It is well known that k-anonymity algorithms are generally computationally expensive and complex, making them difficult to perform well with large amounts of data [43]. Thus, several approximation methods requiring a trade-off between data utility and computing time have been proposed [44][45][46][47]. We also note that the approach proposed in this study is an approximation-based algorithm in that it trades off between data utility and computing time. We will now describe each of the above three steps in detail.

Phase I: Estimating the Generalization Level
In this paper, we estimate the generalization level of each quasi-identifier attribute to achieve k-anonymity over the query results by leveraging the statistical information, such as the histograms that are maintained for query optimization purposes in most commercial database management systems. In general, a histogram on attribute A i is constructed by dividing the entire value range of A i into w disjointed subranges, H(A i ) = {x 1 , x 2 , · · · , x w }. Each subrange, x j , usually stores x s j , x e j , f j , and dv j . Here, x s j and x e j represent the start point and the end point of the subrange x j , respectively. Furthermore, f j corresponds to the number of tuples whose A i values lie between x s j and x e j , and dv j represents the number of distinct values in x j .
Given a query, we first estimate the size of the query result. Estimating the query result size, which is known as a cardinality estimation, has been extensively studied over the past several decades [48][49][50][51][52][53]. Although there are many complex algorithms that can provide a very high level of accuracy in a cardinality estimation, our approach uses a solution based on the assumption of attribute value independence. The cardinality estimation in many database management systems indeed relies on a method based on an attribute value independence assumption owing to its simplicity and reasonably good accuracy. For example, PostgreSQL [54], which is a well-known open-source DBMS, assumes that all attributes are mutually independent and maintains one-dimensional histograms [53].
Given a histogram H(A i ) = {x 1 , x 2 , · · · , x w } and predicate pred u involving attribute A i , let s u be the selectivity ratio associated with pred u . The overall distribution of entire values of the attribute A i can be captured by using the histogram H(A i ). Furthermore, we assume that attribute values in each subrange of H(A i ) are uniformly distributed, which is a common assumption in modern database systems [53]. Then, the selectivity ratio s u is obtained as follows: For the equality predicate (i.e., σ A i =val (R), where val denotes any integer value located within the range between x s j and x e j ), s u is defined as: Here, |R| is the number of tuples in the relation R. For the range predicate (i.e., σ val 1 ≤A i ≤val 2 (R), where val 1 and val 2 are any integer values located within the range of x s j and x e j ), s u is defined as: Note that the above equation considers the case in which val 1 and val 2 are located within the same subrange x j . Let assume the case where val 1 and val 2 are located in different subranges, x j and x j+k , respectively. In this case, subranges, x j+1 , x j+2 , · · · , x j+k−1 , are fully covered by the range predicate, while subranges, x j and x j+k , are partially covered. Thus, in this case, the selectivity ratio s u is computed as: Given a query having l predicates, pred 1 , pred 2 , · · · , pred l , let R res be the corresponding result relation. Then, given selectivity ratios, s 1 , s 2 , · · · , s l , computed as explained previously, the query result size is estimated as: That is, based on the assumption of attribute value independence, the query result size is computed by the product of all selectivity ratios, s 1 , s 2 , · · · , s l .
Once the number of query results is computed, we next estimate the generalization level to achieve the k-anonymity property over the query results. Given a projection list, proj_list, of a query, let Q proj = {A 1 , A 2 , · · · , A y } be the set of quasi-identifier attributes in proj_list (where Q proj ⊂ Q). Let us further assume that L(N, E) be a generalization lattice constructed with the attributes in Q proj , where N and E are the set of nodes and edges, respectively. The set of possible values for the quasi-identifier attributes in Q proj at the specific node n i ∈ N is then defined as follows: Here, V A t (1 ≤ t ≤ y) is the set of possible values for the quasi-identifier attribute, A t , at node n i . Note that each possible value combination in EQ n i indeed corresponds to an equivalence class in the k-anonymity algorithm. Each element in EQ n 5 corresponds to an equivalence class at the node n 5 in the generalization lattice.
Given a result relation, R res , and a node, n i , in a generalization lattice, let R res [v 1 , v 2 , · · · , v y ] be an equivalence class whose values correspond to (v 1 , v 2 , · · · , v y ) ∈ EQ n i at node n i . The size of an equivalence class is then estimated as: Here, s A t =v t (1 ≤ t ≤ y) is the selectivity ratio associated with the quasi-identifier attributes A t and the value v t . Note that s A t =v t is estimated by leveraging a histogram, H(A t ). Here, s Age=10-20 corresponds to the selectivity ratio associated with σ 10≤Age≤20 (R), which is computed using a histogram, H(Age), as described earlier. Similarly, s Zip=3071-3080 can be computed using a histogram H(Zip).
K-anonymity is achieved if each equivalence class contains at least k-tuples. Thus, given node n i in a generalization lattice, our approach checks whether the equivalence class having the minimum size satisfies the k-anonymity property as follows: Thus, our approach traverses each node of a generalization lattice in a bottom-up manner, such as in [2], until a node that satisfies the above equation is found. Our approach is similar to the algorithm in [2] in that the generalization lattice is traversed in a bottom-up manner. However, it should be noted that, unlike the method in [2], our approach estimates the generalization level of each quasi-identifier attribute for k-anonymity based on the estimation method presented in this subsection, instead of performing k-anonymity on actual datasets.

Phase II: Executing a Query and Anonymizing Local Query Results
Upon receiving the user query from the master node, each slave node executes the received query over its local data collections and applies anonymization to the query results according to the generalization information received from the master node. It then returns the anonymized results to the master node. We note that this phase is executed in parallel by the salve nodes, which leads to the resources of slave nodes to be fully utilizied during the anonymization phase.

Phase III: Aggregating (and Further Anonymizing) Locally Anonymized Results
In the final phase, the master node aggregates the anonymized results from every slave node. Because the method proposed in this paper estimates the generalization level of each quasi-identifier attribute based on the histograms, the aggregated results may not satisfy the k-anonymity requirement when either an under-or overestimation occurs. An underestimation corresponds to a case in which a node in the generalization lattice, which is estimated by the algorithm described in Section 4.1, is located before the set of nodes in the traversing order of the lattice node, where k-anonymity with minimal generalizations is achieved. Similarly, an overestimation is defined as a case in which an estimated node in a generalization lattice is located after the set of nodes in the traversing order of the lattice node, where k-anonymity with minimal generalizations is satisfied.
For example, in Figure 6a, let us assume that k-anonymity with minimal generalizations is achieved with the node O 0 , P 2 , which is highlighted with the red oval. Furthermore, let assume that the generalization lattice is traversed in a bottom-up manner and nodes in the same label are traversed from left to right. In this example, the nodes O 2 , P 0 , which are estimated using the algorithm described in Section 4.1, correspond to an underestimation case. On the other hand, the estimated node O 2 , P 1 is an overestimation case.
(a) (b) Figure 6. (a) Example of an underestimation and an overestimation of Phase I, in which we assume that k-anonymity with minimal generalizations is achieved with the node O 0 , P 2 , and (b) for the case of an underestimation, the master node may apply the anonymization process on the actual aggregated results (blue area), skipping the nodes that were already visited during Phase I (red area).
Hence, after aggregating the anonymized results from every slave node, the master node needs to check whether k-anonymity is satisfied over the aggregated results. If so, the aggregated results are returned to the user. However, if k-anonymity is not satisfied owing to an underestimation of the generalization level of each quasi-identifier attribute, the master node needs to conduct further anonymization of the aggregated results until k-anonymity is satisfied. It should be noted that, even in such a situation, the proposed method is more efficient than the baseline approach (i.e., the global approach shown in Figure 4a), because the nodes in the generalization lattice that were already visited during the generalization estimation phase in Section 4.1 can be skipped during the anonymization process of the master node. For example, consider the underestimation example in Figure 6b, in which the algorithm described in Section 4.1 estimates that the k-anonymity requirement is satisfied with the node O 2 , P 0 , even though in reality it is not. In this case, the master node conducts the anonymization process on the actual aggregated results, starting from the node O 1 , P 1 , and thus skips the nodes that were already visited during Phase I. This anonymization process continues until k-anonymity is satisfied. In the example in Figure 6b, the anonymization process stops at the nodes O 0 , P 2 , where k-anonymity is satisfied.
By contrast, an overestimation causes a loss of information of the released microdata because the quasi-identifier attributes are more generalized than necessary. With the anonymized results received from the slave nodes, the master node cannot detect whether an overestimation actually occurs, which results in returning more coarse-grained k-anonymity results to the user. This, in turn, leads to a loss in the data utility of the released microdata. That is, the algorithm proposed in this paper achieves a high level of efficiency in terms of applying k-anonymity by trading information loss with efficiency. However, as described in the experiment section, the proposed approach does not cause a significant reduction in the information on the released microdata, despite the occurrence of an overestimation, while achieving a high level of efficiency.

Experiment Evaluations
In this section, we describe the experimental evaluation of the performance of the proposed approach. First, we describe the experimental setup and then discuss the results.

Experiment Setup
To evaluate the proposed approach, we used the NPS dataset from the Health Insurance Review and Assessment (HIRA) service in Korea [55]. The National Patients Sample (NPS) dataset consists of electronic health records of 3% of the Korean people sampled in 2011. We randomly selected 5 M records with seven attributes (Age, Sex, Length of stay in the hospital, Location, Surgery status, Disease, and Height of patience) from the NPS dataset. We consider the first five attributes (Age, Sex, Length, Location, Surgery) to belong to QA, and the disease attribute belonging to S. In the experiments, we focused on the following range query: SELECT Age, Sex, Length, Location, Surgery, Disease FROM R WHERE min height ≤ Height AND Height ≤ max height .
Here, the values of min height and max height were varied during the experiments. In addition to reporting the experimental results for the method proposed in this paper (which is based on local anonymization in a distributed environment), we also report the results for the k-anonymity algorithm that is based on global anonymization in a distributed environment.
One way to evaluate the performance of the proposed approach is to implement the proposed scheme on commercial or open source distributed DBMSs and conduct comprehensive experiments in real application environments. This, however, is out of scope at this stage of the research. Thus, in this paper, we simulated a distributed query processing environment as following: we used a cluster with one master node and five slave nodes for the experiments. Each node has a 3.30 GHz of CPU. 1 Gbps LAN is used for node communication. The data used in the experiment is horizontally partitioned into five fragments which are distributedly stored in the five slave nodes. That is, each slave node has a relation with the same set of attribute (i.e., Age, Sex, Length, Location, Surgery, Disease, and Height) and records are randomly and evenly distributed among five slave nodes. Each slave node has its local (standalone) DBMS, MySQL [56], managing local data. The communication between a master and a slave node is implemented using standard TCP/IP. Upon receiving a user query, the master node sends it to slave nodes which run in parallel. Then, each slave node runs the user query against the local data, and returns query results to the master node. We ran each query five times and the averaged values are presented in the paper.  Figure 7, can be summarized as follows: the proposed method (Est_kAnonymity that is based on local anonymization in a distributed environment) significantly outperforms the original k-anonymity algorithm (kAnonymity that is based on global anonymization in a distributed environment) in terms of the execution time. As the number of results increases, the performance gap between Est_kAnonymity and kAnonymity increases. Figure 7 also shows whether the generalization level of each quasi-identifier attribute for k-anonymity is correctly estimated (marked with 'C' in the figure), or under-or overestimated (marked with a 'U' or 'O', respectively, in the figure) by the estimation method described in Section 4.1. As can be seen in Figure 7, the underestimation causes a slight increase in the execution time because the anonymization is applied on the actual query result dataset by the master node.   [57] for varying the number of query results. Note that LM measures the amount of information that is lost due to a generalization of the quasi-identifier attributes, ranging from zero to one (a lower value is better). The proposed method, Est_kAnonymity, shows a very similar pattern with kAnonymity in terms of the LM. As can be seen in Figure 8a, the LM slightly increases with the proposed approach. In particular, the increases are observed when the generalization level is overestimated. This is because the overestimation causes the values of the quasi-identifier attributes to be more generalized than needed, which results in the increased LM. However, the underestimation does not lead to an increase in the LM, the reason for which is that, in the case of an underestimation, the values of the quasi-identifier attributes are less generalized, and thus the master node applies further anonymization on the actual aggregated results until k-anonymity is satisfied, which corresponds to Phase III in Section 4.3. To further compare the LM between kAnonymity and Est_kAnonymity, we plot the LM results in Figure 8b, where the x-axis represents the LM quantity for kAnonymity and the y-axis represents the LM quantity for Est_kAnonymity. Here, the red circles correspond to the overestimation cases, whereas the blue circles represent either the underestimation or corrected estimation cases. As can be seen in the figure, in most cases, the circles are located on the dotted diagonal line, which indicates that there is no loss in data utility with the proposed approach. Even under the occurrence of an overestimation, it is observed that the red circles are closely located on the diagonal line, indicating that the proposed approach does not cause a significant reduction in the information of the released microdata. Figure 9 shows (a) the execution times and (b) loss metric (LM) for varying values of k. During the experiments, the values of k varied among 3, 5, 9, and 13, and the values of min height and max height were set such that the number of query results was about 2.3 M. Key observations based on Figure 9 can be summarized as follows: as expected, the proposed method (Est_kAnonymity) significantly outperforms the original k-anonymity algorithm (kAnonymity) in terms of the execution time. The performance gaps between Est_kAnonymity and kAnonymity increase as the value of k increases. The figure also shows whether the generalization level for the k-anonymity is correctly estimated (i.e., 'C', 'U', and 'O'). Once again, as shown in Figure 9b, the LM is slightly increased with the proposed approach, particularly when the generalization level is overestimated (i.e., k = 3, 5). Finally, Figure 10 shows the way that the execution time (shown in Figure 9a) is split among three phases: (1) estimating the generalization level (Phase I); (2) executing a query and anonymizing query results (Phase II); and (3) aggregating (and further anonymizing) locally anonymized query results (Phase III). As can be seen in the figure, Phase III, which corresponds to aggregating (and further anonymizing) locally anonymized query results, has been identified as a major contributor to the execution time for all the cases. Especially, a significant increase in the execution of the Phase III is observed, when the generalization level is underestimated (i.e., k = 13). This is because, in the case of an underestimation, the master node should perform further anonymization on the aggregated results until the k-anonymity is satisfied, which causes a significant increase in the execution time of the Phase III. Note that Phases I and III are applied by the master node, whereas Phase II is executed by each slave node. Thus, the distributed nature of the presented algorithm affects the execution time of Phase II. That is, if more slave nodes are used, the execution time of Phase 2 will be reduced.

Results and Discussion
The experimental results in this section verify that, with the proposed method, significant processing time gains can be achieved without a significant reduction in the information on the released mircrodata.  Figure 10. The way that the execution time is split among three phases when varying k.

Conclusions
Most existing privacy-preserving data publishing algorithms consider an offline data publishing scenario in which the data publisher first anonymizes the data in an offline manner, and then releases the anonymized data for public use. However, with the increasing demand for the sharing of microdata among various parties, an offline privacy-preserving data publishing scenario is insufficient to support the voluminous request for a release of data. Instead, it is more desirable to integrate the data anonymization functionality into existing systems that are capable of supporting online query processing. In this paper, with the aim of supporting efficient online privacy-preserving data publishing, we presented a novel scheme that is able to efficiently anonymize the query results on the fly. In particular, given a user's query, the proposed approach effectively estimates the generalization level of each attribute for achieving the k-anonymity property in the query result datasets based on the statistical information. The proposed algorithm achieves a high level of efficiency in applying k-anonymity by effectively sacrificing the information loss of the released microdata. The experimental results when applying a real dataset show that significant processing time gains can be achieved with the proposed method, while avoiding a significant reduction of information on the released mircrodata. Future work will include an investigation into the various types of queries containing complex operations, such as a join or aggregation.