In-Memory Data Anonymization Using Scalable and High Performance RDD Design

: Recent studies in data anonymization techniques have primarily focused on MapReduce. However, these existing MapReduce based approaches often suffer from many performance overheads due to their inappropriate use of data allocation, expensive disk I/O access and network transfer, and no support for iterative tasks. We propose “SparkDA” which is a new novel anonymization technique that is designed to take the full advantage of Spark platform to generate privacy-preserving anonymized dataset in the most efﬁcient way possible. Our proposal offers a better partition control, in-memory operation and cache management for iterative operations that are heavily utilised for data anonymization processing. Our proposal is based on Spark’s Resilient Distributed Dataset (RDD) with two critical operations of RDD, such as FlatMapRDD and ReduceByKeyRDD, respectively. The experimental results demonstrate that our proposal outperforms the existing approaches in terms of performance and scalability while maintaining high data privacy and utility levels. This illustrates that our proposal is capable to be used in a wider big data applications that demands privacy.


Introduction
The rapid growth of data from many domains (e.g., social media, smartphones, IoT etc.) has brought in a new era where extracting potential information using data analytic and data mining has become a top business priority to many organizations. Such practices, however, have also brought up data privacy concern in the absence of appropriate data protection mechanisms.
Data anonymization approaches are used to conceal private information in such a way where identifiable (sensitive) information is buried among non-identifiable groups [1,2]. Many different data anonymization algorithms have been proposed for the purpose including K-anonymization [3], l-diversity [4], t-closeness [5] and others [6,7].
The recent growth of big data has created a high demand for distributed processing platforms that are equipped with a core set of features, for example, scalable processing units, large execution engines, and high capacity storage. Many existing anonymization methods used to run on a single machine have been redesigned to work with these new platforms (e.g., MapReduce) as the size of the input data increases massively [8][9][10].
In addition, many existing researches show that data anonymization methods implemented on MapReduce platform often have performance bottlenecks because underlying platform does not have appropriate supports for many core anonymizations tasks. These includes; MapReduce does not have a support for allocating data across partitions in different nodes in a balanced fashion which increases network overhead, doesn't support cache operation for saving the data produced while a task is still

•
We provide clearer example of a general approach involved in a basic data anonymization technique with the addition of a flowchart to assist the understanding of the main tasks involved in such a technique. In addition, an additional mapping table is provided to further illustrate the relationship between the symbols and notations we use and database concept.

•
We provide more detailed description of two critical RDDs involved in our proposal, FlatMapRDD and ReduceKeyByRDD respectively. These are designed to provide a better partition management, in-memory access for various data produced during anonymization process, and an effective cache management. We provide a better description as how our RDD-based approach can effectively reduce the significant overhead associated with MapReduce counterparts.

•
We provide a new performance comparison between our proposal and the most up to date existing K anonymity based approaches and evaluates that our proposal offers a very competitive performance advantage.

•
In addition to additional utility measurement matrices for Discenibility Metric (DM) and Minimal Distortion (MD), we provide a new set of privacy measurement matrices, such as Kullback-Leibler-Divergence (KLD) and Information Entropy (I E ), to extensively investigate the privacy and utility trade-offs of our proposal.

•
We also provide the insights of a new set of performances associated with different memory management strategies offered by Spark. We discover that side-effect can occur when there are too excessive demands for memory access.
The paper is structured as follows. In Section 2, we provide the recent related studies, while in Section 3, we provide the issues associated with MapReduce approach along with the description of a basic data anonymization technique as backgrounds. In Section 4, we discuss the details of our proposal along with main algorithms involved in our RDDs. Section 5 describes the details of the number of privacy and utility matrices we utilise and how we use them in the context of our proposal. In Section 6, we discuss the results of our experiments and the key findings. Section 7 provides the conclusion and planned future work.

Related Work
To address the overheads associated with MapReduce, a number of Spark based approaches have been proposed in recent years. Reference [21] proposed the INCOGNITO framework for full-domain generalization using Spark RDDs. Though their experiential results illustrate the improvement in scalability and execution efficiency, their proposal does not provide any insights of privacy and utility trade-offs. Anonylitics [19] utilized Spark's default iteration support to implement data anonymization. However, their approach does not address the potential memory exhaustion unable to accommodate increasing number of intermediate data produced as the number of iteration increases. PRIMA [16] proposes an anonymization strategy for Mondrian algorithm with Optimal Lattice Anonymization (OLA) which is used to define the utility and generalization level rules in order to limit the data utility loss. Reference [22] proposes a distributed Mondrian approach by splitting the input data to the partitions allocated to each node of cluster by using Spark k-mean. A series of Spark jobs runs on each cluster node to produce anonymized results. These anonymized results are then merged together later by another cluster node.
The study that is most close to ours is that in Reference [17] which provided a distributed Top-Down Specialization (TDS) algorithm to provide K-anonymity using Apache Spark. Rather, their solution focuses on addressing scalability and partition management which was originally proposed by Reference [23]. They neither provide the details of the Spark feature they utilized nor any insights of privacy and utility trade-offs. Al-Zobbi et al. [18] proposed a sensitivity-based anonymization using user-defined function in Spark. The authors provide a strategy for reduced data transmission between memory and disk based on serialized data objects implemented with RDD and validate that a Spark-based approach can be many times faster than MapReduce counterparts such as Reference [11].

Background
We first provide the comparison of the difference and issues involved in MapReduce and Spark. This is followed by the description of the main tasks involved in a basic data anonymization strategy (e.g., Datafly [24]).

MapReduce vs. Spark
For many years until Spark, Hadoop MapReduce [8] has been a widely used distributed processing platform for many big data applications. The fundamental building blocks of MapReduce are Map and Reduce. At start, MapReduce divides the (large) input data into a several smaller chucks. Each chunk of data (i.e., typically a collection of records) is mapped to a map across multiple mappers. The data contained in a mapper is assigned for a key-value combination. Each mapper process the data based on the key-value pair and the results, often called as intermediate data, is stored in the local disk where the mapper resides. Once the processing of all mappers are complete, a reducer reads the results from all mappers. Figure 1a shows the full execution cycle of a MapReduce job and data movements involved at each phase. We argue that many performance overheads occur while MapReduce executes a job, especially in the following phases.  Figure 1. • Problem 4: In case of a task with iterative nature, the result is first written in the local disk. If this result needs to be used again in the subsequent iteration, the mapper needs to access the disk again for each iteration. This architectural design is not only ineffective but also results in a tremendous performance bottleneck as it would cause a severe execution queue. To avoid the queue, the developer of MapReduce requires creating a series of sequential MapReduce jobs for the mappers manually. Even with this choice, it is often necessary that each iteration is waited for the completion (due to the issue discussed in the Problem 1).
Spark utilises Resilient Distributed Datasets (RDDs) as the building block to process Spark jobs. RDDs hold immutable collection of records which are partitioned and can be processed separately in parallel. Similar to MapReduce, input data is spilt as several smaller blocks. Each block then can be further divided into several partitions. An input RDD is created to hold all the partitions in the beginning. It then assigns partitions in the manner accounting for the processing capability at each worker's node to have the optimal number of partitions that can work most effective at each node. This new capability of Spark can reduce the issue associated with the Problem 1 we discussed earlier.
Once the initial partition allocation is complete, more RDDs are created to process the data contained in each partition -this is called a transformation in Spark. The intermediate data created by each RDD transformation is written in the memory and referenced as necessary. The memory accessibility can effectively reduce the performance overhead we discussed in the Problem 2 and 4.
In MapReduce jobs, the execution of each node happens as a separate unit of work. The result of each node, the collection of intermediate data, is not shared but being written off at each node due to the data locality principle of MapReduce. The only way to share the intermediate data with a reducer is via data transfer across networks. Spark offers the data sharing across different RDDs including the results produced by the previous stages and the intermediate data produced by different RDDs. This new feature of Spark can address the concerns we discussed in the Problem 3 and 4.
The execution flow of Spark is illustrated in Figure 1b from data reads off the input data to the memory, processing data at different partitions, and then processing the partitions through RDD transformations.

Data Anonymization
Data anonymization refers to a process of transforming a set of original data into an anonymized data in such a way that uniquely identifiable attributes no longer present in the anonymized dataset while preserving statistical information about the original dataset. Two separate techniques are used for data transformation: generalization and suppression, respectively.

•
Generalization involves with a process to replace the value of an attribute to a less specific value. Domain Generalization Hierarchy (DGH), which is typically defined and provided by a domain expert, is used to find the granularity for the generalization levels to be applied for each attribute. • Similar to generalisation, suppression involves with a process replacing the original attribute to the value that does not release any statistical information about the attribute at all.  Figure 2 demonstrates a generalization approach for applying generalization levels (GLs) defined in a DGH. For example, GL0 represents the first level of generalization while higher levels of generalizations are presented by GL2 and GL3. "*" is an example of suppression which appears in many attributes as the highest generalization level. Each "*" represents a numerical value of a generalization level, such as 114* represents GL1 while 11**, 1*** and * represents GL2, GL3, and GL4 respectively. Though many variations of data anonymization methods have been proposed, our approach follows the one that is similar to Datafly [24]. The flow of Datafly algorithm is depicted in Figure 3. In this approach, data anonymization starts by counting the frequency, which represents the number of appearances given the record set, over the Quasi Identifiers Attributes (QID). The QID refers to a set of attributes that can uniquely distinguish an individual (e.g., age, date of birth, or address). Taking from the attribute with the most number of frequency count, the technique generalizes each attribute until K-anonymity constraint [3] is fully satisfied.  Table 1 illustrates the number of iterations in which a generalization is applied from the original data to a fully anonymized dataset. It starts with the original data depicted in Table 1 (a). The original data is transformed based on the counting of the frequency of unique attributes and the frequency of unique tuples. Table 1 (b) now contains the frequency counts. Starting from the attribute with the highest number of the frequency count, generalization based on DGH, an example shown in Figure 2, is applied. For example, the attribute "Age" is first generalized because it has the highest number of the frequency count at 6. Table 1 (c) depicts a partially anonymized data. Note that a multiple level of generalizations can be performed at this stage as long as it doesn't violate the K-anonymity constraint. The final fully anonymized result is presented in Table 1 (d) which meets the K = 2 constraint.

SparkDA
In this section, we describe the details of our approach named SparkDA. We first provide the descriptions for the symbols and notations we used. Then, we describe our two RDDs, FlatMapRDD and RedueByKeyRDD, and the algorithms each of the RDDs executes.

Basic Symbols and Notations
The elements of the data across different scopes are outlined using the symbols and notations in Table 2. The mapping diagram of our proposed notations to a relational database concept is demonstrated in Figure 4. A set that contains f req(qid tuple ) associated to a qid tuple ,

RDD-Based Data Anonymization
In our proposed approach, a data anonymization technique is implemented through the use of two Spark RDD transformations, FlatMapRDD and ReduceByKeyRDD, respectively.

FlatMap Transformation (FlatMapRDD)
The overall purpose of the FlatMapRDD is to compute for both the frequency of distinct attributes and the distinct tuples for all quasi-identifiable attributes. The frequency counts are then used to decide if further anonymization is necessary.
The Algorithm 1 illustrates the working of the FlatMapRDD algorithm. The algorithm starts by loading the input data into QID Tuple . At this initial stage, the QID Tuple contains the original quasi-identifiable attributes. The first part of the algorithm (depicted by step 2-8) executes to identify the frequency counts. To do this, it first measures the size of QID Tuple to compute the total number of qid tuple it contains (in step 3). The current qid tuple is compared to the next qid tuple . If a match is found between the two comparing qid tuple (s), the frequency count is updated by adding the number 1. This is repeated for each and every qid tuple within the QID Tuple . However, the algorithm does not update frequency count if the qid tuple and the subsequent qid tuple values are different as this indicates two different records. When the iteration through QID Tuple completes, the frequency counts for each unique tuple for all qid tuple (s) is saved in the FreqSet (seen in step 7). It should note that Spark sorts the qid tuple (s) within the partition of each executing node and the frequency count of each qid tuple is always equal to the number of respective qid tuple appearing in the dataset as the total frequency count for all qid tuple (s) represent the sum of records in the dataset.
The second part of the algorithm (depicted by step 9-22) runs to identify the count for the distinct attribute within a QID. To do this, it first measures the size of QID Tuple to compute the total number of QID(s) it contains. Subsequently, the current qid is compared to the next qid. If a match is found between the two qid(s), the distinct qid count is updated by adding the number 1. This is repeated for each and every qid given the QID. When the iteration through QID(s) completes, the distinct counts for each unique attribute for all qid(s) is saved in the dint qid -cnt (seen in step 22). The algorithm returns FreqSet and dint qid -cntSet along with QID Tuple to ReduceByKeyRDD.

ReduceByKey Transformation (ReduceByKeyRDD)
The overall aim of the ReduceByKeyRDD is to execute an RDD transformation by applying a generalization level using the information contained in FreqSet and dint qid -cntSet. The RDD transformation can be interpreted as the changes made to the original data in Table 1 (a) until it reaches the results seen in Table 1 (d), through Table 1 (b) and Table 1 (c). We introduce an "anonymization statue (represented by a variable = anonymization s )" to keep track of whether a given QID Tuple , which contains the lasted anonymization results, is fully anonymized or not and if a further anonymization processing is necessary. The Algorithm 2 illustrates the working of the ReduceByKeyRDD algorithm. To start the algorithm, the combination of (DGH, K) which contains the taxonomy tree and the K-anonymity constraint, is received via a broadcast mechanism which is sent by the driver node. DGH is further used to retrieve the generalization level (GL) for each quasi-identifiable attribute. This is described in step 3-4.
The first part of the algorithm (depicted by steps 6-18) is operated to apply a single generalization level in all quasi-identifiable attribute sets. Applying a generalization level is repeated until the frequency counts (freq(qid tuple )) does not exceed the size of K and also does not exceed the maximum generalization level (MAX(GL qid )). The generalization is applied to attributes with the highest distinct attribute counts (MAX(dint qid -cnt)) to lower. The anonymization status is set to false while generalization level is being applied.
The second part of the algorithm (depicted by steps 21-26) is operated by applying suppression for all attributes for a given tuple which have violated the K-anonymity constraint to ensure no indistinguishable tuples exists. By now, all anonymization is complete, including the suppression, therefore the anonymization status is set to true. As seen in step 29, the anonymized results are sent back to the FlatMapRDD along with the anonymization status. Upon receiving updated QID Tuple which now contains the anonymized data, the FlatMapRDD computes again for the frequency counts for the distinct tuples and the distinct attributes if only the anonymization status is still set to false.

Overall SparkDA Scheme
In this section, we describe the overall process of our proposed approach that includes both the data anonymization process by two RDDs we described earlier and how these RDDs interact with other parts of the program.
The overall algorithm for our SparkDA is illustrated in Algorithm 3. The algorithm runs first by reading off user defined information such as K (i.e., K-anonymity constraint) and DGH (i.e., contains the definition of generalization hierarchy), as depicted in step 3-4. The K and DGH are used as global variables that are shard across all Spark worker nodes associated with processing RDDs. Spark supports broadcast mechanism to send the global variables across worker nodes.
The original data file from HDFS is read and saved into an InputRDD (step 1). The InputRDD pre-processes the input data in such a way that is easier to be processed by other RDDs. For example, the input data is divided into two different datasets, one set contains all quasi-identifiable attributes (QID Tuple -RDD) while the other set contains all sensitive attributes (SA-RDD) (step 6). We cache SA-RDD and QID Tuple -RDD as they are used in many subsequent processing. At this stage, the anonymization status is set to false (step 5).
As depicted in steps 9-14, now two RDDs involved in data anonymization process, FlatMapRDD and ReduceByKeyRDD, executes interactively many times. The anonymization process completes when the fully anonymized dataset QID Tuple is returned from ReduceByKeyRDD in which the anonymization status is set to true. The anonymized dataset, a generalized and distinct qid tuple contained within QID Tuple , is finally joined with corresponding SA-RDD (step 16). The details of Spark execution cycle according to the overall SparkDA operations is depicted in Figure 5.

Privacy vs. Utility Trade-Offs
We used the following privacy and utility metrics to validate and understand the trade-offs between these two. In the study of understanding the success of a data anonymization technique, a privacy level is measured by identifying the uniqueness of data. With that, a low privacy typically means that it is easy to identify an individual (an attribute, tuple or record) from a group (e.g., many records are unique) while a high privacy indicates that it is (more) difficult to uniquely identify an individual from a group (e.g., there are many records sharing the same values). A utility level is measured by calculating the level of degradation in accuracy of value between the original value (i.e., baseline) and the anonymized value (i.e., sanitized).

Kullback-Leibler−Divergence (KLD)
KLD is utilized for understanding the likelihood of the presence of the original attribute in the anonymized attribute for each record [25]. For example, assume that the original attribute of the age 24 is anonymized into a range of 20-59. The KLD can measure what is the possibility of guessing the original age 24 from the range 20-59. Note that we use the term "likelihood" instead of "probability" to indicate that our calculation is done on the past event of the known outcomes (i.e., anonymized dataset). We measure KLD on the fully anonymized dataset by computing the followings: (1) calculating the likelihood of the presence of each attribute, (2) sums up all the value of (1) for each attribute within a record, then continues steps (1) and (2) for all records. Here, P InputRDD indicates the sum of the likelihood of the presence of the original attribute within the original data (at a record level). P InputRDD at this stage has a very high data utility and no privacy as there is no changes made. P InputRDD(r) indicates the sum of the likelihood of the presence of the original attribute within the anonymized record. P AnonymizedRDD usually has lost some degree of data utility and has gained some degree of privacy because the data in this set has changed from the baseline after an anonymization technique is applied.
The KLD value increases from 0 which indicates both records between the original record and the anonymized record are the same. The increase of KLD value indicates the level of privacy assurance. With the lower value of KLD, it is easy to identify the original value from the matching anonymized value (i.e., low privacy).

Information Entropy (I E )
The I E is used to measure the degree of how uncertain it is to identify the original value from the anonymized value within the QID attributes [26]. The entropy value of I E is 1 if all the qid attributes are identical in the anonymized dataset for the same QID. The I E (QID) value can be calculated by, (1) calculating the likelihood of the presence of the original attribute in a record, (2) computing the sum of the values of step (1) for each attribute in a record (denoted as P AnonymizedRDD(qid) ), (3) continuing the steps (1) and (2) for each QID, and (4) computing the sum of the value of step for all records. Note that if all attributes are changed between the original record and the anonymized record, the value of P AnonymizedRDD is 1.
From Equation (2), we obtain I E (QID) for a single QID, however, we are interested in the I E for the whole anonymizedRDD. Thus, we calculate the I E for anonymizedRDD by taking the average of all QIDs. The entropy value of I E is 0 if there are two identical records from the original dataset to the anonymized dataset for a matching equivalent class. The maximum value of I E is achieved when the original record sets is completely different from the anonymized record sets for a given QID. Higher value of I E represents more uncertainty (i.e., higher privacy).

Discernibility Metric (DM)
DM reports the data quality resulting from the degree of data degradation, as a result of data anonymization, of an individual tuple based on an equivalent class. Let EC be the set of equivalence classes of a K-anonymized dataset. EC i is one of the equivalence classes of | EC |. The DM metric can be expressed more formally for AnonymizedRDD as follows: where i represents a qid tuple within an equivalent class. The data utility is associated with the DM score. If DM score is high, it means the data utility is low (i.e., the original qid tuple has lost its original values) while the lower the DM score represents the data utility is high.

Average Equivalence Class Size Metric (C AVG )
C AVG measures data utility of attributes by calculating the average size of the equivalence class. A higher data utility is typically achieved when the number of equivalence size is bigger because it is more difficult to distinguish an attribute when there are large number of attributes. Therefore, it is considered that the results of C AVG scores are sensitive to the K group size [27]. We calculate C AVG according to AnonymizedRDD as following.
where |AnonymizedRDD| denotes the total number of records within the anonymized set while the total number of equivalence classes is denoted by |EC|.

Minimal Distortion (MD)
The MD measures data utility of every quasi-identifiable attribute (qid) in a tuple (qid tuple ). It defines data utility by comparing the rate where how many numbers of qid(s) in (qid tuple ) have been made to be indistinguishable. This is done by measuring the level of distortion on each qid in respect to a generalized level [28]. We calculate the distortion from the qid tuple of AnonymizedRDD in comparison to InputRDD by using the following equation.
where |D| depicts the number of tuples in InputRDD. Equation (5) defines MD for complete dataset. The overall distortions between the anonymized dataset and the original dataset can be minimized by decreasing the K group size.

Precision Metric (PM)
As cited in Reference [24], PM calculates the least distorted combination of attribute and tuples from anonymized records. PM is typically considered to be sensitive to the GL. We define the equation for PM score according to AnonymizedRDD as follows.
where GL represents a generalization level (including suppression) which is defined in the DGH.
The attributes associated with a higher generalization level tends to provide a better precision score than the attributes with a lower generalization level.

Experimental Results
This section first illustrates our experimental setups with the dataset and the system environment configurations. Then, we discuss the results of privacy and utility scores we obtained. The comprehensive experimental results of scalability, performance, and the impact of different cache management strategies of Spark follows.

Datasets
In our study, we used two datasets: US Census dataset (i.e., Adult dataset) [29] and Irish Census dataset [30]. We synthesized these datasets to increase the number of records to investigate different aspects of performance. We used "Benerator", which is a Java-based open-source tool, and the guideline from Reference [31] to generate the synthesized datasets. Table 3 illustrates the details of the both datasets including the quasi-identifiable attributes (QID), the number of district value, and generalization levels. The sensitive attributes are set to the "Salary" in the Adult dataset and the "Field of Study" in the Irish dataset.

System Environment Configurations
Our experiments were run on two different platforms. The first sets of experiment were executed in a distributed processing platform environment using Spark while the other sets of experiment were executed on a standalone desktop. The latter was used to validate the comparability of data privacy and utility. The expectation was that the data privacy and utility scores should stay same between two sets of experiments. We used Spark 2.1 where Yarn and Hadoop Distributed File System (HDFS) were configured using Apache Ambari. HDFS was used to distribute data across a NameNode (worked as a master node), a secondary NameNode, and six DataNodes (worked as worker nodes). 3 GB memory was allocated to Yarn NodeManager while 1 GB memory was configured for each of ResourceManager, Driver, and Executor memory. Table 4 Table 4 (b) provides the details of the Spark cluster and standalone desktop setups. Windows 10 was used as a standalone desktop. All experiments ran at least 10 times and the average was used as to warrant the reliability and consistency of the results.

Privacy and Utility
We discuss the results of running privacy and utility metrics in this section. We illustrated the details of experimental in Table 5.

Privacy Results
The results of KLD metric on Adult dataset are shown in Figure 6a. The results show that the KLD values stay identical between Spark and standalone environment which means the implementation of data anonymization in Spark didn't affect any privacy level. The KLD values only increased from around K group size 2 to 5. After K-value (i.e., group size) = 5 the KLD values remain the same for the rest of the K group size. The visible increase of KLD from K-value 2 to 5 (and slight changes from 5 onward) is due to the active generalization level being applied. At approximately K-value 10, all generalization has been applied and there are no more changes to the rest of the K-value thus KLD value remains identical.
The results of KLD metric on Irish dataset are shown in Figure 6b. In general, the overall observation of the changes of KLD values is similar to that of Adult dataset. However, we observe that the average KLD values are much higher in the Irish dataset than Adult dataset. This is due that the Irish dataset has more generalization levels for each QID which increase the chances of more number of QIDs to share the same value. This increases a privacy level.   Our investigation reveals that Adult dataset contains relatively the small number of different QIDs which share the same value as the result of anonymization. The smaller K value affects the I E value more compare to the greater K value due to the number of same values in QID attributes. This affects in the higher I E value as it is easier to identify a unique record within the same equivalent class compare to Irish dataset which has a larger number of different QIDs that share the same value.

Utility Results
We illustrate the results of data utility metrics, based on the results obtained from Adult dataset Figure 8a,c,e,g and from Irish dataset Figure 8b,d,f,h.
We first discuss the data utility results of Adult dataset. The overall DM scores produced by both Spark and standalone are relatively high at 0.9. Recall that DM measures the data utility of tuples within an equivalent class. It is expected that the increased in the K group size would result in the increase in the equivalent class. As the equivalent class becomes larger, there will be more changes to make tuples to be more indistinguishable which would result in a high DM score-the results represented in Figure 8a. In addition, there is a sudden increase in the DM score approximately around K = 5 both in the Spark and standalone. This illustrates that at K = 5 and onwards the degradation of data has reached the maximum and there is no more generalization/suppression to be applied (i.e., data utility is at the lowest).
The trend of C AVG scores were similar to DM as both metrics were based on the calculation according to the size of equivalence classes. We observe the trend where the data utility scores decline when the size of K group increases as there are more matched distinct attributes. The average penalty seem to remain same at around K = 10 with no changes in generalization. The rationale is that at this point, there are no more generalizations or suppressions to apply to an equivalence class. As a consequence, the average penalty for an equivalent class drops when the number of K group size grows. This is seen in Figure 8b. Figure 8c illustrates the results of MD which measures the rate of data utility based on the changes made to tuples from the original dataset to the anonymized dataset. It is expected that MD score would increase when the K group size increases because there would be more attributes in a tuple not matching between the original dataset and the anonymized dataset. MD tends to be more sensitive to generalization levels because the attributes in a tuple applied with higher generalization levels would have more dramatic changes.
Precision Metric (PM), in Figure 8d, demonstrates the level of distortion at the record level (i.e., the combination of tuples and attributes). It is expected that the PM score will be higher as the number of K group size increases as there are more records that have lost its original values. The PM score is highly sensitive to GL for each qid. This is shown in Figure 8d where the PM score increases as the number of K group size increases for both Spark and standalone. This is because the level of GL applied in each qid is increased to its highest as the size of K group increases. We observe that at K = 25 and onward, the qid are appeared to have been generalized to its highest level as the PM score stays the same.

Scalability, Performance and Caching
We ran three sets of experiments to understand scalability, performance, and cache management as shown in Table 6. The execution time for running both FlatMapRDD and ReduceByKeyRDD was measured.

Scalability
In the first set of experiments, we measure the scalability of SparkDA on Adult dataset and Irish dataset by varying the size of QIDs. Before running a scalability test, we first run an experiment for increasing the size of K group on a fixed number of QID to understand the relationship between the execution time and the size of K group. Results show that the execution time appears not to be affected by increasing K group size. This can be explained by following. The number of iterations from the original data to fully anonymized dataset is decided based on the frequency of distinct tuples. The number of K group size would increase the number of tuples. With the fixed number of QIDs, the number of tuples that are increased doesn't necessarily are distinct. This means the frequency count stays the same. With the frequency count remaining the same, the same number of operations are done irrespective to the increasing number of K-size thus the execution time stays the same.
In contrast, as soon as we increase the size of QIDs, the execution time starts to increase. This is because the processing of QID involves applying generalization levels after counting for the number of distinct attribute values which require many iterative operations. Adding more QIDs involved generating more operations. Therefore, the execution time is increasing in the order of the increasing number of QIDs. This is shown Figure 9a We examine the details of different QIDs from both datasets. It appears that there is a strong performance relationship between the distinctness of quasi-identifiers (i.e., often regarded as cardinality) and the execution time. For example, the execution time has sharply increased between Q4 and Q5 in Adult dataset. We observe that the new attribute "Occupation" in Q5 has a high cardinality and it affected the execution time. In addition, we see that higher execution times in Adult dataset as this dataset appears to have more variations of distinct values.

Performance
The second set of experiments is conducted to understand the performance of our proposal. We first compare the performance of our approach against existing data anonymization approaches. The list of existing approaches that were compared include: Spark based multi-dimensional sensitivity-based anonymization (Spark MDSBA) [18], MapReduce based multi-dimensional sensitivity-based anonymization (MR MDSBA) [15], Apache Spark based top-down specialization (Spark TDS) [17], and MapReduce based multi-dimensional top-down specialization (MR MDTDS) [15]. In order to ensure the comparability of results across different approaches, we used the same workload and enforced our configuration to match with the experimental configuration discussed in References [15,17,18] as much as possible. Figure 10 illustrates the execution time obtained across different methods. As clearly seen, our proposal outperforms other similar approaches by providing the lowest execution time. SparkTDS appears to show the highest execution time. Our analysis demonstrates that SparkTDS updates the score of all leaf which appears to be expensive additional overhead. This is because the increase in the number of leaves and associated operations (e.g., applying generalization level at leave) naturally demand more execution time especially for higher K-group sizes. The MapReduce-based approaches, seen in MR MDTDS and MDSBA, appear to have a higher execution time mainly due to expensive disk I/O associated with intermediate data. Spark MDSBA performs relatively well when compared to other approaches. We observed that Spark MDSBA uses a larger memory size compare to the dataset size which results in reduced execution time. Secondly, we conducted a performance experiment to understand the impact of execution time against the growing number of records on the fixed size of 5 QID attributes. As seen in Figure 11a,b, the execution time remains same irrespective to the size of K group. This appears that some operations (e.g., involved in QID generalization) are cached in memory then re-used and this does not affect too much on the execution time. However, this changes as soon as the number of records is increased. The execution time linearly increases as the number of records increase in both datasets.

Caching
Spark offers a multiple cache storage levels to speed up the process of the same RDDs that are accessed multiple times. The Spark cache strategies can be categorized as follows [14]. During the anonymization process, the two RDD transformations we utilize, FlatMapRDD and ReduceByKeyRDDs, are accessed multiple times for generalization from the main application SparkDA. We have set up our experiment with the different cache management options. The results are shown in Figure 12a,b. In general, the memory-based strategies where the RDD blocks are stored in the memory, such as MEMORY_ONLY and OFF_HEAP, outperformed compared to the cached in disk. Understandably, in-memory inside the JVM cache strategy MEMORY_ONLY took the least execution time compared to out of JVM memory cache strategy used by OFF_HEAP. The MEMORY_AND_DISK took more time than memory-based strategies but less than DISK_ONLY as expected as this strategy allows the switch from memory to disk when the allocated memory is fully consumed by RDD blocks. Comparing the overall cache performance, the average execution time for Irish dataset was less than Adult dataset. The higher generalization levels for different attributes in Adult dataset has contributed toward the increase in the execution time as there were more ReduceByKeyRDD operations for the generalization levels defined in the DGH thus the updates for attributes were more frequent.

Conclusions and Future Work
This work introduces "SparkDA" a new novel data anonymization approach designed to take the full advantage of Spark platform to generate privacy-preserving anonymized dataset in the most efficient way possible. Our approach is based on two RDD transformations FlatMapRDD and ReduceByKeyRDD with a better partition control, in-memory processing, and efficient cache management. These new innovations contribute towards reducing many performance overheads associated in other similar approaches implemented in MapReduce. The set of experimental results showed that our proposal provides high performance and scalability while supporting high data privacy and utility required by any data anonymization techniques. We also provided insights of a set of performances associated with different memory management strategies offered by Spark and discovered that a side-effect could occur when there are too excessive demands to save data to executor's memory.
In future, we plan to extend our study to implement data anonymization strategy based on the subtree generalization scheme [1]. This new approach will solve the current limitation of the full-domain based generalization approach where attribution values are generalized equally without considering their respective parents' node which results in the loss of data utility to some degree. We also plan to extend our study to implement a more comprehensive data anonymization strategy for multi-dimensional datasets.