Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark

: Data anonymization strategies such as subtree generalization have been hailed as techniques that provide a more efﬁcient generalization strategy compared to full-tree generalization counterparts. Many subtree-based generalizations strategies (e.g., top-down, bottom-up, and hybrid) have been implemented on the MapReduce platform to take advantage of scalability and parallelism. However, MapReduce inherent lack support for iteration intensive algorithm implementation such as subtree generalization. This paper proposes Distributed Dataset (RDD)-based implementation for a subtree-based data anonymization technique for Apache Spark to address the issues associated with MapReduce-based counterparts. We describe our RDDs-based approach that offers effective partition management, improved memory usage that uses cache for frequently referenced intermediate values, and enhanced iteration support. Our experimental results provide high performance compared to the existing state-of-the-art privacy preserving approaches and ensure data utility and privacy levels required for any competitive data anonymization techniques.


Introduction
Privacy preservation is an ongoing and challenging issue that impacts people's lives on a daily basis. This has inspired and motivated many computer science researchers to provide information privacy preservation approaches such as access restriction, encryption, noise induction, and data anonymization [1][2][3]. The access restriction approach only allows authorized entities to access data, while the encryption approach uses ciphers to protect data privacy. The noise induction approach modifies the original data with additional noise to protect privacy. However, Anonymization approaches such as k-Anonymization generalized or suppresses the sensitive information from the data to provide high utility and more privacy.
k-anonymization-based subtree generalization provides high data utility and better privacy strategies for single dimensional data when compared to full-tree generalization [4][5][6]. The iterative nature of subtree generalization is well suited to find a more efficient attribute generalization strategy. However, the complexity of execution time grows on each additional iteration increase for finding the optimal generalization level. The cost of computation will increase more when other aspects of anonymization are involved, for example, a k-group size, the number of attributes, and generalization hierarchy's tree.
Many solutions have been proposed for scalable big data anonymization [7][8][9][10]. Existing approaches of subtree data anonymization are mostly based on MapReduce platforms to take advantage of the scalability and cost-efficiency [11][12][13]. The MapReduce paradigm typically relies on the processing of two primary functions map and reduces where the former works as a sub-unit of data processing while the latter accumulates and produces the final data analytic results. Without appropriate support for algorithms that runs an extensive iteration such as subtree, the maps and reducers require to communicate many times over, often sequentially and also fetching data from disk, which creates tremendous performance overheads [14,15].
An alternative approach, Spark [16] is used for addressing the overheads associated with MapReduce counterparts have been proposed, often comparing the performance results on both platforms [14,15,17,18]. In-memory-based Spark's performance has well been documented and proven effective for many iteration intensive algorithms such as seen in [19] where it demonstrated 10 times faster performance gain. Other approaches [14,15] also demonstrate the competitive performance advantage of Spark.
Close to our work, several proposals have emerged to illustrate the use of Spark for data anonymization techniques. For example, Ref [20] proposed a distributed Top-Down Specialization (TDS) algorithm that can work on Spark, and [21,22] proposed several sensitively-based multi-dimension data anonymization strategies to use Spark platform Sub tree anonymization. Anonylitics [23] used Spark's default iteration support to implement data anonymization and PRIMA [24] proposes a Spark anonymization strategy to define the utility and generalization level rules for limiting data loss. Although these existing proposals offer interesting aspects of the k-anonymity-based anonymization strategy, they neither provide any guidelines and strategies as to how different types of subtree data anonymization approaches can be best implemented using Spark as a generic framework nor provide any implications of privacy and utility measure.
In this paper, we propose a generic framework for implementing subtree-based data anonymization techniques on Apache Spark. The main contributions of this paper are as follows: • We propose a Resilient Distributed Dataset (RDD)-based subtree generalization implementation strategy for Apache Spark. Our novel approach resolves the existing issues and can provide data anonymization outcomes regardless of any specific subtree implementation approaches (e.g., top-down, bottom-up, or hybrid); • We clearly demonstrate how our proposal can reduce the complexity of operations and improve performance by the use of effective partition, improved memory and cache management for different types of intermediate values, and enhanced iteration support; • We show that the proposed approach offers high scalability and performance through a better selection of subtree generalization process and data partitioning compared to the state-of-the-art similar approaches. We achieve high privacy and appropriate data utility by taking into account the data distribution and data processing using in-memory computation. • Our intensive experiments results demonstrate the compatibility and application of our proposal on various datasets for privacy protection and high data utility. Our approach also outperforms the existing Spark-based approaches by providing the same privacy with minimum privacy loss.
The rest of this paper is organized as follows. Section 2 provides the related work and highlights the pros and cons of each similar work. Section 3 provides the background and definition used throughout the paper and discusses the details of the issues involved in existing subtree generalizations implemented in MapReduce. Section 4 describes the details of our proposal and clearly illustrates how our proposal can resolve the issues associated with the MapReduce-based approaches. In Section 5, we provide our experimental results including setup, configuration, and discuss the observations of the results. Finally, we conclude the paper in Section 6 and provide some potential future directions.

Related Work
Distributed anonymization methods are used to address the anonymization scalability. Most distributed algorithms presented so far aimed at meeting k-anonymity-based privacy models using distributed programming frameworks such as MapReduce. This motivated the authors of the present paper to develop a distributed method for satisfaction of privacy and provide high data utility using subtree-based generalization that provides scalability and high-performance anonymization.
Subtree-based generalization can be broadly categorized into two kinds: Top-Down Specialization (TDS) [11] and Bottom-Up Generalization (BUG) [12]. In the TDS approach, the generalization typically starts from the topmost domain values in the taxonomy trees of attributes towards the bottom as an iterative process. In contrast, the techniques based on BUG generalize data from the bottom of the taxonomy tree towards its top, also iteratively. A hybrid approach that combines both TDS and BUG has been proposed [13]. The majority of these approaches so far have been implemented as sequential MapReduce jobs where the output of each MapReduce job is used as an input for subsequent steps until the anonymization constraints met. Such sequential execution of jobs can attribute significant performance overheads.
Several Spark-based approaches were proposed to address the concerns associated with MapReduce-based data anonymization strategies. Zaharia et al. [16] illustrated a competitive performance advantage of in-memory-based Spark operations compared to disk-based MapReduce execution. Their results demonstrated that Spark's implementation of iterative operations was 100 times faster than it was implemented under the MapReduce platform as Spark provides better parallelism by allowing many iterative tasks running at the same time often accessing memory instead of disks [14,15]. The authors in [14] provided benchmarking results using Word Count, k-means, and PageRank where Spark outperformed over MapReduce especially on iterative tasks. Their work stated that the performance gain of Spark was due to Resilient Distributed Dataset (RDD) caching that reduced the overheads associated with disk and CPU. Maillo et al. [15] demonstrated the performance advantage of Apache Spark on iterative tasks based on K-nearest Neighbour (KNN) using the datasets which contained 10 million instances.
Sopaoglu and Abul [20] developed a distributed TDS algorithm to provide k-anonymity that works for Apache Spark. The main focus of their study was to improve the scalability aspect of the original TDS algorithm [11] by offering improved partition management. Using the adult dataset, they evaluated that the scalability and run-time were significantly improved. Al-zobbi et al. [21,22] proposed several sensitivity-based multi-dimension anonymization strategies that could produce different levels of information obscurity depends on the different access privilege levels of the users (i.e., more customized data generalization result suitable for each user). To understand the roles and responsibility of the user accessing the system, the proposal used a User Defined Function (UDF) of Spark which allows the developer of Spark to be able to extend the vocabulary of default Spark SQL. Their proposal also illustrated that it was possible to reduce the data transmission time between memory and disks by serializing data with Spark RDD.
To address the overheads associated with MapReduce, several Spark-based approaches have been proposed in recent years [18,[25][26][27][28]. In [29], the authors proposed the INCOG-NITO framework for full-domain generalization using Spark RDDs. Although their experiential results illustrate the improvement in both scalability and execution efficiency, they did not provide any insights into privacy and utility trade-offs. Anonylitics [23] provides Spark's default iteration-based data anonymization implementation. The approach provides large-scale data anonymization; however, their approach does not address the potential memory exhaustion unable to accommodate an increasing number of intermediate data produced as the number of iterations increases. PRIMA [24] proposes a data anonymization strategy for Apache Spark with Optimal Lattice Anonymization (OLA). OLA provides data utility and generalization level rules in order to limit the data utility loss. However, the proposed approach does not provide performance comparison and privacy validation with existing approaches.
Somewhat similar but in a different realm of data anonymization technique using differential privacy [30] for Apache Spark, Gao et al. [26,31] proposed several techniques to anonymize k-means clustering algorithm on Spark platform. In their approaches, a new optimal partition mechanism is used to determine the dynamic allocation of datasets for fast processing on the Spark platform. Different partitions containing different classes of datasets then are applied with noises based on Laplace calculation in the reduce phase. A formal privacy proof meeting ε-differential privacy requirement is described. Yin et al. [32] also proposed another approach for data anonymization that uses the Map-Reduce model to control the parallel distribution of k-means clustering and at the same time uses Laplace to implement differential privacy protection. In [33], the authors proposed a more holistic approach to produce differentially private datasets using a synthesizing program that can run on data-parallel analytics frameworks such as Apache Spark. Unlike these existing differential privacy-based approaches where the main focus of proposals is with providing a more solid theoretical foundation for privacy guarantee, our work focus on a mechanism to provide privacy protection of a published data.

Subtree Generalization
In this section, we describe the basic symbols and their descriptions used in this paper (see Table 1) together with the general algorithm involved in a subtree generalization.

Preliminaries
Let define a dataset D = {r 0 , r 1 , ..., r n−1 } as a set of data records r i where 0 ≤ i < n and |D|=n denotes the total number of records in a dataset. Then, a record r ∈ D can be constructed by a set of attributes A={a 1 , a 2 , ..., a m } and each record consists of multiple attribute values r=(av 1 , av 2 , ..., av m ) where a j and av j denote the j th attribute and the attribute values of a record respectively, where 0 < j ≤ m, and m denotes the number of attributes in the dataset D.

Subtree Generalization Algorithm
In this section, we describe a generic subtree generalization algorithm using a sample dataset.
As mentioned, Figure 1 represents the example of Taxonomy Trees (TT) based on Gender, Age, Job, and Education of the census dataset [34]. Each TT includes roots (parent nodes), middle nodes (in between the parent and child nodes but most often act the same as the parent nodes), and leaves (which are mostly child nodes). In a subtree scheme, generalizations are applied for the parent nodes if any child nodes are generalized. For example, in Figure 1b, if the 'Dancer' child node is generalized to its parent node 'Artist', then another child node 'Writer' also needs to be generalized to 'Artist'. Please note that 'Engineer' and 'Lawyer' child nodes retain their values as the dimension of their parent node "Professional" is not affected. The root (parent) node of all taxonomy trees is often called 'Any'. Subtree generalizes data by applying one level of generalization at a time on an attribute by converting child node to parent node. The Subtree generalization steps are presented in Algorithm 1. The iteration starts from the child level. Then, at each step, a specific value (i.e., child) is generalized to a general value (i.e., parent) for an attribute within a QID. This process is repeated until the highest level of generalization violates k-anonymity rule [35]. Table 2 shows the original dataset along with the count of each record (i.e., the frequency of the same record appeared) in the database. Table 3 is produced from Table 2 as a result of a generalization level applied based on Taxonomy Trees depicted in Figure 1. After the first level of generalization, we observe that the attribute of Education for the child nodes "9th" and "10th" are generalized to "Junior-Secondary". Similarly, "Masters" and "Doctorate" child nodes are generalized to "Post-grad", and other child nodes remain the same in this round. Finally, this iteration process is repeated until all QID meet the final required anonymization level, as represented in Table 4. Compare Cut i score for each QID i select highest 5 Replace QID i child to QID i parent in TT for Cut i score 6 Count the r with updated value 7 Repeat step 3 to 6 until k is greater than the number of anonymized records    The overall subtree generalization algorithm is described in Algorithm 1. Each round of iteration includes four major steps: (i) Comparing the k-anonymity level with the number of records generalized, (ii) Calculating the data utility and privacy scores based on [6] for all QIDs, (iii) Finding the best generalization level by comparing the score values for all QIDs and decide the next generalization level based on the highest score of a QID, and (iv) Applying the highest score of the QID and apply the generalization to all QIDs in the same Equivalence Class.

Review of Subtree Implementation in MapReduce
In this section, we review the subtree implementation based on the MapReduce platform and extensively discuss the main limitations involved. There are four main phases in a typical subtree implementation that use MapReduce platform [12] (shown in Figure 1). MapReduce jobs contained in these four phases are coordinated together to accomplish the subtree anonymization. The description for each of the four main phases is as follows: (1) Partition MapReduce Job: this phase involves dividing the original datasets into multiple chunks (i.e., partitions) in which each chunk contains a smaller portion of the original datasets.   However, we identify the following architectural limitations of the MapReduce platform for implementing the subtree anonymization algorithm. These include the issues associated with a partition, memory, and iteration management. We argue that these limitations create execution complexity and performance degradation in various stages. We discuss these problems in detail in the following sections.

Partition
Processing data in MapReduce requires a map task to process a portion of the input data by assigning key-value pairs followed by generating intermediate data. The intermediate data is stored in a local disk of each executor node after applying the hash function. The hash applied to each partition ensures that the output of a map task is arranged using the sort and shuffle process. This hash order ensures reducers can access their respective key-value pairs based on intermediate data locality [36].
An uneven hash partitioning of intermediate values may create skew data in multiple places. For instance, a node that contains a proportionally larger number of records than other nodes would result in tuple skewness [11]. As a consequence, a reducer coordinating these multiple nodes to process the outcomes now will have to wait for a significant time until the node containing the larger number of records completes. Similarly, key skewness [37] may happen when there is a big difference in the generalization levels being applied to different groups of attributes e.g., applying a single generalization level versus multiple generalization levels. This phenomenon would most likely happen more often when the k-group size is larger (e.g., there are more attributes).
To put more formally, let n be the number of tuples and m be the number of attributes in a dataset D, and let s and t represent the number of mapper and reducers, respectively. Then, mapper produces m + 1 key-value pairs which yields O(1) space and O(m * n/s) time complexity [38]. However, the reducer yields O(1) and O(m/k * n/t) for space and time complexities respectively, where k denotes a k group size. The increase in s causes fewer n, which reduces the computing time for the mapper process, and by increasing the number of mappers (s), we get better big O complexity because m * n is divided by the number of mappers [12,39].

Memory
As a mapper loads the input data from disk to memory (of the execution node), the results (i.e., intermediate data) are transferred and stored from the memory to the disk (of the same node). The reducer loads the intermediate data into the memory again (from the disk) of the execution node where the reducer runs on to process and subsequently store the results back to the disk [36]. Without the support of cache, any values that are produced in different stages (i.e., input, intermediate, or output) are stored in the disk and accessed each time the read/write of these values are required. This architectural design of MapReduce adds an excess overhead for I/O operations as well as demands for a larger storage capacity. We will put this more formally. Let subtree uses N It for non-iterative jobs, and It for iterative jobs to convert dataset to anonymized data. J represents a MapReduce job, every J reads R times from the disk and W times writes on disk. I represents the number of iterations needed for each J. Then, I depends on multiple factors including the number of attributes, k group size, and generalization hierarchy. We use the following equation to calculate the total number of R and W operations in MapReduce subtree (ST).
The anonymization process causes both more execution time and complexity especially in the reducer phase while processing intermediate anonymized datasets. The worst case of complexity in the reducer phase can be calculated as: In the meantime, the space complexity of the reducer phase can be formulated as: where GL denotes generalization level in k-group.

Iteration
We argue that there are two architectural design principles of the MapReduce platform that create significant overheads for any iteration tasks. The first one is related to the I/O principle where any intermediate results have to be written to disk and subsequently read by the executor memory as we discussed in Section 3.3.2. The second one is related to the data locality principle where any data processing must be done on the same cluster node which holds the data to be processed. As a consequence, the result of the computation process also has to be saved in the same cluster node. The problem arises when the data read is required by other cluster nodes. In this case, the message exchange is required over the network which could cause noticeable delay and will be multiplied by each iteration process.
With the disk I/O-based operation and data locality principles, we argue that any algorithm that involves intensive iterations such as the subtree generalization can cause significant overheads at multiple places such as at Disk I/O, Network, and Scheduling [40].
Disk I/O overhead: Significant I/O overheads may occur at many different stages of subtree generalization where the stage involves an intensive iteration such as applying generalization levels for attributes, calculating privacy and utility scores, finding the most optimal generalization level, and re-applying the generalization based on the optimal generalization level.
Network Overhead: The anonymization steps in MapReduce require to use of the network to exchange the intermediate data among the cluster nodes. In this case, the network overhead may be created, as the various intermediate data generated by the iterative tasks may need to be transferred to the other cluster nodes multiple times. This problem may get worse by any network delay and consequently is considered an expensive task causing a significant delay in the iteration process.
Scheduling Synchronization Overhead: Assume a situation in which there are two mappers with different workloads and one mapper takes a significantly longer time to complete. In this case, the reducer processing the results of these two mappers needs to wait until both mappers complete their jobs. This is referred to as scheduling synchronization. However, if there are many mappers in iterations where the difference in the workloads are observed, the scheduling synchronization overhead can be increased as the number of imbalances across mappers happens.

Our Proposal
In this section, we provide a detailed description of our proposed approach for Sparkbased subtree anonymization. Our proposal consists of three phases, where each phase output is required as an input for the next phase. We inherit the application of Spark Resilient Distributed Dataset (RDD) design and data partitioning mechanism for our approach and describe the step by step data flow to address the concerns discussed in Section 3.3. We robust our approach by segregating the computation of data anonymization using Figure 3 illustration. We discuss the details of each phase and the specific improvements we have made to resolve MapReduce-based issues in the following three phases:

Phase 1-Initialization
This phase ensures that each RDD partition contains the optimal number of records without duplication to provide a balanced workload of each partition. The original data records are counted and then assigned with a frequency value based on the times of appearance of that specific record in the whole dataset. we increase the stability by using the total record count approach to address both tuple and key skewness problems discussed in Section 3.3.1. In this phase, we provide new partition management that can avoid both tuple skewness and key skewness. The following steps detail our partition strategy. With roughly the equal number of records contained in each RDD partition, each partition executes in parallel by taking approximately a similar processing time.

•
To avoid the tuple skewness, we first count the total number of records from the input data then divide the records according to the number of partitions so that each partition contains roughly a similar number of records. • To avoid the key skewness, we count the duplicate records that appear in multiple partitions. Their frequency is recorded in one partition and the duplicated records from other partitions are removed. • After key skewness is addressed by the above step, we count the number of records from each partition again (as some duplicate records removed) and move the records across partitions so that each partition contains a similar number of records.
We present this initialization step involving efficient partition in Algorithm 2. In Step 2, the "partition factor" indicates the variable that contains the number of records and the capability of the node. Step 3 uses Map_RDD to transform the input RDD_in as a key, and the value showing the key-value pairs used to process the data such as (r,C r ), where r represents records and C r denoted the count in each Map. At this phase, the key-value pair is used, the key represents a single record while the value represents the number of times a key (a record) has appeared in the dataset. The ReduceByKey_RDD in Step 4 reads the Map_RDD key-value pairs (r,C r ) and aggregates the value for the same key. Then, the count of the same r is summed up together to find the ∑ C r which represents the total number of record counts across all partitions. Please note that this process requires shuffling the data from different partitions in the executor nodes to exchange the values for the same key over the network.

Phase 2-Generalization
This phase calculates the privacy and utility scores for each attribute. The privacy and utility scores are used to find the most optimal generalization level to be applied for a certain attribute. Frequently referenced intermediate values (e.g., the privacy and utility scores and the results of the generalization level being applied) are stored first in memory and then moved to a cache to reduce any potential I/O overhead discussed in Sections 3.3.2 and 3.3.3. The purpose of this phase is to apply the most optimal generalization level according to the privacy and utility scores. The frequent use of memory and cache increases the robustness of our proposal. The memory holds the intermediate results for the computation of the privacy and utility scores of each attribute, the results of the generalization level are cached to avoid expensive disk access.
The generalization phases (Section 4.1) results as an input to compute the score value. We describe the details of this phase in Algorithm 3.
Step 2 assigns A v as child C A in r for the generalization level for QID while P A is assigned to all QID based on its C A in TT. This process applies one level of generalization for one attribute (i.e., one iteration) and holds the results in memory so that the results can be used in the subsequent step. Steps 4-7 are used to compute the privacy and utility scores which are denoted as Score ILPG (QID) as following, based on [41,42].
where IL(QID) contains the result of information loss for QID while PG(QID) contains the result of privacy gain for QID. The details of the calculation for IL(QID) and PG(QID) are depicted in Equations (3) and (4), as follows, respectively.
where |C A | represents the child attribute and |P A | represents the parent attribute for the given QID. E n (C A ) and E n (P A ) denote the entropy value of child and parent attributes respectively.
where A P A (QID) and A C A (QID) contain the Anonymization Level (AL) of the parent and child QIDs. Steps 7-10 are used to identify and update the best generalization level based on the privacy and utility score calculated in the earlier steps. The process goes through each r iterating over each A v , where any A v belonging to QID is considered to be qid. The A v is considered to be C A when the value is compared in TT. The C A is compared with the same QID attributes in DOM. Once the C A is found in TT, the C A value is replaced by its P A parent nodes. Then, the A v values are replaced from C A to P A for each r to obtain r * . Finally, the RDD returns anonymized key-value pairs (r * ,∑ C r ).
It must be noted that most of the data is stored and fetched from memory rather than disk during any iteration processes which avoids unnecessary disk I/O overhead. We also use the capability of cache in this generalization phase to avoid the re-computation of the intermediate values (i.e., the privacy and utility score and the results of generalization level) during the iterations. Update(score) ← update r update AL ← Update(score) 11 RDD_update (r * , ∑ C r ) ← Update(score) 12 return (r * , ∑ C r )

Phase 3-Validation
This phase validates if the generalized dataset meets the k-anonymization requirements, we provide a mechanism to deal with frequently referenced intermediate values (e.g., semi-anonymized dataset) by caching it to reduce the overheads discussed in Sections 3.3.2 and 3.3.3. In this phase, we validate if the full anonymization has been achieved i.e., the optimal generalization levels for all attributes have been applied up until they do not violate the k-anonymization constraint. This phase improves the intermediate results access time by storing semi-anonymized attributes into a memory cache to avoid expensive disk access and improve memory management.

end
This phase use the results of Section 4.3 as input for the final computation of the anonymization process. The detail of this phase is depicted in Algorithm 4. Step 2 is used to update the partition based on the Phase 1 strategy. Steps 3-6 are used to check whether ∑ C * r that contains the total number of generalized records (represented as AL) meets the kanonymization constraint or not. If AL has fewer records than k-anonymization constraint, the semi-anonymized records ∑ C * r are required to be copied to a new partition of a map ∑ C r and returns to Phase 1. Steps 7-11 are used in the case where the full generalization is achieved-that is, the number of generalized records meets k-anonymization constraint. Then, a key is assigned for each (distinct) fully generalized record where the value of a fully generalized record is used as a value. Finally, Step 12 saves all fully anonymized records to memory.
Based on the proposed algorithm described in this phase, we mitigate the disk I/O, network I/O, and synchronization overheads during the iteration involved in this phase. For instance, by saving a semi-anonymized dataset in memory, we reduce the disk I/O overhead. Moreover, we minimize any chances for a potential network transfer by reducing the size of the dataset by removing duplicate records while still preserving the count and performing RDD operations that share the cached intermediate values without expensive message exchanges across multiple network nodes. This significantly reduces network I/O overhead. Because the optimal number of datasets operated in this level (as the result of partition management described in Phase 1) reduces synchronization overhead significantly as the number of iterations increases.

Experimental Results
In this section, we first describe our experimental setup including the details of the datasets and the system environment configurations. Subsequently, We compare our proposal with existing approaches based on record volume and number of records. We then provide the experimental results for our model on Adult and Irish datasets. we further investigate the impact of our proposal on memory and iteration performance. Finally, we discuss the results of the privacy and utility scores obtained through several privacy and utility measurement metrics.

Datasets
The experiments are carry out using two datasets: US Census dataset (i.e., Adult dataset) [34] and Irish Census dataset [43], larger datasets are created using the similar proposed approach proposed [25] for the experiments. Tables 5 and 6 illustrates each quasi-identifiable attribute (QID) we used in our experiments and generalization level (GL) of each QID obtained from the taxonomy trees for Adult and Irish datasets. The sensitive attributes are set to the "Salary" in the Adult dataset and the "Industrial Group" in the Irish dataset.

System Environment Configurations
We configured Yarn and Hadoop Distributed File System (HDFS) using Apache Ambari. The HDFS was used to distribute data in a NameNode (worked as a master node), a secondary NameNode, and six DataNodes (worked as worker nodes). We allocated 3 GB memory to Yarn NodeManager, and 1 GB memory to ResourceManager, Driver, and Executor memories each. We used Spark version 2.1 [44] along with Yarn as a cluster manager. The details of the experimental setup for both Spark platform and datasets are illustrated in Table 7.

Performance and Scalability
We ran experiments to understand performance and scalability in terms of memory and iteration management. We ran our experiments 10 times and used the average value to ensure the reliability and consistency of the results. We ensured that the experiments for each dataset use a constant number of partition sizes (i.e., 24) instead of the default partition size. Fixing the partition size ensures that the data can be processed with an equal number of executors.

Performance Comparison with Existing Subtree Approaches
In this section, we discuss the results of our proposal in comparison with existing subtree MapReduce and Spark-based methods. We compare all the approaches based on the volume(size) of data. The increasing volume of data validates the requirements of big data, while the anonymity parameters validate the k-anonymity requirements [45].
We conduct the experiments to compare our approach with Spark and MapReduce multi-dimensional sensitivity-based anonymization for data (MDSBA) [21,22] against the growing size of data. Figure 4 compares the execution times of Spark MDSBA [22], MapReduce MDSBA [21], Spark Top-Down Specialization(TDS) [20], MapReduce TDS [11] with our proposal. The results show that our approach has the least amount of execution time in comparison with the other approaches. Moreover, we observed that the execution time increases linearly along with the increase in data size in all three approaches. Spark-based approaches such as our proposal and Spark MDSBA have almost the same performance when the data size is less than 10 GB, while when it comes to the bigger data size such as bigger than 10 GB, the execution time is much higher in other approaches in comparison with our approach.

Performance Comparison with Existing Spark-Based k-Anonymity Approaches
In the second set of results compare the performance of our approach with stateof-the-art Spark-based k-anonymity approaches with constant k group size, number of records size, and generalization level as shown in Table 7. Figure 5 compares the execution time across different Spark-based approaches such as Prima [24], Anonylitics [23]. The results indicate that our proposal yields the lowest execution time compared to the other platforms, while Anonylitics shows the highest execution time. We also identified that our approach uses a smaller number of RDDs and parallelism during the execution of each partition in its respective executors. Our approach measures the score and updates the anonymity in its prospective RDDs for all generalization levels of each QID. However, the Prima approach measures and updates the score of each leaf as a single RDD. Thus, the increments in the generalization level increased the number of leaves which caused more execution time for k-group size.

Performance Comparison on Adult and Irish Datasets
In this section, we perform an experiment to identify the impact and execution behavior of various datasets on our proposed model. We use Adult and Irish datasets for our performance experiment to understand the impact of execution time against the growing number of records on the fixed size of 5 QID attributes and 1000 k group size. As seen in Figure 6, the execution time changes as soon as the number of records is increased. The execution time linearly increases as the number of records increase in both datasets. We observe that our approach computes various datasets with different generalization levels and k group size; however, the increase in distance qid value and generalization level may increase the execution time i.e., that is the case with the Irish dataset for "Field of Study" QID. Although both adult and Irish datasets are used for the same number of records, k group size, and number of QIDs but the execution time increase in the Irish dataset with the increase in distance qid value and generalization level.

Memory Effects on Performance and Scalability
In this section, we discuss our results based on three aspects, including (i) performance in terms of the growing size of records, (ii) performance compared to other similar approaches, and finally (ii) scalability in terms of the increasing the number of attributes.
We first analyzed the performance implication of our approach by increasing the number of record sizes against different k-group size. The results in Figure 7 show the execution time based on the increasing record size starting from 0.1 billion (10 8 records) to 1 billion (10 9 records). We observed that the execution time has a linear growth with respect to increasing dataset size. We also did not observe any distortion caused by k-group size as the execution times remain almost constant even though k-group size increases. We identified that this effect is because of two reasons: (i) The records are required for the measurement of privacy and utility score from RDD rather than the complete data records; thus, after each generalization step, the same records are aggregated and represented with the key-value pairs. The key-value pairs contain enough information and do not require additional calculation, (ii) Our anonymization process uses a broadcast mechanism that works as a data-sharing mechanism across executors which effectively reduces network I/O and memory, and disk I/O. Consequently, it reduces the computation time significantly.
The scalability of the distributed anonymization is benchmarked against the increasing QID size and is represented in Figure 8. We increased the adult dataset QID size with respect to increasing the number of record sizes. We discovered that the execution time is dependent on the size of QID and the variety of each qid value (i.e., the level of generalization applied). Thus, the higher size of QID set and diverse qid value cause the higher execution time. We observed that for the higher size of QID, the larger size of equivalence classes was needed to satisfy the k-anonymity requirements as it allowed a greater number of attributes grouped/partitioned together, thus it reduced the number of required iterations.

Iteration Effects on Scalability
We analyzed the effects of iterative operation for our proposed approach with respect to increasing the number of records. In the next set of results, we identify the importance of cache for iterative intensive operations. We compared the execution time for both cached and none cached operations during the execution of the anonymization process. Figure 9 compares the number of iterations against the execution time for various dataset sizes. We can observe that having more iteration leads to more execution time. When we increased the record size from 0.1 B and 0.2 B, we observed that the executor memory had enough space to accommodate the records while processing the anonymization. Thus, it does not invoke evacuation of memory due to overload. While as we increase the size of the record, the executor memory starts the evacuation process. It is noticeable that although each RDD is allocated with the same or a smaller number of input data, the anonymization process adds more data to the memory for execution. Thus, the larger the record size, the more records need to be evacuated to make enough space for execution. Figure 10 compares the cache and Non-Cache (NC) effects on increasing generalization level on each iteration. We observed that the execution time for NC RDD has a higher execution time in comparison with cached RDD regardless of the storage levels such as DISK_ONLY (D) MEMORY_ONLY (M), or MEMORY_AND_DISK (DM). For the smaller dataset, it has more space to hold the cached RDD in memory. However, by increasing the dataset size, the RDD partitions need to be deleted from the memory and calculated again for the next transformation. In each iteration, RDD data needs to be scanned for finding the optimal generalization. To achieve this, more frequent visits to RDD data in memory were necessary thus increasing the execution time.
While we observed that DISK, MEMORY, and their combination provide a similar execution time, all these three storage options have different approaches for accommodating cache results. As described by [14,46], the combination of memory and storage is the most cost-effective operation for iteration by ensuring faster computation. We also observed that after each iteration execution, the read and write time taken by memory and disk is slightly increased without recomputing the space and size for the next iteration.
Furthermore, we investigated the impact of RDD partition on the number of executors to identify a balance between high parallelism and using the available resources to the maximum capacity. A partition against the executor trade-off has been discussed in [20,47]. Having considered the results demonstrated in Figure 11, we can observe that the increase in the number of partitions improves the execution time as 64 partitions (denoted as P64) has more execution time in comparison with when only two partitions are used (P2). This means that the partition size and the executor number needs to be in balance to avoid any potential latency.

Privacy & Utility Trade-Off
We used the privacy and utility metrics for the measurement and validation of our proposed method. The data anonymization technique uses trade-offs between privacy and utility to quantify the success of an anonymization algorithm. A privacy level is estimated by recognizing the uniqueness of information, a low privacy normally implies that it is anything but difficult to distinguish an individual (a tuple or record) from a group (e.g., numerous records). We used two privacy metrics Kullback-Leibler−divergence (KLD) and Information Entropy (I E ) to evaluate the impact of the privacy level of our proposal.
In contrast, a utility level is estimated by computing the degree of degradation in the accuracy of significant value between the baseline (i.e., original) value and the anonymized value (i.e., sanitized). We use two utility metrics Discernibility Metric (DM) and Average Equivalence Class Size Metric (C AVE ).

Kullback-Leibler-Divergence (KLD)
KLD measures the likelihood of the presence of the original attribute in the anonymized attribute for each record [48]. For example, let the original attribute of the Job is "Writer" and is anonymized into "Artist". The KLD measures the possibility of guessing the original data of "Writer" from "Artist".
In our approach, we calculate KLD on the final anonymized dataset by measuring the likelihood of the presence of each attribute and sums all the value for each attribute within a record and repeat this for all records.
KLD can be computed based on the formula represented in Equation (5).
The KLD value increases from 0 which indicates both records between the original record and the anonymized record are the same. The increase of KLD value indicates the level of privacy assurance. With the lower value of KLD, it is easy to identify the original value from the matching anonymized value (i.e., low privacy). Figure 12 presents the KLD values of our proposed subtree generalization implementation on Adult and Irish datasets. The KLD values increase with the increase of k-group size and are very close to the comparative approaches discussed [25,49].
The results of KLD metric on the Adult and Irish datasets are shown in Figure 12. The KLD values only increased from around k group size 2 to 5. After k-value (i.e., group size) = 5 the KLD values remain the same for the rest of the k group size. The visible increase of KLD from k-value 2 to 5 (and slight changes from 5 onward) is due to the active generalization level being applied. At approximately k-value 10, all generalization has been applied and there are no more changes to the rest of the k-value thus KLD value remains identical. The overall observation of the changes in KLD values is similar to that of the Adult dataset. However, we observe that the average KLD values are much higher in the Irish dataset than in the Adult dataset. This is due that the Irish dataset has more generalization levels for each QID which increases the chances of more number of QIDs sharing the same value.

Information Entropy (I E )
I E is used to measure the degree of how uncertain it is to identify the original value from the anonymized value within a QID set [50]. The entropy value of I E is 1 if all the qid attributes are identical in the anonymized dataset for the same QID. To compute I E (QID), (1) the likelihood of the presence of the original attribute in a record is calculated, (2) sums up the value of (1) for each attribute in a record (denoted as P RDD * (qid) ), (3) continues (1) and (2) for each QID, (4) sums up the value of (3) for all records. Please note that if all attributes are changed between the original record and the anonymized record, the value of P RDD * is 1.
Based on Equation (6), we computed I E (QID) for single QID. To obtain the I E for the entire anonymized dataset (denoted as RDD * ), we calculated the I E for RDD * by taking the average of all QID. The entropy value of I E is 0 if the records are identical between the original dataset and the anonymized dataset within the matching equivalent class. The maximum value of I E is achieved when the original record sets are very different from the anonymized record sets for a given QID. A higher value of I E represents more uncertainty (i.e., higher privacy). Figure 13 shows the privacy level in terms of the entropy of our proposed approach. Generally, the entropy increases with k size. Though the I E score is highest at (log 2 k) in [51], the I E score of our proposal works better than the scheme proposed by [29]. The performance of our approach is close to the higher end of standard and achieves much higher privacy levels achieved by the scheme proposed by [25]. As the entropy represents the information content of data change, the entropy after data anonymization should be higher than the entropy before the anonymization which is the phenomenon observed in our results.
The results of I E metric on Adult and Irish dataset are shown in Figure 13. Again, the values between the Adult and Irish remain in the study parallel which ensures that the implementation of our data anonymization technique in Spark did not destroy the privacy level. The average of I E values in the Adult dataset is lower compared to the Irish dataset. Our investigation reveals that the Adult dataset contains a relatively small number of different QIDs that share the same value as the result of anonymization. The smaller k value affects the I E value more compared to the greater k value due to the number of the same values in QID attributes. This affects the higher I E value as it is easier to identify a unique record within the same equivalent class compared to the Irish dataset which has a larger number of different QIDs that share the same value.

Discernibility Metric (DM)
DM reports the data quality resulting from the degree of data degradation due to data anonymization based on a tuple within an equivalent class (EC). Let EC be the set of equivalence classes of a k-anonymized dataset RDD * . EC i is one of the equivalence classes of | EC |. The DM metric can be expressed more formally as Equation (7).
where i represents a qid tuple within an equivalent class. The data utility is associated with the DM score. If DM score is high, it means the data utility is low (i.e., the original qid tuple has lost its original values) while the lower the DM score represents the data utility is high. Discernability Metrics (DM) [52] measure the cardinality (i.e., distinctness) of the equivalence class. For a low group size of k, the cardinality of equivalence is too small. If the privacy level is high (e.g., a higher group size of k), the discernability metric increases sharply which increases the cardinality of an equivalence class. Equivalence classes with a large cardinality tend to group datasets in a large range leading to large information loss. Figure 14 presents the discernability penalty of 30,162 records for adult dataset.
We observed that the overall trends for the DM to the DM values observed in other similar approaches in [25,51]. As the k-group size increases, more records are part of an EC, and thus records are less distinguishable from each other. The Irish dataset shows higher DM scores compared to the Adult dataset because there are more chances to make on tuples on the Irish dataset as it contains more distinct qid values. For both datasets, the DM score stays steady which shows low sensitivity to the growth of k-group size (e.g., no more changes in EC).

Average Equivalence Class Size Metric (C AVG )
C AVG is used to measure data utility based on attributes of the average size of the equivalence class. The increase in the number of equivalence sizes results in a higher data utility as it is more difficult to identify an attribute among many identical attributes. In the k-anonymized dataset, the size of the equivalence classes is greater than or equal to k. As a result, the quality of the data is lower if the size of all or part of the equivalence classes greatly exceeds the value of k. The score of C AVG is sensitive to the k-group size [53]. C AVG for RDD * is calculated as Equation (8).
The total number of records of RDD * is donated as |RDD * |, whereas |EC| represents the total number of equivalence classes. Figure 15 represents the results of C AVE for increasing group size of k. The decreasing score value against the increasing size of k is observed indicating that the size of the created ECs is equal to the given k, that is the ECs contain the number of generalized records that satisfy the k-anonymity. As the value of k increases, the EC has more records than the k requirement due to higher generalization level, this keeps increasing the C AVE score value.
The trend of C AVG scores was similar to DM as both metrics were based on the calculation according to the size of equivalence classes on the number of records in the dataset. The comparison of C AVG and DM scores are very hard to compare the values for Adult and Irish different datasets. At the time of C AVG consider the number of records, however, the DM score does not take into account the records of the dataset [54]. As defined earlier the equivalent class EC contains the identical number of QID attributes in a table, the increase in k will increase the qid for each EC class. In Figure 14 score for the Irish dataset is significantly different from the Adult dataset, this is because the Adult dataset applies generalization on "Race" QID, whereas Irish use "Field of Study" for the highest generalization. We observe the number of distance qid values for "Race" and "Field of Study" QID are large in the margin as shown in Tables 5 and 6 respectively. The increase in k increase the GL accordingly, with the "Field of Study" QID in Irish dataset changes a large number of the attribute that increases the EC which impact the C AVG and DM score compared to Adult dataset "Race". In addition to that, there are 16 GL in Irish data and 17 in Adult data, once all the GL are applied for k > 20 we observe both datasets returns the same utility as observed for the C AVG and DM scores for k > 15 in Figures 14 and 15.

Information Loss (IL)
The Information Loss in Equation (9) is calculated with method used in [18] that computes the amount of information loss of information after generalizing a particular value to a general value in the anonymization process. Let the record r and qid be the number of attributes, the U n.m and L n.m are the upper and lower bounds of the n th qid in the m th r for anonymized data while the maximum and minimum for m attributes is denoted as MAX m and MI N m in the original dataset.
Anonymization via generalization and/or suppression is able to protect the privacy of individuals, but at the cost of information loss especially for high-dimensional data. We compare the results of our approach with the existing state-of-the-art approaches for IL such as Prima [24], and Anonylitics [23].
The results in Figure 16 shows that our approach provides less information loss compared to Prima and Anonylitics. We use a constant k size with a fixed number of records. We identify the using Score and update and semi anonymized RDD (Section 4), our approach reduces the duplication of records and anonymizing the duplicate records by replacing qid as single records. The equivalence anonymization for the same records normalized the information loss thus providing minimum information loss.

Conclusions and Future Work
This study proposes a generic framework for implementing subtree-based generations on Apache Spark. Our proposal is implemented using a series of RDDs that support more efficient partition management, improves memory usage that also uses cache to store frequently referenced values, and support a better iteration process which is much more suited for an iteration-intensive algorithm such as subtree generalization. Our proposed approach outperforms the existing similar approaches on various data sizes and k group sizes. Our proposal not only reduces the complexity of the operation and improves the performance but also shows high data utility scores while maintaining a competitive level of data privacy required for any data anonymization techniques. We plan to extend our study by further exploring the suitability of other data anonymization approaches for the Apache Spark platform. For instance, we plan to investigate one of the multi-dimensional data anonymization strategies such as Mondrian [53] to examine the support for recursive operations in Apache Spark.

Conflicts of Interest:
The authors declare no conflict of interest.