In this section, we present the system design of CCA, which makes the caching decision based on the computing cost model in consideration of execution contexts. We describe our elaborate implementation in Spark. Additionally, we describe detailed methodologies to make the caching decision through two algorithms.
Table 1 is the glossary of notations used in this paper.
3.1. Cost Model and Caching Benefit
In the previous section, we noted the limitation of existing task-level computing time metrics to build the computing cost model [
18]. Existing frameworks do not measure the individual block computing time and local task-level metrics cannot represent the computing time from the perspective of the distributed environment. We establish an operator-level metric by integrating operator times on all tasks to determine the computing cost of the operator in the execution flow. We split the task into individual block computing and measure the computing time of each block. Initially, we tried to estimate the operator computing cost as the maximum value of the computing time of blocks in the dataset. However, one major challenge we faced with the initial estimation is that multiple tasks can be assigned per executor core. The executor can perform tasks in parallel as many as the number of cores. If the number of tasks is greater than the number of available cores, the number of tasks processed by one core can be multiple. In this case, the cost of computing the dataset by the operator cannot be determined by the maximum block computing time. Our approach to addressing this challenge is matching the sum of the dataset’s block computing time proportionally to the stage duration. Assuming that the stage contains
n tasks. In Equation (
1),
is the total computing time of blocks generated by
, where
is the computing time of
.
In frameworks that adopt BSP model [
19], such as Spark or Hadoop, a stage finishes only when the last task is completed. Stage duration can be obtained as the time from the start of the first task to the end of the last task. In Equation (
2),
, the estimated computing cost of
, is defined by matching the ratio of
to the sum of the
of all
m operators in the stage to the stage execution time
S. For the computing costs of reused operators at multiple stages, the averages of the measured computing costs are used.
Measured computing cost by operator-level metric depends on the size of the input file and is not generally applicable to the different size input files. We build a computing cost model based on the measured operator-level metrics in terms of the input file sizes. We measure the operator-level metrics for three representative sizes of input data and calculate the linear trend model by using three computing costs from different input sizes. Our cost model predicts the costs of operators on a given size of input data through the linear trend cost model.
Based on the computing cost model, we define caching benefit as the reduced execution time which is decreased by caching the dataset. The caching benefit changes as iteration is performed, so the benefit must be recalculated for each job. The
a stands for the nearest cached ancestor in DAG. The
is a number of iterations for
. The benefit from caching the dataset generated by
is calculated in Equation (
3) as follows:
Most applications that running in the distributed environment are recurring applications [
20]. Our approach obtains the block computing time and the size of the dataset from the previous run.
To make a caching decision that maximizes caching benefit, all possible decision’s caching benefits should be compared. The number of possible caching decisions with k operators in the execution flow is . As k increases, the cost of comparing all possible caching decisions increases exponentially. Even if the caching decision selected from our approach shows sub-optimal performance, the completeness of making a caching decision must be guaranteed. To address this problem, we propose a DAG clustering method, which clusters nodes with the same iteration count from the job DAG. Each node represents the dataset in the execution flow of an application. The iteration count of the dataset is defined as the number of job DAG that records the dataset. The operator that creates the dataset in the execution flow is used as much as the iteration count.
Considering the execution process of the analytics framework, only one dataset in the cluster needs to be cached. When two nodes in the job DAG are adjacent, the child node is always created from the parent node. If both nodes have the same iteration count, both nodes are referenced in the same job DAGs. Therefore, datasets in the cluster will be referenced in the same job DAGs. If the cluster contains cached nodes, only the child nodes of the cached nodes in the cluster need computation. Considering the characteristics of DAG, only the bottom of the nearest cached node is referenced, so only one node in the cluster needs to be cached. Caching can be specified based on the cluster in which the entire DAG is divided into subgroups. Thus, DAG clustering narrows down candidates for caching decisions and reduces the cost of selecting the caching decision. The dataset with the highest caching benefit in the cluster is selected as the dataset to be cached. The caching benefit of the cluster is defined as the caching benefit of the selected dataset.
Figure 4 shows part of KMeans workload’s job DAGs. Our clustering method starts the clustering from DAG’s root node and nodes with the same iteration count are separated. Sequence [
] is used from job 0 to job 2, and the dataset generated by sequence [
] is referenced three times in the example. Sequence [
] is used from job 1 to job 2, and the dataset generated by sequence [
] is referenced two times in the example. Sequence [
] is only used in job 2. Datasets in all job DAGs are clustered into [
], [
], and [
] according to the number of using.
3.2. Spark Implementation
Figure 5 gives an overall architecture of CCA. We have implemented CCA in Spark, and shaded components are the main implementations in
Figure 5.
AppProfiler and
CCA-CachingManager are implemented on a master node of distributed Spark.
TaskMonitor is implemented on each worker node of distributed Spark. The other components,
DAGScheduler,
SparkContext,
BlockManagerMaster, and
BlockManager, are default components of Spark.
Before running the application for the first time, the AppProfiler collects the necessary information of application for building computing cost model. It collects DAGs, the iteration count of a dataset, the size of a dataset, and computing cost of blocks. Iteration count and DAG information are obtained from the DAGScheduler. Distributed TaskMonitor collects the computing time of data block for each task from the BlockManager and sends it to the BlockManagerMaster. The BlockManagerMaster uses the collected information to determine the computing costs of the operator and sends it to the AppProfiler. After profiling, the AppProfiler sends the cost model of an application to CCA-CachingManager.
The main algorithm to make a caching decision is implemented in CCA-CachingMan-ager. When an application is submitted through spark-submit scripts, Spark launches the driver with an object called SparkContext. SparkContext provides access to the various components on the distributed Spark. One component of the distributed Spark is SparkConf, which gives the information such as a number of executors and executor’s capacity of the memory. CCA-CachingManager makes a caching decision by using the profiled results received from AppProfiler and configuration information from SparkConf.
3.3. Caching Decision Algorithm
In the previous section, we proposed a clustering method for the caching decision. The pseudocode for the DAG clustering and caching decision in CCA is described in Algorithm 1 and Algorithm 2. Two algorithms are implemented in CCA-CachingManager component.
We formalize the procedure of clustering the DAG in Algorithm 1. As briefly described above, nodes with the same iteration count will be clustered. The clustering method recursively traverses nodes of the DAG starting from root. The clusters is a set of a cluster that partially grouped from the DAG. The descs is a queue in which nodes whose iteration count should be compared to the nodes stored in cluster are stored. Empty descs means that all nodes in the job DAG are clustered, meaning there are no more nodes to traverse. If descs is not empty, consider whether to include desc in cluster. A iter stores the iteration count of all nodes, and the cluster’s iteration count is the same as the node in the cluster. If the iteration count of desc and the iteration count of cluster are the same, add child nodes of desc to descs and include the desc in the cluster. If they are not the same, start the clustering recursively with a sub-graph where desc is the root node. Finally, clustering results from all job DAGs are integrated to obtain a cluster set of the entire execution flow. The procedure of clustering is performed once after the application is launched.
We describe the procedure of making a caching decision in Algorithm 2. The procedure of clustering nodes of the DAG and extracting the cluster set of the application is involved in making a cache decision before the start of the first job. The benefit stores caching benefits of all clusters. The cluster.dataset, the dataset to be cached in the cluster, is the dataset with the highest caching benefit in the cluster. The caching benefit of the cluster is defined as the caching benefit of cluster.dataset. Initially, the algorithm updates the benefit according to the cost of the model and the remaining iteration count of the dataset. All clusters are candidates for caching and a cluster to be included in the caching decision is determined in order of the cluster’s caching benefit. If there is enough space in memory to store the selected dataset, include the dataset in caches. If the new dataset is added to caches, the caching benefit of clusters is updated.
CCA updates the remaining iteration count of the dataset and performs the procedure of making a caching decision each time before every job starts. The dataset included in
caches is stored in the cache when used for the first time in the job. The dataset not included in
caches is removed from the cache. Caching decision is made on the master node of the distributed Spark at the job running time of worker nodes. Required decision time for the next job is overlapped at the running time of the previous job. The decision time for the first job is overlapped at the time after the application launches and before the first job is submitted.
Algorithm 1 A recursive algorithm for DAG clustering |
Input :iter—map that stores the iteration count of the corresponding dataset or cluster |
root—top node in DAG
|
Output: clusters—a set of clustered nodes
|
1: function clustering |
2: ▹ Recursively traverse all nodes in DAG
|
3: cluster |
4: descs |
5: while descs do |
6: desc = descs.pop
|
7: if iter[desc] == iter[cluster] then |
8: descs |
9: cluster |
10: else |
11: clusters |
12: clusters |
13: cluster |
14: end if |
15: end while |
16: return clusters |
17: end function |
Algorithm2 A baseline algorithm for making a caching decision. |
Input :M—size of total cache memory |
U—size of used cache memory
|
benefit—map that stores caching benefit of the corresponding cluster
|
clusters—a set of clustered nodes
|
Output: caches—a set of candidates to cache
|
1:function make_decision |
2: update(benefit))
|
3: caches |
4: for all cluster in clusters do |
5: if U ≤ M then |
6: caches |
7: update(benefit)
|
8: end if |
9: end for |
10: return caches |
11: end function |