Exploiting Machine Learning For Improving In-memory Execution of Data-intensive Workﬂows on Parallel Machines

: Workﬂows are largely used to orchestrate complex sets of operations required to handle 1 and process huge amounts of data. Parallel processing is often vital to reduce execution time when 2 complex data-intensive workﬂows must be run efﬁciently, and at the same time in-memory processing 3 can bring important beneﬁts to accelerate execution. However, optimization techniques are necessary 4 to fully exploit in-memory processing avoiding performance drops due to memory saturation events. 5 This paper proposes a novel solution, called Intelligent In-memory Workﬂow Manager (IIWM), 6 for optimizing the in-memory execution of data-intensive workﬂows on parallel machines. IIWM 7 is based on two complementary strategies: 1) a machine learning strategy for predicting memory 8 occupancy and execution time of workﬂow tasks; 2) a scheduling strategy that allocates tasks to 9 a computing node taking into account the (predicted) memory occupancy and execution time of 10 each task, and the memory available on that node. The effectiveness of the machine learning-based 11 predictor and the scheduling strategy are demonstrated experimentally using as a testbed Spark, 12 a high-performance Big Data processing framework that exploits in-memory computing to speed 13 up execution of large-scale applications. In particular, two synthetic workﬂows have been prepared 14 for testing the robustness of IIWM in scenarios characterized by a high level of parallelism and a 15 limited amount of memory reserved for execution. Furthermore, a real data analysis workﬂow has 16 been used as a case study, for better assessing the beneﬁts of the proposed approach. Thanks to high 17 accuracy in predicting resources used at runtime, IIWM was able to avoid disk writes caused by 18 memory saturation, outperforming a traditional strategy in which only dependencies among tasks 19 are taken into account. Speciﬁcally, IIWM achieved up to 31% and 40% reduction of makespan and a 20 performance improvement up to 1.45 x and 1.66 x on the synthetic workﬂows and the real case study 21 respectively. 22

Version April 29, 2021 submitted to Future Internet 3 of 20 specific scenarios related to the presence of a high level of parallelism and a limited amount of memory 80 reserved for execution. The effectiveness of the proposed approach has been further confirmed 81 through the execution of a real data mining workflow as a case study. We carried out an in-depth 82 comparison between IIWM and a traditional blind scheduling strategy, which only considers workflow 83 dependencies for the parallel execution of tasks. The proposed approach showed to be the most 84 suitable solution in all evaluated scenarios outperforming the blind strategy thanks to high accuracy 85 in predicting resources used at runtime, which leads to the minimization of disk writes caused by 86 memory saturation. 87 The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 describes 88 the proposed system. Section 4 presents and discusses the experimental results. Section 5 concludes 89 the paper. The problem addressed in this study consists in the optimization of the in-memory execution 92 of data-intensive workflows, evaluated in terms of makespan (i.e., the total time required to process 93 all given tasks), and application performance. The main reason behind the drop in performance in 94 such workflows is related to the necessity of swapping/spilling data to disk when memory saturation 95 events occur. To cope with this issue, we propose an effective way of scheduling a workflow that 96 minimizes the probability of memory saturation, while maximizes in-memory computing and thus 97 performance.

98
A workflow W can be represented using a DAG, described by a set of tasks T = {t 1 , t 2 , . . . , t n } 99 (i.e., vertices) and dependencies among them A ⊆ (T × T ) = {a 1 , . . . , a m }: a i = (t i , t j ), t i ∈ T , t j ∈ T 100 (i.e., directed edges). Specifically, data dependencies (i.e., all the input data of a task have already been 101 made available) have to be considered rather than control dependencies (i.e., all predecessors of a task 102 must be terminated before it can be executed), as we refer to data-intensive workflows [12]. 103 Formally, given a set of q computing resources R = {r 1 , . . . , r q }, workflow scheduling can be 104 defined as the mapping T → R from each task t ∈ T to a resource r ∈ R, so as to meet a set of specified 105 constraints which influence the choice of an appropriate scheduling strategy [13]. Workflow scheduling 106 techniques are often aimed at optimizing several factors, including makespan and overall cost that 107 in turn depend on data transfer and compute cost [14]. In this study, a multi-objective optimization 108 has been applied, jointly minimizing execution time and memory saturation. This is achieved by 109 using a scheduling strategy that exploits a regression model aimed at predicting the behavior of a 110 given workflow, in terms of resource demand and execution time (see Section 3). For the Reader's 111 convenience, Table 1 shows the meaning of the main symbols used in the paper. Dependencies. a i = (t i , t j ), t i ∈ T , t j ∈ T . d t Description of the dataset processed by task t. W = (T , A) Workflow. N in (t) = {t ∈ T | (t , t) ∈ A} In-neighbourhood of task t. N out (t) = {t ∈ T | (t, t ) ∈ A} Out-neighbourhood of task t. M Regression prediction model. S = s 1 , . . . , s k List of stages. s i ⊆ T | (t x t y )∀t x , t y ∈ s i . C Maximum amount of memory available for a computing node. C s = C − ∑ t∈s M.predict_mem(t, d t ) Residual capacity of a stage s. State-of-the-art techniques aimed at improving the performance of data-intensive applications 126 can be divided into two main categories: analytical-based and machine learning-based. For each category, 127 the main proposed solutions and their differences with respect to our technique are discussed. 128 2.1.  Techniques in this category use information collected at runtime and statistics in order to tune a 130 Spark application, improving its performance as follows:

131
• Choosing the serialization strategy for caching RDDs in RAM, based on previous statistics 132 collected on different working sets, such as memory footprint, CPU usage, RDDs size, 133 serialization costs, etc. [15,16]. 134 • Dynamically adapting resources to data storage, using a feedback-based mechanism with 135 real-time monitoring of memory usage of the application [17]. 136 • Scheduling jobs by dynamically adjusting concurrency through a feedback-based strategy. Taking 137 into account memory usage via garbage collection, network I/O, and Spark RDDs lineage 138 information, it is possible to choose the number of tasks to assign to an executor [18,19].

139
The aforementioned works use different strategies to improve in-memory computing of Spark that 140 exploit static or dynamic techniques able to introduce some information in the choice of configuration 141 parameters. However, no prediction models are employed and this may lead to unpredicted behaviors.

142
IIWM, instead, uses a prediction regression model to estimate a set of information about a running 143 Spark application, exploiting it to optimize in-memory execution. Moreover, unlike real time adapting 144 strategies, which use a feedback-based mechanism by continuously monitoring the execution, the 145 IIWM model is trained offline, achieving fast and accurate predictions while used for inferring the 146 resource demand of each task in a given workflow. [8], based on an informed performance improvement, which can be beneficial for the execution of 152 data-intensive applications, especially in the context of HPC systems.

153
Several techniques use collaborative filtering to identify how well an application will run on a 154 computing node. For instance, Quasar [8] uses classification techniques based on collaborative filtering 155 to determine the characteristics of the running application in allocating resources and assigning tasks.

156
When submitted, a new application is shortly profiled and the collected information is combined with and assignment if required, using a single model for the estimation. Adapting this technique to Spark 160 can help to assign tasks to computing nodes within the memory constraints and avoid exceeding the 161 capacity, thus causing spilling of data to disk. Another approach based on collaborative filtering has 162 been proposed by Llull et al. [9]. In this case, the task co-location problem is modelled as a cooperative 163 game and a game-theoretic framework, namely Cooper, is proposed for improving resource usage. The 164 algorithm builds pairwise coalitions as stable marriages to assign an additional task to a host based 165 on its available memory, and the Spark default scheduler is adopted to assign tasks. In particular, a 166 predictor receives performance information collected offline and estimates which co-runner is better, 167 in order to find stable co-locations.

168
Moving away from collaborative filtering, Marco et al. [10] present a mixture-of-experts approach 169 to model the memory behavior of Spark applications. It is based on a set of memory models (i.e., linear 170 regression, exponential regression, Napierian logarithmic regression) trained on a wide variety of 171 applications. At runtime, an expert selector based on k-nearest neighbour (kNN) is used to choose the 172 model that best describes memory behavior, in order to determine which tasks can be assigned to the 173 same host for improving throughput. The memory models and expert selector are trained offline on 174 different working sets, recording the memory used by a Spark executor through the Linux command 175 "/proc". Finally, the scheduler uses the selected model to determine how much memory is required 176 for an incoming application, for improving server usage and system throughput.

187
• It focuses on data-intensive workflows while in reference [10] general workloads are addressed.

188
• It uses high-level information for describing an application (e.g. task and dataset features), while 189 in reference [10] low-level system features are exploited, such as cache miss rate and number of 190 blocks sent, collected by running the application on a small portion (100 MB) of the input data.

191
• It proposes a more general approach, since the approach proposed in [10] is only appliable to 192 applications whose memory usage is a function of the input size.

194
The Intelligent In-memory Workflow Manager (IIWM) is based on three main steps:   can be ensured, while minimizing the risk of memory saturation.

205
In the following sections, a detailed description of each step is provided.

Execution monitoring and dataset creation 207
The first step in IIWM consists of monitoring the execution of different tasks on several input 208 datasets with variable characteristics, in order to build a transactional dataset for training the regression 209 model. The proposed solution was specifically designed for supporting the efficient execution of data 210 analysis tasks, which are used in a wide range of data-intensive workflows. Specifically, it focuses 211 on three classes of data mining tasks: classification tasks for supervised learning, clustering tasks for 212 unsupervised learning and association rules discovery. Using Spark as a testbed, the following data 213 mining algorithms from the MLlib 2 library have been used: Decision Tree, Naive Bayes, and Support    Using the aforementioned Spark APIs, we monitored the execution of several MLlib algorithms 236 on different input datasets, covering the main data mining tasks, i.e. classification, clustering, and association rules. The goal of this process is the creation of a transactional dataset for the regression 238 model training, which contains the following information:

241
• The description of the input dataset in terms of the number of rows, columns, categorical columns 242 and overall dataset size.

243
• Peak memory usage (both execution and storage) and execution time, which represent the three 244 target variables to be predicted by the regressor. In order to obtain more significant data, the 245 metrics were aggregated on median values by performing ten executions per task.

246
For the sake of clarity, table 3 shows a sample of the dataset described above.  Starting from 20 available datasets, we divided them into two partitions used for training and 248 testing respectively. Afterwards, an oversampling procedure was performed, aimed at increasing the

256
correlation-based univariate linear regression test for real labels (regression problems).

257
• For clustering datasets we used a correlation-based test to maintain the k features with the 258 smallest probability to be correlated with the others.

259
• For association rules discovery datasets no features selection is required, as the number of 260 columns refers to the average number of items in the different transactions.

261
The described procedure has been applied separately on the training and test partitions, so as to 262 avoid the introduction of bias into the evaluation process. Specifically, the number of datasets in the 263 training and test partitions has increased from 15 to 260 and from 5 to 86 respectively. Subsequently,

264
we fed these datasets to the MLlib algorithms, obtaining two final transactional datasets of 1309 and 265 309 monitored executions, used for training and testing the regressor, respectively.  Table 4.  Among 20 trained models, initialized with different random states, we selected the best one by maximizing the following objective function: O =R 2 − MAE whose goal is to choose the model that best explains the variance of data, while minimizing the 287 forecasting error. This function jointly considers the adjusted determination coefficient (R 2 ), which 288 guarantees robustness with respect to the addition of useless variables to the model compared to the 289 classical R 2 score, and the mean absolute error (MAE), normalized with respect to the maximum.

290
The described model has been developed in Python3 using the scikit-learn 3 library and evaluated 291 against the test set of 309 unseen executions obtained as described in Section 3.1.2. Thanks to the combination of different models, the ensemble technique showed to be very well suited for this task, 293 leading to good robustness against outliers and a high forecasting accuracy, as shown in Figure 2. These results are detailed in

Workflow scheduling 299
The prediction model described in Section 3.2 can be exploited to forecast the amount of memory 300 that will be needed to execute a given task on a target computing node and its duration, based on 301 the task features listed in Section 3.1. These predictions are then used within the scheduling strategy 302 described in the following, whose goal is to avoid swapping to disk due to memory saturation in order 303 to improve application performance and makespan through a better use of in-memory computing.

304
The results discussed below refer to a static scheduling problem, as the scheduling plan is generated 305 before the execution. In typical static scheduling the workflow system has to predict the execution 306 load of each task accurately, using heuristic-based methods [21]. Likewise, in the proposed method the 307 execution load of each task of a given workflow is predicted by the model trained on past executions.

308
Moreover, we investigated how workflow tasks can be scheduled and run on a single computing node, 309 but this approach can be easily generalized to a multi-node scenario. For example, a data-intensive 310 workflow can be decomposed into multiple sub-workflows to be run on different computing nodes 311 according to their features and data locality. Each sub-workflow is scheduled locally to the assigned 312 node using the proposed strategy.

313
In IIWM, we modelled the scheduling problem as an offline Bin Packing (BP). This is a well-known to a bin without exceeding the capacity c and minimizing the number of used bins. The problem is 320 N P-complete and a lot of effort went into finding fast algorithms with near-optimal solutions. We 321 adapted the classical problem to our purposes as follows: • An item is a task to be executed.

323
• A bin identifies a stage, i.e. a set of tasks that can be run in parallel.

324
• The capacity of a bin is the maximum amount C of available memory in a computing node.

325
When assigning a task to a stage s ∈ S, its residual available memory will be indicated with C s .

326
• The weight of an item is the memory occupancy estimated by the prediction model. In the case of 327 Spark testbed, it will be the maximum of the execution and storage memory, in order to model selecting the stage to be assigned when memory constraints hold for multiple stages.

330
With respect to the classical BP problem two changes were introduced:

331
• All workflow tasks have to be executed, so the capacity of a stage may still be exceeded if a task 332 takes up more memory than the available one.

333
• The assignment of a task t to a stage s is subjected to dependency constraints. Hence, if a 334 dependency exists between t i and t j , then the stage of t i has to be executed before the one of t j .

335
To solve the BP problem, modelled as described above, in order to produce the final scheduling   The first part (lines 1-23) starts with the initialization of an empty list of stages S, which will be 348 filled according to a dictionary Q that stores the in-degree of each task in the DAG, which is used for the DAG-based workflow representation, there will always exist a task t ∈ T with a zero in-degree 356 not yet scheduled, unless set T is empty. Afterwards, the task with the highest memory occupancy is selected from T f ree in order to be scheduled (line 8). At this point, a list of candidate stages (S sel ) for 358 the selected task is identified according to the peak memory occupancy forecasted by the prediction 359 model M (lines 9-10). In particular, a stage s i belongs to S sel if it satisfies the following conditions:

360
• The residual capacity C s i of the selected stage s i is not exceeded by the addition of the task t.

361
• There not exists a dependency between t and any task t belonging to s i and every subsequent 362 stage (s i+1 ∪ · · · ∪ s k ), where a dependency (t , t) n is identified by a path of length n > 0.

363
If there exist one or more candidate stages S sel (line 11), the best one is chosen based on the 364 minimum marginal increase. Specifically, for each of these stages, the expected increase of the execution 365 time is estimated (lines 12-13), assigning the task t to the stage s with the lowest value (lines 14-16).

366
Otherwise (line 17), a newly created stage is allocated for t and added to the list S (lines 18-21). Once 367 the task t is assigned to the stage s, the residual capacity C s is updated (lines 15, 20). Then, the residual 368 in-degree for every task in the out-neighbourhood of t (line 22) is decremented by updating the 369 dictionary Q, so as to allow the assignment of these tasks in the next iterations. Finally, the assigned 370 task t is removed from the set of workflow nodes T (line 23).

ALGORITHM 1: IIWM SCHEDULER
S sel ← {s i ∈ S | mem t ≤ C s i and (t , t) n ∈ A, n > 0, ∀t ∈ s i ∪ s i+1 ∪ · · · ∪ s k } with subsequent stages if the available capacity C is not exceeded. For each movable stage s i (line 377 27), another stage s j from S is searched among the subsequent ones, such that its residual capacity is 378 enough to enable the merging with s i (lines 28-30). The merging between s i and s j is performed by 379 assigning to s j each task of s i (line 31), finally removing s i from S (line 32). In the end, the list of stages 380 S built by the scheduler is returned as output. Given this scheduling plan, the obtained stages will be 381 executed in sequential order, while all the tasks in a stage will run concurrently.

382
Compared to a blind strategy where the maximum parallelism is achieved by running in parallel 383 all the tasks not subjected to dependencies, which will be referred to as Full-Parallel in our experiments, 384 IIWM can reduce both delays of parallelization ( p ), due to context switch and process synchronization, 385 and swapping/spilling to disk ( s ), due to I/O operations. Delay p is always present in all scheduling 386 strategies when two or more tasks are run concurrently, while s is present only when a memory 387 saturation event occurs. Given = p + s , IIWM mainly reduces s , which is the main factor behind 388 the drop in performance in terms of execution time, due to the slowness in accessing secondary storage.

389
As far as the Spark framework is concerned, the proposed strategy is effective for making the 390 most of the default storage level, i.e. MEMORY_AND_DISK: at each internal call of the cache() method, 391 data is saved in-memory as long as this resource is available, using disk otherwise. In this respect, 392 IIWM can reduce the actual persistence of data on disk by better exploiting in-memory computing.

394
This section presents an experimental evaluation of the proposed system, aimed at optimizing the 395 in-memory execution of data-intensive workflows. We experimentally assessed the effectiveness of 396 IIWM using Apache Spark 3.0.1 as a testbed. In particular, we generated two synthetic workflows for 397 analyzing different scenarios, by assessing also the benefits coming from the use of IIWM using a real 398 data mining workflow as a case study.

399
In order to provide significant results, each experiment was executed ten times and the average 400 metrics with standard deviations are reported. In particular, for each experiment, we evaluated the 401 accuracy of the regression model in predicting memory occupancy and execution time. 402 We evaluated the ability of IIWM to improve application performance taking into account two 403 different aspects:

404
• Execution time: let m 1 and m 2 be the makespan for two different executions. If m 2 < m 1 we can compute the improvement on makespan (m imp ) and application performance (p imp ) as follows: • Disk usage: we used the on-disk usage metric, which measures the amount of disk usage, jointly considering the volume and the duration of disk writes. Formally, given a sequence of disk writes w 1 , ..., w k let τ i , τ i ∈ T be the start and end time of the w i write respectively. Let also W : T → R be a function representing the amount of megabytes written to disk over time T. We define on-disk usage as:  We firstly evaluated our approach against two complex synthetic data analysis workflows, where 412 the Full-Parallel approach showed its limitations due to a high degree of parallelism. The dependencies 413 in these workflows should be understood as execution constraints. For instance, clustering has to be 414 performed before classification for adding labels to an unlabelled dataset, or a classification task is 415 performed after the discovery of association rules for user classification purposes.

416
The first test has been carried out on a synthetic workflow with 42 nodes. Table 6 provides a 417 detailed description of each task in the workflow, while their dependencies are shown in Figure 4.   Figure 4. Task dependencies (workflow 1).
The first step is to predict the memory occupancy and execution time of each task of the workflow: 419 the regression model was able to accurately estimate the peaks on storage and execution memory and 420 the duration, as shown in Table 7 Table 7. Performance evaluation of the prediction model.
We firstly considered a configuration characterized by 14 GB available for running the workflow, 422 which will be used up to 60% due to the Spark unified memory model. Table 8 shows an execution example with IIWM, focusing on its main steps: i) the scheduling of tasks based on their decreasing 424 memory weight; ii) the allocation of a new stage; iii) the exploitation of the estimated execution time 425 while computing the marginal increase. This last aspect can be clearly observed in iteration 17, where 426 task t 17 is assigned to stage s 7 , which presents a marginal increase equal to zero. This is the best 427 choice compared to the other candidate stage (s 6 ), whose execution time would be increased by 12496 428 milliseconds by the assignment of t 17 , with a degradation of the overall makespan.

It.
17 Assign t 17 to s 7 Unlock {t 25 }  At the end of the process, a consolidation step is exploited for optimizing throughput and These results can be clearly seen also in Table 9, which shows the scheduling plan produced 437 by the IIWM scheduler, together with some statistics about execution times and the use of the disk.

438
In particular, given the curves representing disk writes over time shown in Figure   With different sizes of available memory, the Full-Parallel approach showed higher and higher 444 execution times and disk writes as memory decreased, while IIWM was able to adapt the execution 445 to available resources as shown in Figure 6, finding a good trade-off between the maximization of 446 the parallelism and the minimization of the memory saturation probability. At the extremes, with 447 unlimited available memory, or at least greater than that required to run the workflow, IIWM will 448 perform as a full concurrent strategy, producing the same scheduling of Full-Parallel.  The second synthetic workflow consists of the 27 tasks described by Table 10 and their 450 dependencies, shown in Figure 7. This scenario is characterized by highly heavy tasks and very 451 low resources, where the execution of a single task can exceed the available memory. In particular, task 452 T 18 has an estimated peak memory occupancy higher than Spark available unified memory of 5413.8 453 MB (i.e., corresponding to a heap size of 9.5 GB): this will bring the IIWM scheduling algorithm to 454 allocate the task to a new stage, but memory will be saturated anyway.

455
In such a situation, data spilling to disk cannot be avoided, but IIWM tries to minimize the 456 number of bytes written and the duration of I/O operations. Even in this scenario, the prediction 457 model achieved very accurate results, shown in Table 11, confirming its forecasting abilities.  Table 11. Performance evaluation of the prediction model. Figure 8 shows disk occupancy during the execution. As we can see, even IIWM cannot avoid 459 data spilling, even though its disk usage was much lower considering peak value and writes duration 460 compared to Full-Parallel. Finally, Table 12 shows the statistics about disk usage and execution times.

On-disk usage (MB)
Full-Parallel (t 0 ), (t 1 t 2 t 3 t 4 ), (t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 ),  unlabelled dataset. Figure 9 shows a representation of the workflow designed by the visual language  Figure 9 shows the tasks of the 486 workflow that will be analyzed.   The results are detailed in Table 13, which shows a boost in execution time of almost 1.66x (p imp ) 492 and a 40% time reduction (m imp ) with respect to Full-Parallel.

IIWM
(t 10 t 12 t 14 t 16 t 20 t 22 t 24 t 26 t 28 ), (t 18 t 0 t 2 t 4 t 6 t 8 t 11 t 13 t 15 t 17 t 21 t 23 t 25 t 27 t 29 ), (t 19 t 1 t 3 t 5 t 7 t 9 ) 3 6.88 ± 0.1 0 0 0 The general trends varying the amount of available resources are also confirmed with respect to 494 the previous examples, as shown in Figure 11.

496
Nowadays, data-intensive workflows are widely used in several application domains, such as 497 bioinformatics, astronomy, and engineering. This paper introduced and evaluated a system, named 498 Intelligent In-memory Workflow Manager (IIWM), aimed at optimizing the in-memory execution of 499 data-intensive workflows on high-performance computing systems. Experimental results suggest 500 that by jointly using a machine learning model for performance estimation and a suitable scheduling 501 strategy, the execution of data-intensive workflows can be significantly improved with respect to 502 state-of-the-art blind strategies. In particular, the main benefits of IIWM resulted when has been 503 applied to workflows having a high level of parallelism. In this case a significant reduction of memory 504 saturation has been obtained. Therefore it can be used effectively when multiple tasks have to be 505 executed on the same computing node, for example when they need to be run on multiple immovable 506 datasets located on a single node or due to other hardware constraints. In these cases, an uninformed 507 scheduling strategy will likely exceed the available memory, causing disk writes and therefore a drop in performance. The proposed approach also showed to be a very suitable solution in scenarios 509 characterized by a limited amount of memory reserved for execution, thus finding possible applications 510 in data-intensive IoT workflows, where data processing is performed on constrained devices located 511 at the network edge. 512 IIWM has been evaluated against different scenarios concerning both synthetic and real data 513 mining workflows, using Apache Spark as a testbed. Specifically, by accurately predicting resources 514 used at runtime, our approach achieved up to 31% and 40% reduction of makespan and a performance 515 improvement up to 1.45x and 1.66x for the synthetic workflows and the real case study respectively.

516
In future work additional aspects of performance estimation will be investigated. For example, 517 the IIWM prediction model can be extended also to consider other common stages in workflows 518 besides data analysis, such as data acquisition, integration and reduction, and other information about 519 tasks, input data, and hardware platform features can be exploited in the scheduling strategy.