# Exploiting Machine Learning for Improving In-Memory Execution of Data-Intensive Workflows on Parallel Machines

^{*}

## Abstract

**:**

## 1. Introduction

- Workflow structure, in terms of tasks and data dependencies.
- Input format, such as the number of rows, dimensionality, and all other features required to describe the complexity of input data.
- The types of tasks, i.e., the computation performed by a given node of the workflow. For example, in the case of data analysis workflows, we can distinguish among supervised learning, unsupervised learning, and association rule discovery tasks, as well as between learning and prediction tasks.

#### Problem Statement

## 2. Related Work

#### 2.1. Analytical-Based

- Dynamically adapting resources to data storage, using a feedback-based mechanism with real-time monitoring of the memory usage of the application [17].

#### 2.2. Machine Learning-Based

- It focuses on data-intensive workflows, while in [10], general workloads were addressed.
- It uses high-level information for describing an application (e.g., task and dataset features), while in [10], low-level system features were exploited, such as the cache miss rate and the number of blocks sent, collected by running the application on a small portion (100 MB) of the input data.
- It proposes a more general approach, since the approach proposed in [10] is only appliable to applications whose memory usage is a function of the input size.

## 3. Materials and Methods

- Execution monitoring and dataset creation: starting from a given set of workflows, a transactional dataset is generated by monitoring the memory usage and execution time of each task, specifying how it is designed and giving concise information about the input.
- Prediction model training: from the transactional dataset of executions, a regression model is trained in order to fit the distribution of memory occupancy and execution time, according to the features that represent the different tasks of a workflow.
- Workflow scheduling: taking into account the predicted memory occupancy, and execution time of each task, provided by the trained model, and the available memory of the computing node, tasks are scheduled using an informed strategy. In this way, a controlled degree of parallelism can be ensured, while minimizing the risk of memory saturation.

#### 3.1. Execution Monitoring and Dataset Creation

#### 3.1.1. Execution Monitoring within the Spark Unified Memory Model

#### 3.1.2. Dataset Creation

- The description of the task, such as its class (e.g., classification, clustering, etc.), type (fitting or predicting task), and algorithm (e.g., SVM, K-means, etc.).
- The description of the input dataset in terms of the number of rows, columns, categorical columns, and overall dataset size.
- Peak memory usage (both execution and storage) and execution time, which represent the three target variables to be predicted by the regressor. In order to obtain more significant data, the metrics were aggregated on median values by performing ten executions per task.

- For datasets used in classification or regression tasks, we considered only the k highest scoring features based on:
- –
- the analysis of variance (F-value) for integer labels (classification problems);
- –
- the correlation-based univariate linear regression test for real labels (regression problems).

- For clustering datasets, we used a correlation-based test to maintain the k features with the smallest probability to be correlated with the others.
- For association rule discovery datasets, no feature selection is required, as the number of columns refers to the average number of items in the different transactions.

#### 3.2. Prediction Model Training

#### 3.3. Workflow Scheduling

- An item is a task to be executed.
- A bin identifies a stage, i.e., a set of tasks that can be run in parallel.
- The capacity of a bin is the maximum amount C of available memory in a computing node. When assigning a task to a stage $s\in \mathcal{S}$, its residual available memory is indicated with ${C}_{s}$.
- The weight of an item is the memory occupancy estimated by the prediction model. In the case of the Spark testbed, it is the maximum of the execution and storage memory, in order to model a peak in the unified memory. As concerns the estimated execution time, it is used for selecting the stage to be assigned when memory constraints hold for multiple stages.

- All workflow tasks have to be executed, so the capacity of a stage may still be exceeded if a task takes up more memory than the available one.
- The assignment of a task t to a stage s is subject to dependency constraints. Hence, if a dependency exists between ${t}_{i}$ and ${t}_{j}$, then the stage of ${t}_{i}$ has to be executed before the one of ${t}_{j}$.

- The residual capacity ${C}_{{s}_{i}}$ of the selected stage ${s}_{i}$ is not exceeded by the addition of the task t.
- There does not exist a dependency between t and any task ${t}^{\prime}$ belonging to ${s}_{i}$ and every subsequent stage (${s}_{i+1}\cup \cdots \cup {s}_{k}$), where a dependency ${({t}^{\prime},t)}^{n}$ is identified by a path of length $n>0$.

Algorithm 1: The IIWM scheduler. |

## 4. Results and Discussion

- Execution time: Let ${m}_{1}$ and ${m}_{2}$ be the makespan for two different executions. If ${m}_{2}<{m}_{1}$, we can compute the improvement on makespan (${m}_{imp}$) and application performance (${p}_{imp}$) as follows:$${m}_{imp}=\frac{{m}_{1}-{m}_{2}}{{m}_{1}}\times 100\%\phantom{\rule{28.45274pt}{0ex}}{p}_{imp}=\frac{{m}_{1}}{{m}_{2}}$$
- Disk usage: We used the on-disk usage metric, which measures the amount of disk usage, jointly considering the volume and the duration of disk writes. Formally, given a sequence of disk writes ${w}_{1},\dots ,{w}_{k}$, let ${\tau}_{i}^{{}^{\prime}}$, ${\tau}_{i}^{{}^{\u2033}}\in \mathbb{T}$ be the start and end time of the ${w}_{i}$ write, respectively. Let also $W:\mathbb{T}\to \mathbb{R}$ be a function representing the amount of megabytes written to disk over time $\mathbb{T}$. We define on-disk usage as:$$on\text{-}disk\phantom{\rule{4.pt}{0ex}}usge=\sum _{i=1}^{k}\frac{1}{{\tau}_{i}^{\u2033}-{\tau}_{i}^{\prime}}{\int}_{{\tau}_{i}^{\prime}}^{{\tau}_{i}^{\u2033}}W\left(\tau \right)d\tau $$

#### 4.1. Synthetic Workflows

#### 4.2. Real Case Study

## 5. Conclusions and Final Remarks

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Talia, D.; Trunfio, P.; Marozzo, F. Data Analysis in the Cloud; Elsevier: Amsterdam, The Netherlands, 2015; ISBN 978-0-12-802881-0. [Google Scholar]
- Da Costa, G.; Fahringer, T.; Rico-Gallego, J.A.; Grasso, I.; Hristov, A.; Karatza, H.D.; Lastovetsky, A.; Marozzo, F.; Petcu, D.; Stavrinides, G.L.; et al. Exascale machines require new programming paradigms and runtimes. Supercomput. Front. Innov.
**2015**, 2, 6–27. [Google Scholar] - Li, M.; Tan, J.; Wang, Y.; Zhang, L.; Salapura, V. SparkBench: A Comprehensive Benchmarking Suite for in Memory Data Analytic Platform Spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers, Ischia, Italy, 18–21 May 2015; Association for Computing Machinery: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
- De Oliveira, D.C.; Liu, J.; Pacitti, E. Data-intensive workflow management: For clouds and data-intensive and scalable computing environments. Synth. Lect. Data Manag.
**2019**, 14, 1–179. [Google Scholar] [CrossRef] - Verma, A.; Mansuri, A.H.; Jain, N. Big data management processing with Hadoop MapReduce and spark technology: A comparison. In Proceedings of the 2016 Symposium on Colossal Data Analysis and Networking (CDAN), Indore, India, 18–19 March 2016; pp. 1–4. [Google Scholar] [CrossRef]
- Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauly, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), San Jose, CA, USA, 25–27 April 2012; pp. 15–28. [Google Scholar]
- Samadi, Y.; Zbakh, M.; Tadonki, C. Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks. Concurr. Comput. Pract. Exp.
**2018**, 30, e4367. [Google Scholar] [CrossRef] - Delimitrou, C.; Kozyrakis, C. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, Salt Lake City, UT, USA, 1–5 March 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 127–144. [Google Scholar] [CrossRef]
- Llull, Q.; Fan, S.; Zahedi, S.M.; Lee, B.C. Cooper: Task Colocation with Cooperative Games. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, TX, USA, 4–8 February 2017; pp. 421–432. [Google Scholar] [CrossRef]
- Marco, V.S.; Taylor, B.; Porter, B.; Wang, Z. Improving Spark Application Throughput via Memory Aware Task Co-Location: A Mixture of Experts Approach. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, Las Vegas, NV, USA, 11–15 December 2017; Association for Computing Machinery: New York, NY, USA, 2017. Middleware’17. pp. 95–108. [Google Scholar] [CrossRef] [Green Version]
- Maros, A.; Murai, F.; Couto da Silva, A.P.; Almeida, J.M.; Lattuada, M.; Gianniti, E.; Hosseini, M.; Ardagna, D. Machine Learning for Performance Prediction of Spark Cloud Applications. In Proceedings of the 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), Milan, Italy, 8–13 July 2019; pp. 99–106. [Google Scholar] [CrossRef] [Green Version]
- Talia, D. Workflow Systems for Science: Concepts and Tools. Int. Sch. Res. Not.
**2013**, 2013, 404525. [Google Scholar] [CrossRef] [Green Version] - Smanchat, S.; Viriyapant, K. Taxonomies of workflow scheduling problem and techniques in the cloud. Future Gener. Comput. Syst.
**2015**, 52, 1–12. [Google Scholar] [CrossRef] - Bittencourt, L.F.; Madeira, E.R.M.; Da Fonseca, N.L.S. Scheduling in hybrid clouds. IEEE Commun. Mag.
**2012**, 50, 42–47. [Google Scholar] [CrossRef] - Zhao, Y.; Hu, F.; Chen, H. An adaptive tuning strategy on spark based on in-memory computation characteristics. In Proceedings of the 2016 18th International Conference on Advanced Communication Technology (ICACT), Pyeongchang, Korea, 31 January–3 February 2016; pp. 484–488. [Google Scholar] [CrossRef]
- Chen, D.; Chen, H.; Jiang, Z.; Zhao, Y. An adaptive memory tuning strategy with high performance for Spark. Int. J. Big Data Intell.
**2017**, 4, 276–286. [Google Scholar] [CrossRef] - Xuan, P.; Luo, F.; Ge, R.; Srimani, P.K. Dynamic Management of In-Memory Storage for Efficiently Integrating Compute-and Data-Intensive Computing on HPC Systems. In Proceedings of the 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Madrid, Spain, 14–17 May 2017; pp. 549–558. [Google Scholar] [CrossRef]
- Tang, Z.; Zeng, A.; Zhang, X.; Yang, L.; Li, K. Dynamic memory-aware scheduling in spark computing environment. J. Parallel Distrib. Comput.
**2020**, 141, 10–22. [Google Scholar] [CrossRef] - Bae, J.; Jang, H.; Jin, W.; Heo, J.; Jang, J.; Hwang, J.; Cho, S.; Lee, J.W. Jointly optimizing task granularity and concurrency for in-memory mapreduce frameworks. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 130–140. [Google Scholar] [CrossRef]
- Wolpert, D.H. Stacked generalization. Neural Netw.
**1992**, 5, 241–259. [Google Scholar] [CrossRef] - Liu, J.; Pacitti, E.; Valduriez, P.; Mattoso, M. A Survey of Data-Intensive Scientific Workflow Management. J. Grid Comput.
**2015**, 13, 457–493. [Google Scholar] [CrossRef] - Raj, P.H.; Kumar, P.R.; Jelciana, P. Load Balancing in Mobile Cloud Computing using Bin Packing’s First Fit Decreasing Method. In Proceedings of the International Conference on Computational Intelligence in Information System, Brunei, 16–18 November 2018; Springer: Berlin, Germany, 2018; pp. 97–106. [Google Scholar]
- Baker, T.; Aldawsari, B.; Asim, M.; Tawfik, H.; Maamar, Z.; Buyya, R. Cloud-SEnergy: A bin-packing based multi-cloud service broker for energy efficient composition and execution of data-intensive applications. Sustain. Comput. Informatics Syst.
**2018**, 19, 242–252. [Google Scholar] [CrossRef] [Green Version] - Stavrinides, G.L.; Karatza, H.D. Scheduling real-time DAGs in heterogeneous clusters by combining imprecise computations and bin packing techniques for the exploitation of schedule holes. Future Gener. Comput. Syst.
**2012**, 28, 977–988. [Google Scholar] [CrossRef] - Coffman, E.G., Jr.; Garey, M.R.; Johnson, D.S. An application of bin-packing to multiprocessor scheduling. SIAM J. Comput.
**1978**, 7, 1–17. [Google Scholar] [CrossRef] - Darapuneni, Y.J. A Survey of Classical and Recent Results in Bin Packing Problem. UNLV Theses, Dissertations, Professional Papers, and Capstones. 2012. Available online: https://digitalscholarship.unlv.edu/cgi/viewcontent.cgi?article=2664&context=thesesdissertations (accessed on 29 April 2021).
- Marozzo, F.; Rodrigo Duro, F.; Garcia Blas, J.; Carretero, J.; Talia, D.; Trunfio, P. A data-aware scheduling strategy for workflow execution in clouds. Concurr. Comput. Pract. Exp.
**2017**, 29, e4229. [Google Scholar] [CrossRef] - Marozzo, F.; Talia, D.; Trunfio, P. A Workflow Management System for Scalable Data Mining on Clouds. IEEE Trans. Serv. Comput.
**2018**, 11, 480–492. [Google Scholar] [CrossRef] - Aseeri, A.O.; Zhuang, Y.; Alkatheiri, M.S. A Machine Learning-Based Security Vulnerability Study on XOR PUFs for Resource-Constraint Internet of Things. In Proceedings of the 2018 IEEE International Congress on Internet of Things (ICIOT), San Francisco, CA, USA, 2–7 July 2018; pp. 49–56. [Google Scholar] [CrossRef]

**Figure 3.**Execution flow of the IIWM scheduler. Given a workflow and a prediction model as the input, a scheduling plan is generated in two steps: (i) building of the stages and task assignment; (ii) stage consolidation.

Symbol | Meaning |
---|---|

$\mathcal{T}=\{{t}_{1},{t}_{2},\dots ,{t}_{n}\}$ | Set of tasks. |

$\mathcal{A}\subseteq (\mathcal{T}\times \mathcal{T})=\{{a}_{1},\dots ,{a}_{m}\}$ | Dependencies. ${a}_{i}=({t}_{i},{t}_{j}),{t}_{i}\in \mathcal{T},{t}_{j}\in \mathcal{T}$. |

${d}_{t}$ | Description of the dataset processed by task t. |

$\mathcal{W}=(\mathcal{T},\mathcal{A})$ | Workflow. |

${\mathcal{N}}^{in}\left(t\right)=\{{t}^{\prime}\in \mathcal{T}\mid ({t}^{\prime},t)\in \mathcal{A}\}$ | In-neighborhood of task t. |

${\mathcal{N}}^{out}\left(t\right)=\{{t}^{\prime}\in \mathcal{T}\mid (t,{t}^{\prime})\in \mathcal{A}\}$ | Out-neighborhood of task t. |

$\mathcal{M}$ | Regression prediction model. |

$\mathcal{S}=\langle {s}_{1},\dots ,{s}_{k}\rangle $ | List of stages. ${s}_{i}\subseteq \mathcal{T}\mid ({t}_{x}\Vert {t}_{y})\forall {t}_{x},{t}_{y}\in {s}_{i}$. |

C | Maximum amount of memory available for a computing node. |

${C}_{s}=C-{\displaystyle \sum _{t\in s}}\mathcal{M}.predict\_mem(t,{d}_{t})$ | Residual capacity of a stage s. |

MLlib Algorithm | Persist Call |
---|---|

K-Means | //Compute squared norms and cache them norms.cache() |

Decision Tree | //Cache input RDD for speedup during multiple passes BaggedPoint.convertToBaggedRDD(treeInput,…).cache() |

GMM | instances.cache() … data.map(_.asBreeze).cache() |

FPGrowth | items.cache() |

SVM | IstanceBlock.blokifyWithMaxMemUsage(…).cache() |

Task Name | Task Type | Task Class | Dataset Rows | Dataset Columns | Categorical Columns | Dataset Size (MB) | Peak Storage Memory (MB) | Peak Execution Memory (MB) | Duration (ms) |
---|---|---|---|---|---|---|---|---|---|

GMM | Estimator | Clustering | 1,474,971 | 28 | 0 | 87.00 | 433.37 | 1413.50 | 108,204.00 |

K-Means | Estimator | Clustering | 5,000,000 | 104 | 0 | 1239.78 | 4624.52 | 4112.00 | 56,233.50 |

Decision Tree | Estimator | Classification | 9606 | 1921 | 0 | 84.91 | 730.09 | 297.90 | 39,292.00 |

Naive Bayes | Estimator | Classification | 260,924 | 4 | 0 | 13.50 | 340.92 | 6982.80 | 16,531.50 |

SVM | Estimator | Classification | 5,000,000 | 129 | 0 | 1542.58 | 6199.11 | 106.60 | 238,594.50 |

FPGrowth | Estimator | Association Rules | 823,593 | 180 | 180 | 697.00 | 9493.85 | 1371.03 | 96,071.50 |

GMM | Transformer | Clustering | 165,474 | 14 | 1 | 6.37 | 2.34 | 1 × 10${}^{-6}$ | 62.50 |

K-Means | Transformer | Clustering | 4,898,431 | 42 | 3 | 648.89 | 3.23 | 1 × 10${}^{-6}$ | 35.00 |

Decision Tree | Transformer | Classification | 1,959,372 | 42 | 4 | 257.69 | 3.68 | 1 × 10${}^{-6}$ | 65.50 |

Naive Bayes | Transformer | Classification | 347,899 | 4 | 0 | 17.99 | 4.26 | 1 × 10${}^{-6}$ | 92.50 |

SVM | Transformer | Classification | 5,000,000 | 129 | 0 | 1542.58 | 2.36 | 1 × 10${}^{-6}$ | 55.50 |

FPGrowth | Transformer | Association Rules | 136,073 | 34 | 34 | 13.55 | 1229.95 | 633.50 | 52,429.00 |

… | … | … | … | … | … | … | … | … | … |

Hyperparameter | Value |
---|---|

n_estimators | 500 |

learning_rate | 0.01 |

max_depth | 7 |

loss | least squares |

RMSE | MAE | Adjusted ${\mathit{R}}^{2}$ | Pearson Correlation | |
---|---|---|---|---|

Storage Memory | $108.23$ | $26.66$ | $0.96$ | $0.98$ |

Execution Memory | $312.60$ | $26.30$ | $0.91$ | $0.95$ |

Duration | 4443.17 | 2003.70 | $0.95$ | $0.98$ |

Node | Task Name | Task Type | Task Class | Rows | Columns | Categorical Columns | Dataset Size (MB) |
---|---|---|---|---|---|---|---|

${t}_{0}$ | Naive Bayes | Estimator | Classification | 2,939,059 | 18 | 4 | 198.94 |

${t}_{1}$ | FPGrowth | Estimator | Association Rules | 494,156 | 180 | 180 | 417.01 |

${t}_{2}$ | Naive Bayes | Estimator | Classification | 5,000,000 | 27 | 0 | 321.86 |

${t}_{3}$ | K-Means | Estimator | Clustering | 1,000,000 | 104 | 0 | 247.96 |

${t}_{4}$ | Decision Tree | Estimator | Classification | 4,000,000 | 53 | 0 | 505.45 |

${t}_{5}$ | Decision Tree | Estimator | Classification | 4,000,000 | 27 | 0 | 257.49 |

${t}_{6}$ | Decision Tree | Estimator | Classification | 5,000,000 | 129 | 0 | 1542.58 |

${t}_{7}$ | K-Means | Estimator | Clustering | 2,000,000 | 53 | 0 | 252.73 |

${t}_{8}$ | Naive Bayes | Estimator | Classification | 2,000,000 | 104 | 0 | 495.90 |

${t}_{9}$ | Naive Bayes | Estimator | Classification | 1,000,000 | 129 | 0 | 307.57 |

${t}_{10}$ | SVM | Estimator | Classification | 2,000,000 | 53 | 0 | 252.72 |

${t}_{11}$ | K-Means | Estimator | Clustering | 2,049,280 | 9 | 2 | 122.03 |

${t}_{12}$ | GMM | Estimator | Clustering | 2,458,285 | 28 | 0 | 145.01 |

${t}_{13}$ | K-Means | Estimator | Clustering | 9169 | 5812 | 1 | 101.89 |

${t}_{14}$ | SVM | Estimator | Classification | 2,000,000 | 27 | 0 | 128.75 |

${t}_{15}$ | K-Means | Estimator | Clustering | 3,000,000 | 104 | 0 | 743.87 |

${t}_{16}$ | SVM | Estimator | Classification | 3,000,000 | 53 | 0 | 379.09 |

${t}_{17}$ | SVM | Estimator | Classification | 14,410 | 1921 | 0 | 127.38 |

${t}_{18}$ | K-Means | Estimator | Clustering | 5,000,000 | 53 | 0 | 631.81 |

${t}_{19}$ | K-Means | Estimator | Clustering | 5,000,000 | 104 | 0 | 1239.78 |

${t}_{20}$ | K-Means | Estimator | Clustering | 2,000,000 | 78 | 0 | 371.93 |

${t}_{21}$ | SVM | Estimator | Classification | 3,000,000 | 104 | 0 | 743.87 |

${t}_{22}$ | K-Means | Estimator | Clustering | 2,939,059 | 18 | 4 | 198.94 |

${t}_{23}$ | SVM | Estimator | Classification | 19,213 | 1442 | 0 | 123.28 |

${t}_{24}$ | Decision Tree | Estimator | Classification | 3,000,000 | 129 | 0 | 922.69 |

${t}_{25}$ | K-Means | Estimator | Clustering | 1,959,372 | 26 | 4 | 189.55 |

${t}_{26}$ | Decision Tree | Estimator | Classification | 4,898,431 | 18 | 4 | 331.57 |

${t}_{27}$ | Naive Bayes | Estimator | Classification | 4,898,431 | 18 | 4 | 331.57 |

${t}_{28}$ | K-Means | Estimator | Clustering | 2,939,059 | 34 | 4 | 334.91 |

${t}_{29}$ | K-Means | Estimator | Clustering | 4,898,431 | 18 | 4 | 331.57 |

${t}_{30}$ | K-Means | Estimator | Clustering | 1,966,628 | 42 | 0 | 170.49 |

${t}_{31}$ | Naive Bayes | Estimator | Classification | 1,959,372 | 18 | 4 | 132.62 |

${t}_{32}$ | K-Means | Estimator | Clustering | 3,000,000 | 78 | 0 | 557.91 |

${t}_{33}$ | Decision Tree | Estimator | Classification | 3,000,000 | 53 | 0 | 379.09 |

${t}_{34}$ | Decision Tree | Estimator | Classification | 14,410 | 2401 | 0 | 159.71 |

${t}_{35}$ | K-Means | Estimator | Clustering | 2,939,059 | 42 | 4 | 386.53 |

${t}_{36}$ | Decision Tree | Estimator | Classification | 2,939,059 | 34 | 4 | 334.91 |

${t}_{37}$ | Decision Tree | Estimator | Classification | 4,000,000 | 129 | 0 | 1230.24 |

${t}_{38}$ | Naive Bayes | Estimator | Classification | 1,000,000 | 53 | 0 | 126.36 |

${t}_{39}$ | GMM | Estimator | Clustering | 1,000,000 | 53 | 0 | 126.36 |

${t}_{40}$ | Decision Tree | Estimator | Classification | 2,939,059 | 18 | 4 | 198.94 |

${t}_{41}$ | K-Means | Estimator | Clustering | 4,898,431 | 18 | 4 | 331.57 |

RMSE | MAE | Adjusted ${\mathit{R}}^{2}$ | Pearson Correlation | |
---|---|---|---|---|

Storage Memory | $246.63$ | $95.6$ | $0.96$ | $0.98$ |

Execution Memory | $4.7$ | $1.6$ | $0.99$ | $0.99$ |

Duration | 20,354.38 | 7,877.72 | $0.80$ | $0.91$ |

Iteration | State | Stages |
---|---|---|

It. 0 | ${\mathcal{T}}_{free}^{0}=\left\{{t}_{0}\right\}$ Create ${s}_{0}$ and assign ${t}_{0}$ Unlock {${t}_{1},{t}_{2},{t}_{3}$} | ${s}_{0}=\left\{{t}_{0}\right\}$ |

It. 1 | ${\mathcal{T}}_{free}^{1}=\{{t}_{1},{t}_{3},{t}_{2}\}$ Create ${s}_{1}$ and assign ${t}_{1}$ Unlock {${t}_{4}$} | ${s}_{0}=\left\{{t}_{0}\right\}$, ${s}_{1}=\left\{{t}_{1}\right\}$ |

It. 2 | ${\mathcal{T}}_{free}^{2}=\{{t}_{3},{t}_{4},{t}_{2}\}$ Create ${s}_{2}$ and assign ${t}_{3}$ Unlock {${t}_{7},{t}_{8},{t}_{9},{t}_{10}$} | ${s}_{0}=\left\{{t}_{0}\right\}$, ${s}_{1}=\left\{{t}_{1}\right\}$, ${s}_{2}=\left\{{t}_{3}\right\}$ |

It. 3 | ${\mathcal{T}}_{free}^{3}=\{{t}_{7},{t}_{4},{t}_{10},{t}_{2},{t}_{8},{t}_{9}\}$ Create ${s}_{3}$ and assign ${t}_{7}$ | ${s}_{0}=\left\{{t}_{0}\right\}$, ${s}_{1}=\left\{{t}_{1}\right\}$, ${s}_{2}=\left\{{t}_{3}\right\}$, ${s}_{3}=\left\{{t}_{7}\right\}$ |

It. 4 | ${\mathcal{T}}_{free}^{4}=\{{t}_{4},{t}_{10},{t}_{2},{t}_{8},{t}_{9}\}$ ${\mathcal{S}}_{sel}=\{{s}_{2},{s}_{3}\}$ $increase=\{0,0\}$ Assign ${t}_{4}$ to ${s}_{2}$ | ${s}_{0}=\left\{{t}_{0}\right\}$, ${s}_{1}=\left\{{t}_{1}\right\}$, ${s}_{2}=\{{t}_{3},{t}_{4}\}$, ${s}_{3}=\left\{{t}_{7}\right\}$ |

⋯ | ⋯ | ⋯ |

It. 17 | ${\mathcal{T}}_{free}^{17}=\{{t}_{17},{t}_{23},{t}_{8},{t}_{9}\}$ ${\mathcal{S}}_{sel}=\{{s}_{6},{s}_{7}\}$ $increase=\{12,496.36,0\}$ Assign ${t}_{17}$ to ${s}_{7}$ Unlock {${t}_{25}$} | ${s}_{0}=\left\{{t}_{0}\right\}$, ${s}_{1}=\{{t}_{1},{t}_{2}\}$, ${s}_{2}=\{{t}_{3},{t}_{4},{t}_{5}\}$, ${s}_{3}=\{{t}_{7},{t}_{10},{t}_{6}\}$, ${s}_{4}=\{{t}_{12},{t}_{11}\}$, ${s}_{5}=\{{t}_{15},{t}_{18}\}$, ${s}_{6}=\{{t}_{19},{t}_{22}\}$, ${s}_{7}=\{{t}_{24},{t}_{16},{t}_{17}\}$ |

⋯ | ⋯ | ⋯ |

Strategy | Task-Scheduling Plan | Number of Stages | Time (min) | Peak Disk Usage (MB) | Writes Duration (min) | On-Disk Usage (MB) |
---|---|---|---|---|---|---|

Full-Parallel | (${t}_{0}$), (${t}_{1}$ ‖ ${t}_{2}$ ‖ ${t}_{3}$), (${t}_{4}$ ‖ ${t}_{5}$ ‖ ${t}_{6}$ ‖ ${t}_{7}$ ‖ ${t}_{8}$ ‖ ${t}_{9}$ ‖ ${t}_{10}$), (${t}_{11}$ ‖ ${t}_{12}$ ‖ ${t}_{13}$ ‖ ${t}_{14}$), (${t}_{15}$ ‖ ${t}_{16}$ ‖ ${t}_{17}$ ‖ ${t}_{18}$ ‖ ${t}_{19}$ ‖ ${t}_{20}$ ‖ ${t}_{21}$), (${t}_{22}$ ‖ ${t}_{23}$ ‖ ${t}_{24}$ ‖ ${t}_{25}$ ‖ ${t}_{26}$ ‖ ${t}_{27}$ ‖ ${t}_{28}$), (${t}_{29}$ ‖ ${t}_{30}$ ‖ ${t}_{31}$ ‖ ${t}_{32}$ ‖ ${t}_{33}$ ‖ ${t}_{34}$ ‖ ${t}_{41}$), (${t}_{35}$ ‖ ${t}_{36}$ ‖ ${t}_{37}$), (${t}_{38}$ ‖ ${t}_{39}$), (${t}_{40}$) | 10 | $31.52\pm 0.6$ | 356,106.60 | $11.56$ | 126,867.06 |

IIWM | (${t}_{0}$), (${t}_{1}$ ‖ ${t}_{2}$), (${t}_{3}$ ‖ ${t}_{4}$ ‖ ${t}_{5}$), (${t}_{7}$ ‖ ${t}_{10}$ ‖ ${t}_{6}$ ‖ ${t}_{8}$ ‖ ${t}_{9}$), (${t}_{12}$ ‖ ${t}_{11}$ ‖ ${t}_{13}$), (${t}_{15}$ ‖ ${t}_{18}$), (${t}_{19}$ ‖ ${t}_{22}$ ‖ ${t}_{23}$), (${t}_{24}$ ‖ ${t}_{16}$ ‖ ${t}_{17}$ ‖ ${t}_{29}$), (${t}_{25}$ ‖ ${t}_{41}$), (${t}_{30}$ ‖ ${t}_{20}$), (${t}_{35}$ ‖ ${t}_{21}$ ‖ ${t}_{14}$), (${t}_{28}$ ‖ ${t}_{26}$ ‖ ${t}_{27}$), (${t}_{33}$ ‖ ${t}_{32}$ ‖ ${t}_{34}$ ‖ ${t}_{31}$), (${t}_{37}$ ‖ ${t}_{36}$), (${t}_{39}$ ‖ ${t}_{38}$), (${t}_{40}$) | 16 | $21.70\pm 0.63$ | 0 | 0 | 0 |

Node | Task Name | Task Type | Task Class | Rows | Columns | Categorical Columns | Dataset Size (MB) |
---|---|---|---|---|---|---|---|

${t}_{0}$ | K-Means | Estimator | Clustering | 3,918,745 | 34 | 4 | 446.55 |

${t}_{1}$ | Decision Tree | Estimator | Classification | 4,000,000 | 27 | 0 | 257.49 |

${t}_{2}$ | GMM | Estimator | Clustering | 2,458,285 | 28 | 0 | 145.01 |

${t}_{3}$ | Decision Tree | Estimator | Classification | 3,000,000 | 53 | 0 | 379.09 |

${t}_{4}$ | Decision Tree | Estimator | Classification | 4,000,000 | 129 | 0 | 1230.24 |

${t}_{5}$ | Decision Tree | Estimator | Classification | 3,918,745 | 18 | 4 | 265.25 |

${t}_{6}$ | Decision Tree | Estimator | Classification | 4,898,431 | 42 | 3 | 648.89 |

${t}_{7}$ | Decision Tree | Estimator | Classification | 2,939,059 | 42 | 4 | 386.53 |

${t}_{8}$ | K-Means | Estimator | Clustering | 2,458,285 | 56 | 0 | 278.75 |

${t}_{9}$ | GMM | Estimator | Clustering | 3,000,000 | 53 | 0 | 379.09 |

${t}_{10}$ | SVM | Estimator | Classification | 4,000,000 | 53 | 0 | 505.45 |

${t}_{11}$ | K-Means | Estimator | Clustering | 2,939,059 | 42 | 4 | 386.53 |

${t}_{12}$ | SVM | Estimator | Classification | 2,000,000 | 53 | 0 | 252.72 |

${t}_{13}$ | K-Means | Estimator | Clustering | 1,639,424 | 9 | 2 | 93.70 |

${t}_{14}$ | Naive Bayes | Estimator | Classification | 260,924 | 3 | 0 | 10.33 |

${t}_{15}$ | K-Means | Estimator | Clustering | 2,000,000 | 78 | 0 | 371.93 |

${t}_{16}$ | Decision Tree | Estimator | Classification | 3,918,745 | 26 | 4 | 379.11 |

${t}_{17}$ | Decision Tree | Estimator | Classification | 3,918,745 | 34 | 4 | 446.55 |

${t}_{18}$ | FPGrowth | Estimator | Association Rules | 823,593 | 180 | 180 | 697.00 |

${t}_{19}$ | Decision Tree | Estimator | Classification | 2,939,059 | 26 | 4 | 284.33 |

${t}_{20}$ | SVM | Estimator | Classification | 5,000,000 | 27 | 0 | 321.86 |

${t}_{21}$ | FPGrowth | Estimator | Association Rules | 164,719 | 180 | 180 | 139.87 |

${t}_{22}$ | GMM | Estimator | Clustering | 3,000,000 | 27 | 0 | 193.12 |

${t}_{23}$ | K-Means | Estimator | Clustering | 4,898,431 | 26 | 4 | 473.88 |

${t}_{24}$ | Decision Tree | Estimator | Classification | 2,000,000 | 104 | 0 | 495.90 |

${t}_{25}$ | K-Means | Estimator | Clustering | 2,458,285 | 69 | 0 | 344.60 |

${t}_{26}$ | FPGrowth | Estimator | Association Rules | 494,156 | 180 | 180 | 417.01 |

RMSE | MAE | Adjusted ${\mathit{R}}^{2}$ | Pearson Correlation | |
---|---|---|---|---|

Storage Memory | $213.81$ | $78.92$ | $0.98$ | $0.99$ |

Execution Memory | $29.86$ | $11.56$ | $0.98$ | $0.99$ |

Duration | 20,086.80 | 9925.13 | $0.82$ | $0.94$ |

Strategy | Task-Scheduling Plan | Number of Stages | Time (min) | Peak Disk Usage (MB) | Writes Duration (min) | On-Disk Usage (MB) |
---|---|---|---|---|---|---|

Full-Parallel | (${t}_{0}$), (${t}_{1}$ ‖ ${t}_{2}$ ‖ ${t}_{3}$ ‖ ${t}_{4}$), (${t}_{5}$ ‖ ${t}_{6}$ ‖ ${t}_{7}$ ‖ ${t}_{8}$ ‖ ${t}_{9}$ ‖ ${t}_{10}$ ‖ ${t}_{11}$ ‖ ${t}_{12}$), (${t}_{13}$ ‖ ${t}_{14}$ ‖ ${t}_{15}$ ‖ ${t}_{16}$ ‖ ${t}_{17}$ ‖ ${t}_{18}$ ‖ ${t}_{19}$ ‖ ${t}_{20}$), (${t}_{21}$), (${t}_{22}$ ‖ ${t}_{23}$ ‖ ${t}_{24}$ ‖ ${t}_{25}$), (${t}_{26}$) | 7 | $29.42\pm 1.88$ | 27,095.84 | $20.6$ | 10,593.79 |

IIWM | (${t}_{0}$), (${t}_{4}$ ‖ ${t}_{2}$), (${t}_{11}$ ‖ ${t}_{7}$), (${t}_{8}$ ‖ ${t}_{3}$), (${t}_{15}$ ‖ ${t}_{10}$ ‖ ${t}_{9}$ ‖ ${t}_{16}$), (${t}_{18}$), (${t}_{17}$ ‖ ${t}_{1}$ ‖ ${t}_{12}$), (${t}_{6}$ ‖ ${t}_{5}$), (${t}_{14}$), (${t}_{13}$ ‖ ${t}_{20}$ ‖ ${t}_{19}$), (${t}_{21}$), (${t}_{23}$ ‖ ${t}_{24}$), (${t}_{25}$), (${t}_{22}$), (${t}_{26}$) | 15 | $22.68\pm 1.65$ | $304.5$ | $3.6$ | $60.82$ |

Strategy | Task-Scheduling Plan | Number of Stages | Time (min) | Peak Disk Usage (MB) | Writes Duration (min) | On-Disk Usage (MB) |
---|---|---|---|---|---|---|

Full-Parallel | (${t}_{0}$ ‖ ${t}_{2}$ ‖ ${t}_{4}$ ‖ ${t}_{6}$ ‖ ${t}_{8}$ ‖ ${t}_{10}$ ‖ ${t}_{12}$ ‖ ${t}_{14}$ ‖ ${t}_{16}$ ‖ ${t}_{18}$ ‖ ${t}_{20}$ ‖ ${t}_{22}$ ‖ ${t}_{24}$ ‖ ${t}_{26}$ ‖ ${t}_{28}$), (${t}_{1}$ ‖ ${t}_{3}$ ‖ ${t}_{5}$ ‖ ${t}_{7}$ ‖ ${t}_{9}$ ‖ ${t}_{11}$ ‖ ${t}_{13}$ ‖ ${t}_{15}$ ‖ ${t}_{17}$ ‖ ${t}_{19}$ ‖ ${t}_{21}$ ‖ ${t}_{23}$ ‖ ${t}_{25}$ ‖ ${t}_{27}$ ‖ ${t}_{29}$) | 2 | $11.42\pm 0.27$ | 124,730.87 | $9.6$ | 54,443.19 |

IIWM | (${t}_{10}$ ‖ ${t}_{12}$ ‖ ${t}_{14}$ ‖ ${t}_{16}$ ‖ ${t}_{20}$ ‖ ${t}_{22}$ ‖ ${t}_{24}$ ‖ ${t}_{26}$ ‖ ${t}_{28}$), (${t}_{18}$ ‖ ${t}_{0}$ ‖ ${t}_{2}$ ‖ ${t}_{4}$ ‖ ${t}_{6}$ ‖ ${t}_{8}$ ‖ ${t}_{11}$ ‖ ${t}_{13}$ ‖ ${t}_{15}$ ‖ ${t}_{17}$ ‖ ${t}_{21}$ ‖ ${t}_{23}$ ‖ ${t}_{25}$ ‖ ${t}_{27}$ ‖ ${t}_{29}$), (${t}_{19}$ ‖ ${t}_{1}$ ‖ ${t}_{3}$ ‖ ${t}_{5}$ ‖ ${t}_{7}$ ‖ ${t}_{9}$) | 3 | $6.88\pm 0.1$ | 0 | 0 | 0 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Cantini, R.; Marozzo, F.; Orsino, A.; Talia, D.; Trunfio, P.
Exploiting Machine Learning for Improving In-Memory Execution of Data-Intensive Workflows on Parallel Machines. *Future Internet* **2021**, *13*, 121.
https://doi.org/10.3390/fi13050121

**AMA Style**

Cantini R, Marozzo F, Orsino A, Talia D, Trunfio P.
Exploiting Machine Learning for Improving In-Memory Execution of Data-Intensive Workflows on Parallel Machines. *Future Internet*. 2021; 13(5):121.
https://doi.org/10.3390/fi13050121

**Chicago/Turabian Style**

Cantini, Riccardo, Fabrizio Marozzo, Alessio Orsino, Domenico Talia, and Paolo Trunfio.
2021. "Exploiting Machine Learning for Improving In-Memory Execution of Data-Intensive Workflows on Parallel Machines" *Future Internet* 13, no. 5: 121.
https://doi.org/10.3390/fi13050121