You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Editor’s Choice
  • Article
  • Open Access

1 February 2025

Optimizing Apache Spark MLlib: Predictive Performance of Large-Scale Models for Big Data Analytics

,
and
1
Department of Management Science and Technology, University of Patras, 26334 Patras, Greece
2
Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition)

Abstract

In this study, we analyze the performance of the machine learning operators in Apache Spark MLlib for K-Means, Random Forest Regression, and Word2Vec. We used a multi-node Spark cluster along with collected detailed execution metrics computed from the data of diverse datasets and parameter settings. The data were used to train predictive models that had up to 98% accuracy in forecasting performance. By building actionable predictive models, our research provides a unique treatment for key hyperparameter tuning, scalability, and real-time resource allocation challenges. Specifically, the practical value of traditional models in optimizing Apache Spark MLlib workflows was shown, achieving up to 30% resource savings and a 25% reduction in processing time. These models enable system optimization, reduce the amount of computational overheads, and boost the overall performance of big data applications. Ultimately, this work not only closes significant gaps in predictive performance modeling, but also paves the way for real-time analytics over a distributed environment.

1. Introduction

Since it was first introduced, big data has changed the way structures function and extract insights from complex mass data. However, this brings two main issues, especially about the computational complexity. Apache Spark, a high-performance distributed computing environment, has emerged as one of the leading frameworks in big data computing, well received for its efficiency in data processing based on in-memory computation. This capability is expanded further yet by Spark’s MLlib library which consists of a wide range of machine learning (ML) algorithms capable of analyzing big data in parallel. Nevertheless, when it comes to running MLlib algorithms in Spark, it may well prove a technically challenging and laborious process, which only gets exponentially worse with increased dataset sizes.
However, in such big data scenarios and large-scale environments, predicting the time of execution and usage of resources for running MLlib algorithms can be unpredictable [1,2]. On another level, as organizations require the possibility of real-time analytics and rapid decision-making, these computational lags are seen as essential hurdles. The complexity of enhancing the efficiency of ML workloads in Spark has therefore become one of the major areas of interest to researchers and practitioners.
Despite its massive potential as an ML library in big data processing, Apache Spark’s MLlib proves to be challenging to optimize as its timings cannot be predicted and its resource usage can vary unpredictably. Although comprehensive predictive modeling for performance optimization has been generally examined in existing studies, especially for large-scale and real-time analytics, this study fills a gap by exploring a variety of aspects including multi-operator resource sharing and its relationship with dataset properties.
To address this challenge, there is a pressing need for comprehensive approaches that leverage historical performance data to forecast the computational requirements of MLlib algorithms within Apache Spark (https://spark.apache.org/, accessed on 2 January 2025). Existing research has primarily focused on optimizing specific aspects of Spark’s performance or refining the algorithms themselves, often overlooking the potential of predictive modeling as a tool for proactive resource management. The unpredictability associated with executing MLlib algorithms on large datasets poses a substantial barrier to achieving optimal performance, leading to inefficient resource allocation and delayed analytics processes [3,4].
Motivated by this research gap, the present study aims to empower organizations with the ability to anticipate the performance of MLlib algorithms before execution. By accurately forecasting execution times and resource requirements, it becomes feasible to optimize system configurations in advance, thereby preventing computational bottlenecks and reducing resource wastage. Leveraging machine learning techniques to predict and enhance the performance of machine learning workflows introduces an innovative feedback mechanism that holds significant promise for improving the efficiency of distributed data systems [5,6].
In addition, despite the research already completed in the field to enhance the performance of Apache Spark MLlib, there are still some open issues. These include the absence of appropriate predictive models suitable for workloads of different types, resource allocation frameworks which cannot scale out, and the absence of a consideration of real-time workload features, especially in a dispersed setting. Furthermore, existing techniques are often based on static settings that cannot dynamically respond to a shift in workload. This study addresses these gaps by proposing a predictive modeling technique which uses metrics of execution in order to improve the utilization of resources and congestion in one or more processing nodes of the large-scale Spark environment.
Our work extends beyond standard approaches motivated by unresolved gaps in hyperparameter optimization, scalability in distributed systems, and the integration of external factors affecting performance. We utilize machine learning models that give us a advantages in predictive insight into MLlib operator behaviors for proactive configuration tuning. The research’s necessity exists in how it may be utilized in streamlining big data applications, alleviating computational bottlenecks, and improving decisions in data analytics that require large-scale data.
To mitigate these limitations, this study presents a novel idea of how to estimate Apache Spark MLlib algorithms, given that their performance cannot be directly assessed due to their high computational complexity. Through the use of performance data, it is possible to establish performance modeling that will predict the performance metrics including the execution time of a job and the amount of resources to be used in executing it. This allows organizations to proactively tune Spark configurations to efficiently allocate the resources and eliminate computational bottlenecks. The key insight of our work is the ability of machine learning to use itself as a way to enhance the execution of other machine learning processes and form feedback loops for the betterment of distributed data systems, similar to the manner in which ML can be used on low-power devices in scenarios utilizing TinyML or Federated Learning mechanisms [7,8,9].
The key purpose of this study is to deliver a usable approach for increasing the efficiency in big data analytics through predictive performance modeling. We gather detailed efficiency measures from several MLlib algorithms, like K-Means Clustering, Random Forest Regression, and Word2Vec, which are executed on big, distributed datasets. Utilizing these metrics, we develop machine learning models that are able to precisely foresee algorithm performance determined by input data characteristics and system configurations. Providing a proactive alternative, these decision models can reduce execution delays and resource waste, therefore improving the big data analytics experience overall.
The remainder of this paper is structured as follows. In Section 2, the background and related work is presented, as well as existing approaches and research gaps. In Section 3, we outline the methodology used for data collection and predictive model training along with the systems and software utilized throughout the study. Section 4 evaluates the performance of our predictive models, discussing their impact on optimizing big data analytics workflows in Apache Spark along with their evaluation metrics. In Section 5, the key findings are summarized and discussed, while ultimately, our paper concludes with Section 6 and Section 7, which present the Conclusions and Future Work.

3. Methodology

The proposed methodology for predicting the performance of machine learning (ML) operators in Apache Spark is illustrated in Figure 1. The process launches with the foundation of the infrastructure (Step 1), during which a multi-node Apache Spark cluster is developed. This cluster brings together a leader node and a range of follower nodes, all linked via a private network, which provides the ability for distributed data processing. For the data creation in Step 2, we use data generators specifically for some of the ML algorithms such as K-Means, Random Forest, and Word2Vec. The datasets, in terms of their size and dimensions, allow for a wide range of inputs to assess operator performance.
Figure 1. Illustration of the methodology.
After creating the dataset, in Step 3, the ML operators run with different parameters and collect metrics inclusive of execution time, CPU utilization, and memory usage. The collection of these performance metrics that show operator resource use and time efficiency is carried out while the execution is in progress; they serve as the groundwork for model training. In Step 4, the metrics go into a prediction model in Step 5, where they are trained to project the performance of ML operators according to the dimensions of the input dataset. The final model predicts important performance indicators, including execution time and memory usage, giving the ability to optimize and plan computational resources for future operations.
Figure 1 offers an overarching view of the system architecture, while Figure 2 shows the internal architecture of the workflow for dataset preparation, training, and validation. The latter diagram emphasizes the logical steps, starting from synthetic dataset generation and culminating in model training and performance evaluation, effectively complementing the high-level depiction in the former.
Figure 2. Workflow for dataset preparation, training, and validation.
In Step 3 of Figure 1, the ML operators are executed with different parameters and the metrics include execution time, CPU, and memory usage. However, the collection of these performance metrics indicating operator resource use and time efficiency are performed in progress of the execution and form the basis for model training. In Step 5, the metrics obtain predicted in a model to learn how to predict the performance of ML operators on the basis of the dimensions of the input dataset in Step 4. It also predicts the execution time and memory usage of important performance indicators in order to allow the planner to optimize and plan computational schedules for future operations.

3.1. Infrastructure Setup

This section describes the configuration of a multi-node Apache Spark cluster hosted on the DIOGENIS service at the Computer Engineering and Informatics Department at the University of Patras, in Greece. The cluster consists of one leader node and one follower node, represented as N = { M 1 , S 1 } , where M 1 is the leader node and S 1 is the follower node. The primary objective of this setup is to distribute computational tasks efficiently for large-scale data processing using Spark’s in-built distributed computation model.
Given a machine learning operator O, the goal is to calculate its performance in a specific scenario P. We aim to evaluate the performance by measuring two metrics: total processing time and memory usage. These are modeled as functions of the underlying infrastructure and data distribution across the nodes, and will be evaluated later using the performance space R 2 , where each point p R 2 is a tuple of (execution time, memory usage).

3.2. Cluster Architecture and Mathematical Model

Apache Spark adopts a leader–follower architecture for distributed computing, which can be formally represented as a graph G = ( V , E ) , where
  • V = { M 1 , S 1 } are the set of nodes, with M 1 as the leader node and S 1 as the follower node.
  • E = { e M 1 S 1 } represents the directed edges, where e M 1 S 1 symbolizes task distribution from the leader to the follower node.
The leader node M 1 coordinates task execution by assigning computational jobs J, denoted as J = { J 1 , J 2 , , J n } , to follower nodes, distributing them across the available resources. Formally, the resource allocation to node S 1 can be represented as follows:
Resources ( S 1 ) = { CPU ( S 1 ) , RAM ( S 1 ) }
Each job J i has an associated computation time t i and memory requirement m i , leading to the evaluation of performance metrics. Let T total represent the total computation time, and M total represent the total memory usage across all jobs J executed by the cluster.

3.2.1. Leader–Follower Task Assignment

The leader node M 1 dispatches tasks to the follower nodes based on available resources, which can be mathematically modeled as a function f assign , where
f assign ( M 1 , S 1 ) = J 1 if CPU and RAM are available on S 1 , 0 otherwise .
The processing capability of S 1 can be denoted as C ( S 1 ) , which is the total computational power available, and is a function of CPU cores and memory:
C ( S 1 ) = CPU ( S 1 ) × RAM ( S 1 )
Thus, the execution time for each job J i can be expressed as follows:
t i = J i C ( S 1 )
The overall execution time for the cluster, T total , for a given scenario P is as follows:
T total = i = 1 n t i = i = 1 n J i C ( S 1 )
where n is the number of jobs distributed across the nodes.

3.2.2. Performance Metrics: Time and Memory Usage in R 2

To evaluate the performance of the operator O in scenario P, we calculate both the total time T total and memory usage M total . These are represented in the performance space R 2 , where
p performance = ( T total , M total ) R 2
The memory usage for the entire cluster is given by
M total = i = 1 n m i
where m i is the memory footprint of each job J i . Therefore, the performance of operator O for scenario P can be summarized as follows:
p performance = i = 1 n J i C ( S 1 ) , i = 1 n m i

3.3. Data Partitioning and Task Distribution

In Apache Spark, data are divided into partitions, each representing a subset of the dataset distributed across worker nodes. Let D be the dataset, and P ( D ) be the set of partitions:
P ( D ) = { D 1 , D 2 , , D k }
where k is the total number of partitions. Each partition D i is processed independently by the worker nodes, enabling parallel execution. The relation between partitions and workers can be modeled as follows:
f partition ( P ( D ) , W ) = k W
where W represents the number of available workers in the cluster, and k W is the partition-to-worker ratio. For our two-node setup ( W = 1 worker on S 1 ), the partition processing time is directly proportional to the total execution time:
T partition = i = 1 k D i C ( S 1 )
Thus, the total time T total is the sum of partition processing times, which also accounts for Spark’s task parallelism.

3.4. Goal of Evaluation

The ultimate objective of this study is to evaluate the performance of the operator O on scenario P by analyzing p performance in R 2 . The evaluation criteria include total execution time and memory consumption as defined above. Later sections will provide empirical results and further evaluate how these metrics behave under different configurations of Spark and varying data sizes.

3.5. Initialization

In this phase, we initialized the Spark session and made necessary modifications to its default configuration options, based on a comprehensive analysis of system performance and technical constraints. We aimed to modify these instructions to optimize resource usage and to guarantee the optimal execution of O in the experiment P.
Let the total memory allocated to each executor Mem exec be represented as follows:
spark . executor . memory = 2000 MB
For example, Mem exec refers to the maximum amount of memory that each Spark executor can access. The setting we run with makes sure that the executors are all working within the resource boundaries of existing virtual machines (VMs), thus getting the best out of them before burdening the system. Additionally, we assigned each executor a number of cores Cores exec :
Cores exec = spark . executor . cores = 2
The driver memory Mem driver was set to the following:
Mem driver = spark . driver . memory = 3 GB
This value was determined through empirical testing, as smaller values significantly degraded system performance. Furthermore, we reduced the periodic garbage collection interval GC interval to
GC interval = spark . cleaner . periodicGC . interval = 1 minute
This adjustment helped to mitigate the risk of memory overload by forcing frequent garbage collection, ensuring unused memory was quickly reclaimed.

3.6. Dataset Generation

The dataset generation process was a critical component of our experimental setup, allowing us to evaluate operator performance O across various dataset configurations. Let the dataset D be characterized by the number of samples n and the number of features f. We generated datasets dynamically based on these parameters, with the goal of controlling the scale and dimensionality of the data.
For the KMeans and RandomForest operators, we used standard functions from the sklearn.datasets package, specifically the following:
D KMeans = make _ blobs ( n , f , clusters )
D RandomForest = make _ regression ( n , f )
where D KMeans and D RandomForest represent the datasets generated for the KMeans and RandomForest algorithms, respectively. The number of clusters c in KMeans was a configurable parameter, ranging from 2 to 20.
For the Word2Vec operator, we generated datasets of sentences, each sentence being a random sequence of words. The dataset size D Word 2 Vec is represented as follows:
D Word 2 Vec = { Sentence 1 , Sentence 2 , , Sentence n }
Each sentence contains a fixed number of words w, where w = 15–50, based on linguistic research on average sentence length. The sentences were generated by randomly sampling from a dictionary of 200 unique words. The generated datasets were saved in CSV format, and their metadata, including n and f, were stored in separate files.

3.7. Dataset Sources and Relevance

This study exploited synthetic datasets designed to enable the controlled experimental performance of Apache Spark operators, which were used in this work. These datasets were of particular importance, which allowed for strict control of parameters—size, feature dimensionality, and complexity—and permitted a comprehensive study of computational performance and scalability.
  • Synthetic Datasets: Generated using Python’s 3.10 version, scikit-learn library for key MLlib operators:
    make_blobs for K-Means Clustering: Configured to test scalability across varying numbers of clusters and data dimensions.
    make_regression for Random Forest: Designed to simulate regression problems with tunable complexity.
    Custom sentence generators for Word2Vec: Created sequences of words with varying sentence lengths, mimicking natural language text.
The systematic exploration of operator behavior over multiple configurations was possible using synthetic datasets. While we did not use real-world datasets directly, the synthetic datasets were generated to resemble real-world behavior in the machine learning workflows, including clustering, regression, and text processing. The approach offers a solid basis for understanding operator performance in controlled situations, extending to actual use in real-world situations in the future.

3.8. Model Training

Apache Spark MLlib’s operators, including K-Means, Random Forest, and Word2Vec, were configured with varied hyperparameters:
  • K-Means: Number of clusters (c) varied from 2 to 20.
  • Random Forest: Tree depths ranged from 5 to 50.
  • Word2Vec: Sentence lengths varied between 15 and 50 words.
Training was conducted on a distributed multi-node Spark cluster to ensure scalability.

3.9. Model Testing and Validation

The key metrics for evaluation were as follows:
  • Execution Time: Measured as the elapsed time to complete training.
  • Memory Usage: Monitored through Spark’s REST API to assess resource consumption.
Cross-validation was employed to ensure reliability of predictions.

3.10. Execution

The execution phase involved running each operator O on various datasets D, while adjusting the number of rows n and columns f. The operator execution time T ( O ) and memory usage M ( O ) were the primary metrics of interest. For each operator, the number of rows n and columns f were varied according to the ranges of the datasets. The performance of KMeans depends on both n, f, and the number of centers c:
T ( O KMeans ) n × f × c
For RandomForest, execution time T ( O RandomForest ) scales with the number of trees t and the dataset dimensions:
T ( O RandomForest ) n × f × t
In the case of Word2Vec, the execution time T ( O Word 2 Vec ) depends on the length of the sentences w and the number of rows n:
T ( O Word 2 Vec ) n × w
All models were fit to the datasets without performing predictions, as the goal was to evaluate the training time and memory consumption of each operator. The number of iterations for KMeans was fixed at 10 to avoid convergence failures. Throughout the execution phase, system resources, including CPU usage and memory consumption, were closely monitored, and garbage collection was manually triggered to free up memory when necessary.

3.11. Statistics Collection

The final phase of each experiment involved logging and preserving key statistics to be used in the performance evaluation of the operators. The most significant metrics collected were execution time T ( O ) and memory usage M ( O ) .
Let the dataset metadata Meta ( D ) be represented as follows:
Meta ( D ) = ( n , f , Size MB , Clusters )
where Size MB is the size of the dataset in megabytes, and Clusters is the number of clusters for KMeans datasets.
The training time T ( O ) was measured using Python’s time module and included only the time taken for the operator to fit the dataset:
T ( O ) = End _ time Start _ time
Memory usage M ( O ) was collected via the Spark UI REST API by aggregating the memory usage across all executors:
M ( O ) = i = 1 e memoryUsed i
where e is the number of executors. The collected metrics were stored and later used for performance analysis.
In conclusion, the two key metrics—execution time and memory usage—were the most representative values for assessing the performance of each operator. These values formed the foundation for predicting operator behavior and optimizing the Spark cluster’s configuration.

3.12. Rationale for Metrics

Execution time and memory usage were chosen as the main metrics due to their direct impact on the scalability and efficiency of distributed systems. These metrics were chosen for the following reasons:
  • They serve as industry-standard indicators for evaluating computational performance.
  • They help to identify bottlenecks in resource allocation and operator configuration.
Thus, the metrics proved essential in building predictive models for Apache Spark configuration optimization as well as overall system performance.

4. Experimental Results

In this section, we provided the results obtained from the experimental evaluation of these operators namely KMeans, RandomForest, and Word2Vec, amongst others. In each experiment, the operator O was tested concerning the input dataset D and characteristics of the number of samples n, the number of features f, and the size of the dataset S D . The acquired datasets that have more than one feature and target column were then utilized for analysis and model prediction.
  • Total Time: The time T ( O ) that the operator took to fit the data.
  • Memory Usage: The amount of memory M ( O ) used by the cluster.
  • num_samples: The number of samples n in the dataset.
  • num_features: The number of features f in each sample.
  • num_classes: The number of clusters c (only applicable for KMeans).
  • dataset_size: The dataset size S D in megabytes.
The first two columns (Total Time and Memory Usage) serve as target variables for our prediction models, while the others are input features. Further to identifying the given input features, we examined the correlation between these input features and the target variables to better understand the experimental findings through data visualization techniques.

4.1. KMeans Operator

The KMeans operator O KMeans was used with a set of different datasets to investigate how the number of samples n, features f, and clusters c influenced the overall computation time T ( O ) and memory requirements M ( O ) .

4.1.1. Total Time

The relationship between total time T ( O KMeans ) and the number of samples n is illustrated in Figure 3. As expected, the time complexity of the KMeans algorithm grows linearly with the number of samples, as
T ( O KMeans ) n × f
Figure 3. Total time for KMeans in relation to number of samples and features. (a) Relationship between total time and number of samples for K-Means. (b) Total time and number of features relationship for KMeans.
This is consistent with the time complexity of KMeans, which is O ( n · k · f ) , where k represents the number of clusters, n the number of samples, and f the number of features. Figure 3 shows the linear trend for both samples and features.
The linear increase in time with sample size demonstrates the scalability of the algorithm under varying dataset sizes in Figure 3a.
Figure 4 shows that the number of clusters c has minimal impact on total time. On the other hand, the dataset size S D has a strong linear correlation with time, suggesting that the larger the dataset, the more significant the processing time.
Figure 4. Total time for KMeans in relation to number of clusters and dataset size. (a) Total time and number of clusters relationship for KMeans. (b) Total time and dataset size relationship for KMeans.

4.1.2. Memory Usage

Similarly, memory usage M ( O KMeans ) was analyzed with respect to input features. Figure 5 shows the memory usage scales with the number of samples n and features f.
M ( O KMeans ) n and M ( O KMeans ) f
Figure 5. Memory usage for KMeans in relation to number of samples and features. (a) Memory usage and number of samples relationship for KMeans. (b) Memory usage and number of features relationship for KMeans.
Figure 6 shows that while the number of clusters c has little impact on memory usage, the dataset size S D shows a near-linear relationship with memory usage.
Figure 6. Memory usage for KMeans in relation to number of clusters and dataset size. (a) Memory usage and number of clusters relationship for KMeans. (b) Memory usage and dataset size relationship for KMeans.

4.2. Random Forest Operator

The RandomForest operator O RF was evaluated using the same methodology. The total time T ( O RF ) was found to exhibit linear growth with both the number of samples and features, similar to KMeans:
T ( O RF ) n × f
Figure 7 confirms these relationships. Additionally, as shown in Figure 8, the time required increases as the number of samples is getting bigger, while the memory usage follows a similar pattern. As per the memory usage of the Random Forest Method, the results are shown in Figure 9.
Figure 7. Time taken for Random Forest in relation to number of samples and features. (a) Time taken and number of samples relationship for Random Forest. (b) Time taken and number of features relationship for Random Forest.
Figure 8. Time taken and memory usage for Random Forest in relation to dataset size and number of samples. (a) Time taken and dataset size relationship for Random Forest. (b) Memory usage and number of samples relationship for Random Forest.
Figure 9. Memory usage for Random Forest in relation to number of features and dataset size. (a) Memory usage and number of features relationship for Random Forest. (b) Memory usage and dataset size relationship for Random Forest.

4.3. Word2Vec Dataset

The results of the performance analysis for the Word2Vec model are presented in Figure 10. The first set of plots illustrates the relationship between total time and the number of samples and features, as shown in Figure 10a,b. These plots demonstrate a clear upward trend in time consumption as both the number of samples and features increase, highlighting the computational demands of the Word2Vec method. Additionally, Figure 11 and Figure 12 depict the memory usage in relation to the number of samples, features, and dataset size. These figures reveal the substantial memory requirements of Word2Vec, further emphasizing its complexity. The plots collectively provide a comprehensive understanding of how the Word2Vec operator scales with increasing data size and feature dimensions.
Figure 10. Total time performance metrics for Word2Vec. (a) Total time vs. number of samples (Word2Vec). (b) Total time vs. number of features (Word2Vec).
Figure 11. Memory usage performance metrics for Word2Vec (Part 1). (a) Total time vs. dataset size. (b) Memory usage vs. number of samples.
Figure 12. Memory usage performance metrics for Word2Vec (Part 2). (a) Memory usage vs. number of features. (b) Memory usage vs. dataset size.

4.4. Model Performance Comparison

We compared the performance of different models by evaluating metrics such as mean absolute error (MAE) and R 2 . The R 2 metric measures the proportion of the variance in the dependent variable that is predictable from the independent variable, while the mean absolute error (MAE) measures the average absolute difference between the predicted and actual values.
In order to provide a more comprehensive evaluation of model accuracy, we also included Mean Squared Error (MSE) as an extra performance metric alongside R 2 and MAE. A squared perspective on prediction errors is offered by MSE which penalizes larger deviations more. The byproduct of this metric is to give us a sense of how the errors are distributed and which values are outliers. The results indicate that Gradient Boosting performed best over all the experiments, confirming its suitability for modeling resource utilization patterns in Apache Spark MLlib.

4.4.1. Random Forest Regression

We plotted the R 2 scores and MAE results in separate graphs. As seen in the graph for total time, the Random Forest Regressor achieved the highest R 2 score, closely followed by the Gradient Boosting Regressor. The other models lagged significantly behind in terms of performance. The results for the total time of the Random Forest are shown in Figure 13, while the memory usage in Figure 14.
Figure 13. Total time performance metrics for Random Forest Regressor. (a) Total time vs. R2 score for Random Forest Regressor. (b) Total time vs. MAE for Random Forest Regressor.
Figure 14. Memory usage performance metrics for Random Forest Regressor. (a) Memory usage vs. R2 score for Random Forest Regressor. (b) Memory usage vs. MAE for Random Forest Regressor.
Similarly, in the graph for memory usage, the Random Forest Regressor and the Gradient Boosting Regressor stood out as the top-performing models, with the former outperforming the latter. Once again, the other models were far behind in terms of performance.
After running numerous tests, we found that the Random Forest Regressor had much more stable results across multiple runs. Based on its superior performance, we decided to use the Random Forest Regressor as our final model.

4.4.2. KMeans Regression

The results from the evaluation of total time performance suggest that the Gradient Boosting Regressor exhibits better performance for both R 2 and MAE metrics compared to the Random Forest Regressor. Among the evaluated models, these two demonstrate the highest accuracy in terms of performance. Conversely, the remaining models exhibit lower levels of accuracy. More specifically, the results for the total time of KMeans method are given in Figure 15, while the results for the memory usage in Figure 16.
Figure 15. Total time performance metrics for KMeans. (a) Total time vs. R2 score for KMeans. (b) Total time vs. MAE for KMeans.
Figure 16. Memory usage performance metrics for KMeans. (a) Memory usage vs. R2 score for KMeans. (b) Memory usage vs. MAE for KMeans.
For memory usage, the evaluated models showed similar performance, with no significant differences. Based on the evaluation results for both total time and memory usage, the Gradient Boosting Regressor exhibits the best overall performance.

4.4.3. Word2Vec Regression

The evaluation of the total time performance revealed that the Linear Regression model outperformed the other models in terms of both R 2 and MAE metrics. However, as previously mentioned, the data points for memory usage were insufficient to train an accurate model. Thus, the performance for both R 2 and MAE is notably low, as shown in the Figure 17 and Figure 18.
Figure 17. Total time performance metrics for Word2Vec. (a) Total time vs. R2 score for Word2Vec. (b) Total time vs. MAE for Word2Vec.
Figure 18. Memory usage performance metrics for Word2Vec. (a) Memory usage vs. R2 score for Word2Vec. (b) Memory usage vs. MAE for Word2Vec.

4.5. Limitations of Word2Vec and Recommendations

The memory consumption predictions for the Word2Vec model displayed inconsistencies, as evidenced by a lower R 2 value (0.06). This discrepancy can be attributed to the complexity of the algorithm and the irregular memory allocation patterns caused by processing unstructured textual data.
To address this, we recommend integrating advanced memory profiling tools and developing enhanced feature engineering techniques such as the following:
  • Incorporating word frequency distributions and text complexity metrics as additional input features for predictive modeling.
  • Adopting hierarchical or hybrid modeling approaches to capture the nonlinear behavior inherent in text-based algorithms.
These strategies aim to improve the accuracy of memory usage predictions for Word2Vec, thereby enhancing its applicability in large-scale distributed systems.

4.6. Statistical Validation of Experimental Results

To evaluate the statistical significance of the observed differences in execution time and memory usage across the models (K-Means, Random Forest, and Word2Vec), we calculated the confidence intervals (CI) at the 95% level and conducted hypothesis tests using ANOVA for multi-group comparisons. The resulting p-values were below 0.05, indicating significant differences in the performance metrics across models.
For instance, the Random Forest model demonstrated a significant reduction in memory usage compared to K-Means and Word2Vec, with a 95% CI ranging from 12.4% to 18.9%. These statistical validations provide greater confidence in the reported improvements and demonstrate the robustness of the proposed predictive modeling framework.

5. Discussion

In this study, we explored the inner workings of Apache Spark’s MLlib library, which provides a range of algorithms and tools for processing large datasets and building machine learning models. Our focus was on three key operators: K-Means Clustering, Random Forest Regressor, and Word2Vec. We trained several models to predict the performance of each operator based on two key metrics: total execution time and memory usage. The results of the best-performing models for each operator are summarized in Table 2.
Table 2. Best models and corresponding metrics for each operator.
The results indicate strong predictive accuracy for both K-Means and Random Forest operators, particularly in predicting memory usage, with R 2 scores of 0.98 and 0.76, respectively. The Gradient Boosting Regressor (GBR) showed the highest performance for K-Means, achieving R 2 values of 0.922 for total time and 0.98 for memory usage. For the Random Forest operator, the Random Forest Regressor (RFR) performed best with R 2 scores of 0.942 for total time and 0.76 for memory usage.
On the other hand, the Word2Vec operator exhibited different behavior, with a Linear Regressor performing best in predicting total execution time ( R 2 = 0.972 ), but showing poor accuracy in predicting memory usage ( R 2 = 0.06 ). This discrepancy suggests that the memory consumption patterns for Word2Vec are more complex and may require more sophisticated modeling techniques, possibly due to the large-scale text data and the algorithm’s complexity.
These results emphasize the importance of selecting appropriate models for different operators and metrics. For structured data, such as in K-Means and Random Forest, ensemble models like Gradient Boosting and Random Forest Regressor performed well. In contrast, the simpler Linear Regressor was suitable for predicting execution time in the Word2Vec operator, but inadequate for memory usage prediction. The poor performance in memory prediction for Word2Vec suggests the need for additional research into more advanced techniques, possibly by incorporating more features or preprocessing steps to better capture the memory profiles of text-based algorithms.
Overall, this study demonstrates how predictive modeling can enhance our understanding of the performance of Apache Spark’s MLlib operators. By using machine learning models, we can anticipate execution time and memory usage more effectively, which is crucial for optimizing the performance of distributed systems. Future work could focus on improving the predictability of memory usage for complex operators like Word2Vec and exploring new machine learning methods to further refine the capabilities of Spark’s MLlib in large-scale data processing.

Real-World Applications and Limitations

  • Real-World Scenarios: Applications: The predictive models in this study have direct application in several industrial settings. To further extend this application, the models perform the optimization of resource allocation themselves for quality control processes, including defect detection via the application of machine learning, improving efficiency and eliminating downtime while increasing production throughput in the manufacturing industry, for example. For healthcare analytics, predictive models can be used to plan costs and generate timeliness of the results in priorities such as medical image processing and patient outcome predictions.
    These models can help in logistics and supply chain management where real time analytics workflows such as route planning and inventory control incur delays in processing large datasets and can cause financial inefficiencies. In such dynamic environments, it provides the ability to carry out proactive scheduling and resource management, i.e., the ability to forecast and execute time and memory usage.
  • Challenges and limitations: There are a number of limitations. During training, the predictive models have computational overhead, and at a sufficient scale, this overhead could rigorously outweigh the resource savings provided during execution. Moreover, implementing these models within existing Apache Spark workflows with architectural modifications is needed. A limitation of this work is dependency on hyperparameter tuning, due to which it may be challenging to utilize in a resource-constrained or expertise-scarce environment.
  • Future Directions: To address these limitations, future research could be based on using hybrid modeling techniques to make a predictive task scalable and robust. Furthermore, by embedding automated hyperparameter optimization frameworks within the predictive modeling process, we can simplify the deployment of these tools for broader industrial use.

6. Conclusions

This study demonstrated the feasibility of using predictive models to estimate the performance of Apache Spark’s MLlib operators by focusing on two key metrics: It saves complete info regarding total execution time as well as reducing the consumption of memory. To provide a broad set for evaluation, we ran experiments across each aspect of dataset size, number of features, and sample counts, for three major machine learning operators: K-Means, Random Forest, and Word2Vec. The results show that even though both execution time and memory can be accurately modeled using strong predictive models such as Gradient Boosting and Random Forest Regressor, it is harder to fit some structured data algorithms such as K-Means and Random Forest. The R 2 scores for these models are very high, which suggests that relationships between dataset characteristics and operator performance can be modeled using nonlinear techniques.
Nevertheless, there are also some limitations highlighted in the results, in particular for unstructured data algorithms such as Word2Vec. Although Linear Regression mostly accounted for the total execution time, the low R 2 of memory usage describes that simple linear models cannot be used to capture the memory usage patterns of more complex operators. This finding highlights the importance of developing more sophisticated models that can handle nonlinear behavior created by algorithms that process textual data and manipulate unstructured inputs; a new area of study may be needed.
While this study provided valuable insights into predictive modeling for structured algorithms like K-Means and Random Forest, it also highlighted significant challenges in predicting memory usage for unstructured data algorithms, such as Word2Vec. Addressing these challenges requires the development of sophisticated nonlinear models and enhanced preprocessing techniques tailored to the complexities of unstructured data processing. These refinements are essential for improving predictive accuracy and robustness across diverse machine learning operators in distributed frameworks.
The key contribution of this work is that predictive models are a viable tool for optimizing resource allocation in distributed machine learning environments. In large-scale distributed systems such as Apache Spark, job scheduling and resource management are critical, and we need accurate predictions of execution time and memory usage. Using predictive models, data engineers not only can reduce the chance of data processes running into memory exhaustion or bottlenecks, but they can better know what resources they might need in advance.
This study offered valuable insights about K-Means and Random Forest, but algorithms like Word2Vec still present a problem. Further research into nonlinear models and feature engineering as well as advanced data preprocessing methods is needed to address the need for more refined memory usage predictions. In order to improve the overall accuracy and robustness of predictive models of any type of machine learning operator in distributed frameworks, these refinements will be critical.

7. Future Work

Several potential improvements can be built upon the findings of this study. The first is that a clear opportunity exists to see if more advanced machine learning models than what we tested can predict memory usage better, such as Word2Vec. Additional techniques such as deep learning, neural networks, or ensemble may be better at capturing the nonlinearity caused by such unstructured data processing.
Further research could also be performed to improve the data preprocessing and feature extraction techniques used in this study. If we make the model harder—using word frequency distributions or text complexity metrics for Word2Vec, for instance—perhaps we can better predict memory usage. Other techniques for dimensionality reduction, or feature selection, would also be candidates to help improve the generalization of predictive models.
Future work may also evaluate the scaling of developed models over larger, more heterogeneous datasets. Running these predictive models in different distributed environments and measuring in what ways they are robust and applicable to real-world problems would give valuable insights. Further, the direct incorporation of such predictive models within the Apache Spark framework may obtain substantial performance improvements. By allowing Spark to know in advance how long it will take to execute and how much memory it needs based on dataset characteristics, scheduling and resource management could be dramatically improved, eliminating computational overhead and reducing the waste of machine time and resources. Below are some future recommendations.
  • Deep Learning for Word2Vec Memory Prediction: Future research should explore advanced machine learning models, such as Recurrent Neural Networks (RNNs) or Transformers, to improve memory usage forecasting for algorithms handling unstructured data. These models are designed to capture complex dependencies and nonlinear relationships inherent in text-based workloads.
  • Feature Engineering Enhancements: Introducing advanced features like word frequency distributions, text complexity metrics, and hierarchical data representations could enhance the predictive accuracy for memory usage. Dimensionality reduction and feature selection techniques should also be investigated to improve generalization across datasets.
  • Scalability Analysis: Testing the developed models on larger, more heterogeneous datasets across various distributed environments can provide valuable insights into their scalability and adaptability to real-world big data scenarios.
  • Integration into Apache Spark: Embedding predictive models directly into Apache Spark could optimize job scheduling and resource management. By enabling the framework to anticipate computational demands based on dataset characteristics, significant reductions in execution overhead and resource wastage could be achieved.
  • Extension to Other Operators: Extending the predictive modeling approach to other Spark MLlib operators or alternative distributed frameworks would generalize the findings, offering broader applicability for resource-efficient big data processing.
Finally future investigations may explore how other operators in Spark MLlib or other distributed frameworks can apply predictive modeling. Extending the utility of predictive modeling would further enhance the development of resource-efficient big data processing systems by investigating how general these models can be applied to other machine learning tasks, like classification or deep learning-based operators.

Author Contributions

L.T., A.K. and G.A.K. conceived of the idea, designed and constructed the experiments, drafted the initial manuscript, and revised the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Assali, T.; Ayoub, Z.T.; Ouni, S. Multivariate LSTM for Execution Time Prediction in HPC for Distributed Deep Learning Training. In Proceedings of the 2024 IEEE 27th International Symposium on Real-Time Distributed Computing (ISORC), Tunis, Tunisia, 22–25 May 2024; pp. 1–5. [Google Scholar] [CrossRef]
  2. Salman, S.M.; Dao, V.L.; Papadopoulos, A.V.; Mubeen, S.; Nolte, T. Scheduling Firm Real-time Applications on the Edge with Single-bit Execution Time Prediction. In Proceedings of the 2023 IEEE 26th International Symposium on Real-Time Distributed Computing (ISORC), Nashville, TN, USA, 23–25 May 2023; pp. 207–213. [Google Scholar] [CrossRef]
  3. Chen, R. Research on the Performance of Collaborative Filtering Algorithms in Library Book Recommendation Systems: Optimization of the Spark ALS Model. In Proceedings of the 2024 International Conference on Integrated Circuits and Communication Systems (ICICACS), Raichur, India, 23–24 February 2024; pp. 1–6. [Google Scholar] [CrossRef]
  4. Han, M. Research on optimization of K-means Algorithm Based on Spark. In Proceedings of the 2023 IEEE 6th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 24–26 February 2023; Volume 6, pp. 1829–1836. [Google Scholar] [CrossRef]
  5. Pham, T.P.; Durillo, J.J.; Fahringer, T. Predicting workflow task execution time in the cloud using a two-stage machine learning approach. IEEE Trans. Cloud Comput. 2017, 8, 256–268. [Google Scholar] [CrossRef]
  6. Balis, B.; Lelek, T.; Bodera, J.; Grabowski, M.; Grigoras, C. Improving prediction of computational job execution times with machine learning. Concurr. Comput. Pract. Exp. 2024, 36, e7905. [Google Scholar] [CrossRef]
  7. Schizas, N.; Karras, A.; Karras, C.; Sioutas, S. TinyML for Ultra-Low Power AI and Large Scale IoT Deployments: A Systematic Review. Future Internet 2022, 14, 363. [Google Scholar] [CrossRef]
  8. Karras, A.; Giannaros, A.; Theodorakopoulos, L.; Krimpas, G.A.; Kalogeratos, G.; Karras, C.; Sioutas, S. FLIBD: A federated learning-based IoT big data management approach for privacy-preserving over Apache Spark with FATE. Electronics 2023, 12, 4633. [Google Scholar] [CrossRef]
  9. Karras, A.; Karras, C.; Giotopoulos, K.C.; Tsolis, D.; Oikonomou, K.; Sioutas, S. Federated Edge Intelligence and Edge Caching Mechanisms. Information 2023, 14, 414. [Google Scholar] [CrossRef]
  10. Sewal, P.; Singh, H. A Machine Learning Approach for Predicting Execution Statistics of Spark Application. In Proceedings of the 2022 Seventh International Conference on Parallel, Distributed and Grid Computing (PDGC), Solan, India, 25–27 November 2022; pp. 331–336. [Google Scholar] [CrossRef]
  11. Ye, G.; Liu, W.; Wu, C.Q.; Shen, W.; Lyu, X. On Machine Learning-based Stage-aware Performance Prediction of Spark Applications. In Proceedings of the 2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC), Austin, TX, USA, 6–8 November 2020; pp. 1–8. [Google Scholar] [CrossRef]
  12. Ataie, E.; Evangelinou, A.; Gianniti, E.; Ardagna, D. A hybrid machine learning approach for performance modeling of cloud-based big data applications. Comput. J. 2022, 65, 3123–3140. [Google Scholar] [CrossRef]
  13. Gulino, A.; Canakoglu, A.; Ceri, S.; Ardagna, D. Performance Prediction for Data-driven Workflows on Apache Spark. In Proceedings of the 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Nice, France, 17–19 November 2020; pp. 1–8. [Google Scholar] [CrossRef]
  14. Tsai, L.; Franke, H.; Li, C.S.; Liao, W. Learning-Based Memory Allocation Optimization for Delay-Sensitive Big Data Processing. IEEE Trans. Parallel Distrib. Syst. 2018, 29, 1332–1341. [Google Scholar] [CrossRef]
  15. Gárate-Escamilla, A.K.; El Hassani, A.H.; Andres, E. Big data execution time based on Spark Machine Learning Libraries. In Proceedings of the 2019 3rd International Conference on Cloud and Big Data Computing, Oxford, UK, 28–30 August 2019; pp. 78–83. [Google Scholar]
  16. Wang, G.; Xu, J.; He, B. A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning. In Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Sydney, NSW, Australia, 12–14 December 2016; pp. 586–593. [Google Scholar] [CrossRef]
  17. Lu, X.; Shankar, D.; Gugnani, S.; Panda, D.K. High-performance design of apache spark with RDMA and its benefits on various workloads. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; pp. 253–262. [Google Scholar] [CrossRef]
  18. Manzi, D.; Tompkins, D. Exploring GPU Acceleration of Apache Spark. In Proceedings of the 2016 IEEE International Conference on Cloud Engineering (IC2E), Berlin, Germany, 4–8 April 2016; pp. 222–223. [Google Scholar] [CrossRef]
  19. Öztürk, M.M. MFRLMO: Model-free reinforcement learning for multi-objective optimization of apache spark. EAI Endorsed Trans. Scalable Inf. Syst. 2024, 11, 1–15. [Google Scholar] [CrossRef]
  20. Ishizaki, K. Analyzing and optimizing java code generation for apache spark query plan. In Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, Mumbai, India, 7–11 April 2019; pp. 91–102. [Google Scholar]
  21. Giannaros, A.; Karras, A.; Theodorakopoulos, L.; Karras, C.; Kranias, P.; Schizas, N.; Kalogeratos, G.; Tsolis, D. Autonomous vehicles: Sophisticated attacks, safety issues, challenges, open topics, blockchain, and future directions. J. Cybersecur. Priv. 2023, 3, 493–543. [Google Scholar] [CrossRef]
  22. Theodorakopoulos, L.; Karras, A.; Theodoropoulou, A.; Kampiotis, G. Benchmarking Big Data Systems: Performance and Decision-Making Implications in Emerging Technologies. Technologies 2024, 12, 217. [Google Scholar] [CrossRef]
  23. Karras, A.; Giannaros, A.; Karras, C.; Theodorakopoulos, L.; Mammassis, C.S.; Krimpas, G.A.; Sioutas, S. TinyML algorithms for Big Data Management in large-scale IoT systems. Future Internet 2024, 16, 42. [Google Scholar] [CrossRef]
  24. Dong, C.; Akram, A.; Andersson, D.; Arnäs, P.O.; Stefansson, G. The impact of emerging and disruptive technologies on freight transportation in the digital era: Current state and future trends. Int. J. Logist. Manag. 2021, 32, 386–412. [Google Scholar] [CrossRef]
  25. Ohlhorst, F.J. Big Data Analytics: Turning Big Data into Big Money; John Wiley & Sons: Hoboken, NJ, USA, 2012; Volume 65. [Google Scholar]
  26. Vummadi, J.; Hajarath, K. Integration of Emerging Technologies AI and ML into Strategic Supply Chain Planning Processes to Enhance Decision-Making and Agility. Int. J. Supply Chain. Manag. 2024, 9, 77–87. [Google Scholar] [CrossRef]
  27. Sun, Z. Intelligent big data analytics: A managerial perspective. In Managerial Perspectives on Intelligent Big Data Analytics; IGI Global: Hershey, PA, USA, 2019; pp. 1–19. [Google Scholar]
  28. Pouyanfar, S.; Yang, Y.; Chen, S.C.; Shyu, M.L.; Iyengar, S. Multimedia big data analytics: A survey. ACM Comput. Surv. (CSUR) 2018, 51, 1–34. [Google Scholar] [CrossRef]
  29. Sterling, M. Situated big data and big data analytics for healthcare. In Proceedings of the 2017 IEEE Global Humanitarian Technology Conference (GHTC), San Jose, CA, USA, 19–22 October 2017; p. 1. [Google Scholar] [CrossRef]
  30. Ularu, E.G.; Puican, F.C.; Apostu, A.; Velicanu, M. Perspectives on big data and big data analytics. Database Syst. J. 2012, 3, 3–14. [Google Scholar]
  31. Crowder, J.A.; Carbone, J.; Friess, S.; Crowder, J.A.; Carbone, J.; Friess, S. Data analytics: The big data analytics process (bdap) architecture. In Artificial Psychology: Psychological Modeling and Testing of AI Systems; Springer: Cham, Switzerland, 2020; pp. 149–159. [Google Scholar]
  32. Padilha, B.; Schwerz, A.L.; Roberto, R.L. WED-SQL: A Relational Framework for Design and Implementation of Process-Aware Information Systems. In Proceedings of the 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW), Atlanta, GA, USA, 5–8 June 2017; pp. 364–369. [Google Scholar] [CrossRef]
  33. Udoh, I.S.; Kotonya, G. Developing IoT applications: Challenges and frameworks. IET Cyber-Phys. Syst. Theory Appl. 2018, 3, 65–72. [Google Scholar] [CrossRef]
  34. Horii, S. Improved computation-communication trade-off for coded distributed computing using linear dependence of intermediate values. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 179–184. [Google Scholar]
  35. Yan, Q.; Yang, S.; Wigger, M. Storage-Computation-Communication Tradeoff in Distributed Computing: Fundamental Limits and Complexity. IEEE Trans. Inf. Theory 2022, 68, 5496–5512. [Google Scholar] [CrossRef]
  36. Jangda, A.; Huang, J.; Liu, G.; Sabet, A.H.N.; Maleki, S.; Miao, Y.; Musuvathi, M.; Mytkowicz, T.; Saarikivi, O. Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February–4 March 2022; pp. 402–416. [Google Scholar]
  37. Hu, H.; Jiang, C.; Zhong, Y.; Peng, Y.; Wu, C.; Zhu, Y.; Lin, H.; Guo, C. dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training. arXiv 2022, arXiv:2205.02473. [Google Scholar]
  38. Cheng, D.; Wang, Y.; Dai, D. Dynamic resource provisioning for iterative workloads on Apache Spark. IEEE Trans. Cloud Comput. 2021, 11, 639–652. [Google Scholar] [CrossRef]
  39. Kordelas, A.; Spyrou, T.; Voulgaris, S.; Megalooikonomou, V.; Deligiannis, N. KORDI: A Framework for Real-Time Performance and Cost Optimization of Apache Spark Streaming. In Proceedings of the 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Raleigh, NC, USA, 23–25 April 2023; pp. 337–339. [Google Scholar]
  40. Cheng, G.; Ying, S.; Wang, B. Tuning configuration of apache spark on public clouds by combining multi-objective optimization and performance prediction model. J. Syst. Softw. 2021, 180, 111028. [Google Scholar] [CrossRef]
  41. Geng, J.; Li, D.; Cheng, Y.; Wang, S.; Li, J. HiPS: Hierarchical parameter synchronization in large-scale distributed machine learning. In Proceedings of the 2018 Workshop on Network Meets AI & ML, Budapest, Hungary, 24 August 2018; pp. 1–7. [Google Scholar]
  42. Nascimento, J.P.B.; Capanema, D.O.; Pereira, A.C.M. Assessing and improving the performance and scalability of an iterative algorithm for Hadoop. In Proceedings of the 2017 Computing Conference, London, UK, 18–20 July 2017; pp. 1069–1076. [Google Scholar] [CrossRef]
  43. Sahith, C.S.K.; Muppidi, S.; Merugula, S. Apache Spark Big data Analysis, Performance Tuning, and Spark Application Optimization. In Proceedings of the 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT), Bengaluru, India, 20–21 October 2023; pp. 1–8. [Google Scholar]
  44. Ousterhout, K. Architecting for Performance Clarity in Data Analytics Frameworks. Ph.D. Thesis, UC Berkeley, Berkeley, CA, USA, 2017. [Google Scholar]
  45. Dubey, R.; Gunasekaran, A.; Childe, S.J.; Blome, C.; Papadopoulos, T. Big data and predictive analytics and manufacturing performance: Integrating institutional theory, resource-based view and big data culture. Br. J. Manag. 2019, 30, 341–361. [Google Scholar] [CrossRef]
  46. Gupta, Y.K.; Kumari, S. Performance Evaluation of Distributed Machine Learning for Cardiovascular Disease Prediction in Spark. In Proceedings of the 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 3–5 June 2021; pp. 1506–1512. [Google Scholar]
  47. Assefi, M.; Behravesh, E.; Liu, G.; Tafti, A.P. Big data machine learning using apache spark MLlib. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 3492–3498. [Google Scholar] [CrossRef]
  48. Atefinia, R.; Ahmadi, M. Performance evaluation of Apache Spark MLlib algorithms on an intrusion detection dataset. arXiv 2022, arXiv:2212.05269. [Google Scholar]
  49. Karras, A.; Karras, C.; Bompotas, A.; Bouras, P.; Theodorakopoulos, L.; Sioutas, S. SparkReact: A Novel and User-friendly Graphical Interface for the Apache Spark MLlib Library. In Proceedings of the 26th Pan-Hellenic Conference on Informatics, Athens, Greece, 25–27 November 2022; pp. 230–239. [Google Scholar]
  50. Qadri, A.M.; Raza, A.; Munir, K.; Almutairi, M.S. Effective Feature Engineering Technique for Heart Disease Prediction with Machine Learning. IEEE Access 2023, 11, 56214–56224. [Google Scholar] [CrossRef]
  51. Azeroual, O.; Nikiforova, A. Apache spark and mllib-based intrusion detection system or how the big data technologies can secure the data. Information 2022, 13, 58. [Google Scholar] [CrossRef]
  52. Esmaeilzadeh, A.; Heidari, M.; Abdolazimi, R.; Hajibabaee, P.; Malekzadeh, M. Efficient Large Scale NLP Feature Engineering with Apache Spark. In Proceedings of the 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 26–29 January 2022; pp. 0274–0280. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.