You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

9 May 2024

Adaptive Multi-Criteria Selection for Efficient Resource Allocation in Frugal Heterogeneous Hadoop Clusters

Department of Computer Science, Prince Sultan University, Riyadh 11586, Saudi Arabia

Abstract

Efficient resource allocation is crucial in clusters with frugal Single-Board Computers (SBCs) possessing limited computational resources. These clusters are increasingly being deployed in edge computing environments in resource-constrained settings where energy efficiency and cost-effectiveness are paramount. A major challenge in Hadoop scheduling is load balancing, as frugal nodes within the cluster can become overwhelmed, resulting in degraded performance and frequent occurrences of out-of-memory errors, ultimately leading to job failures. In this study, we introduce an Adaptive Multi-criteria Selection for Efficient Resource Allocation (AMS-ERA) in Frugal Heterogeneous Hadoop Clusters. Our criterion considers CPU, memory, and disk requirements for jobs and aligns the requirements with available resources in the cluster for optimal resource allocation. To validate our approach, we deploy a heterogeneous SBC-based cluster consisting of 11 SBC nodes and conduct several experiments to evaluate the performance using Hadoop wordcount and terasort benchmark for various workload settings. The results are compared to the Hadoop-Fair, FOG, and IDaPS scheduling strategies. Our results demonstrate a significant improvement in performance with the proposed AMS-ERA, reducing execution time by 27.2%, 17.4%, and 7.6%, respectively, using terasort and wordcount benchmarks.

1. Introduction

Frugal computing refers to the practice of designing, building, and deploying computing systems with a focus on cost-effectiveness, resource efficiency, and sustainability. The term “frugal” implies simplicity, economy, and minimalism, where the goal is to meet computing needs with the least number of resources, both in terms of hardware and energy [1]. Frugal clusters are an innovative solution that intersects sustainability and digital transformation [2]. By leveraging energy-efficient hardware components like Single-Board Computers (SBCs) [3], these clusters reduce energy consumption, aligning with sustainability goals and minimizing environmental impact [4], thus aligning with broader sustainability goals. Moreover, their cost-efficient nature makes them accessible to organizations with limited budgets, democratizing access to big data processing capabilities and fostering inclusivity in digital transformation initiatives [1]. Frugal clusters prioritize resource optimization through adaptive resource allocation and workload-aware scheduling, ensuring efficient resource utilization and maximizing performance.
Hadoop, an open-source framework, facilitates the distributed processing of large datasets across computer clusters using simple programming models. A key distinction of Hadoop is its integration of both storage and computation within the same framework. Unlike traditional methods, Hadoop allows for the flexible movement of computation, primarily MapReduce jobs, to the location of the data, managed by a Hadoop Distributed File System (HDFS). Consequently, efficient data placement within compute nodes becomes essential for effective big data processing [5]. Hadoop’s default approach to data locality relies heavily on the physical proximity of data to computation nodes, which may not always guarantee optimal performance. However, this feature overlooks other important factors such as network congestion, node availability, and load balancing, which can significantly impact data access latency and overall job execution time [6]. Additionally, Hadoop’s default data locality mechanism does not take into account the heterogeneity of cluster nodes, including variations in processing power, memory capacity, and disk I/O capabilities [7,8]. As a result, tasks may be assigned to nodes that are ill suited for processing them efficiently, leading to resource contention and reduced performance. Furthermore, the default data locality mechanism may not dynamically adapt to changing cluster conditions or workload patterns, resulting in suboptimal resource utilization and wasted computational resources.
In recent times, researchers have addressed the optimal resource allocation for scheduling issues in heterogeneous Hadoop clusters. Z. Guo and G. Fox. [9] introduced techniques like speculative execution to mitigate the impact of slow nodes, thereby optimizing resource utilization and job completion times. The study emphasizes the importance of efficient resource management and scheduling algorithms to improve overall performance in environments with varying computational capabilities. In [10], Bae notes that in heterogeneous environments, Hadoop’s subpar performance was observed due to equal block allocation across nodes in the cluster. They proposed a new data placement scheme aimed at improving Hadoop’s data locality while minimizing replicated data by selecting and replicating only blocks with the highest likelihood of remote access. In [11], Bawankule K. presents a historical data-based data placement (HDBDP) policy to balance the workload among heterogeneous nodes. Their approach is based on the node’s computing capabilities to improve the map tasks’ data locality and to reduce the job turnaround time in the heterogeneous Hadoop environment. Resource- and Network-aware Data Placement Algorithm (RENDA) for resource- and network-aware data placement in Hadoop is presented in [12]. The RENDA reduces the time of the data distribution and data processing stages by estimating the heterogeneous performance of the nodes on a real-time basis. It carefully allocates data blocks in several installments to participating nodes in the cluster.
The researchers in [13] discuss the development of a novel job scheduler, CLQLMRS, using reinforcement learning to improve data and cache locality in MapReduce job scheduling, highlighting the importance of reducing job execution time for enhancing Hadoop performance. In [14], the authors propose a DQ-DCWS algorithm to balance data locality and delays in Hadoop while considering five Quality of Service factors. The DQ-DCWS is based on dynamic programming in calculating the length of the edge in the DAG and scheduling tasks along the optimal path. In [15], Postoaca et al. presented a deadline-aware fog Scheduler (FOG) for cloud edge applications. The job queue is ordered for context based on deadlines. The nodes in the cluster are ordered using a similarity index. The highest-ordered jobs are sorted and assigned to the appropriate clusters. The authors in [16] propose an Improved Data Placement Strategy (IDaPS) based on intra-dependency among data blocks to enhance performance and reduce data transfer overheads. The proposed IDaPS uses the Markov clustering algorithm to characterize MapReduce task execution based on intra-dependency and task execution frequency.
This paper addresses the challenge of efficient resource allocation in frugal Hadoop clusters. We propose an Adaptive Multi-criteria Selection for Efficient Resource Allocation (AMS-ERA) in Frugal Heterogeneous Hadoop Clusters. Our criterion considers CPU, memory, and disk requirements for jobs and aligns the requirements with available resources in the cluster for optimal resource allocation. Resources available in the cluster are profiled and ranked based on similarity and proximity using the K-means clustering method. A dynamic Analytical Hierarchy Process (dAHP) determines the optimal placement of a job using a score vector to determine the best possible node for a job. The process involves refining the AHP model’s accuracy by integrating historical information obtained through Hadoop APIs to assign weights to jobs based on their resource requirements. Finally, the jobs are assigned to the most appropriate nodes, ensuring load balancing in the heterogeneous cluster. These strategies aim to optimize data layout in Hadoop by maximizing parallelism while accommodating the resource constraints of frugal SBC nodes in the Hadoop cluster. To validate the proposed AMS-ERA, we deploy a heterogeneous SBC-based cluster consisting of 11 physical nodes and execute the Hadoop benchmark tests to analyze the performance of the proposed technique against Hadoop-Fair, FOG [15], and IDaPS [16] scheduling strategies. The results showcase a notable enhancement in performance with our proposed approach. Our results demonstrate a significant improvement in performance with the proposed AMS-ERA, reducing execution time by 27.2%, 17.4%, and 7.6%, respectively, using terasort and wordcount benchmarks. The contributions of this work are threefold:
  • We introduce the AMS-ERA approach to optimize resource allocation in frugal Hadoop clusters with Single-Board Computers (SBCs). By considering CPU, memory, and disk requirements for jobs, and aligning these with available resources, AMS-ERA enhances resource allocation to improve performance and efficiency.
  • The proposed method involves profiling available resources in the cluster using K-means clustering and dynamically placing jobs based on a refined Analytical Hierarchy Process (AHP). This dynamic placement ensures optimal resource utilization and load balancing in heterogeneous clusters.
  • We construct a heterogeneous 11-node Hadoop cluster using popular SBC devices to validate our approach. The work demonstrates that AMS-ERA achieves significant performance improvements compared to other scheduling strategies like Hadoop-Fair, FOG, and IDaPS using various IO-intensive and CPU-intensive Hadoop microbenchmarks such as terasort and wordcount.
AMS-ERA adapts to changing conditions, improving load balancing and data locality in a way that traditional Hadoop resource allocation strategies, which tend to rely heavily on physical proximity, often fail to achieve. By dynamically selecting the best-suited nodes for each job, AMS-ERA reduces execution time and avoids resource contention. This innovative approach directly addresses the challenges of frugal clusters, where energy efficiency and resource constraints are paramount.
The rest of the paper is organized as follows. Section 2 presents relevant work and background. Section 3 details the proposed strategies and algorithms. Section 4 presents the extensive performance evaluation of the SBC cluster followed by Section 5, concluding this work.

3. Adaptive Multi-Criteria Selection for Efficient Resource Allocation

3.1. Motivation

The native Hadoop framework lacks a built-in mechanism to distinguish the specific capacities of individual nodes, including CPU processing power and available physical memory and storage availability. Such characteristics are crucial in edge clusters composed of resource-frugal devices, as they significantly influence the performance of concurrently executing MapReduce tasks.
Consider a scenario in which a cluster consisting of N nodes must process M MapReduce tasks across D data blocks. According to the default configuration of Hadoop’s InputSplit, the number of MapReduce tasks corresponds to the number of data blocks, meaning each task operates on one data block per node. However, this default approach overlooks the available utilization of resources within the cluster leading to the suboptimal allocation of resources. Typically, a node can have multiple CPU cores available, and an optimal resource allocation strategy can leverage the available resources to allow the simultaneous execution of multiple MapReduce jobs to improve the parallelism in the cluster, thereby improving the overall efficiency of the cluster.
Single-Board Computers (SBCs), exemplified by the Raspberry Pi computers, typically feature quad-core processors, with more advanced models boasting hexa- or octa-core processors. Leveraging these resources effectively for optimal resource allocation is crucial. Additionally, SBCs have limited onboard memory and disk capacity. In many instances, the default Hadoop input split may not allocate data blocks optimally on these SBC-based nodes, resulting in various out-of-memory errors [3,23]. Consequently, the MapReduce jobs fail and need to restart, which can be expensive. In Table 2, we present a matrix listing the various features of popular SBCs.
Table 2. A comparison of popular Single Board Computers.
Moreover, the positioning of data blocks on nodes where MapReduce tasks are executed is crucial for efficient processing, aiming to minimize latency in data transfers between different nodes within the cluster. Given the limited available resources on the frugal SBC-based Hadoop clusters, it is essential to develop optimal resource allocation strategies tailored to frugal Hadoop clusters, considering the unique resource constraints of SBCs. Table 3 lists the main symbol notations and their meanings used in this paper.
Table 3. Symbols used and their meanings.

3.2. Problem Definition

We define a few terms to quantify the proposed research. We assume that a set of k number of jobs J = j 1 , , j k is submitted to a heterogeneous Hadoop cluster consisting of x number of nodes C = n 1 , , n x .
As each job may have unique CPU, memory, disk, and I/O requirements, we model a vector consisting of these parameters for a job j i ,
j i = i d i , c p u r i , d i s k r i , m e m r i
where c p u r i is the CPU, d i s k r i is the disk, and m e m r i is the memory requirement for the job j i with a unique i d i .
To define the utilization U = U C P U ,   U m e m , U d i s k of resources available in a node ni in Cluster C at time t, we give
U c p u i = 100 %   o f   i d l e   t i m e 100
where U c p u i is the CPU utilization of the ith node. The memory utilization U m e m i of a node ni is given as
U m e m i = m e m k m e m t o t a l
where m e m k is the sum of memory usage of all jobs running in node ni where m e m t o t a l is the total memory available on the node. The disk utilization U d i s k i of the ith node is given as
U d i s k i = d i s k u s e d d i s k t o t a l
where d i s k u s e d is the used capacity and d i s k t o t a l is the total disk capacity of the ith node. The values of utilization are within range [0, 1].
In the Hadoop YARN cluster architecture, the NMs within Cluster C regularly transmit status updates as heartbeat messages to the RM. These messages convey crucial information regarding resource availability, including CPU utilization, memory usage, and disk I/O activity for a data node managed by the corresponding NM.
The Hadoop cluster’s execution traces can be obtained using the Starfish Hadoop log analysis tool [42], serving as crucial input for refining data placement decisions. Throughout the execution of each job within the cluster, essential details such as Job ID and job timestamp are captured and stored as job status files. These execution traces are typically located in the configuration directory of the name node. The location of this file is available in the Hadoop name node job history folder.
The proposed AMS-ERA leverages machine learning techniques such as K-means clustering to cluster similar data by grouping nodes into clusters based on their similarity or proximity to each other. We use these techniques to classify nodes in the cluster based on the similarity of utilized resources (CPU, mem, disk), initializing a r e s o u r c e l i s t that is subsequently used by the dAHP. A schematic diagram of various steps in the AMS-ERA process can be seen in Figure 1.
Figure 1. Workflow of the proposed AMS-ERA process for optimal job scheduling.

3.3. K-Means with Elbow Clustering

K-means clustering is a popular unsupervised machine learning algorithm used in various domains. It is used to analyze, optimize, and manage resources, among other applications [37]. K-means partitions a dataset into k clusters based on their features. It selects k centroids, and each datapoint is assigned to the closest centroid. The elbow method [43] is used to determine the optimal number of clusters k o p t i m a l . It involves plotting the within-cluster sum of squares (WCSS) for different values of k and identifying the “elbow” point where increasing the number of clusters does not significantly reduce the WCSS, indicating the optimal k value.
Algorithm 1 presents the proposed K-means with an elbow optimization algorithm. It starts by obtaining the RM listing for n nodes. Next, we use Min–Max normalization [44] to rescale numerical data from the Hadoop RM to a fixed range. This normalization method preserves the relative relationships between datapoints while ensuring that all features have the same scale. We define the dataset D = x 1 ,   x 2   x n , where each x i is a datapoint. We verify the status of all the nodes to remove any nodes that have a failed state. In the node identification phase, based on the acquired parameters (U(cpu), U(mem), U(disk)), the proposed approach organizes nodes into clusters characterized by similar performance attributes.
Next, we determine the k o p t i m a l where k is initially given in a range of 1 to k m a x . First, we select k initial centroids given by μ 1 ,   μ 2   μ k via random selection. Next, we assign each datapoint x i to the nearest centroid μ i . We then define C j as the set of datapoints assigned to the jth cluster:
C j = x i     x i μ j 2     x i μ p 2   f o r   a l l   p = 1,2 , , k }
where x i μ j 2 represents the squared Euclidean distance. Next, we recalculate each centroid μ j as the mean of the datapoints in its cluster, given as follows:
μ j = 1 C j x     C j x
We repeat steps in Equations (5) and (6) until the centroids no longer change significantly or a predetermined number of iterations is reached. The WCSS for the clustering with k clusters is given as follows:
W C S S = j = 1 k x     C j x i μ j 2
Next, we plot the WCSS against k and identify the “elbow” point, where the reduction in WCSS starts to plateau. This elbow point suggests the optimal number of clusters k o p t i m a l for the dataset D.
Once the k o p t i m a l is determined, we use the K-means clustering algorithm to cluster the nodes based on resource utilization. Each datapoint is assigned to the nearest centroid μ p . Next, we calculate the Euclidean distance of datapoint x i to each centroid μ p . The datapoint xi is assigned to the nearest centroid based on the closest distance to the selected centroid. After all datapoints have been assigned, we recalculate each centroid as the mean (average) of all the datapoints assigned to that cluster. The process for the recalculation of centroids is repeated k o p t i m a l times. Once the node similarity clusters are established, our strategy organizes the groups based on the three selection attributes CPU, mem, and disk, with higher-performing nodes belonging to higher-ranked clusters. The resulting data are written to r e s o u r c e l i s t for further processing. The runtime for Algorithm 1 can be given as O k × x where x is the number of servers/nodes in the cluster C; k is the number of clusters.
Algorithm 1: K-means clustering with elbow
1:Start: Obtain RM listing for n nodes.
2:  apply Min–Max normalization to rescale the dataset
3:   initialize   r e s o u r c e l i s t     i d i , U c p u i , U m e m i , U d i s k i
4:   Let   D = x 1 ,   x 2   x n   be   the   dataset ,   where   x i is a datapoint
5:   determine   k o p t i m a l for K-means using Equations (5)–(7)
6:  foreach   k   in   { 1 ,   2 ,     k o p t i m a l }
7:     calculate   the   distance   of   each   datapoint   x i   to   each   centroid   μ p
8:     assign   each   datapoint   x i to the closest centroid
    recalculate each centroid as in Equation (6)
9:  end for
11:  return   r e s o u r c e l i s t
12:end

3.4. Dynamic AHP-Based Job Scoring

In this section, we detail the dynamic AHP-based scoring mechanism for optimal resource allocation to jobs. We develop an algorithm based on the AHP [45], where the goal is to find the optimal placement of a job using a score vector to determine the best possible node. The process involves refining the AHP model’s accuracy by integrating historical information obtained through Hadoop APIs to assign weights to jobs based on their resource requirements including CPU, memory, and disk requirements.
We define criteria considering the CPU, memory, and disk requirements of a job. We also define alternate criteria for selecting the best possible node in the frugal cluster. The criteria are pairwise compared based on the importance of the criteria. The alternatives are compared against each of the criteria. Figure 2 shows the selection framework of the dAHP for an example of n heterogeneous nodes, where n = 6 .
Figure 2. dAHP 3-level criteria for node selection.
To select the optimal node for job allocation, we look at the job requirements, assuming that a job requires a large amount of processing power and memory to complete; however, the storage requirement is not equally important. Based on this requirement, we develop the pairwise comparison matrix Cji for this job. The criteria are prioritized based on their importance. We assume that the CPU and memory requirements are equally important for a job. They are moderately more important than the disk requirement. Based on these criteria, the Cji is given in Table 4.
Table 4. Pairwise decision criteria matrix Cji.
Next, the alternate criteria are considered based on the node utilization requirements. For instance, if a job requires a faster node, it should be assigned n 2 . If it requires more memory, it can be assigned n 6 . Similarly, if more disk space is required, it can be assigned n 6 based on the node capabilities. The weights are determined through the magnitude of the difference in node properties. For instance, if n 2 and n 5 have a faster/larger number of cores compared to n 1 , they are assigned weight 4. This allows us to have three matrices presenting a pairwise comparison of CPU, MEM, and DISK requirements. The CPU, MEM, and DISK matrices are given in Table 5, Table 6 and Table 7, respectively.
Table 5. Pairwise CPU alternate criteria decision matrix CPU.
Table 6. Pairwise mem alternate criteria decision matrix MEM.
Table 7. Pairwise disk alternate criteria decision matrix DISK.
To ensure consistency and accuracy, the pairwise comparison matrices are normalized to determine the Consistency Index (CI). A matrix is considered to be consistent if the transitivity rule is valid for all pairwise comparisons [39]. The CI is determined via
C I = λ m a x n n 1
where λ m a x is the maximal eigenvalue obtained via the summation of products between each element of the eigenvector and the sum of columns of the matrix and n is the number of nodes. In this case, since the size of the matrix is 6 × 6 where n = 6, the CI value for each of the CPU, MEM, and DISK is 0.1477, 0.0178, and 0.0022, respectively. Using the Random Index (RI) = 6, the Consistency Ratio (CR) for each of these matrices is 0.0119, 0.0143, and 0.0017, respectively. For reliable results, the CR values must be less than 0.1, ensuring that the matrices are consistent. The score vector S c o r e i is determined for the job j i using the M matrix, which consolidates the CPU, MEM, and DISK matrices, as given in Equation (6).
S c o r e i = m a x j = 1 n w e i g h t i   .     m i j
where w e i g h t i is the weight of CPU, mem, and disk obtained from corresponding matrices for the defined criteria and m i j is the normalized score for each value in the matrix. The score vector S c o r e i computed in Equation (6), is given in Table 8. In this particular case, n 6 has the highest score = 0.280, indicating that it is the most suitable for the given requirements of job ji.
Table 8. The score vector determined from the M matrix for every alternative.
Similar to this example, each job’s score is determined using these criteria. After computing scores for all jobs, the system selects the job with the highest score, indicating the greatest resource demand. The resulting job priority list sorted in descending order (ordered by score) is forwarded to the RM for resource allocation.

3.5. Efficient Resource Allocation

The resource allocation takes place in the RM once the job priority listing is available. By integrating the job listing information obtained from the previous phase, the RM ensures the optimal match between job demand and available resources. The jobs with the highest resource requirements are arranged in descending order. These high-resource-demanding jobs are prioritized to utilize the most powerful nodes with the maximum available resources. This load-balancing strategy ensures that less-resource-intensive jobs do not hinder the utilization of the high-resource nodes.
Algorithm 2 presents the AMS-ERA resource allocation process. The RM maintains a r e s o u r c e l i s t of current resource utilization in the cluster for each node I  { i d i , U c p u i ,   U m e m i ,   U ( d i s k i ) } as determined in Equations (2)–(4). To assign a job to a node, considering the jobs score vector {cpu, mem, disk} obtained in Table 8, the most under-utilized node is sought. Once a job is assigned to a node, the utilization values in the r e s o u r c e l i s t for the corresponding node are updated. This ensures a well-balanced strategy that maximizes the utilization of resources across the cluster while considering the heterogeneous Hadoop cluster node capabilities. This enables our system to prevent resource-intensive jobs from being allocated to lower-performing nodes in the cluster. Once the mapping is complete, the jobs are sent for execution in newly allocated containers by YARN.
To give the runtime for the algorithm, we look at the three computation-intensive operations. Step 5 in the algorithm requires the computation of pairwise decision matrices for each job as detailed in Table 5, Table 6 and Table 7. Assuming that there are m jobs in a cluster of size n nodes, the cost of forming pairwise comparison matrices for the three selection criteria CPU, mem, and disk is n n 1 2 , giving a complexity of O n 2 . In step 6, we determine the values of Equations (8) and (9) for the normalization and consistency check; these require a runtime of O n . Steps 10 and 11 compute the M matrix and the score vector. The total number of pairwise comparisons required to compute the matrix M and the score vector would be given as O m × n 2 , where the size of the matrix is n × n . Finally, the best values are written to the r e s o u r c e l i s t . Overall, the complexity of both algorithms can be given as O m × n 2 .
Algorithm 2: AMS-ERA resource allocation
1:Start: Obtain RM listing for n nodes
2:   obtain   r e s o u r c e l i s t
3:   obtain   RM   listing   for   m   j o b s ;   initialize   j o b p r i o r i t y l i s t i n g
4:  foreach  m j o b s
5:     determine   pairwise   decision   matrices   C P U ,   M E M ,   D I S K for m
6:     determine   consistency   C I = λ m a x n n 1
7:    if   C R = C I R I < 0.1  then continue
8:    else re-compute
9:    end if
10:    determine M matrix
11:     Compute   S c o r e m   j o b p r i o r i t y l i s t i n g
12:  end for
13:  foreach  i j o b p r i o r i t y l i s t i n g
14:     assign   best   j i ,   c p u ,   m e m ,   d i s k r e s o u r c e l i s t
15:     update   r e s o u r c e l i s t
16:  end for

4. Experimental Evaluation

In this section, we present the experimental setup and conduct various experiments to compare and analyze the performance of the proposed AMS-ERA against Hadoop-Fair, FOG, and IDaPS Schedulers.

4.1. Experiment Setup

For experimentation, we construct an SBC-based heterogeneous Hadoop cluster with 11 SBC nodes configured in two racks with 5 SBCs in each rack using Gigabit Ethernet. Ten of these SBCs run the Hadoop worker nodes, whereas one serves as the master node. As the master node runs the RM, requiring a large amount of memory, a Raspberry Pi5 device is dedicated to running the master node. On each device, we install a compatible version of Linux Debian; with armbian 23.1 Jammy Gnome on RockPro, Debian Bullseye 11 on Odroid XU4, and RaspberryPi OS Lite 11 on all Raspberry Pi devices. Each device has Java ARM64 version 8 and Hadoop version 3.3.6. Each SBC is equipped with a bootable SD Card; to better observe the placement of jobs with different disk requirements, we varied the capacity of the SD Card for the different SBCs. A 4 GB swap space was reserved on each SD Card; this would be essential for virtual memory management in SBCs with low RAM availability.
To simulate a small cluster, we created two racks, each consisting of five SBC nodes. Each rack has a Gigabit Ethernet switch connecting all the SBCs with a router. The master node running on an Rpi5 connects to the router. The schematic diagram of the experimental setup is available in Figure 3. Table 9 shows the configuration of the worker nodes in the cluster. We used Hadoop YARN 3.3.6 to run our experiments. The HDFS block size was set to 128 MB, with block replication set to 2, and the inputSplit size was set to 128 MB. To avoid out-of-memory errors on Hadoop runs, we modified the mapred-site.xml and YARN-site.xml files. The details are provided in Table 10.
Figure 3. Cluster configuration with 10 worker nodes placed in two racks with a master node.
Table 9. Worker node configuration in the Hadoop cluster.
Table 10. Hadoop YARN configuration properties.

4.2. Generating Job Workloads for Validation

Taking inspiration from previous benchmark studies [10,12,16,18,27,29,30,38], we select wordcount and terasort workloads for the evaluation of AMS-ERA.
  • The Hadoop wordcount benchmark is a CPU-intensive task because it involves processing large volumes of text data to count the occurrences of each word. This process requires significant computational resources, particularly for tasks like tokenization, sorting, and aggregation, which are essential steps in the word-counting process. As a result, the benchmark primarily stresses the CPU’s processing capabilities rather than other system resources such as memory or disk I/O. These 10 jobs are posted to the cluster simultaneously.
  • The Hadoop terasort benchmark is an IO-intensive task because it involves sorting a large volume of data. This process requires substantial input/output (IO) operations as it reads and writes data to and from storage extensively during the sorting process. The benchmark stresses the system’s IO subsystem, including disk read and write speeds, as well as network bandwidth if the data are distributed across multiple nodes in a cluster.
In order to observe the effectiveness of the proposed AMS-ERA scheduling for clustering the jobs based on the CPU, mem, and disk criteria, we generate five job workloads { l 1 , l 2 ,   l 3 ,   l 4 ,   l 5 }, each with varying resource requirements, resulting in highly heterogeneous container sizes for the map and reducing tasks across different jobs. Each workload is given a different dataset whose sizes are 2, 4, 8, 15.1, and 19.5 GB, respectively. The datasets are generated from text files available at project Gutenberg. These dataset sizes represent a range of small to large workloads, allowing us to evaluate the scheduling algorithm’s performance across different job scenarios. By including a range of dataset sizes, we could determine how the proposed AMS-ERA scheduling algorithm handles different resource requirements.
The default InputSplit size of 128 MB is used to distribute the datafiles across the HDFS. The replication factor of 2 is used. Based on the dataset size and the InputSplit size, we define the number of maps and reduces to be <map, reduce> given as <16, 1>, <32, 2>, <64, 4>, <128, 4>, and <160, 8>, respectively.
We execute wordcount and terasort on these workloads with these parameters and observe job placement, resource utilization, and the overall job execution time in the cluster. To ensure the reliability and robustness of our experimental study, we conducted multiple experimental runs for each benchmark and workload. Specifically, for each of the workloads ( l 1 through l 5 ), we performed at least three experimental repetitions to gather consistent data. This repetition allowed us to account for any variability in cluster performance and ensure that our conclusions were statistically valid. Each experiment was run under the same conditions to maintain consistency, providing a strong basis for comparison across different configurations.

4.3. Node Clustering Based on Intra-Node Similarity Metrics

The AMS-ERA profiles the nodes available in the cluster based on available resources. We visually determined the elbow point for three experimental runs for workload l 3 using wordcount and terasort. Workload l 3 , with a dataset size of 8 GB, served as a suitable test case for determining the optimal value for k. This dataset is large enough to offer significant insight into node resource clustering while not being so large as to skew results due to extreme data processing demands. Moreover, by establishing k o p t i m a l for this workload, the same methodology can be applied to smaller workloads ( l 1 and l 2 ) or larger workloads ( l 4 and l 5 ), ensuring that the clustering approach can be scaled effectively based on the size and complexity of the data being processed. Based on these experiments, we determine the value of k o p t i m a l = 3.
Figure 4a reveals the result of AMS-ERA node grouping based on available resources during the execution of the wordcount benchmark using the workload l 3 . A high-performance group of nodes is highlighted in green, consisting of nodes w6 and w9; a medium-performance group of nodes is represented in yellow, comprising w2, w3, w4, w5, w7, and w10. The nodes labeled low-performance are indicated in orange, including nodes w1 and w8.
Figure 4. The worker nodes profiling based on their CPU, mem, and disk resource utilization for workload l 3 . Intra-node similarity reveals the performance of nodes clustered in high-, medium-, and low-performance nodes for wordcount jobs. The size of the bubble reveals the percentage of disk utilization. (a) Wordcount and (b) terasort.
Similarly, Figure 4b reveals the results for terasort jobs run for workload l 3 ; comparatively, terasort requires large disk IO. The proposed clustering algorithm successfully groups the nodes based on the utilization of CPU, mem, and disk resources for terasort jobs with different requirements. Worker nodes w1, w4, and w7 are installed with relatively smaller 32 GB and slower disk IO read/write speed SD Cards. As the terasort progresses, more of their onboard available storage is consumed. Consequently, nodes w6, w9, and w10 give identical disk IO, which is comparatively slower due to the user of faster hardware. The effect of larger disk IO can be seen in Figure 4b, where a larger bubble area reveals increased disk IO.

4.4. Workload Execution Time

Hadoop Fair Scheduler (Fair), a default scheduler in Hadoop, lacks data locality implementation. It allocates resources to jobs, ensuring each job receives an equal share of resources over time. The scheduler organizes jobs into pools and distributes resources equitably among these pools.
The FOG-scheduler presented in [15] considers ordering the scheduling queue based on deadlines. The nodes in the cluster are ordered using a similarity index. The highest-ordered jobs are sorted and assigned to the appropriate clusters.
The IDaPS presented in [16] uses the Markov clustering algorithm to characterize MapReduce task execution based on intra-dependency and task execution frequency. The scheduling algorithm orders the tasks based on execution frequency to achieve maximum parallelism. In this section, we compare these recent works against the proposed AMS-ERA for various workloads of wordcount and terasort.
Figure 5a shows the comparison of the execution time of the five wordcount workloads using Hadoop default fair scheduler, FOG, IDaPs, and the proposed AMS-ERA. For smaller workloads l 1 and l 2 , a total of 16 and 32 map jobs are created; with this workload, the execution runtimes for the proposed AMS-ERA are 27.2%, 17.4%, and 7.6% and 24.5%, 14.1%, and 8.1% faster compared to Fair, FOG and IDaPs, respectively.
Figure 5. (a) A comparison of the execution times in seconds for wordcount jobs with workloads { l 1 , l 2 ,   l 3 ,   l 4 ,   l 5 } between Hadoop-Fair, FOG, IDaPS, and AMS-ERA. (b) depicts the execution times for terasort with the same workloads.
For workloads l 3 and l 4 , a total of 64 and 128 map jobs were created; the execution times for AMS-ERA were 16.2%, 12.7%, and 6.8% and 14%, 8%, and 2% faster. For the larger workload l 5 , AMS-ERA was 11.5%, 4.5%, and 0.2% faster. For larger workloads, both AMS-ERA and IDaPs exhibit similar performance.
We note that as the workload increases, the comparative performance of AMS-ERA against the compared schedulers for execution runtime also decreases. It can be asserted that this is due to the large number of disk IO reads and writes required by the wordcount algorithm. As the frugal cluster consists of SD Cards with slower read/write speeds, it is imperative that the runtime be affected by the available hardware speeds.
Our observation is confirmed when we compare the wordcount workload execution runtimes with the terasort execution runtimes. As mentioned earlier, terasort requires much less IO-intensive read/write operations to the disk; therefore, the expected runtime would be lower. For smaller workloads l 1 and l 2 , the execution runtimes of terasort for the proposed AMS-ERA are 38.4%, 25.9%, and 20.5% and 34.9%, 20.7%, and 17.8% faster compared to Fair, FOG, and IDaPs, respectively.
For workloads l 3 and l 4 , with a total of 64 and 128 map jobs, the execution times for AMS-ERA were 31%, 18%, and 13% and 26%, 14.4%, and 11% faster. For the larger workload l 5 , AMS-ERA was 18.7%, 12.1%, and 7.8% faster.
Figure 5b shows the comparison of the execution time of the five terasort workloads. As terasort is a comparatively less disk IO-intensive application, the AMS-ERA compares well with Fair, FOG, and IDaPs for all ranges of workloads.
From these results, we observe that in the worst-case scenario for workload l 5 , where the entire dataset is required for execution (approx. 20 GB), both AMS-ERA and IDaPS exhibit similar performance. For smaller datasets and workloads, the proposed AMS-ERA performs significantly better. Figure 6a,b show the comparison of AMS-ERA performance percentage against the compared schedulers for the various workflows using wordcount and terasort benchmarks.
Figure 6. (a) A comparison of the performance percentage of AMS-ERA execution times against Hadoop-Fair, FOG, and IDaPS for wordcount jobs with workloads { l 1 , l 2 ,   l 3 ,   l 4 ,   l 5 } (b) Performance percentage of AMS-ERA against Hadoop-Fair, FOG, and IDaPS for terasort.

4.5. Local Job Placement and Resource Utilization

The default Hadoop Fair scheduling scheme distributes data without considering the computing capacity of nodes or network delay, resulting in poor performance. This lack of optimization leads to a higher percentage of non-local task executions and data transfer overhead compared to alternative schemes. Moreover, it overlooks the heterogeneity of the available nodes in the cluster. Consequently, the failure to account for these differences results in the suboptimal placement of map tasks in the cluster, thereby leading to poor performance.
The proposed AMS-ERA assigns resource-intensive jobs to high-performance nodes within each group, sorting nodes in descending order based on their capacity range, including CPU, mem, and disk. Additionally, with higher cluster utilization, more jobs complete their execution quickly, enabling YARN to release resources sooner. Less demanding jobs are allocated to nodes that best match their resource requirements. Consequently, the system minimizes resource wastage and improves load balancing both between and within groups of heterogeneous nodes.
Figure 7a shows the percentage of locally assigned map tasks. All three schedulers outperform the Hadoop fair scheduler. For the wordcount workloads, as the number of map tasks increases, the utilization of resources for the proposed AMS-ERA also improves. For smaller workloads l 1 , AMS-ERA outperforms the comparison with 52% local placement compared to 20% for IDaPS and 14% for FOG. With larger workloads l 5 , the locality improves up to 79% for AMS-ERA compared to 76% for IDaPS and 61% for FOG. We observe a similar task placement rate (percentage) for terasort workloads as can be seen in Figure 7b. This shows that the AMS-ERA optimizes task locality based on resource availability in the cluster.
Figure 7. A comparison local task allocation rate (percentage) for AMS-ERA, Hadoop-Fair, FOG, IDaPS for (a) wordcount jobs with workloads { l 1 , l 2 ,   l 3 ,   l 4 ,   l 5 } and (b) for terasort jobs.
Figure 8a shows a comparison of the percentage of resource utilization for wordcount workload l 3 . The AMS-ERA utilizes the highest resources effectively to complete the workload at the earliest. This shows the AMS-ERA job placement in the cluster effectively outperforms the Hadoop fair, FOG, and IDaPS. Figure 8b shows similar results for a terasort workload. As terasort is not disk-intensive, we can see that AMS-ERA has the highest average CPU and memory utilization; however, the disk utilization is slightly lower. Given these results, we can assume that AMS-ERA successfully considered the availability of resources in the cluster when placing the jobs. As a consequence of lower disk utilization, the map tasks for terasort were placed in high-performing nodes such as w6 and w9, which resulted in faster execution times.
Figure 8. A comparison of the percentage of CPU, mem, and disk resource utilization for (a) a wordcount workload l 3 and (b) a terasort workload.

4.6. Cost of Frugal Hadoop Cluster Setup

Building a Hadoop cluster with a diverse range of SBC models, each offering varying CPU, memory, and storage resources, allowed for diversification. This approach facilitates cost optimization by selecting models based on their price–performance ratio and the specific demands of the workload. The cost of our cluster setup was USD 822 for the 11 SBC devices along with networking essentials (cables, 2× Gigabit Switches, a router) and SD Card storage media.
During our experimental investigations, we observed a notable performance gap between the previous-generation RPi 3B nodes and traditional PC setups, with the former exhibiting suboptimal performance levels. However, with the introduction of AMS-ERA, which accounts for the heterogeneous nature of resources within the cluster, we observed significant improvements in execution times. Looking forward, we anticipate even greater performance enhancements with the latest RPi 5 nodes, which boast improved onboard resources compared to their predecessors.
This evolution in hardware capabilities underscores the potential for frugal SBC-based edge devices to not only enhance performance but also contribute to sustainability and cost-effectiveness in data processing applications. With the anticipated decrease in the cost of RPi 5 devices and their promising performance metrics, they present a compelling option for achieving both sustainability and cost-effectiveness in edge computing environments.

5. Conclusions and Future Work

In this work, we proposed Adaptive Multi-criteria Selection for Efficient Resource Allocation (AMS-ERA) in Frugal Heterogeneous Hadoop Clusters, addressing the critical challenge of resource allocation in clusters with frugal Single-Board Computers (SBCs). By considering the CPU, memory, and disk requirements for jobs and aligning them with available resources in the cluster, AMS-ERA optimizes resource allocation for optimal performance. Through K-means clustering, available resources are profiled and ranked based on similarity and proximity, enabling dynamic job placement. A dAHP refines the selection process by integrating historical data through Hadoop APIs. Jobs are then assigned to the most suitable nodes, ensuring load balancing in the heterogeneous cluster. Compared to Hadoop-Fair, FOG, and IDaPS scheduling strategies, AMS-ERA demonstrates superior performance, reducing execution time by 27.2%, 17.4%, and 7.6%, respectively, in terasort and wordcount benchmarks. The results show that the AMS-ERA is robust and performs consistently well across the diversified Map Reduce-based applications with various workload sizes. Furthermore, these also demonstrate that AMS-ERA ensures reduced execution time and improved data locality compared to Hadoop Fair, FOG, and IDaPS. This study underscores the effectiveness of AMS-ERA in optimizing data layout, maximizing parallelism, and accommodating resource constraints in frugal SBC-based Hadoop clusters, paving the way for enhanced big data processing performance in resource-constrained environments.
AMS-ERA introduces a dynamic and adaptive approach to resource allocation, which could revolutionize how operational tasks are managed in Hadoop clusters. By profiling and ranking available resources and then aligning them with the job requirements, operational practices would become more efficient and responsive to workload demands. The capability to profile resources using K-means clustering and assign jobs based on a dAHP provides a flexible mechanism for job scheduling. This flexibility can lead to a more balanced workload, reducing bottlenecks, and potentially allowing operations teams to focus on other critical aspects of cluster management.
Since AMS-ERA is designed for frugal clusters with SBCs, its adaptive resource allocation mechanism could significantly impact edge computing. It could allow edge devices to participate in larger Hadoop clusters more effectively, opening new possibilities for data processing closer to the data source. The AMS-ERA framework could facilitate the deployment of Hadoop clusters in more constrained environments, like IoT applications or remote sites with limited infrastructure. By optimizing resource allocation and reducing execution time, AMS-ERA can indirectly lead to reduced energy consumption and operational costs. This is particularly relevant in SBC-based clusters, where energy efficiency is crucial.
At the moment, AMS-ERA is limited to CPU, memory, and disk utilization when considering job placement in the cluster. This scope of criteria might not cover all aspects of resource allocation efficiency, such as network bandwidth or I/O throughput. While the current version of AMS-ERA does not explicitly incorporate these factors, they are indirectly addressed through load balancing and job placement. Although AMS-ERA uses a dAHP to adapt to changes, there could be limitations in handling extreme fluctuations or sudden spikes in demand. This may lead to suboptimal resource utilization or load balancing in some scenarios. In the future, we intend to extend it to consider network localization, network capacities, and node/rack-based job placement. Furthermore, we intend to test AMS-ERA integrating real-time workflow datasets, enabling more robust and efficient performance evaluations. This could enhance its application in environments where real-time data processing is critical, such as stream processing or online analytics.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The author would like to acknowledge the support of Prince Sultan University for the payment of the article processing charges.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Awaysheh, F.M.; Tommasini, R.; Awad, A. Big Data Analytics from the Rich Cloud to the Frugal Edge. In Proceedings of the 2023 IEEE International Conference on Edge Computing and Communications (EDGE), Chicago, IL, USA, 2–8 July 2023; pp. 319–329. [Google Scholar]
  2. Qin, W. How to Unleash Frugal Innovation through Internet of Things and Artificial Intelligence: Moderating Role of Entrepreneurial Knowledge and Future Challenges. Technol. Forecast. Soc. Chang. 2024, 202, 123286. [Google Scholar] [CrossRef]
  3. Neto, A.J.A.; Neto, J.A.C.; Moreno, E.D. The Development of a Low-Cost Big Data Cluster Using Apache Hadoop and Raspberry Pi. A Complete Guide. Comput. Electr. Eng. 2022, 104, 108403. [Google Scholar] [CrossRef]
  4. Vanderbauwhede, W. Frugal Computing—On the Need for Low-Carbon and Sustainable Computing and the Path towards Zero-Carbon Computing. arXiv 2023, arXiv:2303.06642. [Google Scholar]
  5. Chandramouli, H.; Shwetha, K.S. Integrated Data, Task and Resource Management to Speed Up Processing Small Files in Hadoop Cluster. Int. J. Intell. Eng. Syst. 2024, 17, 572–584. [Google Scholar] [CrossRef]
  6. Han, T.; Yu, W. A Review of Hadoop Resource Scheduling Research. In Proceedings of the 2023 8th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 23–25 November 2023; pp. 26–30. [Google Scholar]
  7. Jeyaraj, R.; Paul, A. Optimizing MapReduce Task Scheduling on Virtualized Heterogeneous Environments Using Ant Colony Optimization. IEEE Access 2022, 10, 55842–55855. [Google Scholar] [CrossRef]
  8. Saba, T.; Rehman, A.; Haseeb, K.; Alam, T.; Jeon, G. Cloud-Edge Load Balancing Distributed Protocol for IoE Services Using Swarm Intelligence. Clust. Comput. 2023, 26, 2921–2931. [Google Scholar] [CrossRef]
  9. Guo, Z.; Fox, G. Improving MapReduce Performance in Heterogeneous Network Environments and Resource Utilization. In Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), Ottawa, ON, Canada, 13–16 May 2012; pp. 714–716. [Google Scholar]
  10. Bae, M.; Yeo, S.; Park, G.; Oh, S. Novel Data-placement Scheme for Improving the Data Locality of Hadoop in Heterogeneous Environments. Concurr. Comput. 2021, 33, e5752. [Google Scholar] [CrossRef]
  11. Bawankule, K.L.; Dewang, R.K.; Singh, A.K. Historical Data Based Approach for Straggler Avoidance in a Heterogeneous Hadoop Cluster. J. Ambient Intell. Humaniz. Comput. 2021, 12, 9573–9589. [Google Scholar] [CrossRef]
  12. Thakkar, H.K.; Sahoo, P.K.; Veeravalli, B. RENDA: Resource and Network Aware Data Placement Algorithm for Periodic Workloads in Cloud. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 2906–2920. [Google Scholar] [CrossRef]
  13. Ghazali, R.; Adabi, S.; Rezaee, A.; Down, D.G.; Movaghar, A. CLQLMRS: Improving Cache Locality in MapReduce Job Scheduling Using Q-Learning. J. Cloud Comput. 2022, 11, 45. [Google Scholar] [CrossRef]
  14. Ding, F.; Ma, M. Data Locality-Aware and QoS-Aware Dynamic Cloud Workflow Scheduling in Hadoop for Heterogeneous Environment. Int. J. Web Grid Serv. 2023, 19, 113–135. [Google Scholar] [CrossRef]
  15. Postoaca, A.-V.; Negru, C.; Pop, F. Deadline-Aware Scheduling in Cloud-Fog-Edge Systems. In Proceedings of the 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Melbourne, Australia, 11–14 May 2020; pp. 691–698. [Google Scholar]
  16. Vengadeswaran, S.; Balasundaram, S.R.; Dhavakumar, P. IDaPS—Improved Data-Locality Aware Data Placement Strategy Based on Markov Clustering to Enhance MapReduce Performance on Hadoop. J. King Saud Univ. Comput. Inf. Sci. 2024, 36, 101973. [Google Scholar] [CrossRef]
  17. Adnan, A.; Tahir, Z.; Asis, M.A. Performance Evaluation of Single Board Computer for Hadoop Distributed File System (HDFS). In Proceedings of the 2019 International Conference on Information and Communications Technology (ICOIACT), Yogyakarta, Indonesia, 24–25 July 2019; pp. 624–627. [Google Scholar]
  18. Qureshi, B.; Koubaa, A. On Energy Efficiency and Performance Evaluation of Single Board Computer Based Clusters: A Hadoop Case Study. Electronics 2019, 8, 182. [Google Scholar] [CrossRef]
  19. Fati, S.M.; Jaradat, A.K.; Abunadi, I.; Mohammed, A.S. Modelling Virtual Machine Workload in Heterogeneous Cloud Computing Platforms. J. Inf. Technol. Res. 2020, 13, 156–170. [Google Scholar] [CrossRef]
  20. Sebbio, S.; Morabito, G.; Catalfamo, A.; Carnevale, L.; Fazio, M. Federated Learning on Raspberry Pi 4: A Comprehensive Power Consumption Analysis. In Proceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing, Taormina, Italy, 4–7 December 2023; ACM: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
  21. Shwe, T.; Aritsugi, M. Optimizing Data Processing: A Comparative Study of Big Data Platforms in Edge, Fog, and Cloud Layers. Appl. Sci. 2024, 14, 452. [Google Scholar] [CrossRef]
  22. Raspberry Pi. Available online: https://www.raspberrypi.com/ (accessed on 7 May 2024).
  23. Lee, E.; Oh, H.; Park, D. Big Data Processing on Single Board Computer Clusters: Exploring Challenges and Possibilities. IEEE Access 2021, 9, 142551–142565. [Google Scholar] [CrossRef]
  24. Lambropoulos, G.; Mitropoulos, S.; Douligeris, C.; Maglaras, L. Implementing Virtualization on Single-Board Computers: A Case Study on Edge Computing. Computers 2024, 13, 54. [Google Scholar] [CrossRef]
  25. Mills, J.; Hu, J.; Min, G. Communication-Efficient Federated Learning for Wireless Edge Intelligence in IoT. IEEE Internet Things J. 2020, 7, 5986–5994. [Google Scholar] [CrossRef]
  26. Krpic, Z.; Loina, L.; Galba, T. Evaluating Performance of SBC Clusters for HPC Workloads. In Proceedings of the 2022 International Conference on Smart Systems and Technologies (SST), Osijek, Croatia, 19–21 October 2022; pp. 173–178. [Google Scholar]
  27. Lim, S.; Park, D. Improving Hadoop Mapreduce Performance on Heterogeneous Single Board Computer Clusters. SSRN Preprint 2023. [Google Scholar] [CrossRef]
  28. Srinivasan, K.; Chang, C.Y.; Huang, C.H.; Chang, M.H.; Sharma, A.; Ankur, A. An Efficient Implementation of Mobile Raspberry Pi Hadoop Clusters for Robust and Augmented Computing Performance. J. Inf. Process. Syst. 2018, 14, 989–1009. [Google Scholar] [CrossRef]
  29. Fu, W.; Wang, L. Load Balancing Algorithms for Hadoop Cluster in Unbalanced Environment. Comput. Intell. Neurosci. 2022, 2022, 1545024. [Google Scholar] [CrossRef]
  30. Yao, Y.; Gao, H.; Wang, J.; Sheng, B.; Mi, N. New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters. IEEE Trans. Cloud Comput. 2021, 9, 1158–1171. [Google Scholar] [CrossRef]
  31. Javanmardi, A.K.; Yaghoubyan, S.H.; Bagherifard, K.; Nejatian, S.; Parvin, H. A Unit-Based, Cost-Efficient Scheduler for Heterogeneous Hadoop Systems. J. Supercomput. 2021, 77, 1–22. [Google Scholar] [CrossRef]
  32. Ullah, I.; Khan, M.S.; Amir, M.; Kim, J.; Kim, S.M. LSTPD: Least Slack Time-Based Preemptive Deadline Constraint Scheduler for Hadoop Clusters. IEEE Access 2020, 8, 111751–111762. [Google Scholar] [CrossRef]
  33. Zhou, R.; Li, Z.; Wu, C. An Efficient Online Placement Scheme for Cloud Container Clusters. IEEE J. Sel. Areas Commun. 2019, 37, 1046–1058. [Google Scholar] [CrossRef]
  34. Zhou, Z.; Shojafar, M.; Alazab, M.; Abawajy, J.; Li, F. AFED-EF: An Energy-Efficient VM Allocation Algorithm for IoT Applications in a Cloud Data Center. IEEE Trans. Green Commun. Netw. 2021, 5, 658–669. [Google Scholar] [CrossRef]
  35. Zhou, Z.; Abawajy, J.; Chowdhury, M.; Hu, Z.; Li, K.; Cheng, H.; Alelaiwi, A.A.; Li, F. Minimizing SLA Violation and Power Consumption in Cloud Data Centers Using Adaptive Energy-Aware Algorithms. Future Gener. Comput. Syst. 2018, 86, 836–850. [Google Scholar] [CrossRef]
  36. Banerjee, P.; Roy, S.; Sinha, A.; Hassan, M.; Burje, S.; Agrawal, A.; Bairagi, A.K.; Alshathri, S.; El-Shafai, W. MTD-DHJS: Makespan-Optimized Task Scheduling Algorithm for Cloud Computing With Dynamic Computational Time Prediction. IEEE Access 2023, 11, 105578–105618. [Google Scholar] [CrossRef]
  37. Zhang, L. Research on K-Means Clustering Algorithm Based on MapReduce Distributed Programming Framework. Procedia Comput. Sci. 2023, 228, 262–270. [Google Scholar] [CrossRef]
  38. Postoaca, A.V.; Pop, F.; Prodan, R. H-Fair: Asymptotic Scheduling of Heavy Workloads in Heterogeneous Data Centers. In Proceedings of the 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Washington, DC, USA, 1–4 May 2018; pp. 366–369. [Google Scholar]
  39. Guo, T.; Bahsoon, R.; Chen, T.; Elhabbash, A.; Samreen, F.; Elkhatib, Y. Cloud Instance Selection Using Parallel K-Means and AHP. In Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing Companion, Auckland, New Zealand, 2–5 December 2019; ACM: New York, NY, USA, 2019; pp. 71–76. [Google Scholar]
  40. Odroid Xu4. Available online: https://www.hardkernel.com/shop/odroid-xu4-special-price/ (accessed on 7 May 2024).
  41. RockPro64. Available online: https://pine64.com/product/rockpro64-4gb-single-board-computer/ (accessed on 7 May 2024).
  42. Herodotou, H.; Lim, H.; Luo, G.; Borisov, N.; Dong, L.; Cetin, F.; Babu, S. Starfish: A Self-Tuning System for Big Data Analytics. In Proceedings of the CIDR 2011—5th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 9–12 January 2011; Conference Proceedings. pp. 261–272. [Google Scholar]
  43. Syakur, M.A.; Khotimah, B.K.; Rochman, E.M.S.; Satoto, B.D. Integration K-Means Clustering Method and Elbow Method For Identification of The Best Customer Profile Cluster. IOP Conf. Ser. Mater. Sci. Eng. 2018, 336, 012017. [Google Scholar] [CrossRef]
  44. Kim, H.-J.; Baek, J.-W.; Chung, K. Associative Knowledge Graph Using Fuzzy Clustering and Min-Max Normalization in Video Contents. IEEE Access 2021, 9, 74802–74816. [Google Scholar] [CrossRef]
  45. Singh, A.; Das, A.; Bera, U.K.; Lee, G.M. Prediction of Transportation Costs Using Trapezoidal Neutrosophic Fuzzy Analytic Hierarchy Process and Artificial Neural Networks. IEEE Access 2021, 9, 103497–103512. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.