The Application of Fractal Transform and Entropy for Improving Fault Tolerance and Load Balancing in Grid Computing Environments

This paper applies the entropy-based fractal indexing scheme that enables the grid environment for fast indexing and querying. It addresses the issue of fault tolerance and load balancing-based fractal management to make computational grids more effective and reliable. A fractal dimension of a cloud of points gives an estimate of the intrinsic dimensionality of the data in that space. The main drawback of this technique is the long computing time. The main contribution of the suggested work is to investigate the effect of fractal transform by adding R-tree index structure-based entropy to existing grid computing models to obtain a balanced infrastructure with minimal fault. In this regard, the presented work is going to extend the commonly scheduling algorithms that are built based on the physical grid structure to a reduced logical network. The objective of this logical network is to reduce the searching in the grid paths according to arrival time rate and path’s bandwidth with respect to load balance and fault tolerance, respectively. Furthermore, an optimization searching technique is utilized to enhance the grid performance by investigating the optimum number of nodes extracted from the logical grid. The experimental results indicated that the proposed model has better execution time, throughput, makespan, latency, load balancing, and success rate.


Introduction
Fractals are of a rough or fragmented geometric shape that can be subdivided into parts, each of which is a reduced copy of the whole. They are crinkly objects that defy conventional measures, such as length, and are most often characterized by their fractal dimension. They are mathematical sets with a high degree of geometrical complexity that can model many natural phenomena. Almost all natural objects can be observed as fractals [1]. Concepts from fractal theory have been applied to several tasks in data mining and data analysis, such as selectivity estimation, clustering, time series forecasting, correlation detection, and data distribution analysis [2]. Fractal tree properties include log(N) arrays, one array for each power of two, fractal tree indexes can use 1/100th the power of B-trees, and fractal tree indexes ride the right technology trends. In the future, all storage systems will use fractal tree indexes [3-7]. Cao (2005) [14] suggested a grid load balancing approach utilizing artificial intelligence to accomplish efficient workload and resource management. A mixture of smart agents and multi-agent methods is implemented in local grid resource scheduling and global grid load balancing. Each agent represents a local grid resource that uses predictive device output data with incremental heuristic algorithms to maintain local load balance throughout different servers. Yagoubi and Slimani (2006) [15] presented a layered balancing algorithm focused on tree representation. This model turns every grid design into a four-level, specific tree. It generates a two-level sub-tree for each site. This tree's leaves reflect a site's computational components, and the root represents a site-related virtual node. These sub-trees, referring to cluster locations, are clustered together to create a three-level sub-tree. Finally, such sub-trees are clustered together, creating a four-level tree called the generic load balancing model. Yan (2009) [16] offered a hybrid network load policy integrating static or dynamic task scheduling architectures. The static load balancing policy is used to choose an appropriate and convenient node set. As a node shows potential failure to continue supplying services, the complex load balancing strategy can decide whether or not the node involved is successful in providing load distribution. In a short period, the device can receive a new substitute node to preserve system efficiency. Hao et al. (2012) [17] recommended a load structured Min-Min algorithm, mainly implemented to reduce start making span and increase resource utilization in the heterogeneous network. It is introduced in two steps. The Min-Min method is adopted in the first phase to schedule tasks. In the second stage, activities in crowded resources are postponed to effectively use underused resources.

Related Work
The authors in [18] introduced an augmenting hierarchical load balancing. The possibility of deviation of mean system load from cluster load is calculated as well as checked for containment within a specified range from 0 to 1. The best tools are allocated to jobs by matching the predicted work computing capacity with cluster average computer power. The authors provided a grouping-based schedule of data storage for fine-grained work. It groups fine-grained jobs to form coarse-grained jobs based on resource processing capability and grouped job processing requirements. The analysts round about an asset and calculate the MIPS product and granularity time used to estimate the overall number of jobs that could be completed within the same specified timeframe. Then they select the resource with the required amount of waiting jobs [19]. Balasangameshwara (2012) [20] addressed numerous fault recovery processes, including checkpointing, replication, and rescheduling. Having to take checkpoints is the process of regularly saving the status of a permanent storage process. This makes a process that fails to restart from the point of last saving its state or checkpoint on another resource. Replication involves maintaining a sufficient amount of replicas or copies of parallel systems on various resources such that at least one copy succeeds. The rescheduling procedure finds various resources to reschedule failures.
The authors in [21] built a grid fault-tolerant scheduling method to plan backups and minimize job response time. In this approach, directed acyclic graphs model jobs. They schedule jobs with delays to avoid execution failures even with the appearance of processor faults. Initially, a communication framework is designed that defines when contact between such a backup and successor backups is needed. Then any processor malfunction will initiate the backup. This minimizes reply time and expense. The authors in [22] introduced a fault tolerance checkpointing mechanism. The checkpointing method regularly saves the condition of a process operating on a computational resource so it can restart on another resource in case of resource loss. If any resource faults occur, it invokes the required replicas to meet user application capacity needs. Lee et al. (2011) [23] developed a bi-criteria task scheduling that considers users' satisfaction and fault tolerance. It concentrates on a pro-active fault-tolerant mechanism that considers resource failure history while scheduling jobs. It considers user date and job fulfillment period at all tools, and measures fitness value. Then jobs are scheduled depending on fitness value.
Both these heuristic schedulers discussed here have benefits and even certain drawbacks. The duplicitous load balancing does not regard the planned execution period and therefore is weak. The minimum execution time (MET) heuristic algorithm does not consider the completion of jobs, resulting in extreme load imbalance. Minimum completion time (MCT) often implies a low makespan. The Max-Min heuristic procedure is stronger than all these algorithms, but only for the shortest jobs. Of all these evolutionary methods discussed, the Min-Min algorithm is quick, fast, and plays better without being given user satisfaction, considering machine efficiency by reduced makespan. The application demand aware approach works best when taking into account user satisfaction. Many schedulers concentrate on user deadline and task scheduling separately, but no scheduler considers user deadline and resource load. Therefore, there is a wide potential for associated data focusing on both factors. The need to expand the relevant work is the scope of this article, improving both load balancing and fault tolerance with a new selected framework focused on a theory of fractal transformation and entropy.
The current methods address both load balancing and fault tolerance in grid environments separately [24][25][26][27][28][29][30][31][32][33]. The method proposed in this paper deals with each of them within one framework by combining some classical methods based on fractal transform and entropy computation for reducing the complexity of data to obtain an optimal method for rendering the grid structure. The study proposes scheduling algorithms that resolve numerous issues, like customer satisfaction, data aggregation, and fault tolerance, by considering criteria such as failure rate, load status, user deadline, and resource usage of scheduling services. In fact, for the first time, we have used the fractal transform and entropy in order to reduce the complexity of the grid. Figure 1 shows the complete proposed model for improving fault tolerance and load balance based on the fractal transform. The first step is to estimate grid computing service (GCS) parameters, including cost, job queue, task schedule, cluster size, grid size, and the number of resources, then mapping this grid structure into a distributed R-tree index structure enhanced by the entropy method to reduce the completion time of the decision-maker. Finally, the threshold machine is used to choose the route path based on load balance as well as the fault tolerance device. The migrant controller is used to increase fault tolerance and self-stabilizing control is used to increase cumulative state load balancing. Each consumer submits their computational and hardware specifications to the GCS. The GCS can respond by submitting the results when the job execution is done.

Materials and Method
In the GCS, jobs pass through four stages, which can be summarized as follows: (1) task submission phase: grid users can submit their jobs through the available web browsers. This makes the work submission process simple and open to any number of clients. (2) Task allocation phase: when the GCS receives work, it searches for the resources available (computers or processors) and assigns the necessary resources for the task. (3) Task execution phase: until the services available are committed to the assignment, the task is planned to be carried out on that computer location. (4) Results collection phase: when the jobs have been done, the GCS will alert the users of the results of their work. The proposed model would examine network parameters; GCS estimates by evaluating the top-down view of the grid model, which includes the local grid manager (LGM), site manager (SM), and processing elements (PEs) [34]. In this hierarchy, incorporating or deleting SMs or PEs becomes very versatile and tends to render the proposed grid computing service model open and scalable. The LGM's task is to assess information regarding active resources depends on its SMs.
LGMs also participate in grid-specific tasks and load balancing. New SMs can enter the GCS by sending a message to join the nearest LGM parent. Each SM is capable of managing a dynamically configured pool of processing units (computers as well as processors) (i.e., processing elements can join the pool anytime). A new computing element should be registered within the SM. The SM's function is to gather information regarding active input nodes in its pool. The details gathered contains CPU speed and other hardware measurements. Any SM is also responsible for allocating incoming jobs to any processor core in its pool using a defined load balancing. Any public or private PC or workstation can enter the grid system by signing up with any SM and offering grid users their computing resources. As a computational unit enters the grid, it begins the GCS framework that will submit any details about its capabilities, such as processor power, to the SM. Any LGM is the grid model's web server. Using the web browser, customers assign their computer science jobs to an associated LGM. According to the load balancing relevant data, the LGM will pass the published jobs to a suitable SM. The SM, in turn, distributes these computing jobs according to available site high availability information to a selected execution processing element. details about its capabilities, such as processor power, to the SM. Any LGM is the grid model's web server. Using the web browser, customers assign their computer science jobs to an associated LGM. According to the load balancing relevant data, the LGM will pass the published jobs to a suitable SM. The SM, in turn, distributes these computing jobs according to available site high availability information to a selected execution processing element.

Building Distributed R-Tree (DR-Tree) Using Fractal Transform
The self-similarity property can define a fractal, i.e., an object with roughly the same features over a wide range of scales [1]. Accordingly, a real dataset exhibiting fractal actions is exactly or empirically self-similar, so parts of any data size present the same whole dataset characteristics. The fractal dimension [2][3][4] is especially useful for data analysis from fractal theory as it offers an estimate of its inherent dimension D of datasets. The underlying dimension provides the complexity of the entity described by the data independent of the dimension E of the domain in which it is embedded. That is, D measures real data's non-uniformity behavior. For example, a set of points representing a plane immersed in some kind of a 3-dimensional space (E = 3) has independent attributes and one third associated with the other, resulting in D = 2. Correlation recursive dimension can determine the spatial pattern of real datasets. The box-counting technique defines as an efficient tool for measuring the spatial pattern of datasets embedded throughout E-dimensional spaces: where r is the cell side in a (hyper) spherical grid separating the dataset's address space, and . is the count of points in the ith cell. Thus, D2 could be a valuable method for estimating a real dataset's intrinsic parameter D with feasible computing expenses.

Building Distributed R-Tree (DR-Tree) Using Fractal Transform
The self-similarity property can define a fractal, i.e., an object with roughly the same features over a wide range of scales [1]. Accordingly, a real dataset exhibiting fractal actions is exactly or empirically self-similar, so parts of any data size present the same whole dataset characteristics. The fractal dimension [2][3][4] is especially useful for data analysis from fractal theory as it offers an estimate of its inherent dimension D of datasets. The underlying dimension provides the complexity of the entity described by the data independent of the dimension E of the domain in which it is embedded. That is, D measures real data's non-uniformity behavior. For example, a set of points representing a plane immersed in some kind of a 3-dimensional space (E = 3) has independent attributes and one third associated with the other, resulting in D = 2. Correlation recursive dimension D 2 can determine the spatial pattern of real datasets. The box-counting technique defines D 2 as an efficient tool for measuring the spatial pattern of datasets embedded throughout E-dimensional spaces: where r is the cell side in a (hyper) spherical grid separating the dataset's address space, and C r.i is the count of points in the ith cell. Thus, D 2 could be a valuable method for estimating a real dataset's intrinsic parameter D with feasible computing expenses. First, the tree model of grid computing nodes is converted into a DR-tree to reduce the complexity of the grid computing network due to the similarity properties with the tolerated and balanced R-tree index structure that can be run in logarithmic time. The idea is to promote the deferred splitting strategy in R-trees. This is achieved by seeking an R-tree node order. This ordering must be "nice" in that it must group "same" information nodes with whose representation can be contained in a rectangle of compact spaces. Each node has a very well-defined collection of sibling nodes providing the ordering; we may use deferred splitting. By changing the split strategy, the DR-tree will achieve the maximum utilization needed.
R-trees' efficiency depends on how effective the algorithm is, clustering data rectangles into a node. We used space-filling curves (or fractals) here, and precisely the Hilbert curve to enforce a linear ordering on data rectangles. Exactly once, a space-filling curve visits all points in a k-dimensional grid and never crosses. To derive an order curve i and the order curve i − 1, which can be rotated and/or mirrored accordingly, each vertex of the simple curve is substituted. When curve order tends to infinity, the resultant curve is a fractal with a fractal dimension 2 [5][6][7]. The main concept is to construct a tree structure that can act like an R-tree on scan and help deferred separation on insertion using the inserted data rectangle's Hilbert value as the primary key. These objectives can be accomplished as follows: for each node n of our physical tree, they store (a) its cluster region and (b) the largest Hilbert size (LHV) of data rectangles belonging to the root n sub-tree.
DR-trees expand the R-tree index architectures where related nodes were self-organized in a synthetic balanced tree overlay centered on semantic relationships. The framework preserves R-trees' index framework features: logarithmic search time in network size and minimal degree per node. Atomic devices connected to the device may be called p-nodes (real nodes shortcut). A DR-tree is a virtual framework spread over a collection of p-nodes. Terms related to the DR-tree have the prefix: "v-." Thus, DR-trees nodes are called v-nodes (virtual network shortcut). The key points in the DR-tree composition are join/leave procedures. When a p-node connects, it generates a v-leaf. Then another p-node is contacted to inject the v-leaf into the current DR-tree. Any v-nodes can break during this insertion, see Algorithm 1 [35].

Entropy Estimation
Given the conceptual (logic) DR-tree, this stage aims to eliminate unnecessary nodes by its entropy calculation. Entropy is often used as an evaluation metric that represents the consistency of a scheduling choice based on ambiguous state-of-service capacity information [36]. This approach is to omit the domain block (node) with high entropy from the domain pool. Thus, all useless domains would be eliminated from the pool, achieving a more efficient domain pool. This would minimize network overhead by decreasing the number of search nodes and improve grid computing system efficiency. The grid manager at the GCS estimation step will initialize Algorithm 1 by choosing some parameter ξ and then Algorithm 2 will be executed. Herein, the stability of the logical DR-tree relies on the ξ value for each node; these values are empirically determined (trial and error). Stability here means that the node will not be failed. As ξ increases, the uncertain information of the node increases and Quality of Service (QoS) will decrease.

Fault Tolerance
Given the reduced version of the logical nodes from the previous step, fault tolerance for each path that is constituted from the combination of all nodes will be estimated depending on the typical DR-tree performance [13]. If a p-node fails, all the sub-trees are substituted in a non-default configuration (p-node by p-node) to maintain an invariant DR-tree. The proposed model maintains fault tolerance and retains the DR-tree architecture with its invariants utilizing non-leaf v-node replication. A fault-tolerant solution that utilizes non-leaf node replication preserves tree connectivity when crashes arise. The pattern for replication of v-nodes is: p-root has no replica, and each p-node has a replica of its top v-node's v-father.

(a) Fault Tolerance Estimation
Replicas are generated at join operations and updated to join both split processes. When changing a v-node, its holder can alert the p-nodes containing its replicas. Consider an N v-node DR-tree with degree m: M. Presume that upgrading a replica costs one post (message). To calculate the cost of replica changes, determining how many v-nodes are updated throughout a split process is also important. Let the cost be described as the value of split-time updating reproductions. Since each v-node has M−1 replicas, each update requires one post, so it costs: A p-node associated with the system may change between 0 and log m (N) splits, adding its v-leaf to the v-children of another v-node that is denoted in the sequence of the joined v-node. The latter has between m and M v-children; log m (N) − 1 v-ancestors; between m − 1 and M − 1 replicas. A v-node can have m to M v-children and therefore has M − m + 1 possible v-children. The split will happen only when it has exactly M v-children. The probability for a v-node to split is p, where: The likelihood of a p-node producing k splits is that the corresponding v-node and its k-1 first v-ancestor would have practically M v-children, although their k-th ancestor would not split, therefore: Let cost r denote the estimated price of updating replicas when associating a p-node with a DR-tree: Entropy 2020, 22, 1410 8 of 22 The first term describes the case in which no splits are made, i.e., M − 1 replicas of the merged v-node must be modified, whereas the second corresponds to the other cases. The proposed approach should have defined the case where the v-root splits, for a specific possibility, since it has M − 1 v-children. Even so, for m > 2, this likelihood is lower than p, so an upper limit is believed to simplify estimation.

(b) Migration Controller
Reinsertion and replica policy, as in [12], was used to test DR-tree integration operations utilizing internal v-nodes. Once a non-leaf p-node crash has been created by each DR-tree, the cost of device reconstruction in terms of the number of messages and the stabilization period is determined depending on both reinsertion as well as replication policies.
(1) Stabilization time: the reinsertion mechanism stagnates the system in several cycles and also at the crashed p-node. As stability time is the longest reinsertion time, it is proportional to log m (N). (2) Message recovery cost: calculated as the percentage of messages required to stabilize a non-leaf p-node collapse. Costs range in size, but with the reinsertion policy, the amount of message propagation is much distorted, resulting in a large standard deviation.

Load Balance
As mentioned in [11], each SM collects every PE that enters or leaves the grid system and then communicates it to its parent LGM. The above means an input layer only needs communication when it joins or leaves its site. The device burden between computing components can be matched by utilizing the gathered information to effectively use all device resources to reduce response time for user jobs. This strategy would boost connectivity and device efficiency by minimizing overhead contact requiring capturing system details before making a task scheduling decision. The process measures for the GCS model to formalize load balancing policy are defined as:

1.
Job: all jobs in the system will be represented by a job ID, job size in bytes, and a number of job instructions.

2.
Processing element capacity (PEC ij ): defined as the number of jobs that can be processed by the jth PE at full workload in the ith site per second. The PEC can be measured assuming an average number of job instructions using the PE's CPU speed. 3.
Site processing capacity (SPC i ): defined as the number of jobs that can be processed by the ith site per second. Hence, the SPC i can be measured by summing all the PECs for all the PEs managed the ith site.

4.
Local grid manager processing capacity (LPC): defined as the number of jobs that can be processed by the LGM per second. The LPC can be measured by summing all the SPCs for all the sites managed by that LGM.
(a) Load Balance Estimation As described in [11], the load balancing strategy is multi-level and can be clarified, as follows, at each grid design level:

1.
Load balance 0 (central grid manager): as stated earlier, in terms of processing capability SPCs, the LGM retains information regarding all its responsible SMs. LPC is the total LGM processing capacity measured as the sum of all SPCs for all LGM sites. Based on each site's overall computing power, the LGM scheduler balances the workload of all site community members (SMs). Where N j specifies the number of jobs at an LGM in a steady state, ith site workload (S i WL) is its number of jobs to be delegated to the site manager and is measured as follows:

2.
Load balancing at level 1 (site manager): each SM has PEC information on all input nodes in its pool. The overall site processing capability SPC is measured by the sum of all PECs of all processing elements in the site. Where M j is specified as the number of jobs in a steady state at an SM, the SM scheduler uses the same strategy used by the LGM scheduler to distribute the load.
Based on their processing power, the strategy of sharing site workload among the PE community would optimize the productivity of each PE and also boost their resource usage. On the other hand, the amount of jobs for the ith PE is specified as the ith PE workflow (PE i WL), calculated as follows: To calculate the mean job response time, one LGM scenario has been assumed to streamline the grid model. This scenario focuses on the time spent in processing elements by a job. Algorithm 3 is used to measure the traffic volume and the estimated mean response time.
Algorithm 3: Expected mean job response time 1: Obtain λ, µ where, λ is the external job arrival rate from grid clients to the LGM, µ is the LGM processing capacity. 2: Calculate ρ = λ/µ as the system traffic intensity. For the system to be stable, ρ must be less than 1. 3: For i = 1 to m 4: Calculate λ i , µ i where λ i is the job flow rate from the LGM to the ith SM, which is managed by that LGM, µ i is processing capacity of the ith SM. 5: Calculate ρ i = λ i /µ i that is the traffic intensity of the ith SM. 6: For j = 1 to n 7: Calculate λ ij , µ ij where λ ij is the job flow rate from the ith SM to the jth PE managed by that SM, µ ij is the processing capacity of the jth PE that is managed by the ith SM. 8: Calculate ρ ij = λ ij /µ ij that is traffic intensity of the jth PE that is managed by ith SM. 9: Calculate the expected mean job response time, E T g . 10: End for 11: End for The jobs arrive sequentially from clients to the LGM with the expectation of conforming to a time-invariant binomial mechanism, whereas inter-arrival times are independently, identically, and exponentially distributed with the arrival time of jobs/s, excluding simultaneous arrivals. An M/M/1 queue models each PE in the following website pool. Jobs that arrive at the LGM are immediately spread on the LGM structured pages with routing likelihood: Following the load balancing strategy (LBP), where i the site number Under the same scenario, the site i arrivals would also be automatically spread with routing likelihood on the PEs arranged by that site.
Based on the LBP, in which j is the PE, and i is the site number As LGM arrivals are simulated to obey a Poisson process, PE arrivals would also obey a Poisson distribution. Assume that service times only at the jth PE in the ith SM are spread exponentially with fixed service rate µ ij jobs/s and reflect the processing capacity of the PE (PEC) in the high availability policy. Service management is a first-come service.
To calculate the expected mean job response time, let E T g denote the mean time spent by a job at the grid to the arrival rate λ and E N g denote the number of jobs in the system. Hence, the mean time spent by a job at the grid will be: E N g can be determined by averaging the mean amount of jobs at all grid sites in each PE.
where i = 1, 2, . . . , m is the number of site managers handled by an LGM, j = 1, 2, . . . , n is the number of processing elements handled by an SM and E N ij PE is the mean number of jobs in a processing element number j at site number i. As every PE is modeled as an µ ij = PEC ij for PE number j at site number i. From Equation (12), the expected mean job response time is given by: Notice that the stability state PE ij is ρ < 1.

Threshold Device
The suggested approach uses e a novel 2-D figure of merit to test the network performance. To calculate the expected mean job response time, let denote the mean time spent by a job at the grid to the arrival rate λ and denote the number of jobs in the system. Hence, the mean time spent by a job at the grid will be: = × (12) can be determined by averaging the mean amount of jobs at all grid sites in each PE.

=
[ ] (13) where = for PE number j at site number i. From Equation (12), the expected mean job response time is given by: Notice that the stability state PEij is ρ <1.

Threshold Device
The suggested approach uses e a novel 2-D figure of merit to test the network performance. A 2-D figure of merit can be seen in Figure 2. It divides the load balance (LB) fault tolerance (FT) spaces into different development conditions as follows:   Figure 2 can be divided into nine areas to observe the intervals discussed: GG: this interval is good for both FT and LB estimation. GM: this interval is good for FT estimation and medium for LB estimation. GB: this interval is good for FT estimation and bad for LB estimation. MG: this interval is medium for FT estimation and good for LB estimation. MM: this interval is medium for both FT and LB estimation. MB: this interval is medium for FT estimation and bad for LB estimation. BG: this interval is bad for FT estimation and good for LB estimation. BM: this interval is bad for FT estimation and medium for LB estimation. BB: this interval is bad for both FT and LB estimation.
Finally, to improve the calculation of fault tolerance, replication time and message cost must be minimized, which would raise the possibility of completed work. On the other hand, mean job response time would be reduced to boost the load balancing measurement and this will cause an improvement in the number of jobs/s. A novel two-dimension figure of merit is suggested to describe the network effects on load balance and fault tolerance estimation. The suggested model would be improved by using optimization techniques to approximate the optimum replication time value and mean job response time to achieve a GG framework in the 2-D figure of merit. Three separate optimization methods are used to achieve the optimal approach, namely: genetic algorithm, ant colony optimization, and particle swarm optimization (GA, ACO, and PSO). The aim is to compare these three optimization methods. In the first step, each user submits their computing jobs to the GCS with their hardware specifications. The GCS answers the user by sending the results after completing the job processing. This model follows the same measures as model one, but it includes extra modules called "optimization strategies" that take their input from the entropy estimation module (logical network) and send their output to the fault tolerance and load balancing estimation modules (outputs replication time and mean job response time, respectively).

Results
All experiments were conducted based on a dataset that was collected from http://strehl.com/. A sample of 500 records is generated for the node's entropy measurement according to five distributions: random, exponential, normal, uniform, and Poisson. All of these measurements follow Gaussian clusters with means of (−0.227, 0.077) and (0.095, 0.323) and an equal variance of 0.1. Table 1 illustrates a sample of the records and attributes of each node that includes queue size, task time, CPU speed, and memory size. The simulation model was implemented on: CPU processor: Intel (R) Core (TM) i3-243M CPU@2.40 GHz 2.40 GHz. RAM: 4.00 GB. System type: 64-bit operating system. Operating system: Microsoft Windows 7 Professional. A simulation model is constructed using a MATLAB simulator to assess the performance of the grid computing model. This simulation model consists of one local grid manager who manages several site managers. The statistics MATLAB toolbox is utilized to compute the entropy. The accuracy of the proposed model was evaluated by four well-known measures [37]: load balance estimation: this measure is used to evaluate the mean job response time, and this denotes the period spent on the grid by varying arrival times. The objective is to decrease mean job response time and this will cause an increase in the number of jobs/s. Fault tolerance estimation: this measure is used to evaluate the replication cost of the DR-tree; this could be assessed by the summation of the probability of not splitting for every virtual node and its update replica message cost. The objective is to decrease the replication cost. Gain: to evaluate the improvement ratio for the load balance with respect to the traditional load balance model, which can be calculated by gain = (traditional mean job response time-planned mean job response time)/traditional mean job response time. Finally, system utilization: to evaluate the number of required resources with respect to grid size. The objective is to decrease the grid size and consequently decrease the number of resources.

Experiment One: Test the Performance of the Proposed Model in Terms of Load Balancing
Objective: To validate the benefits of implementing the proposed DR-tree model for grid computing networks; this experiment compares it with related load balancing algorithms discussed in [11]. The aim is to decrease the mean job response time and consequently increase the number of jobs/s. Observation: Figure 3 shows the load balance of the grid network at different arrival rates. The mean job response time for different random distributions, such as random (Rand-Dist), exponential (Exp-Dist), normal (Norm-Dist), uniform (Unif-Dist), and Poisson (Poiss-Dist) are calculated and, after many trials, it can be shown that the same results were obtained for all distributions. The results confirm that the mean job response time increases approximately in a linear way as the arrival rate increases. Figure 4 shows the comparison between the load balance of the proposed model and the traditional algorithm mentioned in [11]. The results from Figure 4 prove the superiority of the proposed model. The suggested model improves the mean job response time by decreasing it with a ratio of 26% (gain) as compared to the traditional one. Additionally, it can be observed from the table that the different distributions of mean job response times do not affect the stability of the model. increases. Figure 4 shows the comparison between the load balance of the proposed model and the traditional algorithm mentioned in [11]. The results from Figure 4 prove the superiority of the proposed model. The suggested model improves the mean job response time by decreasing it with a ratio of 26% (gain) as compared to the traditional one. Additionally, it can be observed from the table that the different distributions of mean job response times do not affect the stability of the model.    increases. Figure 4 shows the comparison between the load balance of the proposed model and the traditional algorithm mentioned in [11]. The results from Figure 4 prove the superiority of the proposed model. The suggested model improves the mean job response time by decreasing it with a ratio of 26% (gain) as compared to the traditional one. Additionally, it can be observed from the table that the different distributions of mean job response times do not affect the stability of the model.      Discussion: One possible explanation of these results is that the load balancing achieved by the proposed model is asymptotically optimal because its saturation point intensity ≈ 1 is very close to the saturation level of the grid computing model [11]. Furthermore, the suggested model is more stable at different arrival rate distributions because of the utilized DR-tree properties that mainly depend on the reduction version of the tree. This reduction version of the grid network tree mainly improves the load balancing policy as compared to the alternative model in [11] that depend on the whole grid network tree. Moreover, within the suggested model, the information of any processing element joining or leaving the grid system is collected at the associated site manager which in turn transmits it to its parent local grid manager. This means that communication is needed only if a processing element joins or leaves its site. All of the collected information is used in balancing the system workload between the processing elements to efficiently utilize the whole system resources, aiming to minimize user job response time. This policy minimizes the communication overhead involved in capturing system information before making a load balancing decision that improves the system performance.

Experiment Two: Test the Performance of the Proposed Model in Terms of Fault Tolerance
Objective: The second set of experiments was conducted to confirm the efficiency of the suggested DR-tree-based model in terms of replication time and message cost for grid computing networks; the aim is to decrease replication time and message cost and this will cause an increase in the probability of a job being completed. In general, after building the logical grid computing model and executing the load balancing stage, the grid network size influences the time taken for jobs to be completed. So, if the grid size of the network is minimized, the probability of the job being completed is maximized.
Observation: Figure 6 shows the relationship between grid size and the arrival rates for different random distributions, such as random (Rand-Dist), exponential (Exp-Dist), normal (Norm-Dist), uniform (Unif-Dist), and Poisson (Poiss-Dist). The results reveal that for different arrival rates, the Poisson distribution yields a minimum grid size as compared with other distributions. It has an improvement ratio of about 4.6%. Furthermore, Figure 7 shows a 2-D figure of merits that depicts the relationship between mean load balance estimation error and mean fault tolerance estimation error. It can be inferred that the Poisson distribution gives the best stability system compared to all different distributions and the job response time algorithm. As shown in Figure 7, given the best result for the Discussion: One possible explanation of these results is that the load balancing achieved by the proposed model is asymptotically optimal because its saturation point intensity ≈ 1 is very close to the saturation level of the grid computing model [11]. Furthermore, the suggested model is more stable at different arrival rate distributions because of the utilized DR-tree properties that mainly depend on the reduction version of the tree. This reduction version of the grid network tree mainly improves the load balancing policy as compared to the alternative model in [11] that depend on the whole grid network tree. Moreover, within the suggested model, the information of any processing element joining or leaving the grid system is collected at the associated site manager which in turn transmits it to its parent local grid manager. This means that communication is needed only if a processing element joins or leaves its site. All of the collected information is used in balancing the system workload between the processing elements to efficiently utilize the whole system resources, aiming to minimize user job response time. This policy minimizes the communication overhead involved in capturing system information before making a load balancing decision that improves the system performance.

Experiment Two: Test the Performance of the Proposed Model in Terms of Fault Tolerance
Objective: The second set of experiments was conducted to confirm the efficiency of the suggested DR-tree-based model in terms of replication time and message cost for grid computing networks; the aim is to decrease replication time and message cost and this will cause an increase in the probability of a job being completed. In general, after building the logical grid computing model and executing the load balancing stage, the grid network size influences the time taken for jobs to be completed. So, if the grid size of the network is minimized, the probability of the job being completed is maximized.
Observation: Figure 6 shows the relationship between grid size and the arrival rates for different random distributions, such as random (Rand-Dist), exponential (Exp-Dist), normal (Norm-Dist), uniform (Unif-Dist), and Poisson (Poiss-Dist). The results reveal that for different arrival rates, the Poisson distribution yields a minimum grid size as compared with other distributions. It has an improvement ratio of about 4.6%. Furthermore, Figure 7 shows a 2-D figure of merits that depicts the relationship between mean load balance estimation error and mean fault tolerance estimation error. It can be inferred that the Poisson distribution gives the best stability system compared to all different distributions and the job response time algorithm. As shown in Figure 7, given the best result for the Poisson distribution, the load balance is still the same, and this confirms the stability condition of the DR-tree model. The next step tries to enhance load balance, and the fault tolerance of the suggested model depends on different entropy values. Poisson distribution, the load balance is still the same, and this confirms the stability condition of the DR-tree model. The next step tries to enhance load balance, and the fault tolerance of the suggested model depends on different entropy values.    Poisson distribution, the load balance is still the same, and this confirms the stability condition of the DR-tree model. The next step tries to enhance load balance, and the fault tolerance of the suggested model depends on different entropy values.    at an entropy value of 80%. Figure 9 shows the system stability in terms of load balance and fault tolerance estimation errors after enhancement. The results reveal that when decreasing the entropy threshold value ε to less than 80%, the stability of the system decreases. The suggested model yields an improvement ratio of 98% for load balance and 33% for fault tolerance as compared with initial conditions. entropy value of 80%. Figure 9 shows the system stability in terms of load balance and fault tolerance estimation errors after enhancement. The results reveal that when decreasing the entropy threshold value ε to less than 80%, the stability of the system decreases. The suggested model yields an improvement ratio of 98% for load balance and 33% for fault tolerance as compared with initial conditions.    entropy value of 80%. Figure 9 shows the system stability in terms of load balance and fault tolerance estimation errors after enhancement. The results reveal that when decreasing the entropy threshold value ε to less than 80%, the stability of the system decreases. The suggested model yields an improvement ratio of 98% for load balance and 33% for fault tolerance as compared with initial conditions.  Discussion: As the proposed model depends on the entropy-based DR-tree index structure through which the number of nodes that have properties depends on reducing the error estimation for each of load balance and fault tolerance, these selected nodes represent the logical structure and reduce the network size and reducing the network size leads, accordingly, to reducing the mean job response time and replication time.

Experiment Three: Test the Performance of the Proposed Model in Terms of System Utilization
Objective: To measure the optimum solution for the resources available within the grid network, this set of experiments was conducted to illustrate the effect of three different optimization techniques to minimize the number of resources with the grid. The aim is to decrease the number of resources with respect to grid size to enhance grid utilization. Table 2 illustrates the parameters for factors and levels of setup of different optimization techniques. The error estimation parameter value ε represents the fitness functions. Observation: The results shown in Figures 10-12 reveal that as the number of iterations increases, the ξ value increases for GA, ACO, and PSO, respectively. The increasing of the curve tends to become stable approximately after the 8th iteration for both PSO and ACO, while GA reaches stability at the final iteration to obtain an optimal value. It can also be observed that the PSO optimization technique yields the best ξ value as it reaches a stable value faster than ACO. Figure 13 shows the system utilization resources as a function of optimal size (number of iterations for each optimization technique). As the optimization size increases, the system utilization decreases. When comparing the three optimization algorithms with respect to system utilization, the results confirm that they almost give the same results (about 75%) but it could be shown that the PSO algorithm gives the best performance.     Discussion: Several conclusions were derived from the results: (i) in terms of effectiveness, the three algorithms performed equally well to search for the optimal solution. (ii) In terms of efficiency, PSO is the fastest among the three algorithms to find the optimum result, followed by ACO and then GA. (iii) In terms of consistency, all methods are proved to be consistent in solving this construction site layout problem. This study contributes to the decision in determining an appropriate solution algorithm for the construction site layout problem. The population size in the parameter design gives a significant effect on the objective solution. All methods performed equally well in terms of effectiveness. However, PSO appears to achieve the minimum mean of cost as compared to GA and Discussion: Several conclusions were derived from the results: (i) in terms of effectiveness, the three algorithms performed equally well to search for the optimal solution. (ii) In terms of efficiency, PSO is the fastest among the three algorithms to find the optimum result, followed by ACO and then GA. (iii) In terms of consistency, all methods are proved to be consistent in solving this construction site layout problem. This study contributes to the decision in determining an appropriate solution algorithm for the construction site layout problem. The population size in the parameter design gives a significant effect on the objective solution. All methods performed equally well in terms of effectiveness. However, PSO appears to achieve the minimum mean of cost as compared to GA and ACO. This is due to the implementation of the craziness concept in the PSO mechanism; i.e., advantages of randomly reinitializing particles, a process referred to as craziness. PSO is shown to be the fastest algorithm that converges on the minimum cost. For consistency, all methods are proved to be consistent in finding solutions for both cases. This is due to the diversification components that prevent the algorithm from becoming trapped in local optima and are able to explore the solution in search space until they finally converge on the best objectives.
The experimental results that were found through a simulator confirmed that the suggested model can be improved by up to 98% for load balance from the initial condition; furthermore, it outperforms the related work by an average of 26% for job response time and 33% for fault tolerance. Furthermore, by utilizing different optimization algorithms for finding the optimal number of resources (system utilization), the suggested model decreases the number of resources by an average of 75%. In general, fault-tolerant load balancing is a critical issue for the efficient operation of grid computing environments in distributing the jobs. These results show that passive replication has been combined with distributed load balancing in the grid and suggest a new way to control the stability of the grid networks. Message exchanges between resources in this model are simple and small, thereby preventing network congestion even during heavy job arrival rates. This model integrates static and dynamic load balancing techniques to locate effective sites, identifies system imbalance in the shortest time when any site becomes ineffective, and fills the imbalance with a new site.

Conclusions
In a grid environment, many researchers have proposed various scheduling algorithms for improving the performance of the grid system. This paper began by studying and understanding several aspects of grid computing. In the literature survey, various algorithms and methods were identified and studied. Even though many researchers proposed various scheduling algorithms, it is found that there is no efficient and effective scheduling algorithm that gives a combined solution for many issues. This research proposed efficient and effective scheduling algorithms in which various issues, such as user satisfaction, load balancing, and fault tolerance, are addressed by considering the parameters such as failure rate, load state, user deadline, and resource utilization of the resources for scheduling.
In general, the main contributions of this work can be highlighted as follows: (1) proposing a new adaptive model to improve fault tolerance and load balancing for the grid computing environment. This model depends on an advanced fractal transform that enhances the tree model structure of the grid computing environment to enhance the network performance parameters affected by fault tolerance and load balance equally. An estimate of the fault tolerance and load balance for the network parameters was calculated based on a fractal transform. (2) The grid computing routing protocol is enhanced by improving fault tolerance with load balance estimation in a novel 2-D figure of merit. The improvement of the fault tolerance estimation is carried out by reducing replication time and message cost and this results in an increase in the probability of job completion. On the other hand, reducing mean job response time results in an enhancement of the load balance estimation, and this in turn induces an increase in the number of jobs/s. The experimental results that were found through a simulator confirmed that the suggested models can be improved by up to 98% for the system and outperform the related work by an average of 26% for job arrival rate and 33% for fault tolerance. Furthermore, by utilizing different optimization algorithms for finding the optimal number of resources (system utilization), the suggested model decreases the number of resources by an average of 75%.
Future work may include: (1) applying the proposed model to a real-time environment. (2) The security of the proposed work has not been considered, therefore, researchers may study the security aspects of this work. (3) Some other user requirements, such as cost for execution, may be considered. In addition to that, other passive failure handling mechanisms, such as checkpointing, may be considered. (4) The jobs may arrive in a random manner. So, the dynamicity of the jobs may be considered for testing the suggested models. (5) The proposed model is tested with 64 resources and a varied number of jobs of up to 1000. In the future, the number of resources and jobs may be increased and tested as an extension of the proposed models. (6) Exploring modeling other characteristics such as input/output (I/O) behavior, memory access pattern, cache effects, and seeking to build corresponding scheduling strategies that utilize these parameters to form efficient scheduling strategies.