Multi-Dependency and Time Based Resource Scheduling Algorithm for Scientiﬁc Applications in Cloud Computing

: Workﬂow scheduling is one of the signiﬁcant issues for scientiﬁc applications among virtual machine migration, database management, security, performance, fault tolerance, server con-solidation, etc. In this paper, existing time-based scheduling algorithms, such as ﬁrst come ﬁrst serve (FCFS), min–min, max–min, and minimum completion time (MCT), along with dependency-based scheduling algorithm MaxChild have been considered. These time-based scheduling algorithms only compare the burst time of tasks. Based on the burst time, these schedulers, schedule the sub-tasks of the application on suitable virtual machines according to the scheduling criteria. During this process, not much attention was given to the proper utilization of the resources. A novel dependency and time-based scheduling algorithm is proposed that considers the parent to child (P2C) node dependencies, child to parent node dependencies, and the time of different tasks in the workﬂows. The proposed P2C algorithm emphasizes proper utilization of the resources and overcomes the limitations of these time-based schedulers. The scientiﬁc applications, such as CyberShake, Montage, Epigenomics, Inspiral, and SIPHT, are represented in terms of the workﬂow. The tasks can be represented as the nodes, and relationships between the tasks can be represented as the dependencies in the workﬂows. All the results have been validated by using the simulation-based environment created with the help of the WorkﬂowSim simulator for the cloud environment. It has been observed that the proposed approach outperforms the mentioned time and dependency-based scheduling algorithms in terms of the total execution time by efﬁciently utilizing the resources. Abstract workﬂow models can be transformed into concrete models by using two schemes: static and dynamic. In the static scheme, the dynamic changes in resources are not considered. Whereas in the case of dynamic schemes, concrete models are generated before execution of workﬂow with the help of static information, such as the execution environment [23]. Static schemes can further be categorized into two types: user-directed and simulation-based [24]. The dynamic scheme can further be divided into prediction based as well as just-in-time. This scheme considers both static and dynamic information about resources used for making scheduling decisions at run-time. Prediction-based planning considers information along with some results based on and Most focus model


Introduction
Most of the business processes [1] can be represented in terms of a workflow. Workflow can be defined as a non-directed acyclic graph (DAG) [2] based structure having a group of connected tasks in a parent to child relationship. There is no parent to parent or child to child relationship in a workflow. It means that the tasks on the same level are not connected to each other, and the connectivity of tasks can be done from higher levels to lower levels only. The task invocation, synchronization, and information flow between the different tasks can be represented in a specific order described by the workflow management [3]. A scientific workflow management system (WMS) [4] is used to specify and execute the processing of the complex data. The biggest problem for WMS is scheduling because it is very difficult to identify the resource availability in the central pool of resources at the time of execution. Workflow scheduling is a challenging job as a proper sequence of the workflow tasks for execution needs to be created. The mapping and management of a workflow's tasks on shared resources is done with the help of scheduling [5,6].
Workflow management systems [3,7] are basically used for appropriate management and execution of the workflow tasks. The WMS mainly consists of five major entities, including workflow design, information retrieval techniques, scheduling of workflow tasks, fault tolerance, and data movement as depicted in Figure 1. A workflow structure [8] mainly describes the connection between different tasks of a workflow. It can be of two types such as DAG and Non-DAG based structures. DAGbased workflow structure can further be classified into three main categories: sequence, parallelism, and choice.
Non-DAG structures are superset of DAG and include the iteration pattern [9] in which the tasks can be executed iteratively. The overall design of a workflow can be explained with the help of workflow structure, its model and composition system as shown in Figure 2. The workflow model [10] is used to define the workflow structure and is of two types; abstract and concrete [11]. Workflow composition system helps user to add different components to the workflow [2,12,13].

Existing Challenges
Scheduling workflow's tasks to appropriate cloud resources is a challenging job as it depends on the QoS requirements of the cloud applications. In cloud computing, due to heterogeneity; uncertainty and resource mobilization; resource scheduling is a current area of research in demand. Different criteria for scheduling various resources and the parameters, requires different categories of resource scheduling techniques [14]. Further, major challenges in the workflow scheduling are: (1) how to assign appropriate cloud services to each workflow's task; (2) how to deal with cloud infrastructure variability; (3) how to consider the limits of concurrency in case of multiple tasks running in parallel; (4) how to solve the problem of data transfer between different workflow's task, etc. [15]. According to the authors [11] the challenges are: (5) resource management, economic barriers, such as costs for migrating workflow's task from one cloud service provider (CSP) to another; (6) legal issues such as data location or destination can be defined in advance; (7) security issues such as single sign-on authentication method in the inter-cloud environments, monitoring cloud resources, portability to move them from one cloud to another, and service-level agreement (SLA) having global SLAs between a federation and its customers are common issues.

Research Contribution
The major contribution toward resource scheduling in cloud computing is the proposed P2C algorithm. The proposed P2C algorithm is used to overcome the limitations of the existing time and dependency-based schedulers for scientific applications. The proposed algorithm considers the following parameters at a time:

1.
Parent to child node dependencies (which parent has the maximum number of child nodes); 2.
Child to parent node dependencies (which child has the minimum number of parent odes); 3.
Time of different tasks (in case of a tie in the second condition, as mentioned above) present at different levels in the workflow. The task having maximum time in the ready queue will be scheduled in case of the same dependency ratio.
The key objectives of the proposed algorithm are: (1) to reduce the overall execution time and (2) to utilize the resources efficiently. The proposed P2C algorithm has been validated by using the simulation-based environment created with the help of the Work-flowSim simulator. It has been observed that the proposed P2C algorithm outperforms the existing time and dependency based schedulers in terms of total execution time by utilizing the resources efficiently for the existing scientific applications.

Paper Organization
The rest of the paper is divided into six sections. In Section 2,the existing literature review and all the components of the workflow scheduling along with their execution environment are discussed. Section 3 describes in detail various scientific applications and their architectures. In Sections 4, the problem formulation is discussed. A detailed description of the methodology used to solve the problem, along with the working of proposed P2C algorithm is given in Section 5. Section 6 describes the experimental setup and the workflows in detail with the statistical result analysis. Finally, Section 7 concludes the overall work with the future scope of the paper.

Workflow Scheduling
One of the main issues of cloud computing is resource scheduling. Users can use the cloud services anytime from anywhere with the help of a stable internet connection. However, users will not have direct access to the cloud resources; instead, special application programming interfaces (API) should assist the resources on-demand. Allocating resources for cloud computing to a significantly changing request for the resources based on end-user application's usage pattern is a big challenge. The primary purpose is not only to optimize the resource allocation applications but also to improve resource utilization. There are numerous resource allocation algorithms, models, frameworks, and policies. These assist in allocating or transferring the resources that have proven to be helpful for both cloud service consumers (CSC) and cloud service providers (CSP). Although there are some conditions for resource scheduling, which are not appropriate for resource allocation to customers, these conditions can be highlighted in terms of factors such as: (1) if the cloud has a limited number of resources, then resource shortage may occur; (2) if CSPs has fewer resources for the end-user based on the policies and procedures; (3) allocation of additional resources based on customer's ongoing applications may violate processing strategies; (4) if more than two end-users try to get the same resource simultaneously, then resource congestion may occur; (5) if substantial resources are available within the cloud, but the cloud applications does not assign them to the relevant CSCs request, then resources may be lost [16].
Workflow scheduling [17] finds a correct sequence of task execution by following the scheduling criteria. The components of workflow scheduling are shown in Figure 3. Scheduling architecture is very important when it comes to the quality, efficiency, and effectiveness of a project. The layout of scheduling architecture is organized into three categories: centralized, decentralized, and hierarchical. In the centralized work environment, one central scheduler takes all scheduling decisions for all the workflow based activities.Whereas in a decentralized approach, there are multiple schedulers but no central controller for these different schedulers to help them to communicate with each other. However, in the case of hierarchical scheduling, there is a central controller which is used to control not only the workflow execution but also the sub-workflows assigned to the lower-level schedulers [18,19]. Scheduling decisions [21] are divided into two types, such as local and global. Decisions made based on the work or sub-employment are known as local decisions, whereas the decisions based on all employment are called global decisions. Global decision-making processes provide better overall results because only one job or fewer tasks in the workflow are considered in a local decision-making process. [22].
Abstract workflow models can be transformed into concrete models by using two schemes: static and dynamic. In the static scheme, the dynamic changes in resources are not considered. Whereas in the case of dynamic schemes, concrete models are generated before execution of workflow with the help of static information, such as the execution environment [23]. Static schemes can further be categorized into two types: user-directed and simulation-based [24]. The dynamic scheme can further be divided into prediction based as well as just-in-time. This scheme considers both static and dynamic information about resources used for making scheduling decisions at run-time. Prediction-based planning scheme considers dynamic information along with some results based on prediction and just-in-time scheme makes a decision at the execution time only [8,25,26].
Performance driven scheduling strategies [27] achieve the highest performance for the user-defined QoS parameters as the workflow's tasks mapped with the resources that give the optimal performance [28]. Most operational scheduling strategies focus on increasing the scope of the workflow. Market-driven scheduling strategies focus on resource availability, cost allocation, quality budget, and time frames. The market model used to organize the workflow has become an available resource that leads to lower costs. The trustees' strategy focuses on security and reputation for the resources [29,30].
Effective resource scheduling techniques are used to reduce the execution cost and the execution time; power consumption and deal with other QoS requirements [31]. The QoS requirements [32] may consist of quality attributes, such as reliability, security, availability, and scalability, etc. Resources are seen to be more challenging because both the CSCs and CSPs are not ready to share information. The first objective of resource scheduling is to identify adequate resources requirements for appropriate and effective utilization of resources. The second objective of resource scheduling is to identify the sufficient and proper workload and resources to support scheduling multiple workloads to meet many QoS requirements. Therefore, resource scheduling considers the execution time of different workloads. The overall performance depends on the different types of workloads (heterogeneous) and QoS requirements with similar workload (homogeneous) [14].
The authors [33] considered the task scheduling on clouds as a three queue (TQ) process based on three queues and dynamic preferences. First of all, scheduling has been decided on all the tasks in the available queue based on the priority of tasks. Secondly, subsequent tasks are divided separately into different groups based on input data and output data volume, the number of nodes currently running, tasks completion time, disk I/O rate, etc. Finally, the tasks are rearranged into the ready queue by considering all the parameters defined in the second queue.
The authors [34] proposed a solution to maintain a random cloud computing network. The goal is to satisfy QoS parameters by maximizing resource utilization.The proposed scheme increases the resource utilization, as well as reduces the resource consumption and execution time of the applications. However, there is a need for a mapping mechanism between the already allocated resources and the minimum requirement of resources to complete the execution process.
The authors [35] proposed a game-theoretical framework for real-time task scheduling in the cloud computing. In the proposed model, the task acts as a player, and the VM acts as a tactic, and the player's payoff indicates the player's completion time and waiting time. The proposed model is very effective in reducing the total time and waiting time of all the tasks in the workflow.
A new convenient and dynamic scheduling algorithm is required in the cloud environment for efficient resource utilization. Some of the existing static algorithms include first in first out (FIFO) [36], shortest job first (SJF) [37], round robin (RR) [38], etc. In contrast, dynamic algorithms do not require any advanced information about VMs and tasks; however, the VM requires constant monitoring. These algorithms are more accurate, efficient, and appropriate in cloud environments. When any VM is overloaded, so the work being done on this VM can be intentionally transferred to an under-loaded VM [39]. Dynamic RR [40], heterogeneous early finish time (HEFT) [41], etc., are a few examples of dynamic scheduling algorithms widely applied in cloud environments.
A novel multi-objective workflow scheduling approach in IaaS clouds [42] is proposed to optimize the cost and workflow's makespan for real-world scientific applications. The authors [43] proposed a multi-purpose workflow-scheduling algorithm based on decomposition. The proposed model uses Pareto front solutions to achieve at least as good as scheduling results instead of repeatedly implementing the single-objective scheduling algorithm with multiple constraints. Furthermore, the authors [44] proposed a dynamic fault-tolerant workflow scheduling (DFTWS) approach with hybrid spatial and temporary re-implementation schemes. Firstly, DFTWS calculates the time characteristics of each task and predicts the critical path of the workflows. Secondly, DFTWS identifies the appropriate virtual machine (VM) for each task according to the task requirement and budget quota in the initial resource allocation phase. Finally, DFTWS manages online scheduling, making real-time-error-tolerant decisions based on the failure type and task critique during the workflow execution.
The authors [45] proposed a new, budget deadline aware scheduling (BDAS) algorithm that addresses the scheduling of workflows under the budget and deadline constraints for scientific workflows in the IaaS cloud. The proposed heuristic satisfies the budget and time constraints, while the cost-time trade-off is introduced over heterogeneous cases.
The authors [46] proposed a novel resource prediction-based scheduling technique that automates the allocation of resources for scientific application in a virtualized cloud environment. Firstly, the proposed predictive model is trained on datasets that simultaneously generate the functions of a scientific application in the cloud. Secondly, resources are determined based on the production of the proposed estimation model. The main goal of the resource prediction-based scheduling technique is to efficiently manage resources for virtual machines, reduce execution time, and improve error rates and accuracy. Additionally, to manage fluctuating demand for resources, resources need to be managed efficiently. Furthermore, the authors [47] focuses on the design of a prediction-based scheduling approach that maps the function of a scientific application with appropriate VMs by combining the features of swarm intelligence and a multi-objective decision-making process. The proposed approach is used to improve accuracy rate, execution time, cost, and SLA breach rate.
The initial idea of Spring scheduling in terms of computer architecture has been proposed by the authors [48,49]. According to the authors, Spring scheduling co-processor or multiprocessor have been considered as very large-scale integration (VLSI) accelerator for real-time system. Co-processors can be used for both static and dynamic scheduling. Many different approaches and their combinations can be used with the help of Spring scheduling, such as highest value first, earliest deadline first, and earliest available time first, etc. The Spring scheduler works on parallel structure for standard scheduling of several tasks, number of resources, and the internal criteria of scheduling. Trakadas et. al [50] defined the basic building parts and levels of a decentralized hybrid cloud MEC architecture that results in a platform-as-a-service (PaaS). The stakeholder ecosystem is also examined in order to provide a wide view on the business prospects of the platform.

Scientific Applications
Scientific applications are used to simulate real-world activities using mathematics. Real-world objects are turned into mathematical models, and their actions are simulated by executing the formulas. In this section, the various scientific applications used for the evaluation purpose are described in detail. These scientific applications are represented in terms of the workflows.

CyberShake
CyberShake workflow [51] is used to identify earthquake hazards by identifying the earthquake ruptures having a moment magnitude value greater than 6. CyberShake workflow is parallel and can be represented in 5 levels as shown in Figure 4.

Montage
The montage application [52,53] combines together many input images to create sky mosaics using input images in the flexible image transport system (FITS) format. The montage application has pipeline structure and comprising of nine levels, as shown in Figure 5.

Epigenomics
The epigenomics [52,53] workflow is used to automate various operations in genome sequence processing. Epigenomics workflow also has a pipeline structure and has eight levels, as shown in Figure 6. The overall input to the workflow are sequential data obtained for multiple "lanes" from the genetic analysis process.

Inspiral
Inspiral workflow [52] is used to generate and analyze dynamic gravitational waveformats from data collected during the integration of integrated binary systems. Inspiral workflow has a parallel and pipeline structure and has six levels, as shown in Figure 7.

SIPHT
The SIPHT workflow [52,53] is used to automatically perform searches for uninterrupted RNA (sRNA) viruses in the National Center for Biotechnology Information (NCBI) database. All SIPHT workflows have almost identical structures, and larger workflows can be made by combining smaller independent workflows. The only difference is in the structure of any two events in the number of Patser's works. The SIPHT workflow has a pipeline structure, as shown in Figure 8.

Problem Formulation
In this section, the research problem has been formulated using the following notations: Let us consider that W denotes the list of workflows where individual ith workflow is denoted by w i . Each workflow consists of a group of jobs where ith job at kth level is denoted by j k i .
Let us consider that the list of virtual machines (resources) be denoted by VM where individual ith virtual machine is denoted by vm i .
Let 'C' be the category list of workflows and jobs, where c i denotes the individual category.

Definition 1.
Let us consider that f 1 denotes one to one mapping function between a job and its burst time.
where ∆t b is the burst time of job j i . There exists dependencies between different jobs present in a workflow.

Definition 2.
Let us consider that f 2 be a one to many mapped function between jobs at different levels.
Let us consider that f 3 be a one to one mapped function between workflow and its execution time.
where ∆t w e is the total execution time of the workflow w i .

Definition 4.
Let us consider that f 4 be a one to one mapped function between job and its execution time.
where ∆t j e is the total execution time of the job j i .

Definition 5.
Let us consider that f 5 be a one to one mapped function between a job and its execution status at time 't' Definition 6. Similarly, f 6 is a one to one mapped function between workflow and its execution status at time 't' where '0' indicates not started yet, '1' indicates pending and '2' indicates executed successfully Definition 7. Let us consider that f 7 be a one to one mapped function between ith virtual machine and its computational power Definition 8. Let us consider that f 8 be a one to one mapped function between workflow and its category 'C' Similarly, let f 9 be a one to one mapped function between job and its category 'C' Definition 10. Let us consider that f 10 be a one to one mapped function between virtual machine and its occupancy status where '0' indicates not occupied, '1' indicates occupied Definition 11. Let us consider that f 11 be a one to many mapped function between workflow and its jobs Definition 12. Similarly, f 12 is a one to one mapped function between workflow and its job, where job is the starting job Definition 13. Similarly, f 13 is a one to one mapped function between a job and a virtual machine, where the job is assigned to a virtual machine for its execution Definition 14. Similarly, f 14 is a one to many mapped function between workflow jobs The objective function of the research work is given in Equation (15) ∀k min subject to constraints In this section, the problem formulation has been described with the main objectives of the paper, i.e., (1) to reduce the overall execution time of any scientific application, (2) to utilize the resources efficiently, with some constraints such as computational power of each virtual machine is considered to be the same as represented in Equation 16. Furthermore, any job represented as a node in the workflow can only be scheduled for the execution whenever all the predecessors of that node have been executed successfully represented with the help of Equations (18) and (19).

Proposed Solution
In this paper, mainly time based scheduling algorithms, such as FCFS, min-min, maxmin [54], and (MCT) [55], along with dependency-based scheduler, such as MaxChild [56], have been considered. The MaxChild [56] algorithm is another one-way dependency-based scheduling algorithm that considers the dependency between the parent to child node only. However, there was no consideration given to the dependency between a child to a parent node. To overcome the limitation of these time-based schedulers, multi-dependency, and time-based scheduling algorithm is proposed and compared with the existing approaches. The proposed P2C approach considers the parent to child node relationship (dependencies), child to parent node relationship (dependencies), and the burst time of different tasks present at different levels in the workflow. The main objective of the proposed approach is to reduce the overall execution time of scientific applications, such as CyberShake [51], montage, epigenomics, inspiral, and SIPHT [52], etc., by efficiently utilizing the resources.
The proposed P2C algorithm is used to overcome the limitations of the existing time and dependency-based schedulers for scientific applications. The proposed approach considers the following parameters at a time:

1.
Parent to child node dependencies (which parent has the maximum number of child nodes); 2.
Child to parent node dependencies (which child has the minimum number of parent nodes); 3.
Time of different tasks (in case of a tie in the second condition, as mentioned above) present at different levels in the workflow. The task having maximum time will be scheduled in case of the same dependency ratio.
The proposed P2C algorithm initially checks whether the tasks in the ready queue are less than available resources, or are very much greater than available resources. In cases such as this, the proposed approach will execute the task in the decreasing order of time. Then the P2C algorithm will schedule those child nodes whose parents have already been successfully completed. Maximum dependencies are considered from parent to child nodes, whereas the minimum number of dependencies are considered from child to parent nodes.

Working of the Proposed P2C Algorithm
The step by step working of the proposed P2C algorithm is explained with the help of Figure 9 and according to the steps given in Algorithm 1. Further, to check the status of the ready queue, the procedure READYQUEUESTATUS is described in Algorithm 2. Montage workflow has been considered to have 25 nodes from the dataset file Mon-tage_25. Further, an execution overhead of 0.21 ms has been added to execute the root node represented by R. The tasks numbered from 22 to 25 have been combined into a single task due to the montage workflow pipeline structure in Figure 5. The step by step description of the P2C algorithm for the execution process is given as follows:

Execution 1:
Step 1: Creation of the parent-child table to find the child nodes of each task in the workflow at level 1.
First of all, the proposed P2C algorithm will check the parent to child dependencies from the first level. In level 1, there are five tasks numbered from 1 to 5. Therefore, the parent-child table, Table 1 has been created to represent the parent-child dependencies.  Step 2: Sorting the ready queue based on child to parent dependencies.
For Montage_25 Dataset file, there are 5 VMs available at a time to execute the tasks of a workflow. In level 1, there are only five tasks to be scheduled on 5 VMs. Further, there is a need to sort the tasks according to child-parent node dependencies. Whereas, in the case of more than one task, there is the need to sort the tasks based on the burst time in decreasing order, as depicted in Table 2.  Correct sequence of tasks in the ready queue will become in the order {2, 4, 1, 3, 5} after sorting the tasks based on the number of parents, as well as decreasing order of time.
Step 3: Scheduling of tasks to appropriate VM.
With the help of Step 1 and Step 2, the ready queue of tasks have been found for the scheduling. At the same time the final list of VMs have also been finalized. The next step is to schedule task 2 → V M1, task 4 → V M2, task 1 → V M3, task 3 → V M4, task 5 → V M5. After the first run, tasks numbered from 1 to 5 have been executed and tasks 6-14 are in the available queue. Now, execute all the steps of the algorithm until all the tasks have finished their execution successfully.

Execution 2:
Step 1-2: Creation of the parent-child table to find the child nodes of each task in the workflow at level 2 Each task has only one child, i.e., task 15. Therefore, there is a need to sort the tasks numbered from 6 to 14 in decreasing order of time. Hence, the correct sequence of the tasks in the ready queue will become in the order {8, 9, 13, 6, 7, 12, 10, 14, 11}.
Step 3: Scheduling of tasks to appropriate VM Schedule task 8 → V M5, task 9 → V M4, task 13 → V M3, task 6 → V M2, task 7 → V M1. In this step, we have reversed the VM allocation sequence due to the availability of both free VM and un-allocated tasks in the same order. After the second run, tasks numbered from 1 to 9 and task 13 have been executed successfully and tasks numbered from 10 to 12 and 14 are in the available queue. Hence, the correct sequence for the task in the ready queue will appear in the order {12, 10, 14, 11}.

Execution 3:
Step 1-3: Schedule task 12 → V M1, task 10 → V M2, task 14 → V M3 and task After the third execution, tasks numbered from 1 to 14 have been completed successfully, and tasks 15 is available in available as well as ready queue. There is a need to schedule task 15 on VM1.

Execution 4:
Step 1: Creation of the parent-child table to find the child nodes of each task in the workflow at level 4 After the fourth execution, tasks numbered from 1-15 have been executed, and tasks 16 is available in available and ready queue. The initial level parent-child table for level 4 can be shown in Table 3. Step 2: Sorting the ready queue based on child to parent dependencies After sorting the tasks based on the number of parents and decreasing order of time, updated table can be represented in Table 4.

Execution 5:
After the fifth execution, tasks numbered from 1-16 have been executed, and tasks numbered from 17-21 are available to execution. Hence, the correct sequence of the tasks in the ready queue will become in the order {20, 19, 21, 17, 18}.
Step 1-3: Scheduling of tasks to appropriate VM.
Schedule task 20 → V M1, task 19 → V M2, task 21 → V M3, task 17 → V M4, task 18 → V M5. After the sixth execution, tasks numbered 1-21 have been executed, and task 22 is available in available queue and ready queue. Further, there is need to schedule the task 22 on any of the free VM.
After seventh execution, tasks numbered from 1-22 have been executed and the P2C algorithm has also been stopped due to execution of all the tasks. Hence, overall execution time will be the sum of execution time consumed at each step of the algorithm. The simplified step by step description of the algorithm is represented with the help of flowchart as shown in Figure 10.

Experimental Setup and Results
Cloud applications have different requirements for configuration and deployment [57]. All the experiments have been conducted on a workstation having a 64 bit Windows 10 operating system, 6 GB memory, and Intel (R) Core (TM) i5-3367U @ 1.8GHz CPU. This section describes the details related to the cloud resources, simulation environment, workflows dataset, and results analysis.  for i1 ← 1 to num(AQ) do 3: if aq i1 =job's execution completed then 4: f 5 (aq i1 ) = 2 5: f 10 ( f 13 (aq i1 )) = 0 6: end for 10: for i1 ← 1 to TQ do 11: end for 13: end procedure

Cloud Resources
A resource can be defined as a physical or logical component that is connected to a computer system. Cloud resources are mainly classified as fast computing, storage, communication, power, and security, etc. Every device which is connected to a computer system is considered to be a resource. Figure 11 represents the detailed view of various cloud resources which are provided on the user's request through internet. These resources are mainly classified in two categories: physical resources and logical resources. Physical resources consist of the system's hardware such as processor, memory, and other peripheral devices connected to the system. Logical resources control the physical resources on a temporary basis and mainly consist of an operating system, power sources, APIs, databases, networking bandwidth, and protocols, etc., [58,59]. Cloud framework can be heterogeneous for the proper utilization of homogeneous, as well as heterogeneous resources. Further, quality of service (QoS), specific runtime and virtual technology, can be utilized to provide application's objectivity and the virtualization.

Fast Computing Utility
These resources provide the computational power to the users for the execution of their applications. Cloud computing provides computation as a service (CaaS) as a utility to cloud users. These resources can be treated as the collection of physical machines, mainly processing power, memory, algorithms, operating system, and APIs. To provide these resources to users, the physical machines are deployed on the virtual environment, which is considered as a virtual machine [58,59].

Storage Utility
One of the main issues of end-users is to store their data into any of the convenient storage media, such as hard disk, floppy disk, and flash drive, etc. However, instead of purchasing their own storage space, it is better to take the storage space from CSPs on rent. Therefore, the cloud users store the data and information into an external remote database servers. Data can be accessed and transferred from the database servers to the user's system through internet connectivity [58,59].

Communication Utility
It is also known as network utility or network as a service (NaaS). Communication utility consists of physical resources, such as intermediate devices, sensors, workstations, and logical resources, such as bandwidth, delay, protocols, and communication links. From the networking perspective, intermediate devices and communicating devices consist of modems, hubs, routers, and switches, etc. Further, the networking resources installed on physical machines within a data center are mainly organized in clusters [58,59].

Power Utility
Cloud computing consists of thousands of data servers and various types of other interconnecting devices. Energy efficiency is one of the main issues due to a lot of power consumption for the overall functioning of these service utilities. The power is consumed by the data servers and by cooling and supporting infrastructure, power distribution equipment, and networking equipment. Data centers usually take power from power utility providers, such as local power storage, especially from renewable energy sources including wind and solar energy [58,59].

Security
Cloud users demand various service utilities like IaaS, PaaS, and SaaS from the cloud provider. Therefore, these services must be highly secure, reliable, and available. Trust, integrity, privacy, authentication, and availability must be considered for the perspective of issues in security [58,59].

Workflows Dataset
There is no cost of infrastructure and services needed to test these applications in a repeatable and controlled environment. Initially, less number of virtual machines are created for the execution of smaller workflows. Later on, the number of virtual machines can be increased dynamically according to the execution of larger workflows. The different test cases are created by using the smaller as well as larger workflows as listed in Table 5.

Simulation Environment
The simulation environment evaluates a different kind of resource leasing on the provider's side under different conditions with different load distribution. Implementation of the existing and proposed scheduling algorithms have been done on the Work-flowSim [60] simulator by setting it up on Net Beans IDE. The homogeneous virtual machines used in the simulation process have 512 MB memory, a CPU with 1000 MIPS, a bandwidth of 1000 BPS and 10 GB of image size. WorkflowSim extends the existing CloudSim [61] simulator that allows modeling and matching the cloud environment, data centers, virtual machines, and cloudlets, tackling only one load but is not suitable for scheduling workflows as many tasks need to be integrated. WorkflowSim, therefore, provides a higher level of workflow management over the CloudSim layer.

Statistical Analysis
Statistical analysis is the collection and interpretation of data to uncover patterns and trends. It is a part of data analytics. Statistical analysis can be used in collecting research descriptions, statistical modeling or designing surveys and studies. There are two main types of statistical analysis, i.e., descriptive and inference. After collecting the data, there is a need to analyze it by extracting the data, such as creating a pie chart, line graph, bar graph, and histogram, etc. All of these come down to using the right methods for statistical analysis, i.e., processing and collecting data samples to uncover patterns and trends. For this analysis, there are five different heuristics, such as average, standard deviation, regression, hypothesis test, and sample size determination [62]. The following terms are used in the descriptive analysis for this research: Definition 15. Mean, Standard Deviation, and Variance The mean (average) of a dataset is obtained by adding all the numbers in the dataset divided by the number of values in the dataset. A standard deviation is a measure of the distribution of a dataset from its mean. It measures the absolute diversity of a distribution; the higher the distribution or variance, the higher will be the standard deviation and the greater will be the magnitude of deviation of the value from its mean. The variance measures how well the dataset is distributed. A variance of zero indicates that all data values are the same. However, a high variance indicates that data points are still widely distributed from the mean and each other. The variance can be defined as the average of the squared distances from each point to the mean.

Definition 16. Confidence Interval and Margin of Error
The confidence interval indicates that the parameter is likely to fall between a pair of values near the mean. Confidence interval measures the level of uncertainty or certainty of samples They are usually built using 95% or 99% confidence levels. The margin of error tells about the percentage of points in results that differ from the actual population. For example, a 95% confidence interval with a 5% margin of error indicates that the actual statistics will be between 5% of the substantial population by 95% of the time. The following null hypothesis is considered for this research: H0 = There is a significant difference of size, shape, and structure of the workflow on the proposed scheduling algorithm.
To verify the validity of the proposed algorithm and null hypothesis, the statistic "Two-factor ANOVA without Replication" has been selected and executed successfully at a confidence interval of 95% with 5% a margin of error. The basic difference between the two-factor ANOVA with and without replication is that the sample size is different [63]. In the replication technique, all the sample are mostly unique, and if this happens, then there is a need to calculate the mean independently. In this experiment also, we have different sized workflows as depicted in Table 5.
Definition 17. Two factor ANOVA without Replication A factor is an independent variable. A level is some aspect of a factor; these are also called groups or treatments.The two factors considered are: factor A (Algorithms) and factor B (Workflow Size). Factor 'Algorithms' has 6 levels (FCFS, min-min, max-min, MCT, MaxChild, and P2C). Factor 'Workflow Size' has 4 levels (e.g., for montage workflow: Montage_25, Montage_50, Montage_100, Montag_1000) as depicted in Table 5. The levels for the factor 'Algorithms' are organized as rows and the levels for factor 'Workflow Size' are organized as columns. The two-factor ANOVA will either test for the main effects of factor 'Algorithms' or factor 'Workflow Size' as represented in Equation 22 or Equation 23 and depicted in Table 6.
H0 : µ 1 = µ 2 = µ 3 ......µ m (Factor Algorithms) (22) or H0 : µ 1 = µ 2 = µ 3 ......µ n (Factor Work f low Size) (23)  Table 6 indicates that total (sum of all the tasks of CyberShake workflow in all four variants) execution time and average execution time for the proposed P2C algorithm is less as compared to other comparative scheduling. Further, the value of variance for the proposed approach is higher than the other scheduling algorithms. The descriptive results of Two factor ANOVA test are represented in Table 7. Since the p-Value for the factor Algorithms (0.000 < 0.05) or (F = 7.996 > 2.901 = F-crit) as depicted in Table 7. Therefore, the null hypothesis is rejected, and there is no significant difference of size, shape, and structure of the CyberShake workflow on the proposed P2C algorithm. Similarly, the summary as well as descriptive results of two-factor ANOVA for montage workflow are depicted in Tables 8 and 9.  Table 8 indicates that total and average execution time for montage workflow of proposed P2C algorithm is less than other comparative scheduling algorithms. Since the p-Value for the factor Algorithm (0.000 < 0.05) or (F = 12.752 > 2.901 = F-crit) as depicted in Table 9. Therefore, the null hypothesis is rejected, and there is no significant difference of size, shape, and structure of the montage workflow on the proposed P2C algorithm.  Table 10 indicates that total and average execution time for epigenomics workflow for the proposed P2C algorithm is less or equal to other comparative scheduling algorithms. Since the p-Value for the factor Algorithms (0.024 < 0.05) but (F = 2.526 < 2.901 = F-crit) as depicted in Table 11. Therefore, the null hypothesis for comparative scheduling algorithms with respect to proposed P2C algorithm rejected in some cases while accepted in a majority of the remaining cases. Due to this variable behavior, a significant difference of size, shape and structure of the epigenomics workflow on proposed scheduling algorithm can not be determined. Further, by carefully examining the total and average execution time for epigenomics workflow, it has been observed that either the proposed P2C algorithm outperforms or behaves similar to some existing algorithms. It could be possible due to the parallel, as well as pipeline, structure of the epigenomics workflow.  Table 12 clearly indicates that total and average execution time for inspiral workflow of proposed P2C algorithm is less as compared to other comparative scheduling algorithms. Since the p-Value for the factor Algorithm (0.000 < 0.05) or (F = 40.834 > 2.901 = F-crit) as depicted in Table 13. Therefore, the null hypothesis is rejected, and there is no significant difference of size, shape, and structure of the inspiral workflow on proposed scheduling algorithm.  Table 14 indicates that total and average execution time of SIPHT Workflow for the proposed P2C algorithm is less compared to other comparative scheduling algorithms. Since the p-Value for the factor Algorithm (0.025 < 0.05) or (F = 3.471 > 2.901 = F-crit) as depicted in Table 15. Therefore, the null hypothesis is rejected, and there is no significant difference of size, shape, and structure of the SIPHT workflow on the proposed scheduling algorithm.

Results
Initially, all the scheduling algorithms have been implemented for existing scientific applications such as CyberShake, montage, epigenomics, inspiral and SIPHT with different tasks and different numbers of VMS but having the same structure. The experimental results have been compared with a different number of virtual machines for different sizes of workflows. The existing, as well as proposed, algorithms have been validated on all the test cases listed in Table 5 and results are shown in Figures 12-16, etc.  It is clearly visible in Figure 12 that for CyberShake workflow, the proposed approach outperforms the existing approaches in terms of total execution time for less number of tasks. However, due to its parallel nature, in case of availability of more number of tasks, the proposed approach behaves similar to max-min scheduler and takes the same time as of max-min scheduler. Similarly, in some cases of Inspiral and Epigenomics workflow, the proposed algorithms behaves similar to max-min algorithms. It is clearly visible in Figure 13 for montage workflow and Figure 16 for SIPHT workflow, the proposed algorithm P2C outperforms the existing scheduling in terms of total execution time.

Complexity Analysis
The task scheduling problem in the cloud environment takes a large solution space. Moreover, the scheduling of n tasks onto m resources have been considered as an NP-Hard problem. Thus, it will take O(n m ) time which is non-polynomial since, there is no existence of any algorithm that can find the optimal solution in a polynomial run time. The proposed P2C algorithm is based on a priority queue structure. In general, a priority queue ranks their tasks by a particular key with an order relation. Here, each element has its key, and such keys are not necessarily unique. Some major operations have been associated with priority queue such as TaskList, ReadyQueue, AvailableQueue, READYQUEUESTATUS(), etc. The priority-based task scheduling consists of the following properties:

1.
Each task has a priority associated with it; 2.
A task with high priority must precedes the low priority tasks; 3.
Two tasks can have same priority, however, such tasks will be scheduled as per their order in the queue. . Hence the total computational complexity for removal() is O(n 2 ).
In case of the non-linear sorting, if priority queue is using heap then, insert() and remove() each take O(logn), where n is the number of tasks. Here, each pass takes O(nlogn) time for insert(), as well as remove().

Conclusions and Future Scope
Workflow scheduling is one of the significant issue in cloud computing among other popular issues, such as virtual machine migration, databases management, security, performance, fault tolerance, and server consolidation, etc. Workflow scheduling is a challenging job to find the proper sequence of workflow tasks for execution. It further depends upon the QoS requirements of the cloud applications. Due to heterogeneity; uncertainty and resource mobilization; resource scheduling is one of the hotspot area of research in demand. Different criteria for scheduling various resources and the parameters, requires different categories of resource scheduling techniques. In this paper, existing time-based scheduling algorithms such as first come first serve (FCFS), min-min, max-min, and minimum completion time (MCT), along with dependency-based scheduling algorithm MaxChild have been considered. The main objective of the existing time based schedulers is to reduce the workflow's execution time, but there is no importance given to resource utilization. The proposed (P2C) algorithm mainly focuses on utilization of the resources efficiently. Further, The proposed P2C algorithm outperforms existing time and dependency based scheduling algorithms in terms of total execution time. The experimental results conclude that the existing schedulers have varying execution time based on the size, shape, and number of resources and virtual machines available. From the statistical analysis, it has been analyzed that proposed algorithm has no significance of size, shape, and structure of the workflow. There are still many challenges that need to be overcome to achieve more effective and comprehensive results. Further, at present, the results have been computed for the standard CyberShake, montage, epigenomics, inspiral, and SIPHT workflow with varying number of tasks and virtual machines. However, in the future, the work could be extended to other scientific systems beyond the natural environment, and verification can be tested over real-time cloud space.

Author Contributions:
The initial writing on the ideas; Preparation, creation and/or presentation of the published work by those from the original research group, specifically critical review, commentary or revision-including pre-or post-publication stages have been done by first author V.P. The other authors, S.B. and L.G. have played the role of supervisors with oversight and leadership responsibility for the research activity planning and execution, including mentorship external to the core team. Further, they have provided valuable suggestions specifically on the critical review, commentary or revision-including pre-or post-publication stages. All authors have read and agreed to the published version of the manuscript.