A New Hybrid Fault Tolerance Approach for Internet of Things

: In the Distributed Management Task Force, DMTF, the management software in the Internet of things (IoT) should have ﬁve abilities including Fault Tolerance, Conﬁguration, Accounting, Performance, and Security. Given the importance of IoT management and Fault Tolerance Capacity, this paper has introduced a new architecture of Fault Tolerance. The proposed hybrid architecture has used all of the reactive and proactive policies simultaneously in its structure. Another objective of the current paper was to develop a measurement indicator to measure the fault tolerance capacity in di ﬀ erent architectures. The CloudSim simulator has been used to evaluate and compare the proposed architecture. In addition to CloudSim, another simulator was implemented that was based on the Pegasus-Workﬂow Management System (WMS) in order to validate the architecture that is proposed in this article. Finally, fuzzy inference systems were designed in a third step to model and evaluate the fault tolerance in various architectures. Based on the results, the positive e ﬀ ect of using various combined Reactive and Proactive policies in increasing the fault tolerance in the proposed architecture has been prominently evident and conﬁrmed.


Introduction
The evolution and transformation of the Internet have transitioned from the Internet of content to the Internet of services and the Internet of people, and today there is the Internet of things.The Internet of things, hereinafter referred to as IoT, consists of a series of smart sensors, which, directly and without human intervention, work together in new kinds of applications.It is obvious that the management and traditional architecture of the Internet must be changed.For example, the address space for support must be changed from IPv4 to IPv6.The first and most important thing for upgrading the management of the IoT is a requirement for a layered and flexible architecture.Many models have been proposed for the IoT architecture.A basic model is a three-layered architecture.This architecture has been formed from sensors, network, and process layers (See e.g., [1]).
The explosive growth of smart objects and their dependencies on wireless communication technologies make the Internet more vulnerable to faults.These faults may be established due to different reasons, including cyber-attacks, which are referred to in [2].These faults may provide daunting challenges to experts, as it becomes more important to manage the participating components in IoT.The Distributed Management Task Force, briefly referred to as DMTF, has announced that the cloud management software should have the ability of FCAPS.The first letter of the word that is character F is an acronym for fault tolerance.In other words, the first feature of the management software should entail fault tolerance.Other management capabilities can be considered if there is a fault tolerance feature.However, in the absence of fault tolerance, other features are not important and they accompany no management ability.Fault tolerance refers to providing an uninterrupted service system, even in the presence of an existing fault in the system.It is quite clear that, if we want to review and implement the management in the IoT layered architecture, the primary focus should be on its first feature, i.e., fault tolerance.
Different methods have been introduced to increase the fault tolerance in the IoT.For example, in [3], the virtual form of cluster head nodes are utilized when considering that CH nodes have the role of forwarding collected data in the IoT application.Therefore, the tolerance of the CH nodes failure will increase in the network.Furthermore, in [4], the virtualization technique is used in wireless sensor networks due to the growth of the IoT service.When a fault is developed in wireless sensor network communications, it has significant impact on many virtual networks running IoT services.The framework that is proposed in [4] provides the optimization of the fault tolerance in virtualized wireless sensor networks, with a focus on heterogeneous networks for service-based IoT applications.Regardless of other methods, the main techniques for increasing fault tolerance can be categorized in three broad categories.The techniques for increasing fault tolerance have been divided into three main groups.The first group includes redundancy techniques.These techniques can be implemented as hardware redundancy, software redundancy, or time redundancy.The second category includes load balancing techniques.These techniques can also be implemented as hardware, software, or in the network.Finally, the third group of techniques to increase the fault tolerance (FT) capability is related to the use and benefit of different policies.These policies are dependent on the environment in which they are implemented.For example, two types of policies are used in cloud computing environments: the proactive policy and the reactive policy.The fault will not be allowed in proactive policies and design should be done in such a way that the fault is forecasted before creating the fault, which is possible to be prevented.Accordingly, these types of policies cover two phases, including fault forecasting and fault prevention.This procedure is also followed in reactive policies in which the fault tolerance operation happens after the occurrence of the fault in the form of fault detection, fault isolation, fault diagnosis, and fault recovery.Of course, it should be noted that the use of each of these methods in increasing the FT capability imposes overhead costs to the system, but the highest performance for the system is achieved when the reactive and proactive policies are used and simultaneously implemented.
The aim of the present study was evaluating the management of fault tolerance in the IoT communication platform, i.e., its second layer architecture.In this regard, the analysis that was carried out in [5] was used.Subsequently, a new architecture was proposed, whereby the maximum coverage of various phases enjoyed the FT capability.The simultaneous benefit from the reactive and proactive policies achieved the highest possible performance in the output.
The Internet has created extensive variations in industrial scopes and business models.Industrial internet is the result of combining internet and big data and artificial intelligence and economy in the world.A developed digital channel is responsible for information delivery based on the latest technologies regarding the industrial Internet, so that intelligent decisions can be done in the real time format to enhance the efficiency.To this end, there is a need for a platform in the real world to realize the rapid integrity and reply to the market as fast as possible.The rapid integrity can be realized and it can reply to the market as long as it uses resources given the fact that the proposed architecture in this article was mainly applicable in real time systems.Hence, it can have a wide range of applications in industrial internet platform and IIoT architecture.
The contributions of this paper are as follows: • We offer a hybrid modern architecture that simultaneously uses proactive and reactive policies.

•
Our proposed architecture uses all three types of reactive policies at the same time.

•
Both proactive policies are implemented at the same time in our proposed architecture structure.

•
Maximum use of different fault detection/recovery methods is also considered in our proposed architecture.

•
In addition to simulations that were carried out, the design and implementation of scientific workflows in the past architectures and our proposed architecture are one of the most important achievements and innovations of this study.
The remainder of the article has been divided into eight sections.The related previous architectures are presented in Section 2 of the article.In Section 3, the proposed architecture is introduced and described.Subsequently, in Section 4, the CloudSim simulator is used to simulate the proposed architecture and it has compared its performance with previous architectures; in Section 5, scientific workflows are introduced and the proposed architecture is simulated.In Section 6, the evaluation systems in fuzzy logic are designed and implemented, and the analysis of the results is discussed; and ultimately, in Section 7, optimal decision-making is discussed.In Section 8, the conclusions and related ideas for future work are expressed.

Related Works
As expressed in [5], the fault-tolerant architectures of cloud computing are divided into two general categories according to the policies that will benefit them.The first group is formed by proactive architectures and the second group includes reactive architectures.As expressed in [6][7][8], the preemptive migration and self-healing policies are used in proactive architectures.The Check Point/Restart, Replication, and Job Migration policies have also been used in reactive architectures.The Map-Reduce and FT-cloud architectures are examples of proactive architectures.The Map-Reduce that was introduced in [9-13] has used both proactive policies in its structure.The FT-cloud described in [14] has only used the self-healing policy.
The FTWS, LLFT, FTM-2, FTM, MPI, Gossip, BFT-Cloud, Haproxy PLR, AFTRC, Vega Warden, MagiCube, and Candy, as presented in [15][16][17][18][19][20][21][22][23][24][25][26][27][28], are among the reactive architectures.Each of these architectures simultaneously uses one or two policies.Of course, the architectures of the PLR, AFTRC, FTM, as presented in [15][16][17], have used all three reactive policies.The difference between the FTM with two architectures is that AFTRC and PLR are devoted to real-time systems, whereas FTM is not devoted to such systems.The difference between PLR with AFTRC is that the PLR has a lower Wall Clock Time than the AFTRC.An explanation of all architectures mentioned in [5] has been described in detail.As expressed in [5,16], the AFTRC architecture has benefited from five modules: Recovery Cache (RC), Decision Maker (DM), Reliability Assessor (RA), Time Checker (TC), and Acceptance Test (AT) in its internal structure.The AT module is responsible for checking the output value of the virtual machine.TC performs time checking, i.e., the time that is required for generating the output of the virtual machine.The RA module that evaluates the reliability of the output, in fact, identifies the percentage of the output credit that is based on two previous parameters, i.e., the AT and TC.DM module is responsible for determining and selecting the final output and, finally, the RC module is the storage of the checkpoints in the case of operation replication.Figure 1 shows the structure of this architecture.• In addition to simulations that were carried out, the design and implementation of scientific workflows in the past architectures and our proposed architecture are one of the most important achievements and innovations of this study.
The remainder of the article has been divided into eight sections.The related previous architectures are presented in Section 2 of the article.In Section 3, the proposed architecture is introduced and described.Subsequently, in Section 4, the CloudSim simulator is used to simulate the proposed architecture and it has compared its performance with previous architectures; in Section 5, scientific workflows are introduced and the proposed architecture is simulated.In Section 6, the evaluation systems in fuzzy logic are designed and implemented, and the analysis of the results is discussed; and ultimately, in Section 7, optimal decision-making is discussed.In Section 8, the conclusions and related ideas for future work are expressed.

Related Works
As expressed in [5], the fault-tolerant architectures of cloud computing are divided into two general categories according to the policies that will benefit them.The first group is formed by proactive architectures and the second group includes reactive architectures.As expressed in [6][7][8], the preemptive migration and self-healing policies are used in proactive architectures.The Check Point/Restart, Replication, and Job Migration policies have also been used in reactive architectures.The Map-Reduce and FT-cloud architectures are examples of proactive architectures.The Map-Reduce that was introduced in [9-13] has used both proactive policies in its structure.The FTcloud described in [14] has only used the self-healing policy.
The FTWS, LLFT, FTM-2, FTM, MPI, Gossip, BFT-Cloud, Haproxy PLR, AFTRC, Vega Warden, MagiCube, and Candy, as presented in [15][16][17][18][19][20][21][22][23][24][25][26][27][28], are among the reactive architectures.Each of these architectures simultaneously uses one or two policies.Of course, the architectures of the PLR, AFTRC, FTM, as presented in [15][16][17], have used all three reactive policies.The difference between the FTM with two architectures is that AFTRC and PLR are devoted to real-time systems, whereas FTM is not devoted to such systems.The difference between PLR with AFTRC is that the PLR has a lower Wall Clock Time than the AFTRC.An explanation of all architectures mentioned in [5] has been described in detail.As expressed in [5,16], the AFTRC architecture has benefited from five modules: Recovery Cache (RC), Decision Maker (DM), Reliability Assessor (RA), Time Checker (TC), and Acceptance Test (AT) in its internal structure.The AT module is responsible for checking the output value of the virtual machine.TC performs time checking, i.e., the time that is required for generating the output of the virtual machine.The RA module that evaluates the reliability of the output, in fact, identifies the percentage of the output credit that is based on two previous parameters, i.e., the AT and TC.DM module is responsible for determining and selecting the final output and, finally, the RC module is the storage of the checkpoints in the case of operation replication.Figure 1 shows the structure of this architecture.The two main phases of the FT capability are fault detection and fault recovery.Fault detection and fault recovery can be conducted in different ways.The Gossip architecture that is presented in [20], the FTM presented in [15], and the PLR presented in [17] have used the self-detection method.This is the weakest method of fault detection, because a node itself should detect the fault.Since the PLR architecture has only used the method in the fault detection phase, it has a very weak capability of fault detection, and it is a great disadvantage of this architecture.Nevertheless, the architectures of the FTM and Gossip have also used the methods of other detection and group detection, in addition to this method.
The majority of the architectures that are proposed in the FT field of cloud computing have implemented the other detection method in their own fault detection phase.It should be noted that only the fault group detection method has been in architectures, such as the LLFT and BFT-Cloud, as presented in [19,22].The dominant method in fault recovery is also the system recovery method whereby the recovery is performed at the overall level of the system.Again, it can be seen in this phase that the LLFT and Vega-Worden architectures that are presented in [22,26] perform the fault recovery at the node level, which is a weaker method than the previous method.The weakest recovery method is the fault mask mode in which it sought to exploit the techniques for removing and covering the fault effect.The FTM and AFTRC architectures that are introduced in [15,18] have used this technique with other methods.
It is extremely important that all of the studied architectures so far are reactive or proactive.In addition, the architecture described in [27] is a hybrid one, which is termed as VFT.VFT architecture utilizes Self-healing, Preemptive Migration, and Replication policies.In [28], it is stated that the architecture introduced in [27], simultaneously used both proactive and reactive policies.Figure 2 shows the structure of this architecture.The two main phases of the FT capability are fault detection and fault recovery.Fault detection and fault recovery can be conducted in different ways.The Gossip architecture that is presented in [20], the FTM presented in [15], and the PLR presented in [17] have used the self-detection method.This is the weakest method of fault detection, because a node itself should detect the fault.Since the PLR architecture has only used the method in the fault detection phase, it has a very weak capability of fault detection, and it is a great disadvantage of this architecture.Nevertheless, the architectures of the FTM and Gossip have also used the methods of other detection and group detection, in addition to this method.
The majority of the architectures that are proposed in the FT field of cloud computing have implemented the other detection method in their own fault detection phase.It should be noted that only the fault group detection method has been in architectures, such as the LLFT and BFT-Cloud, as presented in [19,22].The dominant method in fault recovery is also the system recovery method whereby the recovery is performed at the overall level of the system.Again, it can be seen in this phase that the LLFT and Vega-Worden architectures that are presented in [22,26] perform the fault recovery at the node level, which is a weaker method than the previous method.The weakest recovery method is the fault mask mode in which it sought to exploit the techniques for removing and covering the fault effect.The FTM and AFTRC architectures that are introduced in [15,18] have used this technique with other methods.
It is extremely important that all of the studied architectures so far are reactive or proactive.In addition, the architecture described in [27] is a hybrid one, which is termed as VFT.VFT architecture utilizes Self-healing, Preemptive Migration, and Replication policies.In [28], it is stated that the architecture introduced in [27], simultaneously used both proactive and reactive policies.Figure 2 shows the structure of this architecture.A computational algorithm, called the success rate for each node, has been individually applied in VFT architecture.The algorithm decides based on two factors.The first factor is PR, which represents the performance rate.The second factor is the max success rate, which represents the maximum level that is considered for the success rate.If the SR of a node is less than or equal to zero, in this case, the load balancer does not assign tasks to the virtual machine in the next cycles.
It would be possible to decide the failure of a node according to the parameters of SC and TDC.Parameter SC stands for Status Checker and parameter TDC stands for Task Deadline Checking.If all of the nodes also fail, the FDM, Which stands for Final Decision-Making, sends feedback to the fault handler in order to make it aware of this issue.The fault handler detects and recovers the fault, based on the techniques that are defined and implemented for fault detection and fault recovery.It is A computational algorithm, called the success rate for each node, has been individually applied in VFT architecture.The algorithm decides based on two factors.The first factor is PR, which represents the performance rate.The second factor is the max success rate, which represents the maximum level that is considered for the success rate.If the SR of a node is less than or equal to zero, in this case, the load balancer does not assign tasks to the virtual machine in the next cycles.
It would be possible to decide the failure of a node according to the parameters of SC and TDC.Parameter SC stands for Status Checker and parameter TDC stands for Task Deadline Checking.If all of the nodes also fail, the FDM, Which stands for Final Decision-Making, sends feedback to the fault handler in order to make it aware of this issue.The fault handler detects and recovers the fault, based on the techniques that are defined and implemented for fault detection and fault recovery.It is worth noting that the architecture in the fault detection phase acts based on the other detection method according to the scenario of the VFT architecture.The fault recovery method in the VFT architecture has also been implemented at the system level.
The approaches that are presented in [29][30][31][32][33] can be mentioned as some instances of architectural models of fault tolerance, which have been presented according to other architectural models of the IoT.Fault tolerance implemented on the internet of military objects has been investigated in [29].In the presented model, which is shortly called IOMT, is an approach called MM and presented by Malek and Maeny, which has been used in order to facilitate fault detection.This method is the majority of the duplicate entries that have been sent to a dual-processor.In this architecture, the fault mask technique has also been used on the fault recovery phase.In addition, fault tolerance on routing IoT is the proposed method in [30].Layer fault management, which as a plan for end-to-end transfer, has been introduced in [31].In [32], another method has been presented using the concept of virtual services.In the present article, a genetic algorithm, which is known as NSG-ii, has been used.Fault tolerance that is based on the architecture of the middle base ware has been shown in [33].In this method, redundancy on the services' level has been implemented.To conclude, the shortcomings of the existing solutions are as follows:

•
The highest efficiency of the fault tolerance architecture is obtained through hybrid architectures.Among all of the investigated architectures, VFT architecture is the hybrid one, so the lack of hybrid architecture is extensively felt in this scope.

•
VFT architecture enjoys proactive policies in the integrative way.However, it has only used the replication method among reactive policies and it has not utilized all of the methods in an integrative way.Thus, it cannot have the maximum reliability.

•
Fault detection method in architecture has been done in another detection way, which is an average method among the detecting methods.

•
The fault recovery method of the VFT architecture has been the only previous hybrid architecture in the system level, in which the refinement of faults has not happened in VMs' level.Moreover, the fault mask method has not enjoyed this method in fault recovery phase of architecture.

Proposed Approach
The architecture that is proposed in this paper has been called PRP-FRC.This architecture is considered as a hybrid architecture in terms of implementing the reactive and proactive policies.It has been sought in the PRP-FRC that proactive policies, including preemptive migration and self-healing and reactive policies, including Checkpoint/Restart and Job Migration and Replication can be simultaneously implemented.Figure 3 shows the proposed PRP-FRC architectural structure.
In the proposed architecture of PRP-FRC, tasks are initially divided between physical nodes, which are the hosts, by the load balancer.Subsequently, they are divided between virtual nodes by the manager available in each node with the help of the mapping table.Output accuracy and checking the reliability of the output for a virtual machine is achieved by the AT module.If the output validity of a virtual node is not confirmed by the AT, then the task will be assigned to another virtual machine by the manager via feedback that is available from the AT to the manager.The TC module makes decisions on the time validity of a virtual machine's generated output.In fact, the task of the TC is to check that the virtual machine has produced an output in the time that is taken to respond or has spent a more defined period to generate the time departure.
The importance of the mentioned module is very critical, because the proposed architecture of PRP-FRC, such as the AFRC architecture, is intended for real-time applications.The RA module decides on the reliability rate of a node that is based on the output values of the two previous parameters (AT and TC).In other words, if the percentage of node reliability is less than the limit that is defined in the RA modules, then a new task will not be assigned to the virtual node in the next cycles.In the process of verifying the time validity of the virtual nodes, if none of the nodes have the desired output, the TC issues an "All Nodes Failed" message.The request for job migration is activated in this case, and the guest virtual node on another host is selected by the MPI architecture and the job migration is done.The importance of the mentioned module is very critical, because the proposed architecture of PRP-FRC, such as the AFRC architecture, is intended for real-time applications.The RA module decides on the reliability rate of a node that is based on the output values of the two previous parameters (AT and TC).In other words, if the percentage of node reliability is less than the limit that is defined in the RA modules, then a new task will not be assigned to the virtual node in the next cycles.In the process of verifying the time validity of the virtual nodes, if none of the nodes have the desired output, the TC issues an "All Nodes Failed" message.The request for job migration is activated in this case, and the guest virtual node on another host is selected by the MPI architecture and the job migration is done.
In the case where all of the nodes are not failed and only some of them have gained the reliability that is necessary for the production of the Trust Output, and then two modes will be implemented based on the decision of the DM module, which stands for Decision Making.The first case is that the task`s restart operation is performed from the last checkpoint that is based on the storage of checkpoints in the architectural structure.The second case is that the job migration to a virtual machine happens on another host, regardless of the checkpoint.The choice of a method in this step depends on the policies that are defined in the DM module.Algorithm 1 shows the algorithm of the proposed architecture of the PRP-FRC.In the case where all of the nodes are not failed and only some of them have gained the reliability that is necessary for the production of the Trust Output, and then two modes will be implemented based on the decision of the DM module, which stands for Decision Making.The first case is that the task's restart operation is performed from the last checkpoint that is based on the storage of checkpoints in the architectural structure.The second case is that the job migration to a virtual machine happens on another host, regardless of the checkpoint.The choice of a method in this step depends on the policies that are defined in the DM module.Algorithm 1 shows the algorithm of the proposed architecture of the PRP-FRC.As a result, the following issues can be mentioned if we want to state the proposed strength to cover the weaknesses of the previous architecture.

•
The proposed architecture is a hybrid architecture, so it is expected to have the highest efficiency.

•
The proposed architecture enjoys all reactive and proactive policies in an integrated way, so it leads the system to achieve the highest level of fault tolerance.

•
The proposed architecture has simultaneously utilized two methods of self-detection and other detection in the detection phase of the faults.

•
The proposed architecture has used every refinery fault, which is unlike the previous architectures and hybrid VFT ones.It also masks the faults and refines the faults that are in Vms levels.Finally, it prepares the refinement of the faults in the system level.

Simulation of the Proposed Architecture in CloudSim
CloudSim is used to simulate and compare the proposed architecture with previous architectures.This simulator can generate various reports on energy consumption, cost, and execution time of each Cloudlet.The ease of implementation, simulation of different types of network topologies and architectures, the defining of multiple DataCenter, ease of implementation of different policies for allocating Vm and Host, and support for Space Shared and Time-Shared scheduling are the benefits of this simulator.
The implemented scenario in this article has been designed in the cloudSim simulator in a two-host DataCenter with a number of different processors.Each host has four VMs with various specifications.The ability of VMs is differently defined to deal with different versions of each Cloudlet, whose different behaviors and functions have been examined.Tables 1 and 2 show the configuration specifications of DataCenter and Vm, respectively.Moreover, a number of Cloudlet have been designed and implemented with a different processing length.The purpose of this type of design was to have a different length of Cloudlets to create different modes of faults on the system.After designing DataCenter, Host, Vm, and Cloudlet, and before simulation, it has become necessary to extend the CloudSim based on the proposed architecture.According to which, the manner of distribution of computational resources, such as Ram and especially PEs among requests, should be determined.This scheduling can be in the form of Space Shared, Time-Shared, or Customized policy definition.In all simulations, we have used all resources in Space Shared manner.
In the Reliability calculation algorithms, the calculations are done in such a way that the effect of a fail or a Success causes changes with a slower slope in the reliability of each step.Additionally, at the beginning of a Vm, the fail has a different effect with the fails occurred after several requests.Obviously, the reliability in each stage of the simulation becomes equal to the average reliability of all of the VMs at that stage.The simulation results in Tables 3-5 and Figure 4 is shown.These results have demonstrated the positive effect of using all policies as hybrids in the proposed architecture.As it can be understood from the data center configuration and Vms of Tables 1 and 2, the Cloudlets had a data size with a constant value of 300.However, the lengths of jobs were considered to be different, so that, when it faced different faults, various behaviors could be appeared.Regarding the time variation value, the acceptable time for the implementation was considered to be between 1 to 7 s.Thus, in the case that one Vm presented an output in less than 1 s, it would be unaccepted, and if this time was more than 7 s, the output would be unacceptable, as well.Therefore, this time range should be considered in the first phase of analyzing virtual machines' outputs.Regarding the second phase, if the output was produced in the acceptable time range, the validation of the output's amount would be considered to see whether the amount of the produced output of Vm is acceptable or not.

Validation platform
The reliability of a Host Vm equals the average of all Vms' reliability of one Host.It is well evident that, when all Vms have acceptable outputs, the amount of total reliability would be increased.Moreover, when all of the Vms fail, it is quite natural that their total reliability would be decreased.These two, are respectively the best and worst possible modes, which include the Max and Min of the reliability.In the case that Vm has some acceptable outputs and lacks the expected output, the reliability would be high or low.On the other hand, as time passes, the results become more important.Our criteria is the current time and status of each Vm.It is possible that a Vm may have had an acceptable output before, but it lacks proper output at the present, or vice versa.Therefore, the number of successful outputs or failed ones, as well as the success-effect coefficient versus the fail-effect coefficient is considered to be the effective parameters when doing calculations.
For example, we assume that the Success-Effect coefficient, which is indicated as SE, equals to 0.1 and the Fail-Effect coefficient, indicated as FE, equals to 0.03, the importance factor, indicated as IF, equals 0.6, whose highest value would be 1.The number of success and failure would be considered, respectively, as CSC and CFC, in which CSC is the abbreviation for Continues Success Counter and CFC is the abbreviation for Continues Fail Counter, whose first values are considered to be zero.As a result, the amount of every Vm's reliability equals to the amount of previous round's reliability, plus the multiplication of the previous round's reliability when multiplied in CSC and the success-effect or fail-effect coefficient, which is multiplied to the importance factor.

Simulation Environment
Many scientific calculations use workflow technologies to manage complexity [6].Workflow technology is used to the schedule calculating tasks on distributed resource to manage task interdependence.The aim of this scheduling is to optimize the mapping between tasks and appropriate resources.One of the important factors that have a great influence on choosing a scheduling strategy is the dependence and independence among the tasks.Some tasks have to be done in succession; these kinds of tasks are called dependent tasks.In contrast, there is another category of tasks that are simultaneously run or in a special order; these kinds of tasks are called independent tasks.Scheduling dependent tasks is known as workflow scheduling.
A simulated environment has been implemented based on the Pegasus-WMS workflow management system to validate the architecture proposed in this article (Pegasus-WMS).In Pegasus, the workflows are described as Direct A cycle Graphs (DAGs).In DAG, each node represents one of the tasks.The edges of a DAG also represent the interdependence between those tasks.Montage and CyberShake are the most famous scientific workflows.Montage is applied for the processing and transmitting images that have been used in NASA. Figure 5a shows the Montage Workflow.Figure 5b also shows CyberShake Workflow.Cybershake Workflow has been used to process the waves at the California Seismological Center.The Pegasus-WMS approach acts as a bridge between the field of science and the field of action expressing the connection between them.In addition to executing a described abstract workflow, the Pegasus Workflow Management System has the ability to translate the tasks into the jobs.Subsequently, it also attends the running of those jobs.This workflow management system is capable of simultaneously executing data management and running the jobs.Additionally, it has the ability to monitor job execution and tracking.Eventually, Pegasus can handle them in the case of failure.The mentioned actions are performed by the five Pegasus subsystems.Figure 6 shows the architecture of the Pegasus workflow management system.Electronics 2019, 8, x FOR PEER REVIEW 13 of 23 The first major Pegasus subsystem is Mapper.Mapper is the producer of an executive workflow that is based on an abstract workflow that was provided by the user.The second sub-system is the local execution engine, which is responsible for submitting the jobs that are defined by the workflow.Submitting the jobs is done based on the workflow.Subsequently, jobs` states are tracked and the readiness timing of running those jobs is determined.The next sub-system is job scheduler, which is responsible for the management of unique job scheduling and monitoring their implementation on local or remote resources.The remote execution engine`s sub-system manages the execution of one The first major Pegasus subsystem is Mapper.Mapper is the producer of an executive workflow that is based on an abstract workflow that was provided by the user.The second sub-system is the local execution engine, which is responsible for submitting the jobs that are defined by the workflow.Submitting the jobs is done based on the workflow.Subsequently, jobs' states are tracked and the readiness timing of running those jobs is determined.The next sub-system is job scheduler, which is responsible for the management of unique job scheduling and monitoring their implementation on local or remote resources.The remote execution engine's sub-system manages the execution of one or more tasks based on the possible or probable structures of a sub-workflow on one or more remote computing nodes.Finally, the subsystem of the monitoring component is responsible for monitoring at run time, which monitors the execution of a workflow.The analysis of jobs in a workflow, and populating them in a workflow database, is the responsibility of this subsystem.
Basic structures or main components of a scientific workflow include process, pipeline, data distribution, data aggregation, and, finally, data redistribution.The VFT and AFTRC architectures are depicted in Figure 7a,b, respectively.The first major Pegasus subsystem is Mapper.Mapper is the producer of an executive workflow that is based on an abstract workflow that was provided by the user.The second sub-system is the local execution engine, which is responsible for submitting the jobs that are defined by the workflow.Submitting the jobs is done based on the workflow.Subsequently, jobs` states are tracked and the readiness timing of running those jobs is determined.The next sub-system is job scheduler, which is responsible for the management of unique job scheduling and monitoring their implementation on local or remote resources.The remote execution engine`s sub-system manages the execution of one or more tasks based on the possible or probable structures of a sub-workflow on one or more remote computing nodes.Finally, the subsystem of the monitoring component is responsible for monitoring at run time, which monitors the execution of a workflow.The analysis of jobs in a workflow, and populating them in a workflow database, is the responsibility of this subsystem.
Basic structures or main components of a scientific workflow include process, pipeline, data distribution, data aggregation, and, finally, data redistribution.The VFT and AFTRC architectures are depicted in Figure 7a and Figure 7b, respectively.The PrP-FRC workflow proposed architecture has been presented in Figure 8. Obviously, the basic structures of the process, data distribution, data aggregation, and pipeline have been exploited, in the implementation of the workflows of VFT and AFTRC architectures and the proposed PrP-FRC architecture.
The evaluations of the proposed architecture when compared to the VFT and AFTRC architectures have been conducted in terms of the run time of the relevant workflows.The experiments have been performed using the above-described simulation environment.In the following subsections, the results of the carried-out simulations have been described in detail.
The PrP-FRC workflow proposed architecture has been presented in Figure 8. Obviously, the basic structures of the process, data distribution, data aggregation, and pipeline have been exploited, in the implementation of the workflows of VFT and AFTRC architectures and the proposed PrP-FRC architecture.The evaluations of the proposed architecture when compared to the VFT and AFTRC architectures have been conducted in terms of the run time of the relevant workflows.The experiments have been performed using the above-described simulation environment.In the following subsections, the results of the carried-out simulations have been described in detail.

Simulation results
Tables 6, 7, and 8 presented the results of the simulations of AFTRC and VFT architectures and the proposed PrP-FRC architecture in the Pegasus-WMS simulator.
The proposed PrP-FRC architecture has been evaluated with VFT and AFTRC architectures in terms of two criteria.The average execution time and reliability are considered as two comparative criteria.
Figure 9 shows the average execution time for each architectural workflow.It is clear that the PrP-FRC architecture had the highest average of execution time, since it has implemented all Proactive and Reactive policies.The AFTRC architecture also had the lowest average of execution time, which was not a hybrid architecture, rather it was considered as Reactive architecture.
On the other hand, as shown in Figure 10, the highest level of fault tolerance was provided by the proposed PrP-FRC architecture, and the VFT hybrid architecture was in the second place.The reliability of this architecture was less than PrP-FRC and more than the reliability of AFTRC architecture.Additionally, in Figure 11, the number of failed and succeed tasks or jobs has been illustrated in one of the proposed simulation rounds as an example.

Simulation Results
Tables 6-8 presented the results of the simulations of AFTRC and VFT architectures and the proposed PrP-FRC architecture in the Pegasus-WMS simulator.
The proposed PrP-FRC architecture has been evaluated with VFT and AFTRC architectures in terms of two criteria.The average execution time and reliability are considered as two comparative criteria.
Figure 9 shows the average execution time for each architectural workflow.It is clear that the PrP-FRC architecture had the highest average of execution time, since it has implemented all Proactive and Reactive policies.The AFTRC architecture also had the lowest average of execution time, which was not a hybrid architecture, rather it was considered as Reactive architecture.
On the other hand, as shown in Figure 10, the highest level of fault tolerance was provided by the proposed PrP-FRC architecture, and the VFT hybrid architecture was in the second place.The reliability of this architecture was less than PrP-FRC and more than the reliability of AFTRC architecture.Additionally, in Figure 11, the number of failed and succeed tasks or jobs has been illustrated in one of the proposed simulation rounds as an example.

Modelling and Fuzzy Analysis of Architectures
Our real world is the world of uncertainties and ambiguities.Given that fault tolerance is a qualitative parameter and it is associated with uncertainty, the fuzzy logic is used in modelling and analysing this important feature.Different methods have been presented in different papers for reliability analysis.One of these methods, as referred in [35], is the modelling of the fault tolerance

Modelling and Fuzzy Analysis of Architectures
Our real world is the world of uncertainties and ambiguities.Given that fault tolerance is a qualitative parameter and it is associated with uncertainty, the fuzzy logic is used in modelling and analysing this important feature.Different methods have been presented in different papers for reliability analysis.One of these methods, as referred in [35], is the modelling of the fault tolerance based on fuzzy logic.The support of fuzzy systems from the rapid pattern generation and incremental optimization increases the importance of the results.Furthermore, the evaluating frameworks having the tolerant characteristic of the faults have been introduced in the various architectures of [36,37].The evaluating frameworks were fuzzy-base, which analysed and measured the intended capability while considering various parameters, like those methods that were used in detection phases and fault refinements of various designed fuzzy Inference systems.A fuzzy evaluation has been formed of four main parts, including the Fuzzier, Defuzzier, and Fuzzy Inference System, which are briefly referred to as FIS and eventually the Fuzzy Data Base Rules.
The role of the fuzzier of this system is to convert the input terms to a linguistic term set.This is conducted to be the membership function.The fuzzy inference engine uses the database of fuzzy rules in order to obtain the fuzzy output.It is clear that the fuzzy rules have been stored on a particular database and the fuzzy inference engine exploits them.Additionally, the defuzzier converts the fuzzy output of the fuzzy inference engine to a crisp value.An assessment of the architectures has been carried out on four separate fuzzy engines.These engines are termed FIS (1), FIS (2), FIS (3), and FIS (4), respectively.Figure 12 shows the inputs and outputs of the engines.

Modelling and Fuzzy Analysis of Architectures
Our real world is the world of uncertainties and ambiguities.Given that fault tolerance is a qualitative parameter and it is associated with uncertainty, the fuzzy logic is used in modelling and analysing this important feature.Different methods have been presented in different papers for reliability analysis.One of these methods, as referred in [35], is the modelling of the fault tolerance based on fuzzy logic.The support of fuzzy systems from the rapid pattern generation and incremental optimization increases the importance of the results.Furthermore, the evaluating frameworks having the tolerant characteristic of the faults have been introduced in the various architectures of [36] and [37].The evaluating frameworks were fuzzy-base, which analysed and measured the intended capability while considering various parameters, like those methods that were used in detection phases and fault refinements of various designed fuzzy Inference systems.A fuzzy evaluation has been formed of four main parts, including the Fuzzier, Defuzzier, and Fuzzy Inference System, which are briefly referred to as FIS and eventually the Fuzzy Data Base Rules.
The role of the fuzzier of this system is to convert the input terms to a linguistic term set.This is conducted to be the membership function.The fuzzy inference engine uses the database of fuzzy rules in order to obtain the fuzzy output.It is clear that the fuzzy rules have been stored on a particular database and the fuzzy inference engine exploits them.Additionally, the defuzzier converts the fuzzy output of the fuzzy inference engine to a crisp value.An assessment of the architectures has been carried out on four separate fuzzy engines.These engines are termed FIS (1), FIS (2), FIS (3), and FIS (4), respectively.Figure 12 shows the inputs and outputs of the engines.All of the designed engines have an output.The engines of FIS (1), FIS (2), and FIS (3) have three inputs.In addition, the number of database rules and engine membership functions have similarities and differences.The trapezoid membership functions have been used for designing the FIS (1) engine.Each of the triple inputs of these engines has been considered as a three-level input, including the low, medium, and high levels.Moreover, the output of this fuzzy evaluation engine has been designed on five levels.The labels of the linguistic variables in FIS (1) are very low, low, medium, high, and very high.The results of the assessments that were conducted by each architecture by FIS (1) engine have been reported in Table 9.An important point is that the VFT architecture and the proposed PRP-FRC architecture, which are considered to be hybrid architectures, are not valuable to this engine, because the first input of this engine is designed as a three-level input.The first input of this engine is related to the situation of the policies that are used in the architecture.
The trapezoid membership functions have been used for designing the FIS (2) engine like the FIS (1) engine.The main difference between the two engines is in terms of the numbers of the first input linguistic variables of these engines.The first input is dedicated to the policies that are used in

Discussion
Inside the DMTF, whose focus is on the FCAPS capabilities in management, there is another group, which is known as the Cloud commands.The main task of this group is the development of service measurement index technologies, which is briefly called SMI.The goal of the SMI is to evaluate and measure the aspects of cloud performance in the standard form, and some methods were established in this field.The purpose of this article was to present a new hybrid architecture of fault tolerance.The simulation results in CloudSim, Pegasus-WMS, as well as fuzzy modeling and evaluation, confirmed the increasing fault tolerance and reliability in the implementation of the proposed architecture.
VFT architecture has been a hybrid architecture, but the biggest weakness of this architecture has happened when all of the nodes failed.In this case, on the VFT architecture, there was a feedback toward the fault handler that an appropriate decision should be taken.In this condition, the architecture should be designed and implemented, such that the job is migrated toward another host; but, since there is no policy of job migration in this architecture, the internal feedback of the same host occurred, which reduced the recovery ability and, finally, the architecture fault tolerance.Additionally, when the policy of checkpoint/restart has not been implemented in this architecture, in the case of any task being encountered to the lowest fault, again, the task was entirely implemented so that the general effect was significantly affected.If this policy was used, then, implementing that task began from the last checkpoint and, consequently, the architecture efficiency remarkably increased.
In the proposed architecture of PrP-FRC, simultaneously implementing all of the proactive and reactive policies has been sought.It is clear that in this case, the significant weaknesses of the VFT architecture, due to the lack of policies of job migration and checkpoint/restart, would not be in the proposed architecture of PrP-FRC.The phase of fault detection has been carried out due to the decision marker module by the other detection method.
In the proposed architecture, each of the three fault detection methods has been implemented.The AT module that investigated the output of each of the VMs has provided a self-detection method.The TC module that had straight supervision on the time validity of the output generated by each of the VMs caused the use of other detection methods in this architecture.Finally, due to the roles played by the RA and DM modules in the PrP-FRC architecture, this architecture had not used the group detection method in the fault detection phase.
In the fault recovery phase, the VFT architecture has implemented system recovery, due to the feedback that was in its final output.In the PrP-FRC architecture, both the final architecture output and the output of each VM has been separately implemented.They designed feedback to the management system to trigger the recovery, which was simultaneously carried out on two levels, which included the node and the system.
The evaluations by FIS (1) to FIS (4) fuzzy engines also showed that the fault tolerance in the PrP-FRC proposed architecture has increased from 16 to 25 percent in comparison with the AFTRC architecture.The PrP-FRC architecture also showed a 25 to 58 percent increase in the VFT hybrid architecture among the fuzzy evaluators engines.
The fault mask, which is one of the fault recovery methods, has been implemented in the AFTRC simple architecture because the outputs of all the VMs were collected and the fault effects were disappeared.However, in the proposed PrP-FRC, due to the feedback considered in the output of each VM, the node recovery method has been implemented and the fault effects that were made on the VMs were not masked.In addition, in the AFTRC architecture, the node recovery method has not been implemented and it was just used for masking fault effects.As a new idea in the following of leading research studies, reference may be made to a method for using the strategy of masking the fault effects in the PrP-FRC architecture.

Conclusions
Presenting a new architecture of fault tolerance, which simultaneously uses proactive and reactive policies, was the goal of this research.The proposed PrP-FRC architecture has covered fault tolerance on the quintuple phases, which consisted of fault forecasting, fault prevention, fault detection, fault isolation, and fault recovery.The main reason was to fully cover each of the five phases of fault tolerance by the aforementioned architecture, while simultaneously using all of the proactive and reactive policies in this architecture.Given the full implementation of all the policies in the PrP-FRC proposed architecture, it was expected that the proposed architecture would provide higher fault tolerance than previous architectures.The results of the simulations that were performed in the CloudSim and the comparison and simulation of proposed architectural workflow confirmed the increasing of the aforementioned capability.The time execution of PrP-FRC proposed architecture had no significant difference with previous architectures and it had only slightly increased.This feature highlighted the prominence of the proposed architecture.
The proposed architecture in this paper had simultaneously used methods of self-detection and other detection in the fault detection phase.Additionally, this architecture had simultaneously used the triple methods of fault mask, node recovery, and system recovery in the fault recovery phase.If a group detection method has been used in fault detection phase in the PRP-FRC, then it could be considered as a complete architecture for implementing all fault detection methods.Using and implementing this method on fault detection in the aforementioned architecture can be followed as a future work.

Figure 2 .
Figure 2. VFT architectural structure introduced in [27], which is the only hybrid architecture.

Figure 2 .
Figure 2. VFT architectural structure introduced in [27], which is the only hybrid architecture.

Figure 3 .
Figure 3.Our proposed Architecture proposed in this paper (PRP-FRC) hybrid architectural structure.

Figure 3 .
Figure 3.Our proposed Architecture proposed in this paper (PRP-FRC) hybrid architectural structure.

Figure 4 .
Figure 4. Comparison of Diagrams of Reliability of Architectures.

Figure 4 .
Figure 4. Comparison of Diagrams of Reliability of Architectures.

Figure 9 .
Figure 9. Average of execution time of the architectures.

Figure 10 .
Figure 10.Reliability Rate for each architecture.

Figure 9 .
Figure 9. Average of execution time of the architectures.

Figure 9 .
Figure 9. Average of execution time of the architectures.

Figure 10 .
Figure 10.Reliability Rate for each architecture.

Figure 11 .
Figure 11.The number of jobs and tasks and their status.

Figure 11 .
Figure 11.The number of jobs and tasks and their status.

Figure 12 .
Figure 12.(a) Inputs of the fuzzy engines; (b) outputs of the fuzzy engines.Figure 12. (a) Inputs of the fuzzy engines; (b) outputs of the fuzzy engines.

Table 4 .
Results of Simulating VFT in CloudSim.