Optimizing the Reliability and Performance of Service Composition Applications with Fault Tolerance in Wireless Sensor Networks

The services composition technology provides flexible methods for building service composition applications (SCAs) in wireless sensor networks (WSNs). The high reliability and high performance of SCAs help services composition technology promote the practical application of WSNs. The optimization methods for reliability and performance used for traditional software systems are mostly based on the instantiations of software components, which are inapplicable and inefficient in the ever-changing SCAs in WSNs. In this paper, we consider the SCAs with fault tolerance in WSNs. Based on a Universal Generating Function (UGF) we propose a reliability and performance model of SCAs in WSNs, which generalizes a redundancy optimization problem to a multi-state system. Based on this model, an efficient optimization algorithm for reliability and performance of SCAs in WSNs is developed based on a Genetic Algorithm (GA) to find the optimal structure of SCAs with fault-tolerance in WSNs. In order to examine the feasibility of our algorithm, we have evaluated the performance. Furthermore, the interrelationships between the reliability, performance and cost are investigated. In addition, a distinct approach to determine the most suitable parameters in the suggested algorithm is proposed.


Introduction
Wireless Sensor Networks (WSNs) are validated as an integral part of the Internet of Things where they extend the Internet to the physical world [1,2]. Due to their low-power, low-cost and small form factor, WSNs are widely used in Enterprise-IT systems. In order to quickly and flexibly respond to market changes, it is important that the WSN-based Enterprise-IT systems should be able to better adapt the business processes and the underlying software infrastructure [3]. To achieve this goal, organizations have focused on modeling, analysis and adaptation of business processes since early 2004 [4]. Yet, while Service-Oriented Architecture (SOA) is prospering in Enterprise-IT, WSNs have-despite contrary prognoses-largely not found their way into enterprises.
Parallel to the development of SOA, WSNs are envisioned to become an integral part of the Future Internet where they extend the Internet to the physical world. In recent years, some approaches have presented for the seamless integration WSNs with existing, widely deployed SOA technologies such as XML, Web Services, and Business Process Execution Language (BPEL) to build SCAs in WSNs [5,6]. These research results lay the groundwork for a new class of applications where all kinds of devices ranging from simple sensor nodes (SNs) to large-scale application servers interact to drive business processes in ways not possible before. In this scenario, the datastream from WSNs will influence the control flow of business processes in real-time or even trigger some business processes. In these approaches, the entire WSN or every SN can be packaged as some WSN services subject to a Web services technical standard, which can be published, located, and invoked across the Web [7]. Thus, these WSN services can be combined into the workflows in SCAs to fulfill some specific tasks in a services composition way [8,9].
From the perspective of system structure, the SCAs in WSNs are a kind of abstract of the distributed software system based on WSNs and running on the Internet. Since WSNs and the Internet are open, dynamic and difficult to control, the SCAs in WSNs have many differences from traditional software systems, for example system structures, operation mechanisms, correctness guarantees, development methods and life cycle. The traditional software systems have some characteristics, such as finite autonomy, fixed encapsulation, monotonic interaction, tightly coupled structure, and offline evolution, because of their static, closed and controllable running environment. Different from the traditional software systems, the WSN services exist in each SN in the form of active software services. Runtime SCAs in WSNs have some new characteristics that differ from those of traditional software systems, for example flexible evolution, continuous reaction and multi-target self-adaption.
These new characteristics are real challenges faced by researchers attempting to optimize the reliability and performance of SCAs in WSNs [10]. The architecture of WSN service systems with fault tolerance (FT) is considered in this paper, which is shown in Figure 1. As the data resource access and control center in the framework of WSN service systems, the WSN services broker (SB) is closely related to the reliability and performance of system [11,12]. The SB is deployed in the management server to play some important roles. To be specific, the SB manages user's service requirements, and dynamically controls the startup, access and sharing of data resources. When a service request is received, the SB maps it into a super-service which is a logical service in a business logic layer, not a physical WSN service in the physical layer. Then, the SB divides this super-service into some sub-services according to the business rules received from the domain experts. Each sub-service represents a certain business operation in the business flow. However, in a real application scenario there are usually no physical WSN services matching these sub-services in the WSN service system. Therefore, each sub-service must be fulfilled by a services composition composing a set of physical WSN services, named atom-services (ASs). By way of collaboration among these ASs, the user's service request can be fulfilled. The above mapping procedure from a service request to a SCA in WSNs is illustrated in Figure 2.  During the execution of a SCA in a WSN, the execution route and the selection of ASs are dynamically determined by the SB according to the running state. In addition, the outside SNs can be dynamically added in a WSN at any time. According to the business flows specification of user's service requests as well as some business rules, the ASs corresponding to some of these new SNs may be selected to combine into the SCA during runtime by using the late binding mechanism in services composition technology [13,14]. Therefore, the software model of a SCA in WSNs is a dynamic variable. We cannot clearly know what ASs are in a SCA, as well as their running states and performance indices, until the end of the software running. However, the optimization methods for reliability and performance are essential different between the SCAs in WSNs and traditional software, so the optimization methods used for the traditional software are inapplicable to the SCAs in WSNs.
Besides the applicability of optimization methods, the computational complexity is another crucial problem. A great number of possible solutions will be evaluated in solving optimization problems for the reliability and performance of SCAs in WSNs due to the dynamic variability of software models. The traditional reliability assessment methods, such as Boolean Models, Markov Process and Monte-Carlo simulation technique, have some disadvantages. They are either only suitable for small-scale systems, or too time-consuming in simulation [15]. Different from the optimization methods for reliability and performance used for the traditional software, ones used for the SCAs in WSNs pay more attention to the flexible measure, deduce and adoption mechanism of reliability and performance based on summative evaluation on the operation information in an open running environment [16,17].
In addition to the above differences, the SCAs in WSNs are faced with the ever-changing user requests, so they must have the ability to apperceive any changes in the outside environment, and dynamically evolve to adapt to these changes. In order to provide better reliability and performance to users, the SCAs in WSNs must have more adaptability to collect various changes in real-time, to adjust themselves online in runtime [18,19].
At present, the research on the reliability and performance optimization for SCAs in WSNs is just beginning. In the face of urgent demands for SCAs with high-reliability and high-performance in WSNs in many fields, such as military affairs, precision agriculture, safety monitoring, and environmental monitoring, reliability and performance optimization has become the key to encourage the successful development, application and popularization of SCAs in WSNs [20,21].
Facing the above challenges, this paper researches the reliability and performance model of SCAs in WSNs. Based on this, an efficient optimization algorithm for reliability and performance of SCAs in WSNs is presented based on UGF and GA. The rest of this paper is organized as follows: firstly the reliability and performance model of SCAs in WSNs is presented in Section 2. Secondly, the formal definitions for the reliability and performance of SCAs in WSNs are proposed based on UGF in Section 3. On this basis, an efficient optimization algorithm for reliability and performance of SCAs in WSNs is presented based on UGF and GA in Section 4. Following this, in order to illustrate our approach, some numerical examples and analysis process are described in Section 5. Finally, the conclusions and future work are given in Section 6.

Reliability and Performance Model for SCAs in WSNs
Since the service time can take different values, the SCAs in WSNs should be considered as a multi-state system (MSS) [22] with performance depending on combination of states of its elements. In other words, the SCAs in WSNs can have different performance levels corresponding to different combinations of available and failed SNs with different processing speeds and failure rates, as well as their communication channels with different data transmission speeds and failure rates. This paper uses MSS theory to model and analyze the SCAs in WSNs. The next section briefly introduces the MSS theory.
Many real-world systems are composed of multi-state components, which have different performance levels and several failure modes with various effects on the system's entire performance. Such systems are called MSS. The MSS was introduced in the middle of the 1970's in [23]. The MSS can perform their tasks with various distinguished levels of efficiency usually referred to as performance rates. In other words, the MSS can have a finite number of performance rates [24]. Since the SCAs in WSNs consist of different ASs, and have a cumulative effect on the entire system performance, it can be considered as a MSS.
The reliability and performance analysis of the SCAs with fault tolerance in WSNs relates to systems for which one cannot formulate an "all or nothing" type of failure criterion [25]. The SCAs with fault tolerance in WSNs are able to perform their task with partial performance (intensity of the task accomplishment). Failures of some system elements, such as some ASs in SCAs or some SNs in WSNs, lead only to the degradation of the system performance [26,27]. In order to model and analyze the SCAs in WSNs, we use MSS theory to define their reliability and performance, which is described in the next section.

Reliability and Performance Definitions for SCAs in WSNs
The MSS behavior is characterized by its evolution in the space of states. Therefore, MSS reliability can be defined as its ability to remain in the acceptable state during the operation period. Since the system functioning is characterized by its output performance G(t) where t is time, the state acceptability depends on the value of this index. In some cases this dependency can be expressed by the acceptability function F(G(t)) that takes non-negative values if and only if the MSS functioning is acceptable. This takes place when the efficiency of the system functioning is completely determined by its internal state.
Much more frequently, the system state acceptability depends on the relation between the MSS performance and the desired level of this performance (demand) that is determined outside of the system. In general, the demand W(t) is also a random process. It can take discrete values from the set w = {w1, …, wM}, which is a vector of user's requirement rates wj, (j = 1, …, M). The desired relation between the system performance and the demand can also be expressed by the acceptability function F(G(t),W(t)). The acceptable system states correspond to F(G(t),W(t)) ≥ 0, and the unacceptable states correspond to F(G(t),W(t)) < 0. The last inequality defines the MSS failure criterion. In many practical cases, the MSS performance should exceed the demand. In such cases the acceptability function takes the form: From the aspect of users, the reliability of SCAs in WSNs can be defined as the probability that its performance rates satisfy user's requirements which is described as a vector pairs (w,q). Furthermore, q = {q1, q2, …, qM} is the vector of steady state probability qj = Pr{W = wj}, (j = 1, …, M) according to a certain user's requirement rate, where W is a random variable that represents the performance rates of SCAs in WSNs. Based on the above definition, the reliability function of SCAs in WSNs under steady state can be defined as: where Tf is time to failure which is the time from the beginning of the system life up to the instant when the system enters the subset of unacceptable states the first time. Therefore, the reliability function R(t) is the probability that Tf is greater than or equal to the value t (t > 0), where in the initial state (at instant t = 0) MSS is in one of the acceptable states. Then, the reliability function R(t) under transient state can be defined as: where G(t) is the integral performance rates of SCAs in WSNs. In the interval [0, T], the reliability function RT of SCAs in WSNs can be defined as: Based on Equation (4), it can be seen that for the discrete random demand with PMF w = {w1, …, wM}, q = {q1, …, qM}, the reliability function of SCAs in WSNs under dynamically changing user's requirements can be defined as: According to Equation (5), the reliability and performance of SCAs in WSNs can be calculated based on the probability distribution of performance rates of component services, for example sub-services and ASs shown in Figure 2. In order to calculate the reliability and performance of SCAs in WSNs, we present the probability distribution representation of performance rates for any component service, which is described in the next section.

Probability Distribution of Performance Rates for Any Component Service
According to its performance rates, the component service j within a SCA in WSNs can be of kj kinds of various states, described by 1 2 { , , , } j j j j j k , where gji is the performance rate of component service j under the state i, i ∈ {1, 2, ..., kj}. Corresponding to the component service j, the performance rate Gj(t) in any time t ≥ 0 is a random variable that gets the value from gj: Gj(t) ∈ gj. The probability of performance rates of the component service j under various states in any time t can be described as a set 1 2 ( ) { ( ), ( ), , , where pji(t) = Pr{Gj(t) = gji}. Because the component service j is in only one of kj kinds of various states in any time t, these states form a mutual exclusion events complete set. Therefore, Equation (6) is satisfied: In the end, the set of value pairs <gji, pji(t)> completely determines the probability distribution of performance rates corresponding to the component service j in any time t. Having the probability distribution of performance rates of all component services, the reliability and performance of the entire SCA can be calculated according to the composite structure by mapping the performance rates space of component services into that of the entire SCA. In order to achieve this mapping, the structure functions of performance rates are defined in the next section.

Structure Function of Performance Rates for SCAs in WSNs
The structure function of SCAs in WSNs can be defined as follows. Let L n be the possible combinations of performance rates of all component services, and M = {g1, …, gk} be the possible values range of performance rates of SCAs in WSNs. L n can be defined as: For a SCA consisting of n ASs, the performance rates of the ASs unambiguously determine the performance rates of the SCA. These ASs have certain performance rates corresponding to their states in every moment. The states of this ASs determine that of the SCA. Assume that the SCA has K different states and that gi is the SCA performance rate in state i (i ∈ {0, ···, K−1}). The SCA performance rate is a random variable that takes values from the set {g1, ···, gK−1}. Then, the transform function ϕ(G1(t), …, Gn(t)): L n →M, called as structure function, can map the performance rates space of component services into that of the entire SCA. Hence, the reliability model of SCAs in WSNs can be defined as gj, pj(t), 1 ≤ j ≤ n, ϕ(G1(t), …, Gn(t)).
The structure function of SCAs in WSNs establishes a feasible way to calculate the reliability and performance of the entire SCA using those of component services. In order to efficiently calculate the reliability and performance by using a fast algebraic procedure, the UGF technique is introduced into our model. Based on UGF, the reliability and performance of SCAs in WSNs are defined in the next section.

Reliability and Performance Definition Based on UGF
In this paper, we choose the UGF technique to achieve high efficiency in calculating the reliability and performance of SCAs. The next section gives the reasons for selecting it.
The approach based on the extension of Boolean models is historically the first method that was developed and applied for the MSS reliability evaluation. It is based on the natural expansion of the Boolean methods to the multi-state systems.
The stochastic process methods that are widely used for the MSS reliability analysis are more universal. The methods can be applied only to relatively small MSSs because the number of system states increases dramatically with the increase in the number of system elements.
Even though almost every real world MSS can be represented by the Monte-Carlo simulation for the reliability assessment, the main disadvantages of this approach are the time and expenses involved in the development and execution of the model.
The computational burden is the crucial factor when one solves optimization problems where the reliability measures have to be evaluated for a great number of possible solutions along the search process. This makes the use of the first three methods have a problem in reliability optimization [32]. On the contrary, the UGF allows one to find the entire MSS performance distribution based on the performance distribution of its elements by using a fast algebraic procedure. The analysts can use the same recursive procedures for MSS with a different physical nature of performance and different types of element interaction [33]. Therefore, it is fast enough for dynamically changing SCAs in WSNs.
The UGF generalizes the well-known ordinary generating function. Its basic ideas were introduced by Ushakov [34]. It has proved very convenient for numerical realization [35]. In addition, it requires relatively small computational resources for evaluating MSS reliability and performance indices. The advantages of UGF were analyzed in detail in [36], as well as its computational complexity. The efficiency of UGF was discussed in [37]. It has proved more accurate and efficient. Therefore, it can be used in complexes reliability and performance optimization problems. Because the relationships between the system state probability and the system output performance rates can be expressed definitely by UGF, and the UGF of system can be obtained by calculating those of components simply, UGF has proved to be an efficient reliability and performance assessment approach that is suitable to various MSS. The problem of system reliability and performance analysis usually includes evaluation of the probability mass function (PMF) of some random values characterizing the system's behavior. These values can be very complex functions of a large number of random variables. The explicit derivation of such functions is an extremely complicated task. Fortunately, the UGF method for many types of system allows one to obtain the system u-function recursively. This property of the UGF method is based on the associative property of many functions used in reliability engineering. The recursive approach presumes obtaining u-functions of subsystems containing several basic elements, and then treating the subsystem as a single element with the u-function obtained when computing the u-function of a higher level subsystem. Combining the recursive approach with the simplification technique reduces the number of terms in the intermediate u-functions, and provides a drastic reduction of the computational burden.
For the above reasons, we selected UGF technique to develop an efficient reliability and performance evaluation method for SCAs in WSNs. In order to express the u-functions of reliability and performance of SCAs in WSNs, their UGF definitions are proposed in the next section.

Reliability and Performance Definitions of SCAs in WSNs Based on UGF
Based on the reliability and performance model presented in Section 2, the u-function of reliability of SCAs in WSNs can be defined according to [24]. The general form of definition as follows: The reliability of the entire SCA (or a component service within a SCA) in WSNs is a random variable X. According to the UGF technique, the probability distribution of performance can be obtained using a formal operator z that resembles the procedure of the product of polynomials. Therefore, its u-function can be defined as: where the discrete variable X has K possible values, pk is the reliability when X is in the performance state Xk. Based on this definition, the u-function of the reliability of the entire SCA (or one of its component services) in transient state can be expressed as: Because U(z) relates the performance rates Gk with its state probabilities pk, it describes the probability distribution of reliability of SCAs (or a component service) in WSNs. Following this, in order to express other indices related to reliability, such as availability, output performance and unfinished performance, we define three performance operators based on the above u-function of reliability.
(1) Availability operator δA: The availability operator δA is defined as the sum of all probabilities of system states satisfying the condition F(Gk, Wm) ≥ 0. It can be formulated as: (2) Output performance operator δG: The output performance operator δG is defined as the sum total of the products of each performance rate Gk and its corresponding state probability pk. It can be formulated as: ( ) (3) Unfinished performance operator δU: The unfinished performance operator δU is defined as the sum total of the products of un-acceptability (i.e., unfinished performance max{−F(Gk, Wm), 0}) and its corresponding state probability pk. It can be formulated as: Based on the above performance operators, the three indices related reliability for SCAs (or a component service) in WSNs can be defined as follows: (1) Availability: The availability is a prediction about the ability of a SCA to perform its designated function with required performance. It is defined as the sum total of the products of the steady state probability qm and its corresponding probability satisfying the condition F(Gk, Wm) ≥ 0, i.e., δA(U(z), F, Wm). It can be formulated as: (2) Output performance expectation: The output performance expectation is a prediction about the quality of a future task-related behavior by a SCA in WSNs. It is defined the sum total of the products of each performance rate and its corresponding state probability. It can be calculated by the output performance operator δG: (3) Unfinished performance requirement: The unfinished performance requirement is a prediction about the risk of a SCA to perform its designated function without required performance. It is defined as the sum total of the products of the steady state probability qm and its corresponding probability unsatisfying the condition F(Gk, Wm) ≥ 0, i.e., δU(U(z), F, Wm). It can be formulated as:

Composite Operators of Reliability and Performance Indices Based on UGF
For a component based system, the overall reliability and performance are determined by all of its components. The UGF technique provides a fast route to obtain the overall reliability and performance from that of the various components. In order to achieve this goal, some composite operators are defined according to the system structure function f (X1, …, Xn) presented in Section 2.3. In other words, the properties of the composite operator strictly depend on the properties of the system structure function. Since the procedure of the multiplication of the probabilities in composite operators is commutative and associative, the entire operator can also possess these properties if the function possesses them.
Based on the reliability and performance definition expressed by UGF for component services in Section 3.2, the u-function composite operators Ω can be designed for various reliability and performance indices in diverse patterns of services composition. By the Ω calculation, the overall system reliability and performance can be worked out based on those of all components.
Since the procedure of the multiplication of the probabilities in composite operators is commutative and associative, two rules must be satisfied in the design of u-function composite operators Ω as follows: (1) Commutativity rule: The commutativity rule can be formulated as follows: (2) Associativity rule: The associativity rule can be formulated as follows: According to the above design rules, the generic form of composite operators Ω can be expressed as: where f (Gk, Gl) can be defined according to the reliability and performance indices and composition structures of the SCAs in WSNs. Based on the UGF technique mentioned above, we propose an efficient reliability and performance optimization algorithm for WSN service systems in the next section.

Architecture of WSN Service Systems with FT
In order to assure the correctness of observed data, and improve the reliability of SCAs in WSNs, some redundant SNs are deployed in WSN service systems with FT. These redundant SNs compose some sensor clusters according to the observed objects, which is depicted in Figure 1. In other words, the SNs within the same cluster are responsible for the same observed object. From the perspective of the correctness of observed data, the redundant SNs should send the same observed data for the same observed object at the same observation time.
In the architecture of WSN service systems with FT, the SNs within the same sensor cluster are controlled by the same cluster-sink. These cluster-sinks are responsible for receiving and checking the observed data from SNs within their clusters. In order to further reduce the energy consumption, n-version programming (NVP) is introduced into the check mechanism of cluster-sinks in the suggested architecture of WSN service systems with FT.
From Figure 1, one can see that the topology of WSN service systems with FT is a star structure. At every moment, outside SNs (ASs) can be dynamically added to a cluster according to the actual needs without requiring configuration changes. Therefore, the star structure can help WSAs meet the scalability demands adequately. The management server lies in the center of the star topology, which controls the startup, initialization, distribution and recovery of the sinks, cluster-sinks and SNs dynamically.
In this architecture, the sink is responsible for receiving the observed data from the cluster-sinks, and sending it to the management server through a gateway. From the perspective of data processing, the SNs can be considered as resources, because they provide the observed data of target objects to the WSN service system.
The SB is the entrance of a WSN service system for the service requests from users. It is responsible for the mapping from the service requests to the SCAs in WSNs. Figure 2 illustrates this mapping process by the SB, which forms a tree structure with three levels respectively representing the SCAs with different abstract granularities. The top level represents the super-service corresponding to the service request from users; the middle level represents the sub-services composition generated by way of the mapping according to the business rules; the bottom level represents the ASs composition comprising the physical SN services in WSNs.
The services composition application (SCA) in WSNs consists of a set of ASs that should be executed by resources of different types (i.e., the SNs of different types). Therefore, when receiving a service request from a user the SB will allocate suitable resources, i.e., SNs, for the initiatory atom-service (AS) according to the observed object and the type of SNs, and execute this AS. Other ASs require outputs from preordered AS/ASs as inputs for their execution. The order of ASs' execution is determined by the execution logics of the SCA in WSNs. When the results are returned from an AS or some ASs, the SB transforms them into the next ASs as their inputs according to the execution logics, and allocates suitable resources to execute them. When all of the ASs within a SCA in a WSN are fulfilled and the final result is returned, the service request is completely executed. In the end, the final result will be returned to the user by the SB.
In order to simplify the complexity of the problem, we assume here each resource, i.e., SN, can process only a single AS at the same time when it is available. On the other hand, the same AS can be assigned to several resources of the same type, i.e., several SNs within the same sensor cluster, for parallel execution when there are multiple SNs responsible for the same observed object. Considering the reliability and efficiency, the SB usually allocates multiple SNs for each AS to execute it in parallel.  r5}, ω3 = {r6, r7}, ω4 = {r8, r9}, ω5 = {r10, r11, r12}, ω6 = {r13, r14, r15}, ω7 = {r16, r17, r18}, and ω8 = {r19, r20}, which is illustrated in Figure 3. In order to improve the reliability of WSN service systems, a FT model is introduced in the suggested architecture. When the first correct result corresponding to an AS is returned from one of the allocated SNs, the SB will make a mark for the finished AS, and cancel the execution of other SNs allocated to this AS. The detailed FT model and FT mechanism are proposed in the next section.

FT Model in WSNs Service System
For the convenient description in the latter, this section gives some notations listed in Table 1. The time used for the entire cluster-sink execution. T The random task execution time used for the entire SCA. w A maximal allowed system execution time used for the entire SCA.

F(T,w)
The system's acceptability function R(w) The system's reliability function The probabilities function of the number of SNs that can be simultaneously executed. ω cb The cost of SN b used in cluster c Ω The entire system cost Ω* The MAX allowable system cost It is assumed that nc functionally equivalent SNs are available for each cluster c in a WSN service system with FT. Each sensor node (SN) i has an estimated reliability rci and constant observation time τci (the time for sending and transferring data is neglected). Failures of SNs in each cluster are statistically independent, as well as the total failures of the different clusters, because each SN runs independently on different hardware units.
The check mechanism presumes that the different SNs in the same cluster send their observed data to the cluster-sink at first. Then, the cluster-sink compares received observation data with each other. The cluster-sink sends one observed data to sink, if at least kc out of nc outputs agree. Otherwise, the cluster-sink discards these received observation data and requires the SNs for next observation.
The SNs in each cluster c run on parallel hardware units. The total number of hardware units is hc. The hardware units are independent and identical. The availability of each hardware unit is ac. The number Hc of hardware units available at the moment determines the amount of available computational resources and, therefore, the number Lc of SNs that can be executed simultaneously. In other words, Lc depends on Hc. No hardware unit can change its state during execution.
The SNs in each cluster c start their execution in accordance with a predetermined order list. The Lc first SNs from the list start their execution simultaneously (at time zero). If the number of terminated SNs is less than kc, after termination of each SN a new SN from the list starts its execution immediately. If the number of terminated SNs is not less than kc, after termination of each SN the cluster-sink compares their outputs. If kc outputs are identical, the cluster-sink terminates all SNs that are still executed; otherwise a new SN from the list is executed immediately.
If after termination of nc SNs the number of identical outputs is less than kc, the cluster-sink and the entire WSNs services system fail.
In the case that cluster-sink sends the observed data to the sink successfully, the time Tc used for the entire cluster-sink execution is equal to the termination time of the SN that has produced the kc-th correct output (in most cases, the time needed by the cluster-sink to make the decision can be neglected). It can be seen that the cluster-sink execution time is a random variable depending on the reliability and the time used for the SNs' execution and on the availability of the hardware units. We assume that if the cluster-sink fails to send the observed data to sink then its execution time is equal to infinity.
The sum of the random execution time of each cluster-sink gives the random task execution time T for the entire SCA in WSNs. In order to estimate both the system's reliability and its performance, different measures can be used, depending on the application.
In a WSNs service system, the execution time of each task is of critical importance. Given the fixed mission time is T, the system's acceptability function is defined as F(T,w) = 1(T < w), where w is a maximal allowed system execution time. The system's reliability R(w) = E(F(T,w)) in this case is the probability that the correct output is produced in time less than w. therefore, the conditional expected system execution time can be defined as: is considered to be a measure of the system's performance, which determines the SCA's expected execution time given that the system does not fail.
In a WSN service system, the system's average productivity (the number of executed tasks) over a fixed mission time is of interest, the system's acceptability function is defined as F(T) = 1(T < ∞), the system's reliability is defined as the probability that it produces correct outputs regardless of the total execution time (this index can be referred to as R(∞)), and the conditional expected system execution time ( ) ε ∞  is considered to be a measure of the system's performance. Considering the above FT mechanism, the following sections discuss the approach for calculating the reliability and performance of a WSN service system.

Determining the Number of SNs that Can Be Simultaneously Executed
The reliability and performance of a WSN service system are influenced by the number of SNs that can be executed simultaneously. This section discusses how to determine the PMF of the number of SNs that can be simultaneously executed.
The number x of available hardware units in cluster c can vary from 0 to hc. Given that all of the units are identical and have availability ac, one can easily obtain the probabilities of the number of SNs that can be simultaneously executed, i.e., Qc(x) = Pr{Hc = x} for 0 ≤ x ≤ hc: The number x of available hardware units determines the number lc(x) of SNs that can be executed simultaneously. Therefore: Thus, the pairs < Qc(x), lc(x)> for 0 ≤ x ≤ hc determine the PMF of the discrete random value Lc. Having the PMF of the number of SNs that can be simultaneously executed, if the termination time of each SN can be calculated, the PMF of execution time for each SN can be determined. The next section presents the algorithm used for calculating the termination time of each sensor node.

Determining the Termination Time of SN
In each cluster c, a sequence where each SN starts its execution is defined by the numbers of SNs. This means that each SN i starts its execution not earlier than SNs 1, …, i−1 and not later than SNs i + 1, …, nc. If the number of SNs that can run simultaneously is lc then we can assume that the SNs run on lc independent processors. Let αm be the time when processor m terminates the execution of a SN and is ready to run the next SN from the list of not executed SNs. Having the execution time of each SN τci (1 ≤ i ≤ nc), one can obtain the termination time tci(lc) for each SN i using the following simple algorithm. The time tci(lc) (1 ≤ i ≤ nc) corresponds to the intervals between the beginning of cluster execution and the moment when the SNs produce their outputs. Observe that the SNs that start execution earlier can terminate later: j < y does not guarantee that tci(lc) ≤ tcy(lc). In order to obtain the sequence, in which the SNs produce their outputs, the termination time should be sorted in increasing order Based on the PMF of Lc, which can be obtained by Equations (19) and (20), and the PMF of ( ) j cm c t l , which can be derived by the algorithm in this section, the PMF of execution time for each SN can be determined. This provides a way to calculate the reliability and performance for each cluster and the entire system, which is presented in the next section.

Determining the Reliability and Performance of Each Cluster and the Entire System
Let i cm r be the reliability of the SN that produces i-th output in cluster c. In other words, i cm r is equal to the probability that this output is correct. Consider the probability that k out of n first SNs of cluster c succeed. Thus, this probability can be obtained as: The cluster c produces the correct output directly after the end of the execution of j SNs (j ≥ kc) if the mj-th SN succeeds and exactly kc − 1 out of the first executed j − 1 SNs succeed. Thus, the probability of such event pcj(lc) is: Observe that pcj(lc) is the conditional probability that the cluster execution time is Since the events of successful cluster execution termination for different j and x are mutually exclusive, we can express the probability of cluster c success as: Since failure of any cluster constitutes the failure of the entire system, the system's reliability can be expressed as: From the PMF of execution time Tc for each cluster c, one can obtain the PMF of the execution time of the entire system, which is equal to the sum of the execution time of clusters: Having the PMF of the execution time of the entire system, we can evaluate the reliability and performance of a SCA in a WSN based on UGF. On this basis, we can embed this evaluation algorithm in a GA framework for optimizing the system reliability and performance. The optimization of the reliability and performance for SCAs in WSNs based on the UGF technique and GA framework is proposed in the next section.

Evaluating the Execution Time Distribution of Clusters
In order to obtain the execution time distribution for a cluster c for a given lc in the form pcj(lc), It can be easily seen that using the operator + ⊗ we can obtain the u-function: ( )

Evaluating the Different Clusters Consecutively Executed on the Same Hardware
Now consider the case where all of the clusters are consecutively executed on the same hardware consisting of h parallel identical modules with the availability a. The number of available parallel hardware modules H is random with PMF Q(x) = Pr{H = x}, 1 ≤ x ≤ h, defined in the same way as in Equation (27).
When H = x, the number of SNs that can be executed simultaneously in each cluster c is lc(x). The u-functions representing the PMF of the corresponding cluster execution time Tc are ( , ( )) c c u z l x  defined by Equation (32). The u-function ˆ( , ) U z x representing the conditional PMF of the system execution time T (given the number of available hardware modules is x) can be obtained for any x (1 ≤ x ≤ h) as: Having the PMF of the random value H, we obtain the u-function ( )

U z 
representing the PMF of T as:

Optimizing the Structure of SCAs in WSNs
When a SCA with FT in WSNs is designed, one has to select SNs for each cluster and find the sequence of their execution in order to achieve the greatest system reliability subject to cost constraints. The SNs are selected from a list of the available products. Each SN can be characterized by its reliability, execution time, and cost. The total cost of the system is defined according to the cost of its SNs. For each SN, its cost may be a purchase cost (if the SN or its data observation is provided by a commercial service). It also may be a comprehensive cost based on the SN's size, complexity, and performance.
Assume that Bc functionally equivalent SNs are available for each cluster c and that the number kc of the SNs that should agree in each cluster is predetermined. The choice of the SNs and the sequence of their execution in each cluster determine the system's entire reliability and performance.
The permutation x*c of Bc different integer numbers ranging from 1 to Bc determines the order of the SN that can be used in cluster c. Let ycb = 1 if the SN b is selected to be included in cluster c and ycb = 0 otherwise. The binary vector yc = {yc1, …, The system structure optimization problem can now be formulated as find vectors xc for 1 ≤ c ≤ C that maximize R(w) subject to the cost constraint: where ωcb is the cost of SN b used in cluster c, Ω is the entire system cost and Ω* is the MAX allowable system cost. Note that the length of vectors xc can vary depending on the number of SNs that are selected. In order to encode the variable-length vectors xc in the GA using the constant length integer strings one can use (Bc + 1) − length strings containing permutations of numbers 1, …, Bc, Bc + 1. The numbers that appear before Bc + 1 determine the vector xc. For example, for Bc = 5 the permutations (2,3,6,5,1,4) and (3,1,5,4,2,6) correspond to xc = (2,3) and xc = (3,1,5,4,2), respectively. Any possible vector xc can be represented by the corresponding integer substring containing the permutation of Bc + 1 numbers. By combining C substrings corresponding to different clusters one obtains the integer string a, that encodes the entire system structure.
The encoding method is used in which the single permutation defines the sequences of the SNs selected in each of the C clusters. The solution encoding string is a permutation of 1 ( 1) In order to examine the feasibility of our algorithm for SCAs with FT in WSNs, some experiments have been performed, which are presented in the next section.

Experiments and Analysis
Consider a SCA with FT in a WSN, which consists of five clusters running on fully available hardware. The parameters of the SNs that can be used in these clusters are described in Table 2. From this table, one can see that there are six SNs in cluster 1, five SNs in cluster 2, eight SNs in cluster 3, four SNs in cluster 4, and five SNs in cluster 5. This table contains the values of Lc and kc for each cluster and the execution time τ, cost c, and reliability r for each sensor node.

Experimental Environment
In order to investigate the efficiency and performance of the suggested algorithm, we have developed a parallel GA program based on MATLAB ® Distributed Computing Server (MDCS) (The MathWorks, Inc., Natick, MA, USA) and Parallel Computing Toolbox (PCT) (The MathWorks, Inc., Natick, MA, USA). A cloud computing platform based on IBM PureFlex ® cluster with six blade servers was used for this in-depth experimental analysis. In this cloud computing platform, eighteen virtual machines have been built for searching optimal solution in parallel in our GA program, which is shown as Figure 4. We built 16 parallel process nodes (named worker in PCT), 1 MDCS node, and 1 master node on 18 virtual machines. The 16 worker nodes undertake the parallel computing of the UGF of each SN, each cluster and the entire SCA, such as Algorithm 1 and Algorithm 2. The MDCS node is responsible for assigning computing tasks to each worker node and receiving their calculation results. The master node implements the establishment of the input data and the process control of the parallel computing.
Based on the above experiment platform, we compared and analyzed the dependencies among the optimal reliability, the given expected execution time, and the MAX allowable system cost, as well as the robustness of the suggested algorithm, which are described in the following sections.

Experimental Analysis
In order to investigate the change of reliability R(w*) of a SCA in WSNs along with the given expected execution time w* under some given cost constraints Ω*, a set of experiments was designed. In suggested GA algorithm, the population size is set to 500 chromosomes. The max generation is set to 100. The size of the optimal chromosome pool is dynamically increased from 0.5% to 5% of population size with the increase in the number of generations. The crossover probability is set to 0.7. The variation probability is set to 0.3. The penalty factor of reliability is set to 100. The number of repeated experiments is set to 30. A set of experiments were performed, which observe the searching process for the reliability function R(w*) along with along with the execution time w under the two given expected execution time (i.e., w* = 250 and w* = 300) and the given cost constraints (i.e., Ω* = 160, Ω* = 140, Ω* = 120 and Ω* = 100), respectively.
Given two expected execution time (w* = 250 and w* = 300), two sets of solutions were obtained using the suggested algorithm, which are described in Table 2. For each value of w*, four different solutions (i.e., the optimal execution sequences of SNs), were obtained for the four given cost constraints (Ω* = 160, Ω* = 140, Ω* = 120 and Ω* = 100). The tables contain the minimal possible system execution time Tmin, maximal possible system execution time Tmax, the system cost Ω and reliability R(w*) for each solution, the expected conditional execution time ε( ) ∞  , and the corresponding execution sequences of the selected SNs.
Under each constant constraint on the MAX allowable system cost Ω*, the change trend of the total system cost Ω and the optimal reliability R(w*) was investigated along with the execution time w. Comparing the total system cost Ω and the reliability R(w*) of the optimal solutions corresponding to w* = 250 and w* = 300 in Table 2, it can be seen that the total system cost Ω and the reliability R(w*) of the optimal solution corresponding to w* = 300 is always equal or greater than ones corresponding to w* = 250 in the case of the same value of Ω*.
Furthermore, under the given expected execution time w*, the change trend of the optimal reliability R(w*) was investigated along with the MAX allowable system cost Ω*. The optimal R(w*) has been found for Ω* changes from 100 to 160 with a constant incremental change of 20. Comparing the total system cost Ω and the reliability R(w*) of the optimal solutions corresponding to four different MAX allowable system costs in Table 3, it can be seen that the total system cost Ω and the reliability R(w*) of the optimal solution corresponding to larger Ω* is always equal or greater than ones corresponding to smaller Ω* in the case of the same value of w. From Table 2, it can also be seen that the system reliability gradually become greater along with the growth of the value of Ω*.
Furthermore, the relationship between reliability and cost is investigated. The cost-reliability curves with alterations in the cost Ω from 80 to 240 under the two given expected execution time w* = 250 and w* = 300 by user are presented in Figure 5. Each point on these curves corresponds to the best solution obtained by the suggested algorithm. It can be seen that the greater the reliability level achieved, the greater the cost of further reliability improvement. In other words, for a greater reliability level, more SNs are need. From the designer's perspective, he or she can intuitively find out the points which meet the requirements of reliability in Figure 5. Therefore, the corresponding cost can be found. On this basis, the decision on the reasonable quantity of SNs can be made. By this method, the structure of a WSN service application system can be further optimized under the condition satisfying the reliability requirements.  Figure 5. The cost-reliability curves with alterations in the cost from 80 to 240 under the two expected execution times w* = 250 and w* = 300 given by user.
In addition, under each constant constraint on the MAX allowable system cost Ω*, the change trend of the reliability function R(w*) was investigated along with the execution time w. The curves of the values of reliability function R(w*) for the execution time w change from 160 to 310 under four constraints on the MAX allowable system costs and two given expected execution time w* = 250 and w* = 300 by user are shown in Figures 6 and 7, respectively. It can been seen that the values of reliability function R(w*) improve gradually along with the growth of w.
In order to investigate the relationship of the constraints on the MAX allowable system cost Ω* and the expected execution time w* in the combining effects on the system reliability of feasible solutions, the above experimental results are shown in the form of 3D image in Figures 8 and 9, respectively. From these figures, one can see that the constraint on the MAX allowable system cost Ω* and the expected execution time w* influence the system reliability of feasible solutions, while the constraint on the MAX allowable system cost Ω* plays a more important role in the increase of the system reliability of feasible solutions than the expected execution time w*.    The above experimental analysis indicates that the selection of suitable Ω* and w* is helpful to improve the reliability of SCAs in WSNs and to cut down their cost. In the next section we present a distinct approach to selecting the most suitable Ω* and w* for the designers of SCAs in WSNs.
In order to investigate the scalability of the proposed algorithms, we completed a series of experiments on the above cloud computing platform along with the number of clusters in a SCA system growing from 10 to 50 with a step growth of 5. Each cluster randomly contains 8 to 10 SNs. The Lc of each cluster is set to a random number ranging from 4 to 8. The Kc of each cluster is set to a random number from 6 to 8. The execution time τ of each SN is set to a random number from 10 to 50. The cost c of each SN is set to a random number from 3 to 10. The reliability r of each SN is set to a random number from 0.80 to 0.99. The parameters of GA program are set in the same as the experiment above. By inserting a pair of timers in the GA program, the exact algorithm execution time (not including data preparation time and task allocation time) is obtained. For each a number of clusters, we ran the GA program 20 times, and calculated the mean algorithm execution time for each a number of clusters. Figure 10 shows the changes of mean algorithm execution time along with the number of clusters increased from 10 to 50. From Figure 10 one can see that the algorithm execution time gradually rises as the number of clusters is increased. In the two separate stages that the number of clusters increased from 10 to 30 and from 35 to 50, the algorithm execution time grows slowly (for ease of description, these two stages are referred to as the first Slow Growth Stage and the second Slow Growth Stage). However, in the stages that the number of clusters increased from 30 to 35, the algorithm execution time grows fast (for ease of description, this stage is referred to as the Fast Growth Stage). Through investigating the parallel task allocation by MDCS node, we found that in the first Slow Growth Stage the computing tasks of each worker node were not blocked in the task queue. It indicates that the computational load of each worker node is appropriate. All computing tasks assigned to every worker node can be fulfilled sequentially without waiting. However, in the Fast Growth Stage congestion began to appear in the task queues of worker nodes as the number of clusters increased sequentially. The computing tasks of worker nodes must wait for the completion of those tasks in front of them, which results a fast growth of the algorithm execution time. After this, with the continued increase in the number of clusters the execution of the algorithm has entered a new stage-the second Slow Growth Stage due to the load balancing generated by 16 worker nodes. Based on the above analysis, we can see from the two Slow Growth Stages that the proposed algorithm showed good scalability.

A Distinct Approach to Selecting the Most Suitable Ω* and w*
In order to help designers of SCAs in WSNs to select the most suitable Ω* and w*, the curves of the values of reliability function R(w*) under two constant cost constraints w* = 250 and w* = 300 for the expected execution time w* change from 160 to 310 are shown in Figure 11. Figure 11. The values of reliability function R(w*) with changes in the execution time w from 160 to 310 under the given cost constraint on Ω* with changes from 100 to 160 and on the two given expected execution times (w* = 250 and w* = 300).
On the basis of the experimental analysis in the previous section, we present a distinct solution method for the designers of SCAs in WSNs to select the most suitable Ω* and w* based on Figure 9. For a reliability requirement from a user perspective, we can draw a horizontal auxiliary line according to the given value of reliability requirement R*. The intersection between the horizontal auxiliary line and the reliability curves forms multiple shadowed areas. The points falling into the shadow areas represent the feasible solutions subject to R(w*) ≥ R*. From Figure 9, one can see that not all curves of reliability function R(w*) intersect the horizontal auxiliary line. This indicates that only part of solutions meet the given reliability requirement in this example. Specifically, there are five sets of Ω* and w* suitable for the given reliability requirement, i.e., (w* = 250 and Ω* = 160), (w* = 250 and Ω* = 140), (w* = 300 and Ω* = 160), (w* = 300 and Ω* = 140) and, (w* = 300 and Ω* = 120). Obviously, the set of Ω* and w* (w* = 300 and Ω* = 120) is the most suitable for the users who are more concerned about cost. On the contrary, the set of Ω* and w* (w* = 250 and Ω* = 140) is the most suitable for the users who are more concerned about system performance. In addition, one can see that the reliability corresponding to the set of Ω* and w* (w* = 300 and Ω* = 160) is higher than other when w > 240. Therefore, it is the most suitable for the users who are more concerned about system reliability.
Using the approach suggested above designers can easily find which sets of Ω* and w* can meet the reliability requirements of users. Furthermore, designers can easily find the most suitable set of Ω* and w* for different types of users.
In order to better display the efficiency of the suggested algorithm, the selection for the most suitable Ω* and w* is shown in the form of a 3D image in Figures 12 and 13, respectively. Unlike Figure 11, the reliability requirement R* is not a horizontal auxiliary line but rather an auxiliary plane. The auxiliary plane intersects the surface of the system reliability in Figures 8 and 9, respectively. The most suitable Ω* and w* are located on the secant formed by the auxiliary plane and the surface of the system reliability. After carefully balancing the cost and the execution time, the designers can find which sets of Ω* and w* can meet the reliability requirements of users.  Generally, based on the most suitable set of Ω* and w* that are found by the suggested approach, the optimal structure of SCAs with fault-tolerant in WSNs, i.e., the SNs in each cluster as well as their execution sequence, can be found using the suggested algorithms, which can provide as high as possible system reliability and performance under a given cost constraint proposed by users.
The suggested algorithms and approach presented in this paper can be easily realized by software. Furthermore, it has high enough efficiency because a fast algebraic procedure is used for finding the performance distribution of the entire WSN service system based on those of SNs on which the WSN service is running, therefore, it can also be used in online optimization situations.

Conclusions
Traditional reliability and performance optimization methods, such as the Markov model and state space analysis, have some defects such as being too time-consuming, facility for causing state space explosions and unsatisfactory assumptions of component execution independence, therefore they are inapplicable to the ever-changing SCAs in WSNs. In this paper, a novel reliability and performance optimization model based on MSS for WSN services systems is proposed, which eliminates the limitation for component execution independence, and fits better the actual execution of SCAs in WSNs. Based on UGF and GA, an efficient optimization algorithm for the reliability and performance of SCAs with fault tolerance in WSNs is presented, which eliminates the risk of state space explosion, and provides the system with as high reliability and performance as possible under a given cost constraint proposed by users. The suggested algorithms and approach presented in this paper can be used in the optimization for the reliability and performance of SCAs in WSNs both at the design and the execution phase.