Monte Carlo Based Statistical Model Checking of Cyber-Physical Systems: A Review

: The ever-increasing deployment of autonomous Cyber-Physical Systems (CPSs) (e.g., autonomous cars, UAV) exacerbates the need for efﬁcient formal veriﬁcation methods. In this setting, the main obstacle to overcome is the huge number of scenarios to be evaluated. Statistical Model Checking (SMC) is a simulation-based approach that holds the promise to overcome such an obstacle by using statistical methods in order to sample the set of scenarios. Many SMC tools exist, and they have been reviewed in several works. In this paper, we will overview Monte Carlo-based SMC tools in order to provide selection criteria based on Key Performance Indicators (KPIs) for the veriﬁcation activity (e.g., minimize veriﬁcation time or cost) as well as on the environment features, the kind of system model, the language used to deﬁne the requirements to be veriﬁed, the statistical inference approach used, and the algorithm implementing it. Furthermore, we will identify open research challenges in the ﬁeld of (SMC) tools.


Introduction
The ever-increasing deployment of autonomous Cyber-Physical Systems (CPSs) [1] (e.g., autonomous cars, Unmanned Autonomous Vehicles) exacerbates the need for efficient formal verification methods [2]. In this setting, the huge number of scenarios to be evaluated (scenario explosion) is the main obstacle to overcome. Statistical Model Checking (SMC) [3] holds the promise to overcome this obstacle by using statistical methods to sample the set of scenarios.
SMC algorithms use a simulation-based approach and statistical inference over observations in order to establish the correctness of the System Under Verification (SUV), within statistical confidence bound. Statistical inference can be applied for two different purposes: (i) Hypothesis Testing (HT), allowing to infer whether the observations provide statistical evidence of the satisfaction of a property defined over the system; (ii) estimation, allowing to compute approximated values for some system parameters, whose probability distribution has to be known a priori.
Many SMC tools have been developed, suggesting how to carry out simulations and how many simulations to perform. The simulation runs affect the performance of each tool, which significantly varies according to several features. As a consequence, it is hard to know a priori the best tool suited for the verification problem at hand. In this work, we will overview a set of the Monte Carlo based SMC tools, currently available for research purposes, in order to provide selection criteria based on Key Performance Indicator (KPI) about: the verification activity (e.g., minimize verification time or cost); the environment; the kind of system model; the property specification language; the statistical inference approach; and, the algorithm that implements it. For example, if the verification KPI is to ease the parallelization of the simulation activity, our analysis allows us to identify the tools that compute beforehand the number of simulation runs needed to attain given statistical confidence. On the other u x y Figure 1. Simulation-based verification setting. Signal u models exogenous inputs, signal x models the SUV state, and signal y models the verification output.
For verification purposes, we need to model both the set of operational scenarios (SUV environment) as well as our SUV. This can be done by considering deterministic systems with stochastic inputs (the typical Control Engineering approach) or by considering stochastic systems (the typical Computer Science approach). When considering that from a uniformly distributed random variable in the real interval [0, 1], we can obtain a random variable with any specific (possibly dependent on the SUV state) distribution (e.g., as in Chapter 2. of [8]), and we can easily transform one modelling approach into the other. Accordingy, for example, a Discrete-Time Markov Chain M can be modeled with a hlDiscrete-Time Dynamical System S with a uniformly random stochastic input, from which we get M state-dependent transition probability while using techniques, like the those in [8]. Finally, from a practical point of view, we note that simulators will generate random numbers while using a pseudo-random number generator. Thus, both modelling approaches lead to the same implementation. Accordingly, for our purposes, we will consider them equivalent.
Many statistical model checkers work in a black box fashion by interfacing with widely used commercial as well as open source simulators for the design and the verification of CPSs in many application domains (e.g., automotive, avionics, IoT). Examples of such simulators are: Simulink [9], Dymola [10], SimulationX [11], Wolfram SystemModeler [12], MWorks [13], and Open Modelica [14]. Because the SUV model is not visible to such black box statistical model checkers (they will just run the simulator and check its outputs), we will not make any distinction between synchronous or asynchronous systems for them. Of course, the the situation is quite different for white box model checkers. They typically address this problem by defining probabilistic or non-deterministic schedulers, as done, for example, in PRISM [15] or SMV [16], e.g., following the approach in Chapter 2 of [17]. Using such approach, white box model checkers can also model both synchronous as well as asynchronous systems.
Because most statistical model checkers work in a black box fashion with simulators, here, we will follow the Control Engineering approach and model our SUV as a deterministic system with stochastic inputs.

Modelling the Environment
The environment (Figure 1) for the SUV defines the set of operational scenarios our SUV is supposed to withstand. In other words, an environment defines the set of admissible inputs for our SUV. Such inputs are usually named exogenous or uncontrollable inputs, since they are not under the control of the SUV. Typical examples of uncontrollable inputs are faults, change in system parameters, inputs from users, or from other systems.
In this section, we formalize the notion of space of input functions for a system. Such a space defines the system environment, i.e., the set of admissible operational scenarios for our SUV. To this end, first of all, we define the set of time instants, time set (Definition 1). In the following, as usual, the notation B A denotes the set of functions from A to B.

Definition 1 (Time set).
A time set T is a subgroup of (R, +).
For example, when considering discrete-time systems T will be the set Z of all integer numbers, whereas, for continuous-time systems, T will be the set R of all real numbers. Definition 2 (Space of input functions). Given a set U (of input values) and a time set T, a space of input functions U on (U, T) is a subset of U T = {u | u : T → U.} This allows for us to easily model application domain-dependent probability distributions on the SUV input functions as well as stochastic sampling strategies on the SUV input functions.
The above approach is quite convenient, since, in a simulation-based verification, setting the simulation model is typically deterministic and stochastic behaviors stems from uncontrollable events.
Of course, the more liberal the environment model, the larger the set of operational scenarios under which our SUV will be verified. On the other hand, the more liberal the environment model, the heavier the computational load entailed by the verification activity. Accordingly, a trade off is typically sought by verification experts between the degree of assurance (i.e., to which extent all meaningful operational scenarios have been considered) and the computational cost (and, thus, time cost) of the verification activity.

Modelling the SUV
We mainly focus on the case, where the System Under Verification (SUV) is a Cyber-Physical System (CPS), i.e., a system consisting of both hardware and software components. CPSs can be found basically in any domain. Examples are: biomedical devices, aerospace (e.g., airplanes, Unmanned Autonomous Vehicle (UAV), satellites), automotive, smart grid, etc. Indeed, many CPSs are safety or mission-critical systems, which, through software components, are endowed with some degree of autonomous behavior. From this stems the need for thorough verification of correctness of the software controlling the CPS. This motivates our focus on formal verification of CPSs.
From a formal point of view, a CPS is basically a dynamical system whose state variables can undergo continuous as well as discrete changes. Typically, continuous dynamics model the physical components of a CPS, whereas the discrete dynamics model the software components.
To formalize our notion of system, the following definition will be useful.

Definition 3 (Restriction)
. Let I be a time interval (i.e., an interval I ⊆ T). Given a function u ∈ U I (see Definition 2) and two positive real numbers t 1 ≤ t 2 , we denote with u | [t 1 ,t 2 ) the restriction of u to the interval [t 1 , t 2 ), i.e., the function u The above considerations lead us to model (Definition 4) our SUV as a dynamical system, along the lines of [18].
X, the space of state values of H, is a non-empty set whose elements are said states of H; • U, the space of input values of H, is a non-empty set whose elements are said input values for H; • Y, the space of output values of H, is a non-empty set whose elements are said output values for H; • T is a time set; • U , the space of input functions of H, is a non-empty subset of U T ; Function ϕ satisfies the following properties: • Causality. For all t 0 ∈ T, t ≥ t 0 , x 0 ∈ X, u, u ∈ U : In the following, unless otherwise stated, H denotes the tuple (X, U, Y, T, U , ϕ, ψ). From a formal point of view, a CPS is just a dynamical system, where T is the set of nonnegative real numbers and state variables may undergo continuous (to model the physical part of the system) as well as discrete (to model the software part of the system) changes. This typically means that ϕ is not continuous (but, typically, may be assumed continuous almost everywhere). Definition 4 is general enough to encompass discrete as well as continuous-time systems and finite as well as infinite-state systems.
We refer the reader to Section 1 of [18] for examples of how familiar systems fit Definition 4. In particular, on how from a state-based description of the system dynamics (e.g., through Ordinary Differential Equation (ODE) for continuous-time systems or through recurrence equations for discrete-time systems) we can compute the transition function ϕ. In this respect, we note that, typically, the transition function ϕ is computed by simulation while using numerical techniques (e.g., see [19]), since it is seldom available in a closed form (except, e.g., for linear, time-invariant systems).

Modelling the Specifications
A specification (Figure 1) defines the requirements that our SUV should satisfy for all operational scenarios. In our simulation-based setting, a specification is also defined through a system, as outlined below.
Notation 1 (Product of input spaces). Let T be a time set and U ⊆ U T , V ⊆ V T be input spaces. By abuse of language we denote with A specification takes, as input, the SUV input (operational scenario) and state and returns a real value assessing the extent to which requirements are satisfied. This is formalized in Definition 5.

Definition 5 (CPS Specification)
. Let H = (X, U, Y, T, U , ϕ, ψ) be a system. A specification for H is a system Q = (Z, U × X, R, T, U × X T , µ, θ), where: • Z is the space of state values of the system Q; Of course, in general, the function ϕ is not available in a closed form and it can only be computed through simulation (e.g., using simulators, like Simulink, Dymola, or Open Modelica).
Suppose that we want to check that the output from H is always close enough to u 3 . This could be assessed by computing the Root Mean Squared Error (RMSE) of x with respect to u 3 . This leads to the specification Q = (Z, U × X, R, T, U × X T , µ, θ), where: Z = U = X = T = R and the space of input functions of Q consists of pairs of functions (u, x), where u is a constant function and x maps reals to reals.
In order to formalize the relationship between a system and its specification, we need to define the notion of monitored system (Definition 6) [20][21][22] (in the following, as usual in modern programming languages (e.g., Java, Python), the lambda notation λt. f (t) denotes the function f that, applied to t returns f (t)). Definition 6 (Monitored system). Let H be a system and Q be a specification for it. The (H, Q) monitored system is the system M = (X × Z, U, R, T, U , Φ, Ψ) where: and Example 2 (Example of monitored system). Using the systems H and Q from Example 1, we have that the monitored system (H, Q) is the system M = (X × Z, U, R, T, U , Φ, Ψ) with: • X, Z, U, T and U as in Example 1; Thus, the output Ψ(t, Φ(t, t 0 , (x 0 , z 0 ), u), u) of the monitored system at time t is z 0 The real value that is returned by Ψ defines our KPI evaluating to which extent the system meets its requirements under the operational scenario that is defined by input function u. If a KPI is Boolean (i.e., it takes only values in {0, 1}, then we have a crispy classification (i.e., the system will pass or fail to meet the given specification), or else we have a metric classification measuring the extent requirements are violated. On such a basis, our verification problem can be formulated, as follows.
Definition 7 (Verification problem). Let H be a system, Q be a specification for H and t 0 ∈ T a time instant. We say that H satisfies its specification Q from t 0 if for all (x 0 , z 0 ) ∈ (X × Z), for all u ∈ U , for all t > t 0 , we have that: Ψ(t, Φ(t, t 0 , (x 0 , z 0 ), u), u(t)) > 0.
To limit complexity, an upper limit h (horizon) is usually given to the value of t in Definition 7. In this case, we have a time-bounded verification problem.

Statistical Model Checking
From Definition 7, we see that a verification problem can be cast as the problem of finding a state, an admissible input (operational scenario), and a time instant, such that the system KPI is negative (i.e., requirements are violated).
Of course, the above problem is computationally prohibitive in general. Thus, many techniques have been developed to compute an approximate solution to it with formal assurance about the verification outcomes. Basically, we have two main options: first, focus on simple models (e.g., finite state systems) to make the problem tractable; second, use statistical techniques in order to sample the environment (set of operational scenarios) in order to answer the verification problem (Definition 7) with some given statistical confidence.
We note that probabilistic model checking (e.g., PRISM [15]) is an exact approach, since it computes exactly the probability of reaching certain states and it suffers of the state explosion problem, much as the deterministic model checking. In order to overcome this, statistical techniques are used. Furthermore, non exhaustive approaches, for example falsification [23], do not provide a formal guarantee about the verification outcome. Thus, both probabilistic model checking (see [24]) and non exhaustive verification are out of our scope, because: probabilistic model checking returns an exact result; and, non exhaustive verification does not provide a statistical confidence level. SMC follows the second approach and, indeed, offers a set of methods and software tools to solve the above problem in many different settings.
In the following, we will survey the available tools to carry out formal verification via SMC and categorize them on the basis of the environment model (i.e., space of input functions in Definition 2), SUV model (i.e., system in Definition 4), and specification model (i.e., SUV specification in Definition 5) they can support. This will help users in deciding which tool to use for a given verification task, trading off between the completeness of the environment model (all operational scenarios represented), faithfulness of the SUV model (all possible behaviors represented), and expressiveness of the specification language (all requirements are adequately formalized).
Finally, from a computational perspective, we will categorize tools with respect to the SMC algorithm used. This will enable users to evaluate which kind of parallelism they can expect to easily implement on the basis of the sampling mechanism used.

Background
By defining a probability measure on our set of operational scenarios U (space of input functions-see Definition 2), we can regard the deterministic system (SUV) in Definition 4 as a stochastic system. In such a context, the verification problem in Definition 7 will aim at estimating the probability that the system specification is satisfied. Such a probability can be computed while using numerical as well as statistical techniques.
Numerical Model Checking (NMC) techniques (e.g., [25,26]) need an explicit finite state model (white box) for the SUV. They compute accurate results, but they do not scale very well to large systems, because they suffer from the state space explosion problem.
SMC methods avoid an explicit representation of the SUV state space. They use a simulation-based approach and statistical inference over observations in order to establish whether a property, specified through a stochastic temporal logic (which is then transformed into a specification for the SUV as in Definition 5), is true within statistical confidence bounds. When compared to numerical techniques, SMC overcomes the state explosion problem and scales better to big systems, but only guarantees correctness in a statistical sense. Nevertheless, it is sometimes the only feasible approach to the problem.
SMC takes as input a model defining the dynamics of a system S and a property ϕ, defined as a logical formula. The goal of SMC is to decide whether S satisfies ϕ with a probability that is greater than or equal to a certain threshold α. In formal terms, we denote the SMC problem as S |= P ≥α (ϕ).
In the next sections, we describe the models that can be used to represent the system and the kind of temporal logics used to express the properties. Because all of the approaches are simulation-based, we have that all SUV models and languages to define properties discussed in the following can be transformed, respectively, into the general definitions in Definitions 4 and 5.

System Models
Quite often, rather than considering deterministic systems with stochastic inputs as in Definition 4, SMC considers systems without inputs with nondeterministic or stochastic transitions. This can be accomodated in our setting, as discussed at the beginning of Section 2. Accordingly, in our context, we will consider these two modelling approaches to be equivalent.
SMC input systems can be modelled through different kind of structures, such as DTMC or CTMC [26,27], and by Generalized Semi Markov Process (GSMP) [28].
A DTMC is defined as a tuple M = (S, s 0 , P, L), where: S is a set of states (state space), s 0 ∈ S is the initial state, P : S × S → [0, 1] is the transition function, and L is a function labeling states with atomic propositions. A CTMC is a Markov Chain, where time runs continuously. GSMP [29] is a stochastic process where the transition from one state to another depends on a set of several possible events that are associated with the current state.
Furthermore, SMC can be applied to Discrete Event Stochastic Processes (DESPs), consisting of a set of states, which can be infinite, and whose dynamic depends on a set of discrete events. DESPs can be modelled as Generalized Stochastic Petri Net (GSPN) [30], which is a bi-partite graph that consists of two classes of nodes: places, representing the state of the modelled system; transitions, encoding the model dynamics.
SMC algorithms for Probabilistic Timed Automaton (PTA) are also available [31]. Using PTA, we can represent systems with probabilistic and real-time features.
More recently, SMC has been applied also to timed and hybrid systems with a stochastic behavior, like Stochastic Hybrid Automaton (SHA) and Stochastic Timed Automaton (STA) (see [32]). Some SMC algorithms also accept, as input models, Markov Decision Processes (MDPs) [33], which are particular sequential decision models [34] that are characterized by a set of states, a set of actions, probabilistic transitions, and rewards, or costs, which are associated to the actions, such that each decision only depends on the current state and not on the past. MDPs are often used to model concurrent process optimization problems.
SMC tools, like VeSTa or MultiVeSTa, which will be discussed in the following Sections 5.1 and 5.2, take, as input, models whose behavior is described through Probabilistic rewrite theories [35], which is a general high-level formalism to specify probabilistic systems.
The SMC tool SAM (in Section 5.16) works on systems that are modelled through StoKLAIM [36], which is a Markovian extension of KLAIM (Kernel Language for Agents Interaction and Mobility) [37], a language that is used to model mobile and distributed systems.

System Properties
The properties to be evaluated represent a set of behaviors of the system under verification and can be qualitative or quantitative. Verifying a qualitative property means deciding between two mutually exclusive hypotheses, namely if the probability to satisfy the property is above or below a certain threshold. Quantitative properties concern computing the estimation of a stochastic measure, i.e., the expected value of the probability that a given property is satisfied.
Properties are expressed through a temporal logic and, in literature, many different kinds of logics exist, as outlined below.
Linear Temporal Logic (LTL) [38] is the top family of temporal logics that reason along linear traces through time, where each instant is followed by only one future step. Bounded Linear Temporal Logic (BLTL) extends LTL by adding upper time bounds (bounded time of steps when the time domain is discrete) to temporal operators. Probabilistic Bounded Linear Temporal Logic (PBLTL) adds probabilistic operators to BLTL. PBLTLc corresponds to PBLTL with numeric constraints. Metric Temporal Logic (MTL) [39] and Metric Interval Temporal Logic (MITL) [40] both extend LTL by introducing an explicit representation of time.
Another class of stochastic logics considers time evolving in a branching fashion, rather than on a line. This class includes the Computational Tree Logic (CTL) [41], which is used to express qualitative properties over (non-stochastic) transition systems. Probabilistic Computational Tree Logic (PCTL) [42] extends CTL by including operators and time bounds to express the quantitative properties of a DTMC, say M. If s is a state of M and ϕ is a PCTL formula, then s |= P ≥α (ϕ) means that "the probability that ϕ is satisfied in the next state of outgoing paths from state s is greater than or equal to α".
Several extensions of PCTL exist, like Continuous Stochastic Logic (CSL) [43], which can be used to express quantitative properties of CTMCs; Quantitative Temporal Expressions (QuaTEx) [35], a query language that is used to investigate quantitative aspects of probabilistic rewrite systems; PCTL* [44], used to specify quantitative properties of DTMCs or MDPs. PCTL* formulas can be interpreted over the states of a fully or concurrent probabilistic system. CSL TA [45] is a superset of CSL for CTMC that allows for specifying path formulas through a one-clock Deterministic Timed Automaton (DTA).
Hybrid Automata Stochastic Language (HASL) is a temporal logic formalism that is used to define properties over DESPs (Section 3.1). A HASL formula consists of an automaton and an expression. The automaton is a Linear Hybrid Automaton (LHA), i.e., a hybrid automaton with data variables whose dynamic depends on the modelled system states. The expression refers to the moments of path random variables associated with path executions of the system.
Finally, Mobile Stochastic Logic (MoSL) is a temporal stochastic logic that was introduced in [36]. MoSL is based on CSL and allows us to refer to the spatial structure of a mobile network. MoSL can be used to specify both qualitative and quantitative properties.

Statistical Inference Approaches
When using SMC algorithms, the system is simulated many times by executing Monte Carlo experiments. The property ϕ to be verified is checked at each simulation (sample). Let us associate a Bernoulli random variable Z to one simulation. We assume that Z is 1 if ϕ is satisfied, 0 otherwise, and that Pr[Z = 1] = p and Pr[Z = 0] = 1 − p.
HT, as described in Section 4.1, Estimation, Section 4.2, and Bayesian analysis, Section 4.3, can be used as engines for the SMC algorithms, as outlined below.

Hypothesis Testing
The hypothesis testing approach consists of the evaluation of two mutually exclusive hypotheses, namely the null hypothesis H 0 : p ≥ θ, and the alternative hypothesis H 1 : p < θ, where θ is a given threshold value. If we assert that H 0 is false while it is true, we make a Type I error (or false positive); if we decide to reject H 1 , while it is true, we make a Type II error (or false negative). Let us denote, with α, the probability of Type I errors and with β the probability of Type II errors.
HT can be used to analyze both qualitative and quantitative properties. It can be performed on a set of samples that can be generated in advance or on demand. In the former case, the sample size is fixed. In the latter case, called sequential testing, the samples are collected incrementally and their number is not predetermined. In a sequential testing approach, the decision of whether or not to continue sampling is made based on the samples that were generated up to that point. Mixtures of the two types of sampling generation are possible. This is discussed in [5], where several approaches to HT are described, differing from each other in the guarantees they give about the correctness of the result. The algorithms implementing these approaches are the following: the Chernoff C.I. method, using a fixed sample size test, resting on a confidence interval based on the Chernoff-Hoeffding bound [46]; the Gauss C.I., that is a similar procedure based on the Gaussian approximation to determine the confidence interval; the Chow-Robbins test, a sequential test method that continues sampling until the width of the confidence interval has reached a given value; the Sequential Probability Ratio Test (SPRT) technique of Wald [47] and its fixed sample size variant, the Gauss-SSP test, where SSP stands for Single Sampling Plan (see [48] for details).
When compared to simple sequential testing, SPRT is optimal in the sense that it minimizes the average sample size before a decision is made. The last method is the Chernoff-SSP test, which is a variant of the Gauss-SSP test using the Chernoff-Hoeffding bound instead of the Gaussian approximation.
When properties to be verified are rare, that is the probability that they hold is low, the number N of simulations required can explode. Fortunately, two techniques, such as Importance Sampling (ISA) and Importance Splitting (ISS), can be applied to reduce N. ISA works by sampling through a weighted distribution, such that a rare property is more likely to be observed. Subsequently, the results are compensated by the weights, in order to estimate the probability under the original distribution. ISS works by decomposing the sampling from the rare probability distribution into the sampling from the product of less rare conditioned probabilities. See [49][50][51] for an in-depth discussion about the two mentioned rare properties techniques.

Estimation
Estimation is the kind of inference approach that is typically used to verify quantitative properties. Existing SMC algorithms rely on classical Monte Carlo based techniques in order to estimate a property with a corresponding confidence interval. The system model is repeatedly simulated and, if ϕ is the property to be evaluated and p is the probability to satisfy ϕ, an ( , δ) approximationp of p is computed, with the assurance that the probability of error is Pr(|p − p| ≥ ) ≤ δ, where and δ are, respectively, the precision and confidence.
The confidence interval can be computed using different approaches, each one affecting the number of simulations N to perform. For instance, according to the Chernoff-Hoeffding bound [46] N is fixed a priori by setting N = (ln 2 − ln δ)/(2 2 ). The Confidence Interval (CI) method, taken as input two parameters α and δ, samples until the size of the (1 − α) × 100% confidence interval, computed while using the Student's t-test, is bounded by δ.
In [52,53], Grosu and Smolka show how to compute an ( , δ) approximationp of p satisfying the property: This can be done with an optimal number of simulation traces N, through the Optimal Approximation Algorithm OAA that was introduced by Dagum et al. in [54]. OAA uses the minimum number N of traces to compute an ( , δ) approximation, within a constant factor, of a random variable distributed in [0, 1]. OAA algorithm improves the classic Monte Carlo algorithm, which is based on the hypothesis testing approach, since N is computed at runtime. This result is obtained by applying the sequential testing approach of Wald [47] to decide when to stop simulating. As well as for HT, as discussed in Section 4.1, when quantitative properties to be verified are rare, ISA or ISS can be applied (see [49][50][51]), together with the algorithm that was selected to solve the estimation problem, in order to reduce the number of simulations to perform.

Bayesian analysis
The Bayesian approach to SMC is based on the usage of Bayes's theorem and sequential sampling, and can be associated with both HT and Estimation. Additionally, in this case, as for Hypothesis Testing (Section 4.1) and Estimation (Section 4.2) algorithms, the system model is repeatedly simulated. The difference is that Bayes's theorem requires prior knowledge of the model to be verified, which is the prior density of a given random variable. Sequential sampling, as discussed in Section 4.1, means that the number of simulations to perform is decided at run-time. The properties to be verified are expressed as BLTL formulas and input systems are modelled as Discrete-Time Hybrid System (DTHS) (see Section 2).
The algorithm to perform SMC by Bayesian HT, introduced in [55], at each trace of the system model, generated by simulation, checks if a certain property is satisfied or not. Subsequently, it uses the Bayes factor B to decide whether the null hypothesis H 0 or the alternative hypothesis H 1 has to be accepted.
Zuliani et al. in [7] propose a Bayesian statistical Estimation algorithm. The main goal of this approach is estimating the (unknown) probability p that a random execution trace of the system model satisfies a fixed property φ. The estimate corresponds to a confidence interval, i.e., an interval that contains p with a high probability. Since p is unknown, it must be assumed that p is given by a random variable whose density is known a priori. The Bayesian Estimation algorithm iteratively generates traces of the system model and for each trace executes the following steps: checks whether φ is satisfied; computes the posterior mean, whhich is the Bayes estimator, of p; computes an interval estimate; produces a posteriori probability of p.
Finally, Bortolussi et al. in [56] present a novel approach, to the statistical model checking, based on the Bayes analysis and it is called Smoothed Model Checking. Their goal is computing a statistical estimation (and a confidence interval) of the satisfaction of a MITL property against an Uncertain CTMC, which is a CTMC indexed by a parameter vector.

Summary of the Algorithms Used for HT and Estimation
In Table 1, we summarize the main algorithms, as explained in the previous sections, which can be used to carry out Hypothesis testing or Estimation. For each algorithm, we specify whether the number N of simulation runs to execute is predetermined or not. Table 1. For each considered algorithm we show with a • if it supports Hypothesis Testing (HT) and/or Estimation (E). In the last column yes means that the algorithm pre-computes the number of samples (#Samples Fixed a Priori), whereas no means that the number of samples is computed at runtime.

Algorithm HT E #Samples Fixed a Priori
Gauss-SSP • yes C.I.

Statistical Model Checking Tools
In this section, we describe some of the existing Monte Carlo based SMC tools, focusing on those that are directly available to the academic community. There exist tools using simulation-based techniques, but whose output is not a statistical confidence interval, which are not included in our review, like: SyLVaaS [57], a Web-based software-as-a-service tool for System Level Formal Verification of CPSs; S-TaLiRo [58], which is a tool performing temporal logic falsification of properties over non-linear hybrid systems; Ariadne [59], which is a tool for the formal verification of CPSs, using reachability analysis for nonlinear hybrid automata; SpaceEx, a tool platform for reachability and safety verification of continuous and hybrid systems [60]; Symbolic PathFinder [61], a SMC tool implementing approximate algorithms for MDP.
For each surveyed tool, we give a brief description and some details about: the kind of models supported; the languages used to specify the properties to be verified; and, the statistical inference approach used (HT and/or Estimation and/or Bayesian analysis). The information about the environment model, that stem from the kind of model, are summarized in Table 2.
Tools performing statistical inference by algorithms not discussed in Section 4 (e.g., SAM, GreatSPN) will not be included in the taxonomy that is given in Table 2.

(P)VeStA
VeStA is a Java-based tool for statistical analysis of probabilistic systems. It implements the statistical methods from [27,43], based on Monte-Carlo simulation and simple statistical hypothesis testing [62].
Model and Property. This tool can verify properties expressed as CSL/PCTL formula (Section 3.2), against probabilistic systems that are specified as CTMCs or DTMCs (Section 3.1). VeStA is able also to statistically evaluate, in a Monte Carlo based way, the expected value of properties expressed in QuaTEx (Section 3.2), over the observations performed on probabilistic rewrite theories models. In this last case, models are described through PMAUDE, which is an executable algebraic specification language [35].
Statistical Inference approach. VeStA performs SMC by using classic statistical hypothesis testing (Section 4.1), rather than sequential hypothesis testing, according to the algorithm described in [63]. In particular, VeStA implements the Gauss-SSP hypothesis testing (Section 4.1), which is a fixed sample size version of the SPRT algorithm of Wald (Section 4.1). Consequently, it is easily parallelizable. PVeStA [64] is the tool extending and parallelizing the SMC algorithms implemented in VeStA.

MultiVeStA
MultiVeStA [65] extends VeStA (and PVeStA) with an important feature that is the integration with several existing discrete event simulators, in addition to the already supported ones.
Model and property. Properties to be verified are expressed in Multi Quantitative Temporal Expressions (MultiQuaTEx) query language, which extends QuaTEx (Section 3.2) and allows for querying more variables at a time through multiple observations on the same simulation. This represents an improvement of the performance obtained when evaluating several expressions. The supported SUV models are Discrete Event Systems (DESs).
Statistical Inference approach. MultiVeStA performs an estimation of the expected value of MultiQuaTEx properties by HT the Chow-Robbins method (Section 4.1).

Simulation-Based SMC for Hybrid Systems
There are many tools supporting SMC for hybrid systems, for example: ProbReach [66], Probably Approximately Correct (PAC) for hybrid systems verification [67,68], and Plasma [69]. Here, we focus on tool that can interface in a black-box fashion with any available simulator for hybrid systems. One of those is Plasma [70,71] which is a SMC platform that may be invoked from the command line or embedded in other software as a library. This tool uses external simulators (e.g., Simulink, SystemC) in order to define the SUV as well as the property to be verified.
Model and Properties. Plasma Lab accepts properties described as BLTL extended with customized temporal operators, against stochastic models such as CTMCs and MDPs (Section 3.1).
Statistical Inference approach. Plasma Lab can verify both qualitative and quantitative properties. In fact, the tool implements, among others, the following algorithms: the Monte Carlo probability estimation based on the Chernoff-Hoeffding bound [46], to decide a priori the number of simulations to execute; the SPRT algorithm for hypothesis testing; ISA when the properties are "rare" (Section 4.1).

APMC
The Approximate Probabilistic Model Checker (APMC) is a model checker implementing an efficient Monte Carlo-based method in order to approximate the probability that a monotone property is satisfied, with high confidence, by a fully probabilistic system. This algorithm is described in [72]. In [73] the optimal approximation algorithm OAA (Section 4.2) is used to obtain a randomized approximation scheme with multiplicative error parameter.
Model and property. APMC is used to verify quantitative properties over fully probabilistic transitions systems or DTMCs (Section 3.1). In 2006, APMC has been extended to manage also CTMCs (see [74]). Properties to be checked are expressed as LTL (Section 3.2) formulas.
Inference approach. This tool performs estimation through a Monte-Carlo sampling technique, based on the Chernoff-Hoeffding bound (Section 4.1), which is naturally parallelizable.

PRISM
PRISM [15] is a Probabilistic Symbolic Model Checker that from the version 4.0 offers support for SMC of several types of probabilistic models. Furthermore, in [75], a parallelized version of PRISM has been extended to handle MDPs.
Model and property. DTMCs, CTMCs, MDPs models (Section 3.1) are described through the PRISM modelling language, while the properties to be verified are defined by several probabilistic temporal logics, incorporating PCTL, PCTL*, CSL. and LTL (Section 3.2).
Statistical Inference approach. The tool uses the SPRT (Section 4.1) in order to verify qualitative properties and the following algorithms to verify the quantitative properties (Section 4.2): CI method, Asymptotic Confidence Interval (ACI) method, and Approximate Probabilistic Model Checking (APMC) method. All of these algorithms precompute the number of samples to be generated. See [76] for an updated detailed description.

Ymer
Ymer, as illustrated in [77], is a command-line tool written in C and C++, which implements the SMC techniques that are based on discrete event simulation and acceptance sampling. Furthermore, in the Ymer tool, two sampling-based algorithms for the statistical verification of time-unbounded properties of DTMCs are implemented (see in [78]).
Model and property. This tool can verify CSL properties against CTMCs and PCTL properties against DTMCs (Sections 3.1 and 3.2).
Statistical Inference approach. Ymer implements both sampling with a fixed number of observations and sequential acceptance sampling, performed through the SPRT method (Section 4.1). Ymer includes support for distributed acceptance sampling, i.e., the use of multiple machines to generate observations, which can result in significant speedup as each observation can be generated independently. The work in [78] also implements, in Ymer, estimation through two different approaches, the first based on Chernoff C.I and the second based on the Chow-Robbins sequential method.

UPPAAL-SMC
UPPAAL-SMC [79] is a stochastic and statistical model checking extension of UPPAAL [80], which is a toolbox for the verification of real-time systems, jointly developed by Uppsala University and Aalborg University.
Model and Properties. UPPAAL-SMC implements techniques in order to verify both quantitative and qualitative properties of timed and hybrid systems with a stochastic behavior, whose dynamic can be specified by SHA, effectively defining ODEs, and by STA [32]. Properties are expressed through a weighted extension of the temporal logic MITL (Section 3.2).
Statistical Inference approach. This tool carries out quantitative properties verification through a Monte Carlo based estimation algorithm using the Chernoff-Hoeffding bound (Section 4.1), where the number of samples to be generated is predetermined. Qualitative properties are verified through the SPRT algorithm (Section 4.1).

COSMOS [30] is a statistical verification tool that is implemented in C++.
Model and property. This tool analyzes DESPs, a class of stochastic models, including CTMCs, represented in the form of a GSPN (Section 3.1). Properties to be verified are expressed as HASL formulae (Section 3.2).
Statistical Inference approach. COSMOS relies on Confidence Interval based methods to estimate the probability that the property under verification holds, by implementing two possible approaches: the static sample size estimation, based on the Chernoff-Hoeffding bound (Section 4.1), where the sample size is fixed a priori; the dynamic sample size estimation, where the sample size depends on a stopping condition, such as that provided by Chow and Robbins (Section 4.1). COSMOS also provides the SPRT method.

GreatSPN
GreatSPN [81] is a tool that supports the design and the qualitative and quantitative analysis of GSPN. GreatSPN contains, in particular, two modules: a CTL model checker and a CSL TA model checker.
Model and Properties. GreatSPN can verify: CTL properties against models represented as GSPN or its colored extension, defined as Stochastic Symmetric Nets (SSN); CSL TA properties (Section 3.2) against CTMCs.
Statistical Inference approach. The CTL model checker of GreatSPN verifies CTL properties by numeric symbolic algorithms (see Section 3). The CSL TA module is a probabilistic model checker for estimation of properties that can a.so interact with external tools, like PRISM (Section 5.5) and MRMC (Section 5.10).

MRMC
MRMC (Markov Reward Model Checker) [82] is a command-line tool, written in C language, which implements SMC techniques. This tool is a Probabilistic Model Checker, rather than a SMC. However, as MRMC performs statistical inference through Monte Carlo-based techniques, it is included in our review, for sake of completeness.
Model and property. The tool can verify CSL and PCTL properties (Section 3.2) against CTMCs and DTMCs (Section 3.1).
Statistical Inference approach. MRMC performs probability estimation by the Confidence interval method that is based on the Chow-Robbins test (Section 4.1), with a dynamic sample size. The only problem is that since MRMC always loads Markov chain representation completely in memory, it can lose the benefits deriving from simulating instead of using numerical techniques.

SBIP
SBIP [83] is a statistical model checker for the BIP (Behavior, Interaction, Priority) general framework for the design of component-based systems that were developed at Verimag (see [84]).
Model and property. It supports DTMC, CTMC, and GSMP (Section 3.1) as the input models. The properties to be verified can be expressed as PBLTL and MTL formula (Section 3.2).
Statistical Inference approach. The tool implements several statistical testing algorithms for stochastic systems verification of both qualitative and quantitative properties. The qualitative properties are checked through one of the following algorithms: Single Sampling Plan (SSP) [3], where the number of samples is predetermined, and SPRT, where the number of samples is generated at runtime (Section 4.1). The quantitative properties are verified through a Probability Estimation procedure, based on the Chernoff-Hoeffding bound (Section 4.1).

MARCIE
MARCIE [85] is a tool that is written in C++ and made up of several analysis engines, in particular a Stochastic Simulation Engine that is based on statistical estimation.
Model and property. This tool can verify both quantitative and qualitative properties over systems that are modelled as GSPN, including CTMC (Section 3.1). Properties can be defined by CSL, Continuous Stochastic Reward Logic (CSRL) or PLTLc. CSRL includes CSL and adds reward intervals to the temporal operators (Section 3.2).
Statistical Inference approach. The component of MARCIE that is dedicated to estimation implements an algorithm performing several simulation runs depending on the variance of the system stochastic behavior and determined through a Confidence interval method that is based on the Chernoff-Hoeffding bound (Section 4.1).

Modest Toolset Discrete Event Simulator: Modes
Modest [86] is a high-level compositional modelling language for STA. It is completely supported by the Modest Toolset, available at [87], which includes tools for analyzing different subsets of the STA, e.g., PTA, MDP, DTMC, CTMC (Section 3.1). In particular, the modes tool, a discrete event simulator, which is based on the Modest language, is available. From version 1.4 it offers SMC functionalities.
Model and property. modes supports the analysis of SHA, STA, PTA, and MDP. THe properties to be verified are expressed in Modest language.
Statistical Inference approach. modes verifies quantitative properties through a confidence interval based algorithm, which allows for deciding at runtime how many simulations to do; qualitative properties through the SPRT method (Section 4.1).

APD Analyser
Aggregated Power Demand (APD) Analyser ( [88,89]) is a SMC based domain-specific tool to compute the APD probability distribution in a smart grid scenario in the presence of highly dynamic individualised electricity price policies [90,91].
Model and property. The input model consists of a probabilistic model of end-user deviations with respect to their expected behaviors. The tool computes a single domain-specific property: the APD probability distribution in Electric Distribution Networks. Further post-processing of this distribution allows for the Distribution System Operators (DSO) to compute the safety properties of interest.
Statistical Inference approach. As an exact computation of the required APD probability distribution is computationally prohibitive, because of its exponential dependence on the number of users, APD-Analyser computes Monte Carlo based ( , δ)-approximation, through an efficient High-Performance Computing (HPC)-based implementation of the OAA algorithm, discussed in Section 4.2.

ViP Generator
The works in [92][93][94] present a tool, called ViP (Virtual Patient generator). This tool uses a SMC based approach for computing complete populations of virtual patients for in silico clinical trials and the model-based synthesis of optimal personalised pharmaceutical treatments [95,96].
Model and property. The input model is a system of ODEs and the boolean property to be satisfied is the completeness of the virtual patient set generated.
Statistical Inference approach. HT (Section 4.1) is used to check, with high statistical confidence, the completeness of the virtual patient set S generated so far. After defining a probability threshold and a confidence threshold δ, the SMC algorithm in [92] randomly extracts, from a parameter space Λ, a sample {λ} that, if admissible, is added to S. On the basis of [52,53], the algorithm ends when S remains unchanged after N = ln δ/ ln(1 − ) attempts.

SAM
SAM (Stochastic Analyzer for Mobility) [36] is a tool supporting the analysis of mobile and distributed systems. SAM uses a statistical model checking algorithm in order to verify whether a mobile system satisfies a certain property.
Model and Properties. Systems are specified in StoKLAIM (Section 3.1). The properties to be verified are expressed through MoSL (Section 3.2).
Statistical Inference approach. SAM performs the estimation of quantitative properties.

Bayesian Tool
The Bayesian analysis consists of the introduction of Bayesian statistics to the SMC approach, as discussed in Section 4.3.
Model and property. The models to be analyzed are DTHS (see Section 2) defined as Stateflow/Simulink models, while properties are expressed as BLTL formula.
Statistical Inference approach. The Bayesian analysis includes: (i) an algorithm to perform SMC using Bayesian hypothesis testing in order to verify BLTL properties against Stateflow/Simulink models with hybrid features; (ii) a Bayesian estimation algorithm, to compute an interval estimate of the probability that a BLTL formula is satisfied in a stochastic hybrid system model.

Tool Comparison
In Table 2, we show a taxonomy of the surveyed tools, which is based on their main properties, namely: the environment model, characterized by the time (discrete or continue), the set of events values (that can be finite/infinite); the SUV model, consisting of the kind of models or language used to represent the system and the set of states of the model (that can be finite/infinite); the property specification, which is the stochastic logic used to specify the properties to be verified and the search horizon (that can be bounded/unbounded); the verification technique, consisting of the statistical inference procedure used (HT and/or Estimation and/or Bayesian analysis), together with the algorithm implemented to perform the inference. The algorithm is useful to know whether the number of simulation runs is fixed beforehand or not (as shown in Table 1).
In addition to the tools extensively that were reviewed in [4,5], we have included the Bayesian analysis and two other SMC domain-specific software tools, the APD Analyser and the ViP (Virtual Patient) generator. Table 2. Monte Carlo simulation-based SMC tools comparison table. In the Time column D stands for discrete, C for continue. In the column Model, the model is specified as its representation structure or as the language describing the model. In the columns Event values and Set of states, f in means finite and in f means infinite. In the Search horizon column, bnd stands for bounded, ubnd for unbounded. In the Inference column, HT stands for Hypothesis Testing; E stands for Estimation and NS stands for Numeric-symbolic methods (see Section 3). In the same column, for each inference approach, the name of the algorithm used is specified. For further details about the algorithms, see Table 1. Depending on the algorithm, the number of samples can be computed a priori or at runtime.

Discussion
The taxonomy shown in Table 2 highlights the following trade-offs. The Environment model is: complete if it contains all possible operational scenarios; sound if it contains only the operational scenarios that can occur in a real situation. Ideally, we would like a complete and sound environment model. However, in practice, this is not easy to define. Depending on priorities, correctness, or efficiency, we may select a complete (possibly unsound) model or a sound (possibly incomplete) model.
A SUV model has to capture the dynamics of the system. If the properties to be verified are of the safety kind, tools taking as input an over-approximated model (i.e., containing more behaviors than the real system) have to be preferred. If liveness properties have to be verified, tools accepting under-approximated models can be used.
The property language has to be selected with respect to its capability to define the property to be verified. The search horizon should be large enough, but not too large, in order to avoid making simulations overly time consuming. In this case, the trade-off is between the expressiveness and the computational complexity of the verification activity.
The verification technique selection depends on the goal (namely, estimation or HT) and on the need to know beforehand, or not, the number of samples to be generated. Accordingly, all of the tools implement the SPRT to perform HT on qualitative statements, except: (P)VeStA that uses the Gauss-SSP test, through which the sample size is predetermined; the Bayesian analysis, which bounds the sample size while using a test based on the Bayes factor. SBIP, besides SPRT, also implements also SSP.
The estimation of quantitative properties is performed through different techniques. (P)VeSta uses the Confidence Interval method to evaluate QuaTEx formulae; MultiVeStA, COSMOS (from version 1.0) and MRMC (from version 1.5) implement the Chow-Robbins test; Plasma Lab, UPPAAL-SMC, PRISM (from version 4.0), SBIP, and MARCIE use the Chernoff-C.I. method; Ymer uses the Gauss-SSP test; APMC (from v3.0) implements the so-called Chernoff-SSP test; the APD Analyser implements the OAA algorithm (see Table 1). The Bayesian Estimation procedure is different from all of the others, because it is based on the Bayes factor test, but, like with the Chow Robbins test, the sample size is not fixed a priori.
If we have to choose a Monte Carlo based SMC tool minimizing the number of simulations to do, in the case of qualitative statements, all of the tools implementing the SPRT to solve HT are suitable. On the other hand, an a priori fixed sample size technique is easily parallelizable, then sometimes this kind of approach could be preferable. In the case of quantitative statements, the tools implementing the Chow-Robbins test or OAA are the fastest. The Bayesian analysis also allows for us to decide at runtime how many simulations to do, but it needs to know the probability distribution of the variables to be estimated in advance.
As a last result, Table 2 highlights an open research challenge, which is the lack of tools to perform an unbounded verification of hybrid systems whose environment has continuous-or discrete-time.

Conclusions
We have reviewed most of the SMC tools currently available for research purposes, focusing on the complexity, in terms of sample size, of the Monte Carlo-based techniques that are used to evaluate quantitative and qualitative properties through HT, Estimation, and Bayesian analysis.
For each tool, we have highlighted: the environment model; the SUV model; the stochastic logic that is used to define the properties to be verified, together with the search horizon type; the statistical inference procedure; and, the corresponding algorithm used, by explicitly indicating whether the number of samples is generated at runtime or not. After an in-depth comparison, we produced the taxonomy presented in Table 2.
First, we observe that most of the tools assume that the environment is sound, thus events are taken from a finite set. In this scenario, the SUV model set of states is typically finite and the search horizon for the properties to be checked is unbounded. Otherwise, if the SUV model set of states is infinite, then the search horizon is always bounded.
Second, 90% of the tools performing HT use algorithms that decide, at runtime, the number of samples to be generated; approximately 70% of the tools implementing Estimation employ algorithms computing the number of samples beforehand. Thus, according to the problem at hand, we suggest using SMC tools computing a priori the number of samples if there is a need to parallelize and, instead, using tools generating one sample at each iteration if it is more important to reduce the number of simulations.
Finally, our taxonomy points out the challenging task of deploying tools to perform unbounded verification of discrete-or continuous-time hybrid systems.