Bayesian Input Design for Linear Dynamical Model Discrimination

A Bayesian design of the input signal for linear dynamical model discrimination has been proposed. The discrimination task is formulated as an estimation problem, where the estimated parameter indexes particular models. As the mutual information between the parameter and model output is difficult to calculate, its lower bound has been used as a utility function. The lower bound is then maximized under the signal energy constraint. Selection between two models and the small energy limit are analyzed first. The solution of these tasks is given by the eigenvector of a certain Hermitian matrix. Next, the large energy limit is discussed. It is proved that almost all (in the sense of the Lebesgue measure) high energy signals generate the maximum available information, provided that the impulse responses of the models are different. The first illustrative example shows that the optimal signal can significantly reduce error probability, compared to the commonly-used step or square signals. In the second example, Bayesian design is compared with classical average D-optimal design. It is shown that the Bayesian design is superior to D-optimal design, at least in this example. Some extensions of the method beyond linear and Gaussian models are briefly discussed.


Introduction
Discrimination of various dynamical models of the same process has a wide area of applications, especially in multiple-model fault detection and isolation [1][2][3][4] and, in many other estimation and control problems [5][6][7], it is necessary to choose the most likely dynamical model from a finite set. The discrimination task can be formulated as an estimation problem, where the estimated parameter θ indexes particular models. The problem can also be considered as a finite-dimensional approximation of more general identification tasks [8]. As the error probability or variance of the estimator of θ usually depends on the input signal, it is important to select a signal that minimizes error probability or maximizes a utility function that encodes the purpose of the experiment. Selection of an input signal that maximizes a utility function is strongly related to optimal experimental design [9].
Experimental design methods can be divided into classical and Bayesian. The classical methods, also called optimal experimental design, typically use various functionals of the Fisher information matrix as a utility function. These methods are widely described in the literature and work well if the model is linear in its parameters (see [8,[10][11][12] and the review article [9]). Unfortunately, in typical identification tasks, the solution of model equation and the covariance depends non-linearly on θ, even if the model equation is linear. This implies that the information matrix and the utility function depend on the parameter θ to be estimated. Therefore, only locally-optimal design can be obtained [13]. To obtain more robust methods, an averaging over the prior parameter distribution or minimax design [14], [8] (Section 6.1) are commonly used, but these methods are not fully Bayesian.
Bayesian optimal design uses the utility function, a functional of the posterior distribution (see [15,16] and the review articles [13,17,18]). The most commonly used utility functions are mutual information between parameters and model output, Kullback-Leibler divergence between the prior and posterior distributions, and the determinant of the posterior covariance matrix [13,16]. In contrast to classical methods, in Bayesian design the utility function does not depend on the parameters to be estimated. Hence, the method can cope with non-linear problems. The utility function, which is suitable for model discrimination, is the error probability of the MAP estimator of θ [19]. Such a utility function is generally difficult to calculate (see [19]), but the result of Feder & Merhav [20] implies that the error probability of the MAP estimator is upper-bounded by some decreasing function of mutual information between θ and the output of the system. Hence, the maximization of mutual information creates the possibility of reducing the error probability, provided that appropriate estimator is used. However, the most serious problem that inhibits the development of this idea is great computational complexity in calculating the mutual information.
The main contribution of this article is a fully-Bayesian (in the terminology of [13]) method for finding an input signal that maximizes the mutual information between θ and the system output. Maximization of information or, equivalently, maximization of the output entropy has been proposed by many authors (see, e.g., [13,15,17,18,[21][22][23]), but the mutual information is very hard to compute and the problem is often intractable. To overcome this serious difficulty, instead of mutual information the lower bound, given by Kolchinsky & Tracey [24], has been used. This is a pairwise-distance based entropy estimator and it it useful here, since it is differentiable, tight, and asymptotically reaches the maximum possible information (see [24] (Sections 3.2, 4, 6)). Maximization of such a lower bound, under the signal energy (i.e., the square of the signal norm) constraints, is much simpler, gives satisfactory solutions, and allows for practical implementation of the idea of maximizing information. This is illustrated with examples. Moreover, it is shown that, for certain cases, this problem reduces to a solution of a certain eigenproblem.
The article is organized as follows. In Section 2, the estimation task is formulated and the upper bound of the error probability and the lower bound of the mutual information are given. In Section 2.1, a selection between two models is discussed and an exact solution is given. Design of input signals with small energy, which is required in some applications, is described in Section 2.2. In Section 2.3, the large energy limit is discussed. An application to linear dynamical systems with unknown parameters is given in Section 3. An example of finding the most likely model among three stochastic models with different structures is given in Section 4. Comparison with classical D-optimal design is performed in Section 5. The article ends with conclusions and references.

Maximization of Mutual Information between the System Output and Parameter
Let us consider a family of linear models where θ ∈ 1, 2, ..., r, Y, Z ∈ R n Y , and U ∈ R n U . The matrices F θ are bounded. The parameter θ is unknown. The prior distribution of θ is given by The random variable Z is conditionally normal (i.e., p(Z|θ) = N(Z, 0, S θ )), where the covariance matrices S θ are given a priori and S θ > 0, for all θ. The variable U is called the input signal. In all formulas below, the input signal U is a deterministic variable. The set of admissible signals is given by Under these assumptions, and after applying Bayes rule: The entropies of Y and θ and the conditional entropies are defined as The mutual information between θ and Y is defined as (see [25] (pp. 19, 250)) As H(θ|U) = H(θ) and H(Y|θ, U) = H(Y|θ), then I(Y; θ|U) is given by The MAP estimator of θ is defined aŝ θ(Y, U) = arg max θ∈{1,...,r} p(θ|Y, U).
The error probability ofθ is given by (see [20]) It follows from Fano's inequality ( [25] (p. 38)), that P e is lower bounded by an increasing function in H(θ|Y, U). Feder & Merhav [20] proved that 2P e (U) H(θ|Y, U) log 2 e. As H(θ|Y, U) = H(θ) − I(Y; θ|U) and H(θ) does not depend on U, then the maximization of I(Y; θ|U) creates the possibility of reducing P e , and the optimal signal is given by To overcome the problems associated with the calculation of I(Y; θ|U), we will use its lower bound.

Lemma 1. (Information bounds)
. For all U ∈ R n U , where Proof. According to (4), p(Y|U) is finite Gaussian mixture. For such mixtures, the information bounds are known. A detailed proof, based on Chernoff α-divergence, is given in [24] (Section 4).

Lemma 2.
Letθ(Y, U) = arg max θ∈{1,...,r} p(θ|Y, U) be the MAP estimator of θ, and let P e (U) denote its error probability. There exists a continuous, increasing, and concave function f Proof. Feder & Merhav [20] (see Theorem 1 and Equation (14)) proved that there exists an increasing, continuous, and convex function φ As Taking f (η) = g(η log 2 e) we obtain the result. Now, the approximate solution of (12) is given by As I l is smooth and S is compact, (19) is well-defined.

Selection between Two Models
Suppose that θ takes only two values, 1 and 2, with prior probabilities p 0,1 and p 0,2 = 1 − p 0,1 , respectively. It's easy to check, by direct calculation, that Equation (20) implies that the maximization of I l is equivalent to the maximization of D 1,2 . On the basis of (15), we have the following optimization: task The solution of (21) is the eigenvector of Q 1,2 corresponding to its largest eigenvalue; that is,

Small Energy Limit
In many practical applications, the energy of an excitation signal must be small. The second order Taylor expansion of (14) gives where If the value of (see (3)) is small, then the last term in (23) can be omitted. As the second term does not depend on U, we have the following optimization task: The solution of (26) is the eigenvector of Q corresponding to its largest eigenvalue.

Large Energy Limit
We will investigate asymptotic behaviour of I(Y; θ|U) when ||U|| → ∞. On the basis of Lemma 1, the condition min Such signals are weakly informative and they cannot generate the maximum information, even if their amplitude tends to infinity. Let S 1 denote the unit ball in R n U and let µ be the Lebesgue measure on S 1 . The set of weakly informative signals is defined as Proof. ⇐: On the basis of Lemma 1 and (28), Since S i + S j is positive-definite and F i = F j then, on the basis of (16), the matrix Q i,j has at least one positive eigenvalue. Hence, µ(Ω i,j ) = 0 and µ(Ω Hence, Q i,j has at least one positive eigenvalue, which is possible only if As a conclusion, we have the following result. ..,r} p(θ|Y, U), be the MAP estimator of θ and let P e (U) denote its error probability. If F i = F j for all i = j, then, for any > 0, there exists a number > 0 and a signal U ∈ S , such that P e (U) < .

Application to Linear Dynamical Systems
Consider, now, the family of linear systems where the prior distribution of θ is given by (2) and x k ∈ R n , y k ∈ R m , w k ∈ R n w , v k ∈ R m , w k ∼ N(0, I n w ), and v k ∼ N(0, I m ). The variables w 0 , ..., w N−1 , v 1 , ..., v N are mutually independent. The initial condition is zero. The solution of (29) with initial condition x 0 = 0 has the form If we denote X = col(x 1 , ..., x N ), Y = col(y 1 , ..., y N ), U = col(u 0 , ..., u N−1 ), W = col(w 0 , ..., w N−1 ), and V = col(v 1 , ..., v N ), then (31) and (30) can be rewritten as where the matrices B θ , G θ , C θ , and D θ follow forms (30) and (31). The variables W and V are independent, where W ∼ N(0, I Nn w ) and V ∼ N(0, I Nm ). Substituting (32) into (33) we get Equation (1) The conditional density of Z has the form p(Z|θ) = N(Z, 0, S θ ), where the covariance matrix is given by Hence, the results of Section 2 can be applied to the dynamical system (29) and (30).

Example
In some fault detection and isolation problems [3,4], there is a need to determine which of the known models of the process is the most adequate. It is, therefore, important to find a signal that emphasizes the differences between these various models. As an example of this type of problem, let us consider three stochastic continuous-time models: where θ ∈ {1, 2, 3}, x(t) ∈ R θ , x(0) = 0, u(t), w(t) ∈ R, w is a standard Wiener process, and The step responses of these models are similar and they are difficult to experimentally distinguish from each other if the noise level is significant. The observation equation has the form where v k ∼ N(0, 1), t k = kT 0 , T 0 = 0.1, and x 1 is the first component of x(t). If x k = x(t k ) and u(t) = u k , t ∈ [t k−1 , t k ), then, after discretization, the state x k and the output y k are described by (29) and (30), with appropriate matrices A θ , B θ , C θ , G θ , and D θ . The matrices F θ and S θ are calculated by using (31)-(34). Let us observe that, although the orders of the systems are different, the size of both F θ and S θ is always N × N. We are interested in the maximization of I l (U). The solutions of (19) and (26) with a uniform prior, = N, and N = 200 steps, are shown in the upper part of Figure 1. The step responses and the optimal responses are shown in the bottom part of Figure 1. Let us observe that, in contrast to the step signal, the optimal signal clearly distinguishes the systems-although the energy of all input signals was the same and equal to N.
Let U st , U sq ∈ ∂S 1 denote the normalized step and square signal with period of three, respectively, and let U * ( ) denote the optimal signal. To check the validity of the results, the error probabilities P e ( U st ), P e ( U sq ), and P e (U * ( )) were estimated by Monte Carlo simulation with 10 6 trials and N = 50 steps. The results are shown in Figure 2. It was observed that the optimal signal gives an error probability several thousand times smaller than the step or square signal with the same energy. The second observation is that P e ( U sq ) initially increased with . To explain this, let us note that Inequality (17) does not guarantee that P e is decreasing function of . Hence, it is possible that P e increases in certain directions, although Theorem 2 guarantees that P e tends to zero, provided that signal norm tends to infinity. Step response Optimal input response

Comparison with the Average D-Optimal Design
Classical methods of signal design for parameter identification use various functionals of the Fisher information matrix as a utility function. One of the most popular is D-optimal design, which consists of finding a signal that maximizes the determinant of the information matrix (see [10][11][12] and the review article [9]). These methods are well-suited to models that are linear in their parameters. Unfortunately, in typical identification and discrimination tasks, the output is a non-linear function of the parameters and the information matrix depends on unknown parameters to be identified. One of the possibilities for avoiding this problem is the averaging of the utility function over the prior parameter distribution. This method is called average D-optimal design (see [14,26] and [9] (Sections 5.3.5 and 6), for details). The Bayesian design, described in the previous sections, will be compared with the average D-optimal design. To that end, let us consider a finite family of linear models (see also [12] (pp. 91-93)) where θ ∈ {1, 2, 3, 4}, a θ = 0. N(0, 1). The prior distribution of θ is uniform (i.e., p 0,θ = 0.25). The state space representation of (40) has the form which is consistent with (29) and (30). The Fisher information matrix is given by where d k = (ξ k , η k ) T and ξ k = ∂y k ∂a θ , η k = ∂y k ∂b θ , denote the sensitivity of the output y k to changes in parameters a and b, respectively. The derivatives ξ k and η k fulfil the sensitivity equations with zero initial conditions. The average D-optimal design consists in finding a signal U that maximizes the expectation of the determinant of the information matrix (see [9] (Sections 5.3.5 and 6), [14], [11] (Chapter 6), and [12] for details). Hence, the utility function to be maximized has the form with the energy constraints given by (3). Maximization of the utility function (46) has been performed for various signal energies and the error probability of the MAP estimator was estimated by Monte Carlo with 10 5 trials. The same procedure was repeated using Bayesian design for (41) and (42).
The results are shown in Figure 3. The error rate of Bayesian method is significantly smaller when compared to D-optimal design, at least in this example. In particular, the signal shown in the upper-right part of Figure 3 gives an error probability approximately three times smaller than that of D-optimal signal, although the energy of both signals was the same.

Possible Extensions of the Results
In this section, we will briefly discuss some possible extensions of the results to an infinite set of parameters and beyond linear and Gaussian models.

Non-Linear Models
Although the article refers to linear models, it is possible to extend the results to non-linear models of the form where the conditional density of variable Z is given by and S θ (U) > 0, for all U ∈ U ad , θ ∈ {1, ..., r}. Under these assumptions, the density of Y still remains a Gaussian mixture and the information lower bound takes the form where (50)

Non-Gaussian Models
If p(Z|θ, U) is non-Gaussian distribution, then it is possible, on the basis of Equation (10) in [24], to construct an information lower bound of the form where is the Chernoff α-divergence and α ∈ [0, 1]. Unfortunately, calculation of (52) is difficult if n Y is large.

Infinite Set of Parameters
Let us consider following model: where p(Z|θ) = N(Z, 0, S(θ)) and θ ∈ R p . If we assume that prior density of θ is Gaussian; that is, and p(Y) can be approximated by a finite Gaussian mixture where p 0,j 0 and N a ∑ j=1 p 0,j = 1. It's possible to calculate weights and nodes in (56) by using multidimensional quadrature. If n θ is large, then an appropriate sparse grid should be used.
To illustrate the method, we will show only a simple, second-order quadrature with 2n θ points.

Lemma 3.
The approximate value of the integral J( f ) = N(θ, m θ , S θ ) f (θ)dθ is given by where and e i is i th basis vector. If f (θ) = 1 2 θ T Aθ + b T θ + c, then equality holds in (57).
Application of Lemma 3 to (56) gives p 0,j = p 0 = (2n θ ) −1 , N a = 2n θ . Now, since (56) is a Gaussian mixture, the results of Section 2 can be utilized and the information lower bound takes the form where D i,j and θ j are given by (15), (16), and (58), respectively, and F j = F(θ j ), S j = S(θ j ). The approximate solution of (12) can be found by maximization of (59) with the constraints (3).

Discussion and Conclusions
An effective Bayesian design method for linear dynamical model discrimination has been developed. The discrimination task is formulated as an estimation problem, where the estimated parameter θ indexes particular models. To overcome the computational complexity, instead of I(Y; θ|U), its lower bound (I l (U)), proposed by Kolchinsky & Tracey [24], was used as a utility function. This bound is especially useful, as it is differentiable, tight, and reaches the maximum available information H(θ). It has been proved, on the basis of the results of Feder & Merhav [20], that the error probability of the MAP estimator (see Lemma 2) is upper bounded by 1 2 (H(θ) − I l (U)) log 2 e. The maximization of I l (U) has been considered under the signal energy constraint, but other kinds of constraints can also be easily implemented. It was shown that the maximization of I l (U), in the case of two parameters, is equivalent to maximization of a quadratic form on the sphere (see also [3]). Next, the small energy limit was analyzed. It was proved that the solution is given by an eigenvector corresponding to maximal eigenvalue of some Hermitian matrix. This result can serve as starting point for numerical maximization of I l (U). If the energy of the signal tends to infinity, then almost all (in the sense of the Lebesgue measure) signals generate maximum information, provided that the impulse responses of the models are pairwise different. Under these conditions, it was proved that P e of the MAP estimator tends to zero.
An example of discrimination of three stochastic models with different structures was given. It is easy to observe, from Figure 1, that, in contrast to the step signal, the optimal signal clearly distinguished the systems, although the energy of both signals was the same. The P e of the MAP estimator was calculated by Monte Carlo simulation. It was observed that the square signal gave an error probability several thousand times greater than the optimal signal with the same energy. Hence, we conclude that the error probability and the accuracy of MAP estimator depends very strongly on the excitation signal. Although Theorem 2 implies that lim →∞ P e ( U) = 0 for almost all U, there exist signals such that P e ( U) is locally increasing. This is the case of a high-frequency square signal, as illustrated in Figure 2.
It was shown, in Section 5 (see Figure 3), that P e of the MAP estimator corresponding to Bayesian design was a few times smaller than P e generated by D-optimal design, at least in the analyzed example. This result suggests that Bayesian design can be applied to non-linear problems and that it is superior to classical D-optimal design. Some extensions of the results to the infinite set of parameters and beyond linear and Gaussian assumptions were briefly discussed in Section 6. Extension to non-linear models seems to be easy, but the non-Gaussian case is difficult and a deeper analysis is required. The case of an infinite set of parameters was discussed in Section 6.3. It was shown that the measurement density can be approximated by a finite Gaussian mixture, after which the results of Section 2 could be directly applied. The general conclusion of this analysis is that the information bounds can be easily constructed, as long as the measurement density is a mixture of Gaussian distributions.
An analytical gradient formula must be provided for the effective numerical maximization of I l (U). The matrix inversions and the determinants in (15) and (16) should be calculated by SVD. To reduce the computational complexity and required memory resources, the symmetries appearing in (15) and (16), and the fact that D i,i = 0, should be utilized. The determinants in (15) and the matrices F i , S i can be calculated off-line, but the matrices Q i,j may require too much memory if N and r are large. Therefore, Q i,j was calculated on-line.
Applications of the presented methods in dual control problems [5] are expected as part of our future work. An additional area in applications is the issue of automated testing of dynamical systems.
Funding: This research was financed from the statutory subsidy of the AGH University of Science and Technology, No. 16.16.120.773.