Bayesian Input Design for Linear Dynamical Model Discrimination

Bania, Piotr

doi:10.3390/e21040351

Open AccessArticle

Bayesian Input Design for Linear Dynamical Model Discrimination

by

Piotr Bania

Department of Automatic Control and Robotics, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Krakow, Poland

Entropy 2019, 21(4), 351; https://doi.org/10.3390/e21040351

Submission received: 16 January 2019 / Revised: 12 March 2019 / Accepted: 27 March 2019 / Published: 30 March 2019

(This article belongs to the Section Signal and Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

A Bayesian design of the input signal for linear dynamical model discrimination has been proposed. The discrimination task is formulated as an estimation problem, where the estimated parameter indexes particular models. As the mutual information between the parameter and model output is difficult to calculate, its lower bound has been used as a utility function. The lower bound is then maximized under the signal energy constraint. Selection between two models and the small energy limit are analyzed first. The solution of these tasks is given by the eigenvector of a certain Hermitian matrix. Next, the large energy limit is discussed. It is proved that almost all (in the sense of the Lebesgue measure) high energy signals generate the maximum available information, provided that the impulse responses of the models are different. The first illustrative example shows that the optimal signal can significantly reduce error probability, compared to the commonly-used step or square signals. In the second example, Bayesian design is compared with classical average D-optimal design. It is shown that the Bayesian design is superior to D-optimal design, at least in this example. Some extensions of the method beyond linear and Gaussian models are briefly discussed.

Keywords:

bayesian experimental design; model discrimination; information; entropy

1. Introduction

Discrimination of various dynamical models of the same process has a wide area of applications, especially in multiple-model fault detection and isolation [1,2,3,4] and, in many other estimation and control problems [5,6,7], it is necessary to choose the most likely dynamical model from a finite set. The discrimination task can be formulated as an estimation problem, where the estimated parameter

θ

indexes particular models. The problem can also be considered as a finite-dimensional approximation of more general identification tasks [8]. As the error probability or variance of the estimator of

θ

usually depends on the input signal, it is important to select a signal that minimizes error probability or maximizes a utility function that encodes the purpose of the experiment. Selection of an input signal that maximizes a utility function is strongly related to optimal experimental design [9].

Experimental design methods can be divided into classical and Bayesian. The classical methods, also called optimal experimental design, typically use various functionals of the Fisher information matrix as a utility function. These methods are widely described in the literature and work well if the model is linear in its parameters (see [8,10,11,12] and the review article [9]). Unfortunately, in typical identification tasks, the solution of model equation and the covariance depends non-linearly on

θ

, even if the model equation is linear. This implies that the information matrix and the utility function depend on the parameter

θ

to be estimated. Therefore, only locally-optimal design can be obtained [13]. To obtain more robust methods, an averaging over the prior parameter distribution or minimax design [8,14] (Section 6.1) are commonly used, but these methods are not fully Bayesian.

Bayesian optimal design uses the utility function, a functional of the posterior distribution (see [15,16] and the review articles [13,17,18]). The most commonly used utility functions are mutual information between parameters and model output, Kullback-Leibler divergence between the prior and posterior distributions, and the determinant of the posterior covariance matrix [13,16]. In contrast to classical methods, in Bayesian design the utility function does not depend on the parameters to be estimated. Hence, the method can cope with non-linear problems. The utility function, which is suitable for model discrimination, is the error probability of the MAP estimator of

θ

[19]. Such a utility function is generally difficult to calculate (see [19]), but the result of Feder & Merhav [20] implies that the error probability of the MAP estimator is upper-bounded by some decreasing function of mutual information between

θ

and the output of the system. Hence, the maximization of mutual information creates the possibility of reducing the error probability, provided that appropriate estimator is used. However, the most serious problem that inhibits the development of this idea is great computational complexity in calculating the mutual information.

The main contribution of this article is a fully-Bayesian (in the terminology of [13]) method for finding an input signal that maximizes the mutual information between

θ

and the system output. Maximization of information or, equivalently, maximization of the output entropy has been proposed by many authors (see, e.g., [13,15,17,18,21,22,23]), but the mutual information is very hard to compute and the problem is often intractable. To overcome this serious difficulty, instead of mutual information the lower bound, given by Kolchinsky & Tracey [24], has been used. This is a pairwise-distance based entropy estimator and it it useful here, since it is differentiable, tight, and asymptotically reaches the maximum possible information (see [24] (Sections 3.2, 4, 6)). Maximization of such a lower bound, under the signal energy (i.e., the square of the signal norm) constraints, is much simpler, gives satisfactory solutions, and allows for practical implementation of the idea of maximizing information. This is illustrated with examples. Moreover, it is shown that, for certain cases, this problem reduces to a solution of a certain eigenproblem.

The article is organized as follows. In Section 2, the estimation task is formulated and the upper bound of the error probability and the lower bound of the mutual information are given. In Section 2.1, a selection between two models is discussed and an exact solution is given. Design of input signals with small energy, which is required in some applications, is described in Section 2.2. In Section 2.3, the large energy limit is discussed. An application to linear dynamical systems with unknown parameters is given in Section 3. An example of finding the most likely model among three stochastic models with different structures is given in Section 4. Comparison with classical D-optimal design is performed in Section 5. The article ends with conclusions and references.

2. Maximization of Mutual Information between the System Output and Parameter

Let us consider a family of linear models

Y = F_{θ} U + Z,

(1)

where

θ \in 1, 2, . . ., r

,

Y, Z \in R^{n_{Y}}

, and

U \in R^{n_{U}}

. The matrices

F_{θ}

are bounded. The parameter

θ

is unknown. The prior distribution of

θ

is given by

P (θ = i) = p_{0, i}, i = 1, . . ., r .

(2)

The random variable Z is conditionally normal (i.e.,

p (Z | θ) = N (Z, 0, S_{θ})

), where the covariance matrices

S_{θ}

are given a priori and

S_{θ} > 0

, for all

θ

. The variable U is called the input signal. In all formulas below, the input signal U is a deterministic variable. The set of admissible signals is given by

S_{ϱ} = {U \in R^{n_{U}}; U^{T} U ⩽ ϱ} .

(3)

Under these assumptions, and after applying Bayes rule:

\begin{matrix} p (Y | U) = \sum_{θ = 1}^{r} p_{0, θ} N (Y, F_{θ} U, S_{θ}), \end{matrix}

(4)

\begin{matrix} p (Y | θ, U) = N (Y, F_{θ} U, S_{θ}), \end{matrix}

(5)

\begin{matrix} p (θ | Y, U) = \frac{p_{0, θ} N (Y, F_{θ} U, S_{θ})}{\sum_{j = 1}^{r} p_{0, j} N (Y, F_{j} U, S_{j})} . \end{matrix}

(6)

The entropies of Y and

θ

and the conditional entropies are defined as

\begin{matrix} H (θ) = - \sum_{θ = 1}^{r} p_{0, θ} ln p_{0, θ}, \end{matrix}

(7)

\begin{matrix} H (θ | Y, U) = - \int p (Y | U) (\sum_{θ = 1}^{r} p (θ | Y, U) ln p (θ | Y, U)) d Y, \end{matrix}

(8)

\begin{matrix} H (Y | U) = - \int p (Y | U) ln p (Y | U) d Y, \end{matrix}

(9)

\begin{matrix} H (Y | θ) = \frac{1}{2} \sum_{θ = 1}^{r} p_{0, θ} ln ({(2 π e)}^{n_{y}} | S_{θ} |) . \end{matrix}

(10)

The mutual information between

θ

and Y is defined as (see [25] (pp. 19, 250))

I (Y; θ | U) = H (θ | U) - H (θ | Y, U) = H (Y | U) - H (Y | θ, U) .

As

H (θ | U) = H (θ)

and

H (Y | θ, U) = H (Y | θ)

, then

I (Y; θ | U)

is given by

I (Y; θ | U) = H (θ) - H (θ | Y, U) = H (Y | U) - H (Y | θ) .

(11)

The MAP estimator of

θ

is defined as

\hat{θ} (Y, U) = \arg max_{θ \in {1, . . ., r}} p (θ | Y, U) .

The error probability of

\hat{θ}

is given by (see [20])

P_{e} (U) = 1 - \int (max_{θ \in {1, . . ., r}} p (θ | Y, U)) p (Y | U) d Y .

It follows from Fano’s inequality ([25] (p. 38)), that

P_{e}

is lower bounded by an increasing function in

H (θ | Y, U)

. Feder & Merhav [20] proved that

2 P_{e} (U) ⩽ H (θ | Y, U) {log}_{2} e

. As

H (θ | Y, U) = H (θ) - I (Y; θ | U)

and

H (θ)

does not depend on U, then the maximization of

I (Y; θ | U)

creates the possibility of reducing

P_{e}

, and the optimal signal is given by

U^{*} (ϱ) = \arg max_{U \in S_{ϱ}} I (Y; θ | U) .

(12)

To overcome the problems associated with the calculation of

I (Y; θ | U)

, we will use its lower bound.

Lemma 1.

(Information bounds). For all

U \in R^{n_{U}}

,

I_{l} (U) ⩽ I (Y; θ | U) ⩽ H (θ),

(13)

where

\begin{matrix} I_{l} (U) = - \sum_{i = 1}^{r} p_{0, i} ln (\sum_{j = 1}^{r} p_{0, j} e^{- D_{i, j} (U)}), \end{matrix}

(14)

\begin{matrix} D_{i, j} (U) = \frac{1}{4} U^{T} Q_{i, j} U + \frac{1}{2} ln | \frac{1}{2} (S_{i} + S_{j}) | - \frac{1}{4} ln | S_{i} | | S_{j} |, and \end{matrix}

(15)

\begin{matrix} Q_{i, j} = {(F_{i} - F_{j})}^{T} {(S_{i} + S_{j})}^{- 1} (F_{i} - F_{j}) . \end{matrix}

(16)

Proof.

According to (4),

p (Y | U)

is finite Gaussian mixture. For such mixtures, the information bounds are known. A detailed proof, based on Chernoff

α

-divergence, is given in [24] (Section 4). □

Lemma 2.

Let

\hat{θ} (Y, U) = \arg {max}_{θ \in {1, . . ., r}} p (θ | Y, U)

be the MAP estimator of θ, and let

P_{e} (U)

denote its error probability. There exists a continuous, increasing, and concave function

f : [0, H (θ)] \to [0, 1 - r^{- 1}]

, such that

P_{e} (U) ⩽ f (H (θ) - I_{l} (U)) ⩽ \frac{1}{2} (H (θ) - I_{l} (U)) {log}_{2} e .

(17)

Proof.

Feder & Merhav [20] (see Theorem 1 and Equation (14)) proved that there exists an increasing, continuous, and convex function

ϕ : [0, 1 - r^{- 1}] \to [0, H (θ) {log}_{2} e]

, such that

2 P_{e} (U) ⩽ ϕ (P_{e} (U)) ⩽ H (θ | Y, U) {log}_{2} e .

(18)

As

H (θ | Y, U) = H (θ) - I (Y; θ | U)

and

I_{l} (U) ⩽ I (Y; θ | U)

, then

ϕ (P_{e} (U)) ⩽ (H (θ) - I_{l} (U)) {log}_{2} e

. The function

g = ϕ^{- 1}

is increasing, continuous, concave, and it follows from (18) that

2 g (η) ⩽ η

. Hence,

P_{e} (U) ⩽ ϕ^{- 1} ((H (θ) - I_{l} (U)) {log}_{2} e) = g ((H (θ) - I_{l} (U)) {log}_{2} e) ⩽ \frac{1}{2} (H (θ) - I_{l} (U)) {log}_{2} e

. Taking

f (η) = g (η {log}_{2} e)

we obtain the result. □

Now, the approximate solution of (12) is given by

U^{*} (ϱ) = \arg max_{U \in S_{ϱ}} I_{l} (U) .

(19)

As

I_{l}

is smooth and

S_{ϱ}

is compact, (19) is well-defined.

2.1. Selection between Two Models

Suppose that

θ

takes only two values, 1 and 2, with prior probabilities

p_{0, 1}

and

p_{0, 2} = 1 - p_{0, 1}

, respectively. It’s easy to check, by direct calculation, that

e^{- I_{l} (U)} = {(p_{0, 1} + p_{0, 2} e^{- D_{1, 2} (U)})}^{p_{0, 1}} {(p_{0, 1} e^{- D_{1, 2} (U)} + p_{0, 2})}^{p_{0, 2}} .

(20)

Equation (20) implies that the maximization of

I_{l}

is equivalent to the maximization of

D_{1, 2}

. On the basis of (15), we have the following optimization: task

max_{U^{T} U ⩽ ϱ} U^{T} Q_{1, 2} U .

(21)

The solution of (21) is the eigenvector of

Q_{1, 2}

corresponding to its largest eigenvalue; that is,

Q_{1, 2} U^{*} = λ_{m a x} (Q_{1, 2}) U^{*}, | | U^{*} {| |}^{2} = ϱ .

(22)

2.2. Small Energy Limit

In many practical applications, the energy of an excitation signal must be small. The second order Taylor expansion of (14) gives

I_{l} (U) = \frac{1}{4} U^{T} Q U - \sum_{i = 1}^{r} p_{0, i} ln \sum_{j = 1}^{r} α_{i, j} + {o (| | U | |}^{2}),

(23)

where

Q = \sum_{i = 1}^{r} p_{0, i} (\frac{\sum_{j = 1}^{r} α_{i, j} Q_{i, j}}{\sum_{j = 1}^{r} α_{i, j}}), and

(24)

α_{i, j} = p_{0, j} e^{- D_{i, j} (0)} .

(25)

If the value of

ϱ

(see (3)) is small, then the last term in (23) can be omitted. As the second term does not depend on U, we have the following optimization task:

max_{U^{T} U ⩽ ϱ} U^{T} Q U .

(26)

The solution of (26) is the eigenvector of Q corresponding to its largest eigenvalue.

2.3. Large Energy Limit

We will investigate asymptotic behaviour of

I (Y; θ | U)

when

| | U | | \to \infty

. On the basis of Lemma 1, the condition

min_{i \neq j} U^{T} Q_{i, j} U > 0

(27)

guarantees that

{lim}_{ϱ \to \infty} I (Y; θ | ϱ U) = H (θ)

. It is also possible that

{lim}_{ϱ \to \infty} I (Y; θ | ϱ U) < H (θ)

, for some U. Such signals are weakly informative and they cannot generate the maximum information, even if their amplitude tends to infinity. Let

S_{1}

denote the unit ball in

R^{n_{U}}

and let

μ

be the Lebesgue measure on

S_{1}

. The set of weakly informative signals is defined as

Ω = {U \in S_{1} : lim_{ϱ \to \infty} I (Y; θ | ϱ U) < H (θ)} .

(28)

Theorem 1.

μ (Ω) = 0

if and only if

F_{i} \neq F_{j}

, for all

i \neq j

.

Proof.

\Leftarrow :

On the basis of Lemma 1 and (28),

Ω = \cup_{i \neq j} Ω_{i, j}

, where

Ω_{i, j} = {ξ \in S_{1} : ξ^{T} Q_{i, j} ξ = 0}

. Since

S_{i} + S_{j}

is positive-definite and

F_{i} \neq F_{j}

then, on the basis of (16), the matrix

Q_{i, j}

has at least one positive eigenvalue. Hence,

μ (Ω_{i, j}) = 0

and

μ (Ω) = \sum_{i \neq j} μ (Ω_{i, j}) = 0

.

\Rightarrow :

The condition

μ (Ω) = 0

implies that

μ (Ω_{i, j}) = 0

. Hence,

Q_{i, j}

has at least one positive eigenvalue, which is possible only if

F_{i} \neq F_{j}

. □

As a conclusion, we have the following result.

Theorem 2.

Let

\hat{θ} (Y, U) = \arg {max}_{θ \in {1, . . ., r}} p (θ | Y, U)

, be the MAP estimator of θ and let

P_{e} (U)

denote its error probability. If

F_{i} \neq F_{j}

for all

i \neq j

, then, for any

ϵ > 0

, there exists a number

ϱ > 0

and a signal

U \in S_{ϱ}

, such that

P_{e} (U) < ϵ

.

Proof.

By the assumption, and from Theorem 1, the set

S_{1} ∖ Ω

is non-empty. If

U \in S_{1} ∖ Ω

, then

{min}_{i \neq j} U^{T} Q_{i, j} U > 0

and, from Lemma 1, we get

{lim}_{ϱ \to \infty} I_{l} (ϱ U) = {lim}_{ϱ \to \infty} I (Y; θ | ϱ U) = H (θ)

. Now, Lemma 2 implies that

2 P_{e} (U) ⩽ (H (θ) - I_{l} (ϱ U)) {log}_{2} e < ϵ

, for sufficiently large

ϱ

. □

3. Application to Linear Dynamical Systems

Consider, now, the family of linear systems

\begin{matrix} x_{k + 1} = A_{θ} x_{k} + B_{θ} u_{k} + G_{θ} w_{k}, k = 0, 1, 2, . . ., N - 1, \end{matrix}

(29)

\begin{matrix} y_{k} = C_{θ} x_{k} + D_{θ} v_{k}, k = 1, 2, . . ., N, \end{matrix}

(30)

where the prior distribution of

θ

is given by (2) and

x_{k} \in R^{n}, y_{k} \in R^{m}, w_{k} \in R^{n_{w}}, v_{k} \in R^{m}

,

w_{k} \sim N (0, I_{n_{w}}),

and

v_{k} \sim N (0, I_{m})

. The variables

w_{0}, . . ., w_{N - 1}, v_{1}, . . ., v_{N}

are mutually independent. The initial condition is zero. The solution of (29) with initial condition

x_{0} = 0

has the form

x_{k} = \sum_{i = 0}^{k - 1} A_{θ}^{k - i - 1} B_{θ} u_{i} + \sum_{i = 0}^{k - 1} A_{θ}^{k - i - 1} G_{θ} w_{i} .

(31)

If we denote

X = col (x_{1}, . . ., x_{N})

,

Y = col (y_{1}, . . ., y_{N})

,

U = col (u_{0}, . . ., u_{N - 1})

,

W = col (w_{0}, . . ., w_{N - 1})

, and

V = col (v_{1}, . . ., v_{N})

, then (31) and (30) can be rewritten as

\begin{matrix} X = B_{θ} U + G_{θ} W, and \end{matrix}

(32)

\begin{matrix} Y = C_{θ} X + D_{θ} V, \end{matrix}

(33)

where the matrices

B_{θ}

,

G_{θ}

,

C_{θ}

, and

D_{θ}

follow forms (30) and (31). The variables W and V are independent, where

W \sim N (0, I_{N n_{w}})

and

V \sim N (0, I_{N m})

. Substituting (32) into (33) we get Equation (1), where

F_{θ} = C_{θ} B_{θ}

,

Z = C_{θ} G_{θ} W + D_{θ} V

. The conditional density of Z has the form

p (Z | θ) = N (Z, 0, S_{θ})

, where the covariance matrix is given by

S_{θ} = D_{θ} D_{θ}^{T} + C_{θ} G_{θ} G_{θ}^{T} C_{θ}^{T} .

(34)

Hence, the results of Section 2 can be applied to the dynamical system (29) and (30).

4. Example

In some fault detection and isolation problems [3,4], there is a need to determine which of the known models of the process is the most adequate. It is, therefore, important to find a signal that emphasizes the differences between these various models. As an example of this type of problem, let us consider three stochastic continuous-time models:

d x = (A_{θ} x + B_{θ} u) d t + G_{θ} d w,

(35)

where

θ \in {1, 2, 3}

,

x (t) \in R^{θ}

,

x (0) = 0

,

u (t), w (t) \in R

, w is a standard Wiener process, and

\begin{matrix} A_{1} = - 1, B_{1} = 1, G_{1} = 0.05, \end{matrix}

(36)

\begin{matrix} A_{2} = [\begin{matrix} 0 & 1 \\ - 3 & - 2.5 \end{matrix}], B_{2} = [\begin{matrix} 0 \\ 3 \end{matrix}], G_{2} = [\begin{matrix} 0 \\ 0.05 \end{matrix}], \end{matrix}

(37)

\begin{matrix} A_{3} = [\begin{matrix} 0 & 1 & 0 \\ - 3 & - 3.5 & 1 \\ 0 & 0 & - 10 \end{matrix}], B_{3} = [\begin{matrix} 0 \\ 0 \\ 30 \end{matrix}], G_{3} = [\begin{matrix} 0 \\ 0 \\ 0.05 \end{matrix}] . \end{matrix}

(38)

The step responses of these models are similar and they are difficult to experimentally distinguish from each other if the noise level is significant. The observation equation has the form

y_{k} = x_{1} (t_{k}) + 0.05 v_{k}, k = 1, 2, . . ., N,

(39)

where

v_{k} \sim N (0, 1)

,

t_{k} = k T_{0}

,

T_{0} = 0.1

, and

x_{1}

is the first component of

x (t)

. If

x_{k} = x (t_{k})

and

u (t) = u_{k}, t \in [t_{k - 1}, t_{k})

, then, after discretization, the state

x_{k}

and the output

y_{k}

are described by (29) and (30), with appropriate matrices

A_{θ}

,

B_{θ}

,

C_{θ}

,

G_{θ}

, and

D_{θ}

. The matrices

F_{θ}

and

S_{θ}

are calculated by using (31)–(34). Let us observe that, although the orders of the systems are different, the size of both

F_{θ}

and

S_{θ}

is always

N \times N

. We are interested in the maximization of

I_{l} (U)

. The solutions of (19) and (26) with a uniform prior,

ϱ = N

, and

N = 200

steps, are shown in the upper part of Figure 1. The step responses and the optimal responses are shown in the bottom part of Figure 1.

Let us observe that, in contrast to the step signal, the optimal signal clearly distinguishes the systems—although the energy of all input signals was the same and equal to N.

Let

U_{s t}, U_{s q} \in \partial S_{1}

denote the normalized step and square signal with period of three, respectively, and let

U^{*} (ϱ)

denote the optimal signal. To check the validity of the results, the error probabilities

P_{e} (ϱ U_{s t})

,

P_{e} (ϱ U_{s q})

, and

P_{e} (U^{*} (ϱ))

were estimated by Monte Carlo simulation with

10^{6}

trials and

N = 50

steps. The results are shown in Figure 2. It was observed that the optimal signal gives an error probability several thousand times smaller than the step or square signal with the same energy. The second observation is that

P_{e} (ϱ U_{s q})

initially increased with

ϱ

. To explain this, let us note that Inequality (17) does not guarantee that

P_{e}

is decreasing function of

ϱ

. Hence, it is possible that

P_{e}

increases in certain directions, although Theorem 2 guarantees that

P_{e}

tends to zero, provided that signal norm tends to infinity.

5. Comparison with the Average D-Optimal Design

Classical methods of signal design for parameter identification use various functionals of the Fisher information matrix as a utility function. One of the most popular is D-optimal design, which consists of finding a signal that maximizes the determinant of the information matrix (see [10,11,12] and the review article [9]). These methods are well-suited to models that are linear in their parameters. Unfortunately, in typical identification and discrimination tasks, the output is a non-linear function of the parameters and the information matrix depends on unknown parameters to be identified. One of the possibilities for avoiding this problem is the averaging of the utility function over the prior parameter distribution. This method is called average D-optimal design (see [14,26] and [9] (Sections 5.3.5 and 6), for details). The Bayesian design, described in the previous sections, will be compared with the average D-optimal design. To that end, let us consider a finite family of linear models (see also [12] (pp. 91–93))

y_{k} = \frac{b_{θ} z^{- 1}}{1 - a_{θ} z^{- 1}} u_{k} + σ_{v} v_{k},

(40)

where

θ \in {1, 2, 3, 4}

,

a_{θ} = 0.6 + 0.1 (θ - 1)

,

b_{θ} = 1 - a_{θ}

,

σ_{v} = 0.1

, and

v_{k} \sim N (0, 1)

. The prior distribution of

θ

is uniform (i.e.,

p_{0, θ} = 0.25

). The state space representation of (40) has the form

\begin{matrix} x_{k + 1} = a_{θ} x_{k} + b_{θ} u_{k}, \end{matrix}

(41)

\begin{matrix} y_{k} = x_{k} + σ_{v} v_{k}, \end{matrix}

(42)

which is consistent with (29) and (30). The Fisher information matrix is given by

M_{F} (θ, U) = \frac{1}{N σ_{v}^{2}} \sum_{k = 1}^{N} d_{k} d_{k}^{T},

(43)

where

d_{k} = {(ξ_{k}, η_{k})}^{T}

and

ξ_{k} = \frac{\partial y_{k}}{\partial a_{θ}}

,

η_{k} = \frac{\partial y_{k}}{\partial b_{θ}}

, denote the sensitivity of the output

y_{k}

to changes in parameters a and b, respectively. The derivatives

ξ_{k}

and

η_{k}

fulfil the sensitivity equations

\begin{matrix} ξ_{k} = 2 a_{θ} ξ_{k - 1} - a_{θ}^{2} ξ_{k - 2} + b_{θ} u_{k - 2}, \end{matrix}

(44)

\begin{matrix} η_{k} = a_{θ} η_{k - 1} + u_{k - 1}, k = 1, 2, . . ., N, \end{matrix}

(45)

with zero initial conditions. The average D-optimal design consists in finding a signal U that maximizes the expectation of the determinant of the information matrix (see [9] (Sections 5.3.5 and 6), [11,14] (Chapter 6), and [12] for details). Hence, the utility function to be maximized has the form

J (U) = \sum_{θ = 1}^{4} p_{0, θ} | M_{F} (θ, U) |,

(46)

with the energy constraints given by (3). Maximization of the utility function (46) has been performed for various signal energies and the error probability of the MAP estimator was estimated by Monte Carlo with

10^{5}

trials. The same procedure was repeated using Bayesian design for (41) and (42). The results are shown in Figure 3. The error rate of Bayesian method is significantly smaller when compared to D-optimal design, at least in this example. In particular, the signal shown in the upper-right part of Figure 3 gives an error probability approximately three times smaller than that of D-optimal signal, although the energy of both signals was the same.

6. Possible Extensions of the Results

In this section, we will briefly discuss some possible extensions of the results to an infinite set of parameters and beyond linear and Gaussian models.

6.1. Non-Linear Models

Although the article refers to linear models, it is possible to extend the results to non-linear models of the form

Y = F_{θ} (U) + Z,

(47)

where the conditional density of variable Z is given by

p (Z | θ, U) = N (Z, 0, S_{θ} (U))

(48)

and

S_{θ} (U) > 0

, for all

U \in U_{a d}

,

θ \in {1, . . ., r}

. Under these assumptions, the density of Y still remains a Gaussian mixture and the information lower bound takes the form

I_{l} (U) = - \sum_{i = 1}^{r} p_{0, i} ln (\sum_{j = 1}^{r} p_{0, j} e^{- D_{i, j} (U)}),

(49)

where

\begin{matrix} D_{i, j} (U) = \frac{1}{4} {(F_{i} (U) - F_{j} (U))}^{T} {(S_{i} (U) + S_{j} (U))}^{- 1} (F_{i} (U) - F_{j} (U)) + \\ + \frac{1}{2} ln | \frac{1}{2} (S_{i} (U) + S_{j} (U)) | - \frac{1}{4} ln | S_{i} (U) | | S_{j} (U) | . \end{matrix}

(50)

6.2. Non-Gaussian Models

If

p (Z | θ, U)

is non-Gaussian distribution, then it is possible, on the basis of Equation (10) in [24], to construct an information lower bound of the form

I_{l} (U) = - \sum_{i = 1}^{r} p_{0, i} ln (\sum_{j = 1}^{r} p_{0, j} e^{- C_{α} (p_{i} | | p_{j})}),

(51)

where

C_{α} (p_{i} | | p_{j}) = - ln \int p {(Z | i, U)}^{α} p {(Z | j, U)}^{1 - α} d Z

(52)

is the Chernoff

α

-divergence and

α \in [0, 1]

. Unfortunately, calculation of (52) is difficult if

n_{Y}

is large.

6.3. Infinite Set of Parameters

Let us consider following model:

Y = F (θ) U + Z,

(53)

where

p (Z | θ) = N (Z, 0, S (θ))

and

θ \in R^{p}

. If we assume that prior density of

θ

is Gaussian; that is,

p_{0} (θ) = N (θ, m_{θ}, S_{θ}), S_{θ} > 0,

(54)

then

p (Y) = \int p_{0} (θ) N (Y, F (θ) U, S (θ)),

(55)

and

p (Y)

can be approximated by a finite Gaussian mixture

p (Y | U) = \int p_{0} (θ) N (Y, F (θ) U, S (θ)) d θ \approx \sum_{j = 1}^{N_{a}} p_{0, j} N (Y, F (θ_{j}) U, S (θ_{j})),

(56)

where

p_{0, j} ⩾ 0

and

\sum_{j = 1}^{N_{a}} p_{0, j} = 1

. It’s possible to calculate weights and nodes in (56) by using multidimensional quadrature. If

n_{θ}

is large, then an appropriate sparse grid should be used. To illustrate the method, we will show only a simple, second-order quadrature with

2 n_{θ}

points.

Lemma 3.

The approximate value of the integral

J (f) = \int N (θ, m_{θ}, S_{θ}) f (θ) d θ

is given by

J (f) \approx \frac{1}{2 n_{θ}} \sum_{j = 1}^{2 n_{θ}} f (θ_{j}),

(57)

where

θ_{2 i - 1} = m_{θ} - S_{θ}^{0.5} e_{i}, θ_{2 i} = m_{θ} + S_{θ}^{0.5} e_{i}, i = 1, . . ., n_{θ}

(58)

and

e_{i}

is

i^{t h}

basis vector. If

f (θ) = \frac{1}{2} θ^{T} A θ + b^{T} θ + c

, then equality holds in (57).

Proof.

Direct calculation. □

Application of Lemma 3 to (56) gives

p_{0, j} = p_{0} = {(2 n_{θ})}^{- 1}

,

N_{a} = 2 n_{θ}

. Now, since (56) is a Gaussian mixture, the results of Section 2 can be utilized and the information lower bound takes the form

I_{l} (U) = - p_{0} \sum_{i = 1}^{r} ln (\sum_{j = 1}^{r} e^{- D_{i, j} (U)}) - ln p_{0},

(59)

where

D_{i, j}

and

θ_{j}

are given by (15), (16), and (58), respectively, and

F_{j} = F (θ_{j})

,

S_{j} = S (θ_{j})

. The approximate solution of (12) can be found by maximization of (59) with the constraints (3).

7. Discussion and Conclusions

An effective Bayesian design method for linear dynamical model discrimination has been developed. The discrimination task is formulated as an estimation problem, where the estimated parameter

θ

indexes particular models. To overcome the computational complexity, instead of

I (Y; θ | U)

, its lower bound (

I_{l} (U)

), proposed by Kolchinsky & Tracey [24], was used as a utility function. This bound is especially useful, as it is differentiable, tight, and reaches the maximum available information

H (θ)

. It has been proved, on the basis of the results of Feder & Merhav [20], that the error probability of the MAP estimator (see Lemma 2) is upper bounded by

\frac{1}{2} (H (θ) - I_{l} (U)) {log}_{2} e

. The maximization of

I_{l} (U)

has been considered under the signal energy constraint, but other kinds of constraints can also be easily implemented. It was shown that the maximization of

I_{l} (U)

, in the case of two parameters, is equivalent to maximization of a quadratic form on the sphere (see also [3]). Next, the small energy limit was analyzed. It was proved that the solution is given by an eigenvector corresponding to maximal eigenvalue of some Hermitian matrix. This result can serve as starting point for numerical maximization of

I_{l} (U)

. If the energy of the signal tends to infinity, then almost all (in the sense of the Lebesgue measure) signals generate maximum information, provided that the impulse responses of the models are pairwise different. Under these conditions, it was proved that

P_{e}

of the MAP estimator tends to zero.

An example of discrimination of three stochastic models with different structures was given. It is easy to observe, from Figure 1, that, in contrast to the step signal, the optimal signal clearly distinguished the systems, although the energy of both signals was the same. The

P_{e}

of the MAP estimator was calculated by Monte Carlo simulation. It was observed that the square signal gave an error probability several thousand times greater than the optimal signal with the same energy. Hence, we conclude that the error probability and the accuracy of MAP estimator depends very strongly on the excitation signal. Although Theorem 2 implies that

{lim}_{ϱ \to \infty} P_{e} (ϱ U) = 0

for almost all U, there exist signals such that

P_{e} (ϱ U)

is locally increasing. This is the case of a high-frequency square signal, as illustrated in Figure 2.

It was shown, in Section 5 (see Figure 3), that

P_{e}

of the MAP estimator corresponding to Bayesian design was a few times smaller than

P_{e}

generated by D-optimal design, at least in the analyzed example. This result suggests that Bayesian design can be applied to non-linear problems and that it is superior to classical D-optimal design.

Some extensions of the results to the infinite set of parameters and beyond linear and Gaussian assumptions were briefly discussed in Section 6. Extension to non-linear models seems to be easy, but the non-Gaussian case is difficult and a deeper analysis is required. The case of an infinite set of parameters was discussed in Section 6.3. It was shown that the measurement density can be approximated by a finite Gaussian mixture, after which the results of Section 2 could be directly applied. The general conclusion of this analysis is that the information bounds can be easily constructed, as long as the measurement density is a mixture of Gaussian distributions.

An analytical gradient formula must be provided for the effective numerical maximization of

I_{l} (U)

. The matrix inversions and the determinants in (15) and (16) should be calculated by SVD. To reduce the computational complexity and required memory resources, the symmetries appearing in (15) and (16), and the fact that

D_{i, i} = 0

, should be utilized. The determinants in (15) and the matrices

F_{i}, S_{i}

can be calculated off-line, but the matrices

Q_{i, j}

may require too much memory if N and r are large. Therefore,

Q_{i, j}

was calculated on-line.

Applications of the presented methods in dual control problems [5] are expected as part of our future work. An additional area in applications is the issue of automated testing of dynamical systems.

Funding

This research was financed from the statutory subsidy of the AGH University of Science and Technology, No. 16.16.120.773.

Acknowledgments

I would like to thank Jerzy Baranowski for discussions and comments.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

ξ \sim N (m, S)

means that

ξ

has normal distribution with mean m and covariance S. The density of a normally-distributed variable is denoted by

N (x, m, S) = 2 π^{- \frac{n}{2}} {| S |}^{- \frac{1}{2}} exp (- 0.5 {(x - m)}^{T} S^{- 1} (x - m))

. The symbol

col (a_{1}, a_{2}, . . ., a_{n})

denotes a column vector. The set of symmetric, positive-definite matrices of dimension n is denoted by

S^{+} (n)

.

References

Bania, P.; Baranowski, J. Bayesian estimator of a faulty state: Logarithmic odds approach. In Proceedings of the 22nd International Conference on Methods and Models in Automation and Robotics (MMAR), Miedzyzdroje, Poland, 28–31 August 2017; pp. 253–257. [Google Scholar]
Baranowski, J.; Bania, P.; Prasad, I.; Cong, T. Bayesian fault detection and isolation using Field Kalman Filter. EURASIP J. Adv. Signal Process. 2017, 79. [Google Scholar] [CrossRef]
Blackmore, L.; Williams, B. Finite Horizon Control Design for Optimal Model Discrimination. In Proceedings of the 44th IEEE Conference on Decision and Control, Seville, Spain, 12–15 December 2005. [Google Scholar] [CrossRef]
Pouliezos, A.; Stavrakakis, G. Real Time Fault Monitoring of Industrial Processes; Kluwer Academic: Boston, MA, USA, 1994. [Google Scholar]
Bania, P. Example for equivalence of dual and information based optimal control. Int. J. Control 2018. [Google Scholar] [CrossRef]
Lorenz, S.; Diederichs, E.; Telgmann, R.; Schütte, C. Discrimination of dynamical system models for biological and chemical processes. J. Comput. Chem. 2007, 28, 1384–1399. [Google Scholar] [CrossRef] [PubMed]
Ucinski, D.; Bogacka, B. T-optimum designs for discrimination between two multiresponse dynamic models. J. R. Stat. Soc. B 2005, 67, 3–18. [Google Scholar] [CrossRef]
Walter, E.; Pronzato, L. Identification of Parametric Models from Experimental Data. In Series: Communications and Control Engineering; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
Pronzato, L. Optimal experimental design and some related control problems. Automatica 2008, 44, 303–325. [Google Scholar] [CrossRef]
Atkinson, A.C.; Donev, A.N. Optimum Experimental Design; Oxford University Press: Oxford, UK, 1992. [Google Scholar]
Goodwin, G.C.; Payne, R.L. Dynamic System Identification: Experiment Design and Data Analysis; Academic Press: New York, NY, USA, 1977. [Google Scholar]
Payne, R.L. Optimal Experiment Design for Dynamic System Identification. Ph.D. Thesis, Department of Computing and Control, Imperial College of Science and Technology, University of London, London, UK, February 1974. [Google Scholar]
Ryan, E.G.; Drovandi, C.C.; McGree, J.M.; Pettitt, A.N. A Review of Modern Computational Algorithms for Bayesian Optimal Design. Int. Stat. Rev. 2016, 84, 128–154. [Google Scholar] [CrossRef]
Fedorov, V.V. Convex design theory. Math. Operationsforsch. Stat. Ser. Stat. 1980, 1, 403–413. [Google Scholar]
Lindley, D.V. Bayesian Statistics—A Review; Society for Industrial and Applied Mathematics (SIAM): Philadelphia, PA, USA, 1972. [Google Scholar]
Ryan, E.G.; Drovandi, C.C.; Pettitt, A.N. Fully Bayesian Experimental Design for Pharmacokinetic Studies. Entropy 2015, 17, 1063–1089. [Google Scholar] [CrossRef]
Chaloner, K.; Verdinelli, I. Bayesian Experimental Design: A Review. Stat. Sci. 1995, 10, 273–304. [Google Scholar] [CrossRef]
DasGupta, A. Review of Optimal Bayes Designs; Technical Report; Purdue University: West Lafayette, IN, USA, 1995. [Google Scholar]
Routtenberg, T.; Tabrikian, J. A general class of lower bounds on the probability of error in multiple hypothesis testing. In Proceedings of the 25th IEEE Convention of Electrical and Electronics Engineers in Israel, Eilat, Israel, 3–5 December 2008; pp. 750–754. [Google Scholar]
Feder, M.; Merhav, N. Relations between entropy and error probability. IEEE Trans. Inf. Theory 1994, 40, 259–266. [Google Scholar] [CrossRef]
Arimoto, S.; Kimura, H. Optimum input test signals for system identification—An information-theoretical approach. Int. J. Syst. Sci. 1971, 1, 279–290. [Google Scholar] [CrossRef]
Fujimoto, Y.; Sugie, T. Informative input design for Kernel-Based system identification. Procedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 4636–4639. [Google Scholar]
Hatanaka, T.; Uosaki, K. Optimal Input Design for Discrimination of Linear Stochastic Models Based on Kullback-Leibler Discrimination Information Measure. In Proceedings of the 8th IFAC/IFOORS Symposium on Identification and System Parameter Estimation 1988, Beijing, China, 27–31 August 1988; pp. 571–575. [Google Scholar]
Kolchinsky, A.; Tracey, B.D. Estimating Mixture Entropy with Pairwise Distances. Entropy 2017, 19, 361. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
Atkinson, A.C.; Fedorov, V.V. Optimal design: Experiments for discriminating between several models. Biometrika 1975, 62, 289–303. [Google Scholar]

Figure 1. (Top) Numerical solution of (19) and the small energy approximation (26), for

ϱ = N = 200

. (Bottom) Step responses and optimal responses of all systems.

Figure 1. (Top) Numerical solution of (19) and the small energy approximation (26), for

ϱ = N = 200

. (Bottom) Step responses and optimal responses of all systems.

Figure 2. Error probability of the MAP estimator for the optimal signal (.), step signal (+), and square (*) signal with period of three. The number of steps is

N = 50

. The error probability has been estimated by a Monte Carlo method with

10^{6}

trials. Standard error bars were multiplied by factor of 10 for better visibility.

Figure 2. Error probability of the MAP estimator for the optimal signal (.), step signal (+), and square (*) signal with period of three. The number of steps is

N = 50

. The error probability has been estimated by a Monte Carlo method with

10^{6}

trials. Standard error bars were multiplied by factor of 10 for better visibility.

Figure 3. Error probability of theMAP estimator (see Section 2), as a function of signal norm and the exemplary input signals (top right) generated by D-optimal and Bayesian methods. The error probability was calculated by a Monte Carlo method with

10^{5}

trials. Standard error bars were multiplied by a factor of 10 for better visibility.

Figure 3. Error probability of theMAP estimator (see Section 2), as a function of signal norm and the exemplary input signals (top right) generated by D-optimal and Bayesian methods. The error probability was calculated by a Monte Carlo method with

10^{5}

trials. Standard error bars were multiplied by a factor of 10 for better visibility.

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bania, P. Bayesian Input Design for Linear Dynamical Model Discrimination. Entropy 2019, 21, 351. https://doi.org/10.3390/e21040351

AMA Style

Bania P. Bayesian Input Design for Linear Dynamical Model Discrimination. Entropy. 2019; 21(4):351. https://doi.org/10.3390/e21040351

Chicago/Turabian Style

Bania, Piotr. 2019. "Bayesian Input Design for Linear Dynamical Model Discrimination" Entropy 21, no. 4: 351. https://doi.org/10.3390/e21040351

APA Style

Bania, P. (2019). Bayesian Input Design for Linear Dynamical Model Discrimination. Entropy, 21(4), 351. https://doi.org/10.3390/e21040351

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bayesian Input Design for Linear Dynamical Model Discrimination

Abstract

1. Introduction

2. Maximization of Mutual Information between the System Output and Parameter

2.1. Selection between Two Models

2.2. Small Energy Limit

2.3. Large Energy Limit

3. Application to Linear Dynamical Systems

4. Example

5. Comparison with the Average D-Optimal Design

6. Possible Extensions of the Results

6.1. Non-Linear Models

6.2. Non-Gaussian Models

6.3. Infinite Set of Parameters

7. Discussion and Conclusions

Funding

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI