A Natural Gradient Algorithm for Stochastic Distribution Systems

In this paper, we propose a steepest descent algorithm based on the natural gradient to design the controller of an open-loop stochastic distribution control system (SDCS) of multi-input and single output with a stochastic noise. Since the control input vector decides the shape of the output probability density function (PDF), the purpose of the controller design is to select a proper control input vector, so that the output PDF of the SDCS can be as close as possible to the target PDF. In virtue of the statistical characterizations of the SDCS, a new framework based on a statistical manifold is proposed to formulate the control design of the input and output SDCSs. Here, the Kullback–Leibler divergence is presented as a cost function to measure the distance between the output PDF and the target PDF. Therefore, an iterative descent algorithm is provided, and the convergence of the algorithm is discussed, followed by an illustrative example of the effectiveness.


Introduction
Information geometry [1][2][3][4][5][6] proposed by some scholars has been widely applied to various fields, such as neural network [7,8], control systems [9][10][11][12], dynamical system [13,14] and information science [15,16].The main advantage of information geometry is that by considering the set of probability density functions (PDFs) as a manifold, one is able to investigate its properties geometrically.The parameters of a probability density function (PDF), regarded as the coordinate system of a statistical manifold, play important roles.As a classical example, consider the set S of normal PDFs with mean µ and variance σ 2 , namely, Obviously, S can be viewed as a two-dimensional manifold, where (µ, σ) can be considered as a coordinate system.However, this statistical manifold is not a Euclidean space with respect to the Fisher metric, but a Riemannian manifold with a constant negative curvature.These observations in return enable us to investigate statistical properties from the viewpoint of information geometry.
There are many useful consequences of the information geometric theory; among them, the natural gradient algorithm [15] is well known.This algorithm has been applied to various stochastic models, such as stochastic network [16] and signal processing [17].
Stochastic distribution control systems (SDCSs) [18] were proposed from practical production systems, such as steel and paper making, and general material processing.The system is shown in Figure 1.The key point of the controller design problem is to formulate the control input, such that the output PDF is as close as possible to a required distribution shape.Generally, product quality data in industrial processes can be approximated by the Gaussian PDFs when the system operates normally.However, when abnormality occurs along the production line, these quality variables will not be Gaussian.Therefore, various iterative algorithms [19][20][21][22] have been presented to control the shape of PDFs for non-Gaussian cases.In [23], different kinds of SDCSs were discussed.One of them is the input and output model based output PDF control, which is investigated here.In this paper, we are mainly concerned with the information geometric algorithm to control the shape of PDFs.In [22], the authors firstly brought the idea of information geometry to the field of SDCSs and presented a comparative study on the parameter estimation performance between the geodesic equation and the B-spline function approximations, when the system output distributions are Gamma family distributions.Then, in [10] and [11], the authors generalized the results of [22] to more general cases where the system output distributions are assumed to be exponential family distributions and any regular distributions, respectively.There, the authors proposed information geometric algorithms using projection, natural gradient and geodesics, as well.
In the present paper, we investigate more complicated SDCSs of multi-input, single output with a stochastic noise from the viewpoint of information geometry.The remainder of this paper is organized as follows.In Section 2, we specify the SDCSs and re-describe them in the frame of information geometry.In Section 3, based on the natural gradient descent algorithm, a steepest descent algorithm is proposed from the viewpoint of information geometry.In Section 4, the convergence of the algorithm is discussed.In Section 5, an illustrative example is given.

Model Description
In this paper, we investigate the open-loop SDCSs of multi-input and single output with a stochastic noise, where the structure of the systems is characterized by a known nonlinear function f (•) and the noise term is assumed to be subject to a known PDF p ω (x).Therefore, the SDCSs can be expressed as: where u k = (u 1 k , . . ., u n k ) ∈ R n is the control input vector and y k ∈ R 1 is the output (see Figure 2).It is assumed that the function f (•) is invertible with respect to its noise term ω k .Thus, according to p ω (x) and Equation (1), the output PDFs of the system can be expressed by: p(y; u) = p ω f −1 (y, u) ∂f −1 (y, u) ∂y . ( This shows that Equation (2) implies how the control input vector u controls the shape of the output PDF of the SDCSs.For example, when the stochastic noise signal ω is subject to the normal distribution N (0, 1) and the stochastic distribution control system (SDCS) with single input and single output is formulated as y = u 2 + ω, then the output PDF can be obtained as: In order to guarantee the effectiveness, the following assumptions are required.
(1) The inverse function of y = f (u, ω) with respect to ω exists and is denoted by ω = f −1 (y, u), which is at least C 2 with respect to all variables (y, u).(2) The output PDF p(y; u) is at least C 2 with respect to all variables (y, u).
For the shape control of the PDF, the purpose of the controller design is to select the control input vector u * , so that p(y; u * ) is as close as possible to the target PDF h(y).To formulate it in the frame of information geometry, we first define the relevant statistical manifold.Definition 1.The statistical manifold S, called the system output manifold, is defined as: S = {p(y; u)} , where p(y; u) is in the form of Equation ( 2) and the control input vector u = (u 1 , . . ., u n ) T ∈ R n plays the role of a coordinate system for S. Thus, S is an n-dimensional manifold.
Definition 2 ([5,6,24]).For a given statistical manifold, the Kullback-Leibler divergence between two points P and Q corresponding to the PDF p(x) and the PDF q(x), respectively, is defined by: Notice that the Kullback-Leibler divergence neither satisfies the triangular inequality nor is symmetric.Hence, it is not a usual distance function.However, the Kullback-Leibler divergence between two neighboring points θ and θ + dθ can be approximated by using the Fisher metric: where the terms of order O(| dθ| 3 ) are neglected.Thus, the Kullback-Leibler divergence is a distance-like measure of two points on a statistical manifold and has been widely applied, for example, to information theory.Here, (g ij ) is the Fisher metric equipped on manifold S, whose components are expressed as: where l(x; θ) = log p(x; θ), ∂ i = ∂ ∂θ i and E denotes the expectation with respect to p(x; θ).Manifold (S, g) is hence a Riemannian manifold.
Here, we use the following Kullback-Leibler divergence function to measure the difference between h(y) and p(y; u) by: that is, J(u) is considered a cost function.Our purpose is to design a controller so that the control input vector u * minimizes J(u), namely, Alternatively, the problem can be re-described as selecting the points p(y; u) on S with the coordinate system u to make the points as close as possible to the target point h(y) in the frame of information geometry.

Natural Gradient Algorithm
In this section, we will introduce an iterative algorithm for the controller design from the viewpoint of information geometry.It is a fact that the ordinary gradient method is a popular learning method in Euclidean space.However, most practical problems are non-Euclidean, where the ordinary gradient method loses its effectiveness.In such cases, the ordinary gradient does not give the steepest descent direction of a target function, but the natural gradient does.Next, we will introduce an important lemma about the natural gradient.
Let {ω ∈ R n } be a parameter space on which a function L is defined.

Lemma 1 ([15]
).The steepest descent direction of L(ω) on a Riemannian manifold is given by: where G −1 = (g ij ) is the inverse of the Riemannian metric G = (g ij ) and ∇L(ω) is the ordinary gradient: To obtain the steepest descent algorithm, we first formulate the Fisher metric of S as follows.
Proposition 1.The components of the Fisher metric of S are given by: for i, j ∈ {1, 2, . . ., n}.
Proof.The first order derivatives of log p(y; u) with respect to u i (i = 1, 2, . . ., n) are given by: Note that ∂ log p(y;u) ∂u i must satisfy the condition: for all i = 1, 2, . . ., n. Combining (3) and ( 6), we obtain the conclusion in Proposition 1.This completes the proof.
Thus, we have the following iterative descent algorithm.
Theorem 1.Based on the natural gradient algorithm, the steepest descent algorithm for the control input vector u of the considered stochastic distribution control systems is given by: where G −1 k is the inverse of the Fisher metric G k = G| u=u k , and ε is a sufficiently small positive constant, which determines the step size.Here, we set: Proof.Let P k and P k+1 be two close points on S corresponding to the functions log p(y; u k ) and log p(y; u k+1 ), whose coordinates are given by u k = (u 1 k , . . ., u n k ) T and u k+1 = u k + u k , respectively, where u k+1 = (u 1 k+1 , . . ., u n k+1 ) T , and u k = ( u 1 k , . . ., u n k ) T .Therefore, our purpose is to formulate an iterative formula with respect to u k+1 .Assume that the vector − −−− → P k P k+1 ∈ T P k S has a fixed length, namely, where ε is a sufficiently small positive constant.Then, we put: where: can be considered as a tangent vector of T P k S at P k .We denote a = (a 1 , . . ., a n ) T , and the tangent vector v satisfies: where G k means that G| u=u k .
To reveal the iterative relation of the functions log p(y; u k ) and log p(y; u k+1 ) between the sample times k and k + 1, the following equation is performed approximately: where: Combining Equations ( 8), ( 9) and ( 11), we get the following equation: From Equation ( 12), we get the relations between J(u k+1 ) and J(u k ) as follows: where: Note that u k is known at the sample time k + 1.Here, a = (a 1 , . . ., a n ) T should be selected, such that the following performance function: is minimized, where the first two terms are the linear approximation of the Kullback-Leibler divergence J(u k ) at the sample time k, while the third term is a natural quadratic constraint for a = (a 1 , . . ., a n ) T .Then, the optimal vector a can be obtained as: From above, it can be seen that Equation ( 15) sets up the necessary condition for an optimal vector a.Now, let us consider the sufficient condition of Equation ( 15) to minimize the performance function (14).
Firstly, we determine the value of the parameter λ.Since: and ) is positive, we get the value of λ as: Then, the Hessian matrix of F (a 1 , . . ., a n ) with respect to the vector a = (a 1 , . . ., a n ) T is given by: since λ is positive and G k is positive definite, the Hessian matrix is positive definite.This guarantees that the vector a in the form of Equation ( 15) minimizes the performance function (14) naturally.
Substituting Equation (15) into Equation (12), we obtain the conclusion in Equation ( 7), which gives a steepest descent direction algorithm for this stochastic distribution control problem.
We summarize the algorithm above as follows: (1) Initialize u 0 .
(2) At the sample time k − 1, formulate ∇J(u k−1 ) and use Equation (1) to give the inverse G −1 k−1 of the Fisher metric G k−1 .
(3) Calculate u k using Equation (7) and apply it to the stochastic system.(4) If J(u k ) < δ, where δ is a positive constant, which is determined by the precision needed, escape.
Additionally, at the sample time k, the output PDF p(y; u k ) is the final one.If not, turn to Step 5. ( 5) Increase k by one and go back to Step 2.

Convergence of the Algorithm
Now, let us study the convergence of the algorithm for the control input vector u in Equation ( 7).Proof.Let: and: First, we shall prove the following conclusion: for any > 0, there exists a constant K > 0, such that B(a i , ) for arbitrary m > K.
Now, we give the proof by contradiction.If for a certain 0 > 0, we have that for arbitrary K > 0, there exists an m > K, such that x m ∈ s i=1 B(a i , 0 ), then, for K = 1, we get an m 1 > 1 satisfying Following this way, we get a subsequence {x m j } of {x m }, satisfying x m j ∈ s i=1 B(a i , 0 ) for arbitrary j.
Since D is compact, {x m j } must have a convergent subsequence {x m j i }, namely, It is obvious that: ) and f (x) = 0.
As f is continuous, we have: which contradicts lim m→∞ f (x m ) = 0. Therefore, the conclusion in the beginning holds.
Since {x m } ∞ m=1 ⊂ D, we can get a convergent subsequence {x m k } ∞ k=1 .Setting lim k→∞ x m k = x * , it is trivial that x * ∈ Ω according to the process of the conclusion above.
Because the set Ω is finite, setting 0 = 1 2 min 1≤i,j≤s Moreover, we shall prove lim m→∞ x m = x * by its equivalent proposition: for arbitrary 0 < < 0 2 , there exists a K > 0, such that x m ∈ B(x * , ) for any m ≥ K.
For arbitrary 0 < < 0 2 , we get that there exists a K 1 > 0, such that: for arbitrary m > K 1 .Meanwhile, we also have: where α ∈ B(a i , ) and β ∈ B(a j , ) are arbitrary, when i = j.
As lim k→∞ x m k = x * , there exists a constant L > 0, so that: when k > L.
Since lim m→∞ x m+1 − x m = 0, for 0 > 0, there exists a K 2 > 0, such that: for arbitrary m > K 2 . Take Finally, we shall finish the proof by using mathematical induction.When m = K, since x K ∈ {x m k }, we get x m ∈ B(x * , ) from Equation ( 18) directly.If when m = N ≥ K, x N ∈ B(x * , ), then when m = N + 1, we see that x N +1 should be contained in the union of the s open balls from Equation ( 16), while from Equations ( 17) and ( 19), we also get that x N and x N +1 must be in the same ball, i.e., x N +1 ∈ B(x * , ).
This finishes the proof of Lemma 2.
Lemma 3. Let J(u) be at least C 2 with respect to u.For an initial value u 0 ∈ R n , suppose that the level set L = {u ∈ R n |J(u) J(u 0 )} is compact.The sequence {u k } in Equation ( 7) has the following property: for a certain k 0 , either k ∇J(u k ) and α = ε 2 2λ for simplicity.Assume that c k = 0 for any sample time k.Now, let us give a proof by contradiction.Suppose that when k → ∞, c k → 0 does not hold, that is, there exists an ε 0 > 0, so that the norm of c k satisfies: for infinitely many k, where c k 2 = c T k G k c k .Thus, for such k, Equation ( 20) can be rewritten as: Then, from the mean value theorem of differentials, we get: where c(u) = G −1 (u)∇J(u), and v k belongs to the continuous space between u k and u k − αc k .Since c(u) is continuous and the level set L is compact, c(u) is uniformly continuous on L, which means that there exists a β > 0, when 0 holds for all the k.Then, taking α = β c k in Equation ( 22) and combining Equation (21) with Equation ( 23), we have: for infinitely many k.
On the other hand, since: in which G −1 k−1 is positive definite and λ > 0, we have J(u k ) − J(u k−1 ) < 0, i.e., {J(u k )} is monotone decreasing with respect to k.
The level set L is compact, which implies that lim k→∞ J(u k ) exists, namely, This is a contradiction.This completes the proof of Lemma 3.
Theorem 2. Let ∇J(u) be a continuous function with the compact level set L, and suppose the set Ω = {u ∈ L|∇J(u) = 0} is finite.Then, there exists u * ∈ Ω, such that: Proof.From Lemma 3, we get: Meanwhile, similarly with the process of the proof of Lemma 3, we have: Therefore, from Equations ( 24), (25) and Lemma 2, we get the conclusion in Theorem 2.

Simulations
The dynamic characteristic of a simple nonlinear stochastic system is considered as: , where ω k ∈ [0, +∞) and (µ, σ) is the input vector.Here, the stochastic noise ω k is a random process whose PDF is written as: p ω (x) = 1 3750 x 3 e − 1 5 x , where x ∈ [0, +∞).
The target PDF h(y) is given by: Then, from Equation (2), we can get the output PDF p(y; µ, σ) as: The components of the Fisher metric can hence be obtained as: To start the simulation, the initial value is chosen as u 0 = (µ 0 , σ 0 ) T = (0.7, 2.5) T .The weights ε and λ are taken as 0.6 and 0.8, respectively.As a result, the response of the output PDFs is shown in Figure 3, in which y denotes the output of the system, p(y; µ, σ) denotes the PDF of the output y and k denotes the sample time.In order to illustrate the effectiveness of the control law in detail, the comparison between the final controlled output PDF and the target function is shown in Figure 4, where the horizontal axis denotes the output y and the vertical axis denotes the PDF p(y; µ, σ) of y.Obviously, the output PDF can be controlled to be a steady state shown in Figure 4.However, as we know, assume that A and B are two sets in a metric space (e.g., Euclidean space) with the distance function d; then, the distance d(A, B) = max x∈A,y∈B d(x, y) may be larger than zero.Actually, in our simulation, the target PDF is in the set of second order polynomials, and the PDF p(y; µ, σ) of the output y is exponential.Therefore, the non-zero steady error still exists all of the time.
The response of the optimal control input sequences is shown in Figure 5, in which (µ, σ) denotes the input vector and k denotes the sample time.From above, it can be concluded that the simulation results demonstrate the effectiveness of the presented method, which gives a solution to control the shape of the output PDF.

Conclusions
In this paper, we investigate the open-loop stochastic distribution control systems of multi-input and single output with a stochastic noise, via the advantage of information geometric theory.
(1) By the statistical characterizations of the stochastic distribution control systems, we formulate the controller design in the frame of information geometry.By virtue of the natural gradient algorithm, a steepest descent algorithm is proposed.(2) The convergence of the obtained algorithm is proven.(3) An example is discussed in detail to demonstrate our algorithm.

Figure 1 .
Figure 1.The stochastic distribution control systems.

Figure 2 .
Figure 2. The open-loop stochastic distribution control systems.

Lemma 2 .
For an n-dimensional manifold M with a distance function d(x, y), the norm is defined by x − y = d(x, y).Let f : M → R n be a continuous mapping on a compact set D of M and the set Ω = {x ∈ D| f (x) = 0} be finite.If the sequence {x m } ∞ m=1 ⊂ D satisfies: lim m→∞ x m+1 − x m = 0 and lim m→∞ f (x m ) = 0, then there exists an x * ∈ Ω, such that: lim m→∞ x m = x * .

Figure 3 .
Figure 3.The response of the output probability density functions.

Figure 4 .
Figure 4.The final and the target probability density functions.

Figure 5 .
Figure 5.The optimal control input sequences.