An Introduction to Development of Centralized and Distributed Stochastic Approximation Algorithm with Expanding Truncations

: The stochastic approximation algorithm (SAA), starting from the pioneer work by Robbins and Monro in 1950s, has been successfully applied in systems and control, statistics, machine learning, and so forth. In this paper, we will review the development of SAA in China, to be speciﬁc, the stochastic approximation algorithm with expanding truncations (SAAWET) developed by Han-Fu Chen and his colleagues during the past 35 years. We ﬁrst review the historical development for the centralized algorithm including the probabilistic method (PM) and the ordinary differential equation (ODE) method for SAA and the trajectory-subsequence method for SAAWET. Then, we will give an application example of SAAWET to the recursive principal component analysis. We will also introduce the recent progress on SAAWET in a networked and distributed setting, named the distributed SAAWET (DSAAWET).


Introduction
SAA was first introduced by Robbins and Monro in 1950s [1] and is also called the RM algorithm in the literature. Since the 1950s, SAA has found successful applications in systems and control, statistics, machine learning, the stochastic arithmetic, and the CESTAC method for numerical calculation [2,3], energy internet [4], and so forth, and there are plenty of studies on various theoretical properties of SAA, including the sample path convergence [1,[5][6][7], the weak convergence [8], convergence of time-varying root-searching functions [9][10][11][12], robustness of SAA [13], stable and unstable limit points [14], convergence rate and asymptotic normality [7,[15][16][17][18][19][20], and so forth. In this paper, due to space limitations, we will not give an overview of all aspects of SAA, but focus on the sample path analysis of this kind of algorithm, and particularly introduce the centralized and distributed SAAWET developed by Han-Fu Chen and his colleagues during the past 35 years.
At the early stage of SAA, to guarantee the convergence of the algorithm, it usually requires restrictive conditions on the noise {ε k } k≥0 and the function f (·), for example, {ε k } k≥0 being a martingale difference sequence and f (·) being bounded by a linear function. Then, by tools from the probability theory, such as the martingale theory, the almost sure convergence of SAA can be established and such an analysis method is called the probabilistic method (PM) [1,20]. However, in many situations, the noise {ε k } k≥0 may not be a martingale difference sequence, and it may depend on the past estimates generated by SAA itself, which we call the state-dependent noise, and in such cases, PM fails. Towards relaxing the restrictive conditions in PM, in the 1970s a so-called ordinary differential equation (ODE) method was introduced [7,21], which transforms the convergence analysis of SAA into the stability of the equilibrium point of an associated ODE and forms another representative direction for theoretical analysis of SAA. For the ODE method, it a priori are introduced, and some application examples are given as well. In Section 4, some concluding remarks are addressed. The notations used in this paper are listed in Table 1.
Truncation number of agent i at time k Largest truncation number of all agents at time k, that is,

Probabilistic Martingale Method
We first give the detailed problem formulation of SAA. Assume that f (·) : R l → R l is an unknown function with root x 0 , that is, f (x 0 ) = 0. At time k, assume that x k is an estimate of x 0 and the measurement of f (·) at x k is y k+1 , that is, where ε k+1 is the observation noise.
With an arbitrary initial value x 0 , in [1] a recursive algorithm for searching the root x 0 of f (·) is introduced where {a k } k≥1 is the stepsize sequence which satisfies Assumption (3) requires that the stepsize sequence tends to zero, but the sum of the stepsizes goes to infinity. This indicates that the noise effect can be asymptotically reduced while the algorithm has the ability to search in the whole feasible domain. Example 1. Assume that at time k, the estimate x k is very close to the root x 0 . Then there exists an ε > 0 small enough, such that Then, From (5) it can be seen that if the stepsize sequence {a k } k≥1 has a positive lower bound, then for the unbounded noises, such as Gaussian variables, the estimation error will not converge to zero. Thus, for the stepsize sequence {a k } k≥1 , a k −→ k→∞ 0 is a necessary condition for convergence of SAA.
If the initial value and {x k } k≥0 will not converge to x 0 . Thus, ∑ ∞ k=0 a k = ∞ is another necessary condition for convergence of SAA.
We make the following assumptions.
A2.4 There exists a constant c > 0 such that Theorem 1. ([20,45,46]) Assume that A2.1-A2.4 hold. Then, with an arbitrary initial value x 0 , estimates generated from (2) converge to the root x 0 of f (·) almost surely, that is, Proof. Here, we list the sketch of the proof. By using the function v(·) given in A2.2, define v k+1 : By (9), we can obtain By the properties of a nonnegative supermartingale, we have that {v k } k≥0 converges almost surely. Noting (8), we further obtain converges almost surely.
With an arbitrary ε > 0, define and a stopping sequence {σ i ε } i≥0 : By (13) and (14) and noting (9), we have From (16) and by the properties of the supermartingale, we have By (17) and (8), we can prove that σ i ε is finite almost surely, and there exists a subsequence {x k i } i≥1 which satisfies x k i − x 0 ≤ ε. Since ε > 0 can be arbitrarily small, we know that there exists a subsequence of {x k i } i≥0 which converges to x 0 . By the convergence of {v(x k )} k≥0 , we can finally prove (11).
The above proof is based on the convergence of the nonnegative supermartingale, and the analysis method is accordingly named the probabilistic method (PM).
then we can prove the almost sure convergence of the following SAA Remark 2. From A2.3 we can see that it requires the observation noise being an m.d.s. In many problems, the observation noise may contain complicated state-dependent uncertainties, for example, which indicates that f (·) is bounded by a linear function. These are restrictions for PM.

Ordinary Differential Equation Method
We first introduce two technical lemmas.

Lemma 2. Consider the ODEẋ
then x 0 is the globally stable solution of (22).
Before introducing the conditions and the results for the ODE method, we first introduce a crucial notation to be used in this paper. For any T > 0, define From the definition in (23) we can see that m(n, T) is the maximal number of steps starting from n with the sum of stepsizes not exceeding T. Noting that a k → 0 as k → ∞, we have that m(n, T) → ∞ as n → ∞.
We make the following assumptions.
In this regard, (24) is more general compared with A2.3.
We give the following result for the ODE method.
Proof. We sketch the proof. Define t k ∑ k i=0 a i with t 0 = 0. By interpolating the estimates generated from (2), we have a sequence of continuous functions which can be proved to be bounded by the boundedness of {x k } k≥0 and uniform continuity by A3.3. Then by Lemma 1, we can prove that there exists a subsequence {x n k (t)} k≥0 converging a continuous function x(t) satisfying the following ODĖ By A3.2 and Lemma 2 we can further prove that the equilibrium point x 0 is globally stable and the estimates {x k } k≥0 converge to x 0 on the sample path ω.
The essential idea of the above proof is to transform the convergence of a recursive algorithm to the stability of an associated ODE, and thus the analysis method is called the ODE method. From assumption A3. 3 we can see that the ODE method has a wider application potential than the PM method. However, the boundedness assumption on the estimate sequence is still restrictive. Aiming at removing this condition and further relaxing the technical assumptions on the noise and the root-seeking function, in the next section we will introduce the SAA with expanding truncations (SAAWET) and its general convergence theorem.
For Example 3, neither the growth rate condition on f (·) used by PM nor the boundedness condition on the estimate sequence used by the ODE method take place. On the other hand, if we can choose the stepsize in an adaptive manner, estimates generated from SAA may still converge. The core idea of SAAWET consists in adaptively choosing the stepsizes. Let us describe it.
Denote by J the root set of an unknown function f (·) : R l → R l , that is, f (x 0 ) = 0 ∀ x 0 ∈ J. Choose {M k } k≥0 a positive sequence increasingly diverging to infinity, M k −→ k→∞ ∞.
With an arbitrary initial value x 0 ∈ R l , the estimate sequence {x k } k≥1 is generated by the following SAAWET: where y k+1 is the observation of f (·) at x k , ε k+1 is the observation noise, a k is the stepsize and x k is the estimate for the root set of f (·) at time k.
From (29) and (30), we can see that if the magnitude of x k + a k y k+1 is located in the truncation bound, then the algorithm evolves as the classical SAA (2), while if the magnitude of x k + a k y k+1 exceeds the truncation bound, then x k+1 is pulled back to x * with the truncation bound being enlarged and the stepsize being reduced for the next recursion.
We first list conditions for theoretical analysis.
A4.3 On the sample path ω, for any convergent subsequence {x n k } k≥1 generated from (29) and (30), it holds that lim T→0 lim sup with m(n k , T k ) max m : A4.4 The function f (·) is measurable and locally bounded.

Remark 5.
A set being nowhere dense means that the set of the interior points of its closure is empty. It is clear that a set with a single point is nowhere dense. Note that for the noise condition A3.3, it only requires to verify (32) on a fixed sample path and along the index of any convergent subsequence, which compared with (24) is much easier to be verified in practice. In this regard, the analysis method for SAAWET is called the trajectory-subsequence (TS) method.
and J * is a closed connected subset of J.
It is direct to verify that if the root set of f (x) = 0 is a singleton, then under the conditions of Theorem 3 we have that d(x k , x 0 ) −→ k→∞ 0 and J * = J = {x 0 }. Proof. We outline the proof.
Step 2. By the stability condition A4.2 and (35) and (36), prove that for SAAWET, the number of truncations is finite, that is, algorithms (29) and (30) evolve as the classical SAA after a finite number of steps.
Step 3. To establish the convergence of {x k } k≥0 .

Remark 6.
If it is a priori known that estimates {x k } k≥0 generated from (29) and (30) are in a subspace S(⊂ R l ), then for the convergence of {x k } k≥0 , we only need to verify the stability condition in J ∩ S and in such a case assumption A4.2 is formulated as follows: A4.2'There exists a continuously differentiable function v(·) : R l → R such that (29), there exists a constant c 0 > 0 such that x * < c 0 and v(x * ) < inf x =c 0 v(x).
From the above sections, we can see that SAAWET does not require that the observation noise is purely stochastic or the estimate sequence is bounded. In fact, it can be shown that conditions for convergence of SAAWET are the weakest possible in a certain sense [5]. In the next section, we will apply SAAWET to give a recursive algorithm for principal component analysis.

Recursive Principal Component Analysis Algorithm
Before introducing the recursive algorithm, we first make the following assumption.
The principal component analysis algorithm considered in this paper aims at estimating the eigenvalues and eigenvectors based on the observation sequence {A k } k≥1 . Since A is unknown, if we perform SVD or QR decomposition for each A k , k ≥ 1, it would be rather time-consuming. In the following, we will introduce a SAAWET-based recursive algorithm for solving the problem.
The eigenvectors of matrix A are recursively estimated as follows: Choose the stepsize a k = 1 k and any initial values u (1) 0 ∈ R n with unit modular and define Otherwise, u k+1 as another vector with a unit modular. Assume that u (i) k , i = 1, · · · , j have been well-defined. Next, we inductively define the estimation algorithm for u With an arbitrary initial vector u (j+1) 0 with a unit modular, define If u Otherwise, u k } k≥1 , j = 1, · · · , n are generated by the following algorithms: where λ (j) k is the estimate for the eigenvalue corresponding to the eigenvector which u (j) k estimates at time k. Denote S the unit sphere in R n and J the set of all eigenvectors with unit modular of matrix A. Denote the set of all eigenvalues of matrix A by V(J), that is, V(J) {λ (1) , · · · , λ (n) }.
(i) There exists a closed-subset J * j of J such that d u k converges to the eigenvalue λ (j) of matrix A and the limit set of λ (j) k , j = 1, · · · , n coincides with V(J).

Proof. Noting that {u
(j) k } k≥0 , j =, · · · , n are with a unit modular, algorithms (39)-(42) are SAAWET. The proof can be obtained by applying Theorem 3 and Remark 5. Due to space limitations, the detailed proof is omitted.

Distributed Stochastic Approximation Algorithm
In this section, we will introduce the distributed version of SAAWET (DSAAWET).
The key difference from the existing distributed SAA, see for example, [40], lies in the expanding truncation mechanism that adaptively defines enclosing bounds. The theoretical results of DSAAWET guarantee that estimates generated by DSAAWET for all agents converge almost surely to a consensus set, which is contained in the root set of the sum function.
We first introduce the problem formulation.
Consider the case where all agents in a network cooperatively search the root of the sum function given by where f i (·) : R l → R l is the local objective function which can only be observed by agent i. Denote the root set of f (·) by J {x ∈ R l : f (x) = 0}. For any i ∈ V, denote by x i,k ∈ R l the estimate for the root of f (·) given by agent i at time k. Since at time k + 1, agent i only has its local noisy observation where ε i,k+1 is the observation noise, to estimate the root of f (·), all agents need to exchange information with adjacent agents via the topology of the network. The topology of the network at time k is described by a digraph G(k) = {V, E (k)}, where V = {1, · · · , N} is the index set of all agents, and E (k) ⊂ V × V is the edge set with (j, i) ∈ E (k) representing the information flow from agent j to agent i at time k. Denote the adjacency matrix of the network at time k by W(k) = [ω ij (k)] N i,j=1 , where ω ij (k) > 0 if, and only if (j, i) ∈ E (k), and ω ij (k) = 0, otherwise. Denote by N i (k) = {j ∈ V : (j, i) ∈ E (k)} the set of neighbors of agent i at time k.
We apply the idea of expanding truncation to the distributed estimation. Denote by x i,k the estimate for agent i at time k and by σ i,k the number of truncations for agent i up to time k. DSAAWET is given by (46)-(49) with any initial values x i,0 , where O i,k+1 is defined by (45), {γ k } is the stepsize, x * ∈ R l is a vector known to all agents, {M k } k≥0 is a positive sequence increasingly diverging to infinity with M 0 ≥ x * .
Denote by the largest truncation number among all agents at time k. If N = 1, then by denotinĝ which is the SAAWET in the centralized setting. We first introduce assumptions to be used.
for any ∆ > δ > 0, where v x (·) denotes the gradient of v(·) and d(x, J) = min y { x − y : y ∈ J}, for some positive constant c 0 , where x * is used in (47) and (48). A5. 3 The local functions f i (·) ∀i ∈ V are continuous. A5.4 (a) W(k) ∀k ≥ 0 are doubly stochastic matrices (A matrix is said to be doubly stochastic if all entries being nonnegative and the sum of entries in each row and each column being 1.); (b) There exists a constant 0 < η < 1 such that such that j is a neighbor of i which communicates with i infinitely often, that is, for infinitely many indices k}; (d) There exists a positive integer B such that for every (j, i) ∈ E ∞ , agent j sends information to the neighbor i at least once every B consecutive time slots, that is, for all (j, i) ∈ E ∞ and any k ≥ 0.

Remark 7.
Assumptions A5.1-A5.3 are similar to those conditions used in the centralized SAAWET. Since the DSAAWET is in a distributed setting, assumption A5.4 is a commonly used condition describing the topology of the network [34].
Next, we introduce the condition on the observation noises for each agent.
A5.5 For the sample path ω under consideration and for each agent i, (a) γ k ε i,k+1 −→ k→∞ 0, and We now give the main results of DSAAWET.
Theorem 5. ( [44]) Let {x i,k } be produced by (46)-(49) with an arbitrary initial value x i,0 . Assume A5.1-A5.4 hold. Then on the sample path ω where A5.5 holds, the following assertions take place: (i) {x i,k } is bounded and there exists a positive integer k 0 possibly depending on ω such that or in the compact form: (ii)X ⊥,k −→ Proof. The proof can be obtained by first proving the finite number of truncations, then the consensus of estimates, and finally, the convergence of estimates. The detailed proof can be found in [44].
DSAAWET provides a likely tool for solving the distributed problems over networks. In fact, DSAAWET has been successfully applied in the following problems: • Distributed identification of linear systems [48], • Distributed blind identification of communication channels [49], • Output consensus of networked Hammerstein and Wiener systems [38].

Concluding Remarks
In this paper, we briefly reviewed the historical development of SAA, and in particular, introduced SAAWET and its distributed version (DSAAWET) developed by Han-Fu Chen and his colleagues during the past 35 years. SAAWET and DSAAWET establish general convergence results for the root-seeking problems with noise observations. Since many problems, such as identification, estimation, adaptive control, and optimization can be transformed into the root-seeking problem, SAAWET and DSAAWET can hopefully provide powerful tools for solving such problems.