On the Convergence of Stochastic Process Convergence Proofs

Sánchez-López, Borja; Cerquides, Jesus

doi:10.3390/math9131470

Open AccessArticle

On the Convergence of Stochastic Process Convergence Proofs

by

Borja Sánchez-López

^*

and

Jesus Cerquides

^*

IIIA-CSIC, Campus UAB, 08193 Cerdanyola, Spain

^*

Authors to whom correspondence should be addressed.

Mathematics 2021, 9(13), 1470; https://doi.org/10.3390/math9131470

Submission received: 5 May 2021 / Revised: 16 June 2021 / Accepted: 19 June 2021 / Published: 23 June 2021

(This article belongs to the Section D1: Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

Convergence of a stochastic process is an intrinsic property quite relevant for its successful practical for example for the function optimization problem. Lyapunov functions are widely used as tools to prove convergence of optimization procedures. However, identifying a Lyapunov function for a specific stochastic process is a difficult and creative task. This work aims to provide a geometric explanation to convergence results and to state and identify conditions for the convergence of not exclusively optimization methods but any stochastic process. Basically, we relate the expected directions set of a stochastic process with the half-space of a conservative vector field, concepts defined along the text. After some reasonable conditions, it is possible to assure convergence when the expected direction resembles enough to some vector field. We translate two existent and useful convergence results into convergence of processes that resemble to particular conservative vector fields. This geometric point of view could make it easier to identify Lyapunov functions for new stochastic processes which we would like to prove its convergence.

Keywords:

stochastic process; optimization functions; stochastic gradient descent; convergence; Lyapunov functions

1. Introduction

Along most practical research branches, the solution to a given problem is often entrusted to a function optimization problem, where the effectiveness of a solution is measured by a function to be optimized. Machine learning challenges are great examples of this situation. Therefore, optimization algorithms become crucial to solve such problems. Iterative optimization methods start at an initial point and move through parameter space towards trying to minimize the objective function. Its performance may dramatically vary depending on the initial point. This dependence is somewhat diminished if the algorithm is guaranteed to converge in the long term to a minimum. Furthermore, in stochastic optimization algorithms, the quality achieved varies randomly and sometimes there are chances that the algorithm fails to converge. As an example, the stochastic natural gradient descent (SNGD) of Amari [1] and its variants often show instability depending on the starting point and learning rate tuning. Even some experiments are proved to diverge with SNGD [2]. Clearly such issue weights considerably against its practical use.

Convergent algorithms are more stable with respect to both learning rate parameters and initial point estimations. For instance, in [3], the optimization method named convergent stochastic natural gradient descent (CSNGD) is proposed. CSNGD is designed to mimic SNGD but it is proven to be convergent. Sánchez-López and Cerquides show that, unlike SNGD, CSNDG shows stability in the experiments run.

As a consequence, we are interested in understanding better the conditions that make an algorithm convergent. Convergence proofs abound in the literature. In this work we concentrate on two apparently disconnected and well known convergence results. In [4] seminal work, Bottou proved the convergence of stochastic gradient descent (SGD). Later on in [5], Sunehag provided an extended result for variable metric modified SGD. The connection between the proofs of both results are not evident. It is not clear what they have in common, and, therefore, further generalizations seem not to be within reach.

To understand the convergence results (Both theorems are added in Appendix A. However, we alleviate the conditions of Theorem A2 in [5]. The alleviated conditions turn Theorem A2 into Theorem A3 found in the same appendix), it is helpful to take a look at their proofs. Bottou’s proof relies on the construction of a Lyapunov function [6]. On the other hand, Sunehag’s proof uses the Robbins–Siegmundtheorem [7] instead. It can be seen that the latter is proving that the function to optimize serves already as a Lyapunov function, similar to latter chapters in [4]. Therefore, both proofs share some similarity but it is not evident how to raise a connection. Establishing the connection and pointing out its relevance is the main contribution of this paper and results in a generalization from which both results can be easily proved as corollaries.

Stochastic optimization algorithms rely on observations extracted from some possibly unknown probability space. Algorithms subjected to random phenomena are stochastic process [8,9,10,11]. The generalized convergence result for stochastic processes in this article is obtained after 2 main concepts. Precisely, the first one is resemblance between a stochastic process and a vector field. The second one is the locally bounded property of a stochastic process by a function. These two ingredients are enough to state and prove our convergence theorem; a stochastic process Z converges if it is locally bounded by a convex real valued, twice differentiable function

ϕ

with bounded Hessian and Zresembles to

\nabla ϕ

.

Two corollaries are extracted from this result, which we prove to be equivalent to Bottou’s and Sunehag’s convergence theorems. Moreover, we observe that convergence proof in [12] of the algorithm called discrete DSNGD can be addressed by our main theorem, since original convergence theorem of Sunehag is not general enough.

Resemblance concept involves the expected directions set of a stochastic process, that we define in Section 2, and the half-space of a vector field, a concept introduced in Section 3. Then, in Section 4 we state and prove our general result, which highlights the commonalities between Bottou and Sunehag theorems, proving convergence of a wider variety of algorithms.

2. Main Result. Director Process and the Expected Direction Set

Let

(Ω, F, P)

be a probability space and

(S, Σ)

be a measurable space. A discrete stochastic process on

(Ω, F, P)

indexed by

N

is a sequence of random variables

Z = {Z_{t}}_{t \in N}

such that

Z_{t} : Ω \to S

. In this work,

S = R^{k}

and

Σ

is the corresponding Borel

σ

-algebra. As random variables are used to describe general random phenomena, stochastic processes indexed by

N

are usually used to model random sequences.

2.1. Locally Bounded Stochastic Processes and Objective of the Work

The difference between two random variables of a stochastic process is a random variable known as increment. We say that random variable

Z_{t + s} - Z_{t}

with

1 \geq s \in N

is an s-increment at time t. For example, the 1-increments of a stochastic process Z are

Z_{t}^{*} = Z_{t + 1} - Z_{t} .

(1)

We focus our attention to a decomposition of

Z_{t}^{*}

into

Z_{t}^{*} = - γ (t) \cdot X_{t}

, such that

γ : N \to R^{+}

is a positive real valued function and

X = {X_{t}}_{t \in N}

is a stochastic process on

(Ω, F, P)

.

Definition 1.

Let Z and X be stochastic processes and

γ : N \to R^{+}

a function. Then

(X, γ)

is a decomposition of 1-increments of Z if

Z_{t + 1} = Z_{t} - γ (t) \cdot X_{t} .

(2)

Name X the director process of Z and γ the learning rate, and note it by

Z = (X, γ)

.

This way of expressing a process allows to define

Z_{t + 1}

with respect to

Z_{t}

, which gives us control of the difference between both values by means of

γ (t) X (t)

, as Figure 1 shows. This is very useful if we intend to analyse the convergence of a stochastic process.

As represented in Figure 1, we can think of

Z_{t}

as the value of the process at time t, while

- γ (t) X_{t}

is the vector going from

Z_{t}

to

Z_{t + 1}

. For the article, it is important to remember this, since we are constantly referring to

Z_{t}

as points in

R^{k}

while

X_{t}

are managed as direction vectors in

R^{k}

. This distinction is only practical for our purposes.

The trajectories of stochastic approximation algorithms, such as stochastic gradient descent (SGD), are indeed samples of stochastic processes. Furthermore, they are usually expressed by means of their decomposition of 1-increments, as can be seen in the following examples.

Example 1.

SGD [4] is the cornerstone of machine learning to solve the function optimization problem. The objective of SGD is to minimize an objective function

L (η) = E_{z \sim P^{*}} l (η; z)

for some unknown probability distribution

P^{*}

and random variable

l (η)

defined on

(Ω^{*}, F^{*}, P^{*})

. This function l is known as loss function, and it is usually differentiable with respect to η, allowing the definition of SGD as

\begin{matrix} Z_{t + 1} = & Z_{t} - γ (t) \nabla_{η} l (Z_{t}), \\ γ (t) > & 0 t \in N \end{matrix}

(3)

where

Z_{t}

are estimates of

\bar{η}

. We can see

Z = {Z_{t}}_{t \in N}

, and, therefore, SGD, as a stochastic process. Indeed, let

(Ω = \prod_{t \in N} Ω^{*}, F = \prod_{t \in N} F^{*}, P = \prod_{t \in N} P^{*})

(4)

be the product probability space (This space is guaranteed to exist according to Kolmogorov extension theorem (see for example Theorem 2.4.4 and following examples in [13])) over infinite sequences. Hence we can define the stochastic process X on

(Ω, F, P)

such that

X_{t} = \nabla_{η} l (Z_{t})

where for every

ω = {ω_{t}}_{t \in N} \in Ω

it is

X_{t} (ω) = \nabla_{η} l (Z_{t}; ω_{t})

. This implies that

(X, γ)

is a decomposition of 1-increments of SGD.

In addition, we observe that

Z_{t + 1}

depends only on last observation

Z_{t}

and t, which is known as a non-stationary Markov chain.

Example 2.

This example is worked in [5]. Again, we focus on the function optimization problem, using the same notation as in previous example. In this case, the estimation update of the minimum

\bar{η}

is defined as

\begin{matrix} Z_{t + 1} = & Z_{t} - γ (t) B_{t} \cdot Y_{t}, \\ γ (t) > & 0 t \in N \end{matrix}

(5)

where

B_{t}

is a matrix in

R^{k \times k}

known after information

Z_{0}, \dots, Z_{t}

available at time t and

Y_{t} = Y (Z_{t})

, where Y is a function mapping each

η \in R^{k}

to a random variable on the same probability space

(Ω^{*}, F^{*}, P^{*})

.

Similarly as in previous example, Y can be thought as a random variable in the product probability space (Equation (4)) that depends on previous

Z_{t}

, such that for every

ω \in Ω

it is

Y_{t} (ω) = Y (Z_{t}; ω) = Y (Z_{t}; ω_{t})

. If we define

X_{t} = B_{t} \cdot Y (Z_{t})

, then

Z = (X, γ)

is a decomposition of 1-increments of Z with

X = {X_{t}}_{t \in N}

.

Here, Z is not a (non-stationary) Markov chain, since

B_{t}

may depend on

Z_{i}

for all

i < t

.

The naming of

γ

as learning rate is commonly used in the machine learning research branch [4,14,15,16]. The director process X determines the direction

X_{t}

at time t of the update Equation (1) with

Z_{t}

as reference point, while

γ (t)

specifies a certain distance to travel along that direction

X_{t}

. Moreover we demand some constraints to both factors. Condition imposed to

γ

is usually found in the literature [3,4,5]. A learning rate

γ

holds the standard constraint if;

\sum_{t} γ {(t)}^{2} < \infty, \sum_{t} γ (t) = \infty .

(6)

Before we show the condition for the director process X, we fix some notation used throughout the article. Consider the natural filtration

F_{Z} = {F_{t}}_{t \in N}

generated by stochastic process Z, that is,

F_{t} = σ (Z_{i}^{- 1} (A) ∣ i \leq t, A \in Σ)

for all

t \in N

. Then

F_{Z}

is a filtration and by definition Z is adapted to

F_{Z}

.

Intuitively, every

F_{t}

of a filtration is a

σ

-algebra that classifies the elements of

Ω

. For example, if

Ω

is the set of colours,

F_{t}

can gather warm and cold colours into separate and complementary sets. The fact that a random variable

Z_{t}

is

F_{t}

-measurable implies that

Z_{t}

sends all warm colours to the same value and all cold colours also to the same value. Somehow

Z_{t}

is then not providing any additional information about elements of

Ω

beyond the classification of

F_{t}

. The sequence

F_{t}

is increasing, in the sense that

F_{t} \subset F_{t + 1}

for all t. Therefore, a filtration characterizes space

Ω

with sequentially higher levels of information or classification. Denote

E_{t} = E [\cdot ∣ F_{t}]

the conditional expectation given

F_{t}

[10]. Recall that if Y is a random variable in

(Ω, F, P)

then

E_{t} [Y]

is in turn a

F_{t}

-measurable random variable.

Hence, if

Z = (X, γ)

then X is locally and linearly bounded by function

ϕ : R^{k} \to R

if

(\exists A, B) (\forall t) E_{t} {∥ X_{t} ∥}^{2} \leq A + B \cdot ϕ (Z_{t}) .

(7)

These two constraints are finally combined to present the kind of stochastic processes we are interested in.

Definition 2.

Let Z be a stochastic process and

ϕ : R^{k} \to R

be a function. We say that Z is locally bounded by

ϕ

if there is a decomposition of 1-increments

(X, γ)

with γ holding the standard constraint and X locally and linearly bounded by ϕ.

Furthermore, if

Z_{0} = η_{0}

a.s. we say

η_{0}

is the initial point of Z.

For instance, Examples 1 and 2 observed in this section define Z as a locally bounded process. We see it below.

Example 3.

Recall Example 1. In the same reference [4], the optimization algorithm is asked to hold additional conditions in order to prove its convergence. We added the convergence theorem in Appendix A. Some of the conditions are

\begin{matrix} \sum_{t} γ {(t)}^{2} < \infty, \sum_{t} γ (t) = \infty, \\ Z_{0} = η_{0} \in R^{k}, \\ (\exists A, B) (\forall t) E_{t} ∥ X_{t} ∥^{2} \leq A + B {∥ Z_{t} - \bar{η} ∥}^{2} \end{matrix}

(8)

where

\bar{η} \in R^{k}

is the optimal point of L. Standard constraint to γ is clearly asked. Moreover

η_{0}

is a starting point. It remains to be seen if X is locally and linearly bounded by some function

ϕ : R^{k} \to R

. Indeed, if we define

ϕ (η) = ∥ η - \bar{η} ∥^{2}

, then the property is easily checked. Hence Z is locally bounded by ϕ with initial point

η_{0}

.

Example 4.

Recall Example 2. Convergence theorem in [5], which is added in the Appendix A, demands below conditions;

\begin{matrix} \sum_{t} γ {(t)}^{2} < \infty, \sum_{t} γ (t) = \infty, \\ Z_{0} = η_{0} \in R^{k}, \\ (\exists A, B) (\forall t) E_{t} {∥ X_{t} ∥}^{2} \leq A + B L (Z_{t}) \end{matrix}

(9)

where L is a function to optimize. For this example, Z is then locally bounded by

ϕ = L

with initial point

η_{0}

. Just as an observation, property of

B_{t}

being determined after information available at time t, is the same as seeing

B_{t}

as a

F_{t}

-measurable random variable over the product probability space.

We are interested on studying the almost sure convergence of Z to a point

\bar{η} \in R^{k}

. A stochastic process Z almost surely (a.s.) converges to a point

\bar{η} \in R^{k}

if

P [ω \in Ω : lim_{t \to \infty} Z_{t} (ω) = \bar{η}] = 1 .

(10)

Examples 3 and 4 show us that we can understand the results in [4,5] as the almost sure convergence of some locally bounded processes. In this paper, we are interested in characterizing the almost sure convergence of locally bounded processes. The objective of this work is to create a theory that allows to prove the a.s. convergence of locally bounded processes that covers Examples 3 and 4 and whose applicability generalizes to a wider set of processes, such as the one described below:

Example 5.

Assume the function

f (η) = {∥ η ∥}^{2}

defined in

R^{k}

, and the optimization method Z defined by its director process

X_{t} = G_{1} \cdot G_{2} \cdot Z_{t}

where

G_{1}

and

G_{2}

are positive definite and symmetric matrices. For simplicity, this example shows a stochastic process with no random phenomena associated. We wonder about the convergence of process Z, and if so, whether it converges to the point of

R^{k}

that optimizes function f. From Theorems A1 and A2 found in the literature (included in Appendix A) it is not possible to prove a.s. convergence of Z, since conditionsBottou resemblance and C.3, respectively, are not satisfied. That is, because

Z_{t} \cdot^{⊺} G_{1} \cdot G_{2} \cdot Z_{t}

is possibly negative a.s.

Further on, Z is assumed to be locally bounded by

ϕ

where

(X, γ)

is its corresponding decomposition of 1-increments, unless otherwise indicated.

2.2. Main Result

The objective of the article is to proof below theorem, that we prove in Section 4.1.

Theorem 1.

Let Z be a stochastic process on probability space

(Ω, F, P)

. Then Z almost surely converges to a point

\bar{η}

if there is a twice differentiable convex function ϕ with unique minimum

\bar{η}

defined in

R^{k}

with bounded Hessian norm, such that

Z is locally bounded by ϕ;
Z resembles $\nabla ϕ$ .

There is one concept of the theorem that needs a definition. That is, when a stochastic process resembles to a vector field. Next sections have that end, with our main definition that fills the gap appearing at Section 3.2. As we will see in Section 4.4, simple Example 5 finds a solution with our main theorem.

2.3. Expected Direction Set

We now define one key object of our work named the expected direction set. It focuses on gathering all directions that the update may take at time t conditioned to

F_{t}

. Before the definition we provide some concepts and notation.

Random variable

E_{t} [X_{t}]

determines all expected directions of Z at time t that the stochastic process may follow assuming

F_{t}

. For example, if

ω \in Ω

is an observation, then

E_{t} [X_{t}] (ω) \in R^{k}

is a vector pointing to the expected update direction departing from point

Z_{t} (ω)

given

F_{t}

. Denote the expected direction of Z at

ω \in Ω

and time t as

D_{Z} (ω, t) = E_{t} [X_{t}] (ω) .

(11)

The expected direction from point

η = Z_{t} (ω)

of Equation (11) depends on

ω

. That is, the path followed until reaching

η = Z_{t} (ω) \in R^{k}

matters. For instance, if

ω_{1}, ω_{2} \in Ω

are different observations, such that

η = Z_{t} (ω_{1}) = Z_{t} (ω_{2})

, then possibly

D_{Z} (ω_{1}, t) \neq D_{Z} (ω_{2}, t)

. We collect all expected directions at

η = Z_{t} (ω)

and time t in the vector set below;

S_{Z} (η, t) = {D_{Z} (ω, t) ∣ ω \in Ω, Z_{t} (ω) = η} .

(12)

The tools to define the expected direction set at

η \in R^{k}

after time

T \in N

are given, so we proceed to its formal definition.

Definition 3.

Let

Z = (X, γ)

. Define the expected directions set of Z at

η \in R^{k}

after time

T \in N

as

E D S_{Z} (η, T) : = ⋃_{t \geq T} S_{Z} (η, t) .

(13)

With a few words,

E D S_{Z} (η, T)

is a vector set containing all expected directions (provided by the director process X) conditioned to

F_{t}

for every outcome

ω

such that

Z_{t} (ω) = η

where

t \geq T

. In Definition 3,

E D S

depends on T. That is because to assess the convergence of an algorithm it is not important to consider all expected directions throughout all the process. For example, if an algorithm converges we can modify randomly all directions of the director process for just a particular time

T \in N

, and the resulting algorithm still converges. Roughly speaking, only the tail of a process matters to determine the convergence property. This concept is better addressed with Definition 4 in next section.

Example 6.

Recall Example 1. Assume that Z is then SGD. Then

E D S_{Z} (η, T)

is a singleton. Indeed,

D_{Z} (ω, t)

is the same vector for all

t \in N

and all ω with

Z_{t} (ω) = η

and hence

S_{Z} (η, t) = {D_{Z} (ω, t)}

with any

ω \in Ω

with

Z_{t} (ω) = η

. Finally

E D S_{Z} (η, T) = {E_{t} [X_{t}] (ω)} f o r a n y ω \in Ω a n d t \geq T w i t h Z_{t} (ω) = η .

(14)

This is the case of any non-stationary Markov chain.

2.4. Essential Expected Direction Set

Convergence property of an algorithm relates closely to directions followed after time

T \in N

as T tends to infinity. Equivalently, the direction set appearing repeatedly through the whole optimization process matters, while directions set only contemplated for a finite amount of iterations changes nothing, in terms of convergence guarantee. This direction set is named the essential expected directions set in this article.

To define properly the essential expected directions set, we will use the convex vector subspace of a given vector set. Given a vector set U in

R^{k}

, let

C (U)

be the smallest convex vector subspace containing U. See Figure 2 as an illustrative example. Observe that

C (U)

is always closed, but it may be unbounded.

Definition 4.

Let

Z = (X, γ)

. Define the essential expected directions set of Z at η as

E E D S_{Z} (η) : = \cap_{T} C (E D S_{Z} (η, T)) .

(15)

Example 7.

Assume Z is any non-stationary Markov chain, such that SGD in Example 1. Then

E E D S_{Z} (η) = E D S_{Z} (η, T)

for any T. Indeed, we have seen in Example 6 that

E D S_{Z} (η, T) = {E_{t} [X_{t}] (ω)}

for any

ω \in Ω

and

t \geq T

where

Z_{t} (ω) = η

. Hence

E E D S_{Z} (η) = \cap_{T} C (E D S_{Z} (η, T)) = C ({E_{t} [X_{t}] (ω)}) = {E_{t} [X_{t}] (ω)} = E D S_{Z} (η, T),

(16)

for any

ω \in Ω

with

Z_{T} (ω) = η

.

Definition of

E E D S_{Z} (η)

delimits the smallest subspace where all directions at

η

tend to. Clearly,

E E D S_{Z} (η)

is also convex and closed (possibly empty). Deeper properties of this set lead to identify divergence symptoms. For example, if it is empty or unbounded, we face instability of the process at

η

. To see this, observe below result. The proof can be found in the Appendix B.

Corollary 1.

Let

η \in R^{k}

. Then

E E D S_{Z} (η)

is a non-empty bounded set if, and only if, there exists

T \in N

, such that

C (E D S_{Z} (η, T))

is bounded.

This result relates

E E D S_{Z} (η)

with instability properties of Z. If

E E D S_{Z} (η)

is empty or unbounded, then the algorithm is unstable at

η

, since expected directions with arbitrarily large norms exist after enough iterations. Clearly, if this situation is found for all points near the optimum, the algorithm can not converge to the solution. It is desirable instead that

C (E D S_{Z} (η, T))

is compact (bounded) for some T for every

η \in R^{k}

, or equivalently, that

E E D S_{Z} (η)

is compact (bounded) and not empty.

In fact, since we are interested in the case where Z is locally bounded by

ϕ

(recall Definition 2), we can assume that

E E D S_{Z} (η)

is a non empty compact set, by virtue of below results.

Proposition 1.

Let stochastic process Z be locally bounded by ϕ. Then

C (E D S_{Z} (η, 0))

is a non-empty compact set.

Proof.

We know that X is locally and linearly bounded. Hence, applying Jensen’s inequality

∥ E_{t} [X_{t}] ∥^{2} \leq E_{t} {∥ X_{t} ∥}^{2} \leq A + B \cdot ϕ (Z_{t}) .

(17)

Let

η \in R^{k}

and

ω \in Ω

, such that

Z_{t} (ω) = η

for some

t \geq 0

. Therefore, every

v = E_{t} [X_{t}] (w) \in E D S_{Z} (η, 0)

has bounded norm by

A + B \cdot ϕ (η)

, implying that

C (E D S_{Z} (η, 0))

is a non-empty compact set. □

Below corollary is a consequence of Proposition 1 and Corollary 1.

Corollary 2.

Let stochastic process Z be locally bounded by ϕ. Then

E E D S_{Z} (η)

is a non-empty compact set for all

η \in R^{k}

.

3. Vector Field Half-Spaces and Stochastic Processes. Resemblance

This section defines the main concept of this work; the property of resemblance between a stochastic process and a vector field. The definition highlight some commonalities between Theorems A1 and A3. Both of them prove the convergence of stochastic processes that resemble to particular vector fields. A geometric interpretation and explanation of convergence theorems conditions is established in next Section 4.

Some previous definitions are needed and stated before introducing the main concepts of the article, such as

ϵ

-acute vector pair sets and the half-space of a vector field. The section starts with some basic concepts about vectors.

Definition 5.

Let

u, v \in R^{k}

be two vectors. The pair

(u, v)

is acute if u and v form an acute angle, that is, if

u^{⊺} \cdot v > 0

. Furthermore, if

u^{⊺} \cdot v \geq ϵ > 0

then

(u, v)

is ϵ-acute.

Proposition 2.

Let

u, v \in R^{k}

be two vectors. Then the pair

(u, v)

is ϵ-acute if, and only if, there exists a symmetric positive-definite matrix B, such that

B \cdot u = v

and

u^{⊺} \cdot B \cdot u \geq ϵ

.

A vector pair set V is a set of vector pairs

V = {(u_{i}, v_{i}) \in {(R^{k})}^{2} ∣ i \in I}

where I is an index set.

Definition 6.

Let V be a vector pair set. V is ϵ-acute if every vector pair

(u, v) \in V

is ϵ-acute.

Next result is a direct consequence.

Proposition 3.

Let V be a vector pair set, indexed by I. Then, V is ϵ-acute for some

ϵ > 0

if, and only if;

inf_{\begin{matrix} i \in I \\ (u_{i}, v_{i}) \in V \end{matrix}} u_{i}^{⊺} v_{i} > 0 .

(18)

Proposition 4.

Let V be a vector pair set, indexed by I. Then, V is ϵ-acute for some

ϵ > 0

if, and only if, there exist a set of symmetric positive-definite matrices

B = {B_{i} ∣ i \in I}

such that

\begin{matrix} inf_{\begin{matrix} i \in I \\ (u_{i}, v_{i}) \in V \end{matrix}} u_{i}^{⊺} B_{i} u_{i} & > 0, \\ B_{i} u_{i} & = v_{i} . \end{matrix}

(19)

Proof.

Prove, first, that if there exist a set of matrices

B = {B_{i} ∣ i \in I}

holding Equation (19) then V is

ϵ

-acute for some

ϵ > 0

. Observe that after Equation (19);

inf_{i \in I} u_{i}^{⊺} v_{i} = inf_{i \in I} u_{i}^{⊺} B_{i} u_{i} > 0 .

(20)

Then, Proposition 3 implies that V is

ϵ

-acute and finishes this part of the proof.

Now assume that V is

ϵ

-acute, prove then that there exist a set of matrices

B = {B_{i} ∣ i \in I}

holding Equation (19). Since V is

ϵ

-acute, in particular, the pair

(u_{i}, v_{i}) \in V

is

ϵ

-acute for every

i \in I

. Apply Proposition 2: for every

i \in I

there exists a symmetric positive-definite matrix

B_{i}

, such that

B_{i} u_{i} = v_{i}

and

u_{i}^{⊺} \cdot B_{i} \cdot u_{i} \geq ϵ

. This finishes the proof. □

3.1. The Half-Space of a Vector Field

The half-space determined by a vector u is the set of vectors that conform an acute angle with u. This region clearly occupies half of the total space. Additionally, the

ϵ

-half-space of u with

ϵ > 0

is the set of vectors v, such that the vector pair

(u, v)

is

ϵ

-acute. This object is needed for afterwards defining the half-space of a vector field. We define these concepts below and illustrate the

ϵ

-half-space of a vector u in Figure 3.

Definition 7.

Let u be a vector of

R^{k}

. The half-space of u is the set

H (u) = {v \in R^{k} ∣ u^{⊺} \cdot v > 0} .

(21)

Similarly, the ϵ-half-space of u with

ϵ > 0

is the set

H_{ϵ} (u) = {v \in R^{k} ∣ u^{⊺} \cdot v \geq ϵ} .

(22)

A vector field

X

over

R^{k}

is a function assigning to every

η \in R^{k}

a vector of

R^{k}

, that is

X : R^{k} \to R^{k}

. For example, if

l : R^{k} \to R

is a twice differentiable function, we can consider the vector field consisting of the gradient vectors at each point

η

. Precisely, denote the gradient vector field (GVF) as

X_{\nabla l}

, where

X_{\nabla l} (η) = \nabla l (η)

.

We are ready to define the half-space of a vector field.

Definition 8.

Let

X

be a vector field over

R^{k}

. The half-space of

X

is a function

H (X)

mapping every η to

H (X) (η) = H (X (η))

. Similarly, the ϵ-half-space of

X

with

ϵ > 0

is a function

H_{ϵ} (X)

mapping every η to

H_{ϵ} (X) (η) = H_{ϵ} (X (η))

.

3.2. Resemblance between a Stochastic Process and a Vector Field

The convergence of any locally bounded process can be proved comparing the expected directions set of the algorithm with some vector fields. When the expected directions resemble the vector field we compare it to, then we can ensure the almost sure convergence to a point of the stochastic process, after some reasonable conditions. By resemblance, we mean that the expected directions set after some time T is a subset of the

ϵ

-half-space of

X

, among other things explained later. Therefore, resemblance asks for every

η \in R^{k}

that every vector

D_{Z} (ω, t)

with

t \geq T

and every

ω \in Ω

with

η = Z_{t} (ω)

form an acute angle with the vector field at

X (η)

.

However, if the vector field sends a specific point

η

to

0 \in R^{k}

, then no direction can be set by the

D_{Z} (ω, t)

to form an acute vector pair. Therefore, resemblance property is evaluated outside the neighborhood of these annulled points. That is why we must consider now the set of annulled points of a vector field and the neighborhoods around the points of this set.

Formally, let

X

be a vector field defined in

R^{k}

. The set

K_{X}

is the set of points of

R^{k}

annulled by

X

, that is,

K_{X} : = {η \in R^{k} ∣ X (η) = 0}

. Moreover, consider the closed ball centered on

K_{X}

of radius

δ

as

B_{δ} (K_{X}) : = \cup_{η \in K_{X}} B_{δ} (η)

where

B_{δ} (η)

is the closed ball of radius

δ

centered on

η

.

We also use the notation

A^{'} = R^{k} ∖ A

for the compliment set of subset

A \subset R^{k}

. We say that Zϵ-resembles to

X

at

η

from T on if

E D S_{Z} (η, T) \subset H_{ϵ} (X) (η)

. Observe an illustrative example in Figure 4.

This intuition is naturally extended to ϵ-resemblance at sets, when the property is satisfied for every

η

in the set. With this in mind we can define the key concept of this article.

Definition 9.

Let

Z = (X, γ)

be a stochastic process and

X

be a vector field over

R^{k}

. We say that Z resembles to

X

from

T \in N

on, if;

(\forall δ > 0) (\exists ϵ > 0) Z ϵ - resembles to X at B_{δ} {(K_{X})}^{'} from T on

(23)

We say that Z resembles to

X

if there is

T \in N

such that it resembles to

X

from T on.

Everything is set up to accomplish the goal of this paper. We refresh the main theorem of this article in next section and show its proof.

4. Proof of Main Result. Reinterpretation of Convergence Theorems

The objective of the article is within reach now. That is, proving main Theorem 1. Moreover, this section addresses afterwards the task of proving that Theorems A1 and A3 are particular examples of our main Theorem 1.

4.1. Resemblance to Conservative Vector Fields and Convergence

Recall main Theorem 1 and observe that it asks the stochastic process Z to be locally bounded by some function

ϕ

and Z to resemble to

\nabla ϕ

. Therefore,

\nabla ϕ

is a particular type of vector field called conservative vector field. That is, a vector field that appears from derivation of a function. That is why we understand our main theorem as a convergence result of locally bounded processes of resemblance to conservative vector field.

In the theorem statement, it says that

ϕ

has bounded Hessian norm. Similarly to Theorem A3, it means that:

(\exists K) (\forall η) ∥ \nabla_{η}^{2} ϕ (η) ∥ \leq K^{'} .

We are ready to prove the main result of the paper.

Proof of main Theorem 1.

Observe that

ϕ

is bounded from below. Indeed,

\bar{η}

is a minimum and

ϕ

is convex with

X (\bar{η}) = 0

where

X = \nabla ϕ

. Therefore, there exists a constant

m \geq 0

such that

ϕ (η) + m \geq 0

for all

η

. Define

ψ (η) = ϕ (η) + m

. Clearly,

\nabla ψ = \nabla ϕ = X

, and, therefore, Zresembles to

\nabla ψ

. Moreover, Z is locally bounded by

ψ

and

ψ

clearly satisfies the Hessian norm bound.

From here, the prove follows the steps of Theorem A2’s proof. Taylor inequality and Hessian norm bound;

\begin{matrix} ψ (Z_{t + 1}) & = ψ (Z_{t} - γ_{t} X_{t}) \\ \leq ψ (Z_{t}) - γ_{t} X {(Z_{t})}^{⊺} X_{t} + γ_{t}^{2} K {∥ X_{t} ∥}^{2} \end{matrix}

(24)

where

K = \frac{K^{'}}{2}

. Apply expectation conditioned to information until time t and then use that Z is locally bounded by

ψ

;

\begin{matrix} E_{t} ψ (Z_{t + 1}) & \leq ψ (Z_{t}) - γ_{t} X {(Z_{t})}^{⊺} E_{t} [X_{t}] + γ_{t}^{2} K E_{t} [∥ X_{t} ∥^{2}] \\ \leq ψ (Z_{t}) - γ_{t} X {(Z_{t})}^{⊺} E_{t} [X_{t}] + γ_{t}^{2} K (A + B ψ (Z_{t})) \\ \leq (1 + γ_{t}^{2} K B) ψ (Z_{t}) - γ_{t} X {(Z_{t})}^{⊺} E_{t} [X_{t}] + γ_{t}^{2} K A . \end{matrix}

(25)

Use now that Zresembles to

X

. Then, there exists T such that for every

t \geq T

, the term

- γ_{t} X {(Z_{t})}^{⊺} E_{t} [X_{t}]

is negative. All other conditions of Robbins–Siegmund theorem (in [7], added in Appendix A) also hold for the algorithm after time T, thanks to learning rate constraints. Apply it and deduce that random variables

ψ (Z_{t})

converge almost surely to a random variable (and so does

ϕ (Z_{t})

) and that;

\sum_{t} γ_{t} X {(Z_{t})}^{⊺} E_{t} [X_{t}] < \infty a . s .

(26)

Prove now that stochastic process

ϕ (Z_{t})

converges almost surely to value

ϕ (\bar{η})

. Proceed by contradiction. Assume that for

δ_{1} > 0

P [ω \in Ω ∣ lim_{t} ϕ (Z_{t} (w)) \in B_{δ_{1}} {(ϕ (\bar{η}))}^{'}] > 0

(27)

this implies, by continuity and convexity of function

ϕ

, that there exists

δ

P [A = {ω \in Ω ∣ lim_{t} Z_{t} (w) \in B_{δ} {(\bar{η})}^{'}}] > 0 .

(28)

By resemblance and definition of the limit, there exists T and

ϵ

such that

E D S_{Z} (η, T) \subset H_{ϵ} (X) (η)

for every

η \in B_{δ} {(\bar{η})}^{'}

. This leads to a contradiction, since using learning rate standard constraint we have

\sum_{t \geq T} γ_{t} X {(Z_{t} (ω))}^{⊺} E_{t} [X_{t}] (ω) > \sum_{t \geq T} γ_{t} \cdot ϵ = \infty

(29)

for every

ω \in A

, which has measure different to 0 by Equation (28). This clearly contradicts Equation (26).

Hence,

ϕ (Z_{t})

converges almost surely to

ϕ (\bar{η})

and

Z_{t}

converges almost surely to

\bar{η}

as wanted. □

4.2. Reinterpretation of Bottou’s Convergence Theorem

The goal now is to deduce Theorem A1 as a direct consequence of main Theorem 1. Consider a particular case of main Theorem 1 where

ϕ (η) = ∥ η - \bar{η} ∥^{2}

, that reads as follows.

Corollary 3.

Let

ϕ (η) = ∥ η - \bar{η} ∥^{2}

and Z be a stochastic process on probability space

(Ω, F, P)

. Then Z almost surely converges to

\bar{η}

if

Z is locally bounded by ϕ;
Z resembles $\nabla ϕ$ .

Additional conditions to

ϕ

, such as Hessian bound or twice differentiability, are not specified in the corollary since with the particular definition of

ϕ

all those conditions are already satisfied.

To see that Corollary 3 proves Theorem A1 statement, we need to prove that Theorem A1 is assuming that Z is locally bounded by

ϕ

and that Z resembles

\nabla ϕ

. Example 3 already proves that Bottou is assuming that Z is locally bounded by

ϕ

. Therefore, it remains to check that Z resembles to

\nabla ϕ

. To that end, see below proposition proved in Appendix C.

Proposition 5.

Let

Z = (X, γ)

be a stochastic process and

X

be a vector field over

R^{k}

. Then Z resembles to

X

if, and only if,

(\exists T \in N) (\forall δ > 0) inf_{\begin{matrix} η \in R^{k} ∖ B_{δ} (K_{X}) \\ v \in E D S_{Z} (η, T) \end{matrix}} X {(η)}^{⊺} \cdot v > 0 .

(30)

Observe condition Bottou resemblance of Theorem A1 and Proposition 5. Deduce from it, that the algorithm Z of the theorem resembles to vector field

\nabla ϕ

.

Corollary 4.

Let

Z = (X, γ)

be a stochastic process and

\bar{η} \in R^{k}

. Then Z resembles to

\nabla ϕ

with

ϕ (η) = ∥ η - \bar{η} ∥^{2}

if, and only if,Bottou resemblanceholds.

4.3. Reinterpretation of Sunehag’s Convergence Theorem

Theorem A3 is deduced from main Theorem 1. Similarly to previous section, we provide a version of our main theorem for the case where

ϕ = l

is a function that we aim to minimize.

Corollary 5.

Let

l : R^{k} \to R

be a twice differentiable cost function with a unique minimum

\bar{η}

and bounded Hessian norm, and let Z be a stochastic process on probability space

(Ω, F, P)

. Then Z converges to the minimum

\bar{η}

of l almost surely if

Z is locally bounded by l;
Z resembles $\nabla l$ .

The stochastic process described in Theorem A3 has some more properties, such as

X_{t} = B_{t} \cdot Y_{t}

. However, if we prove that Z of that theorem is locally bounded by l and that Zresembles

\nabla l

, then it is clear that Corollary 5 deduces Theorem A3. Recall Example 4 and notice that we already proved that Z is locally bounded by l. The remaining property is acquired after below proposition that we prove in Appendix D.

Proposition 6.

Let

Z = (X, γ)

be a stochastic process and

X

be a vector field over

R^{k}

. Then Z resembles to

X

if, and only if, there exists T, such that for every

t \geq T

there are random vectors

Y_{t}

to

R^{k}

and symmetric and positive-definite

F_{t}

-measurable random matrices

B_{t}

such that

B_{t} \cdot Y_{t} = X_{t},

(31)

E_{t} [Y_{t}] = X (Z_{t}) Z_{t} (ω) \notin K_{X},

(32)

(\forall δ > 0) inf_{\begin{matrix} η \in R^{k} ∖ B_{δ} (K_{X}) \\ t \geq T \\ ω \in Ω, Z_{t} (ω) = η \end{matrix}} X {(η)}^{⊺} \cdot B_{t} (ω) \cdot X (η) > 0 .

(33)

It is only necessary to put together Proposition 6 and condition C.1 and Sunehag resemblance to finish our objective with the following corollary

Corollary 6.

Let l be a differentiable function and

Z = (X, γ)

be a stochastic process. Then Z resembles to

\nabla l

if, and only if, there exist T such that for every

t \geq T

there are random vectors

Y_{t}

to

R^{k}

and symmetric and positive-definite

F_{t}

-measurable random matrices

B_{t}

such that

B_{t} \cdot Y_{t} = X_{t}

and conditionsC.1andSunehag resemblancehold.

Corollaries 4 and 6 nicely show the value of Theorem 1 for proving convergence. To reinforce this, we notice that the convergence of algorithm DSNGD in [12] is easily proved by means of Corollary 5, by combining both Theorem A3 and Corollary 6. This shows that Theorem 1 allows to prove convergence of a wider set of stochastic processes and function optimization methods.

4.4. Convergence of Process in Example 5

Our theorem solves question proposed by Example 5. To see it, just define

ϕ (η) = \frac{1}{2} η^{⊺} \cdot G_{2} \cdot η .

(34)

Twice differentiable and convex function

ϕ

has bounded Hessian norm, since its Hessian is the constant matrix

G_{2}

. Moreover, Z is clearly locally bounded by

ϕ

. Indeed, recall Equation (7) and observe;

∥ G_{1} \cdot G_{2} \cdot Z_{t} ∥^{2} \leq B \cdot ϕ (Z_{t})

(35)

where

B = \frac{2 λ_{1}^{2} \cdot λ_{2}^{2}}{μ_{2}}

such that

λ_{i}

is the greatest eigenvalue of

G_{i}

and

μ_{i}

is the least eigenvalue of

G_{i}

.

Finally, check that Zresembles to

\nabla ϕ (η) = G_{2} η

. Observe that

E D S_{Z} (η, T) = {G_{1} G_{2} η}

is a singleton for every T. Then for all

δ > 0

and all

η \in B_{δ} (K_{\nabla ϕ})

it is

\nabla ϕ {(η)}^{⊺} \cdot G_{1} G_{2} η = η^{⊺} G_{2} \cdot G_{1} G_{2} η \geq ϵ,

(36)

where

ϵ = μ_{1} \cdot μ_{2}^{2} \cdot δ^{2}

. Hence Z resembles to

\nabla ϕ

and by virtue of our main Theorem 1 process Z converges a.s. to 0, and, therefore, minimizes function f as wanted.

5. Conclusions

We have presented a result that allows us to prove the convergence of stochastic processes. We have proven that two useful convergence results in the literature are a consequence of our theorem. This is made after a new theory that compares the expected directions of the algorithm to conservative vector fields. If the expected directions at a point

η

resemble enough to vector

X (η)

with

\nabla ϕ = X

a conservative vector field, then the process is stable at that point. If this happens for every

η \in R^{k}

, and in addition the process is locally bounded by

ϕ

, then the process is globally stable and converges.

Some inspiring paths remain unexplored after this work. For example, finding

ϕ

function is the key to prove convergence, and it is asked to be a convex twice differentiable function. It is interesting to study how function

ϕ

can be obtained, for instance as a sum of other convex twice differentiable functions

ϕ_{i}

.

Another promising research line is a deeper analysis of

E D S

and

E E D S

objects, which may guarantee the existence of a function

ϕ

without the need of finding it. If sufficient conditions are established for a stochastic process to ensure resemblance to some unknown conservative vector field, then

ϕ

searching can be dodged. Even proving the non-existence of such function after a wider study of

E D S

and

E E D S

is useful, forbidding the use of our theorem.

It is also interesting to study the converse implication. Specifically, investigating the conditions that lead to divergent instances after ground theory explained in the article. In this sense, Lyapunov characterization of convergent processes becomes a helpful and key theory, since great similarities arise between these two techniques.

Furthermore, in many occasions the function

ϕ

to optimize can be established beforehand (convex and twice differentiable). Therefore, the opposite process can be considered, that is, generating a set of stochastic processes that resemble to

\nabla ϕ

, assuring in consequence the convergence of such candidates.

In [17], one finds another relevant convergence result. It assures the convergence in probability of a stochastic process, instead of almost sure convergence worked in this article. We wonder about the existing commonalities with our theorem, and the possibility to relax the conditions our theorem imposes, yet ensuring convergence in probability of a process.

We are currently working on two weaker resemblance properties, that we name weak and essential resemblance. The intention is to deduce almost sure convergence of a process by only studying its essential expected direction set (EEDS).

Author Contributions

Conceptualization, B.S.-L. and J.C.; writing—original draft preparation, B.S.-L. and J.C.; review and editing, B.S.-L. and J.C.; supervision, J.C.; funding acquisition, J.C. All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the projects Crowd4SDG and Humane-AI-net, which have received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements No. 872944 and No. 952026, respectively. This work is also partially supported by the project CI-SUSTAIN funded by the Spanish Ministry of Science and Innovation (PID2019-104156GB-I00).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Convergence Theorems

We state below Bottou’s convergence theorem appearing in [4] and Sunehag et al. convergence theorem in [5]. We provide a generalization of such theorems, whose proofs carry no complications from their original proofs. Moreover, we adapt the notation to our text and replace algorithm concepts by the corresponding terms appearing in the more generic stochastic process theory branch. We name every condition described by the result to refer to them in the article.

Theorem A1

(Bottou’s in [4]). Let

l : R^{k} \to R

be a function with a unique minimum

\bar{η}

and

Z_{t + 1} = Z_{t} - γ (t) X_{t}

be a stochastic process. Then Z converges to

\bar{η}

almost surely if the following conditions hold;

\begin{matrix} Bottou resemblance & (\forall δ > 0) inf_{∥ Z_{t} - \bar{η} ∥ > δ} {(Z_{t} - \bar{η})}^{⊺} \cdot E_{t} [X_{t}] > 0 \\ Bottou algorithm bound & (\exists A, B) (\forall t) E ∥ X_{t} ∥^{2} \leq A + B {∥ Z_{t} - \bar{η} ∥}^{2} \\ Learning rate constraint & \sum_{t} γ {(t)}^{2} < \infty, \sum_{t} γ (t) = \infty \end{matrix}

Theorem A2

(Theorem 3.2 in [5]). Let

l : R^{k} \to R

be a twice differentiable cost function with a unique minimum

\bar{η}

and let

Z_{t + 1} = Z_{t} - γ_{t} B_{t} Y (Z_{t})

be a stochastic process where

B_{t}

is symmetric and only depends on information available at time t and. Then Z converges to the

\bar{η}

almost surely if the following conditions hold;

\begin{matrix} C . 1 & (\forall t) E_{t} Y (Z_{t}) = \nabla l (Z_{t}) \\ C . 2 & (\exists K) (\forall η) ∥ \nabla_{η}^{2} l (η) ∥ \leq 2 K \\ C . 3 & (\forall δ > 0) inf_{l (Z_{t}) - l (\bar{η}) > δ} ∥ \nabla l (Z_{t}) ∥ > 0 \\ C . 4 & (\exists A, B) (\forall t) E ∥ Y (Z_{t}) ∥^{2} \leq A + B l (Z_{t}) \\ C . 5 & (\exists a, b : 0 < a < b < \infty) (\forall t) s p e c (B_{t}) \subset [a, b] \\ C . 6 & \sum_{t} γ {(t)}^{2} < \infty, \sum_{t} γ (t) = \infty \end{matrix}

where

s p e c (B)

are the eigenvalues of matrix B.

Now, we provide a generalization of theorem of sunehag in [5]. Specifically, we deleted condition C.5 and modified (and relaxed) conditions C.3 and C.4 of the original statement. The proof is trivial after the original theorem’s proof, so the modifications present no complications.

Theorem A3

(Generalization of Theorem A2). Let

l : R^{k} \to R

be a twice differentiable cost function with a unique minimum

\bar{η}

and let

Z_{t + 1} = Z_{t} - γ_{t} B_{t} Y_{t}

a stochastic process where

B_{t}

is

F_{t}

-measurable. Then Z converges to the

\bar{η}

almost surely if the following conditions hold;

\begin{matrix} C . 1 & (\forall t) E_{t} Y_{t} = \nabla l (Z_{t}) η_{t} \neq \bar{η} \\ Hessian bound & (\exists K) (\forall η) ∥ \nabla_{η}^{2} l (η) ∥ \leq 2 K \\ Sunehag resembance & (\forall δ > 0) inf_{l (Z_{t}) - l (\bar{η}) > δ} \nabla l {(Z_{t})}^{⊺} B_{t} \nabla l (Z_{t}) > 0 \\ Sunehag algorithm bound & (\exists A, B) (\forall t) E ∥ B_{t} Y_{t} ∥^{2} \leq A + B l (Z_{t}) \\ Learning rate constraint & \sum_{t} γ {(t)}^{2} < \infty, \sum_{t} γ (t) = \infty \end{matrix}

Robbins–Siegmund theorem is the key result to prove almost sure convergence on previous theorems, as well as on our generalization result.

Theorem A4

(Robbins-Siegmund). Let

(Ω, F, P)

be a probability space and

F_{1} \subseteq F_{2} \subseteq \dots

a sequence of sub-σ-fields of

F

. Let

U_{t}, β_{t}, ϵ_{t}

and

ζ_{t}

,

t = 1, 2, \dots

be non-negative

F_{t}

-measurable random variables, such that

\begin{matrix} E (U_{t + 1} ∣ F_{t}) \leq (1 + β_{t}) U_{t} + ϵ_{t} - ζ_{t}, t = 1, 2, \dots \end{matrix}

(A1)

Then on the set

\{\sum_{t} β_{t} < \infty, \sum_{t} ϵ_{t} < \infty\}

,

U_{t}

converges almost surely to a random variable, and

\sum_{t} ζ_{t} < \infty

almost surely.

Appendix B. Proof of Corollary 1

To prove the corollary, it is enough to prove the generic proposition below.

Proposition A1.

Let

U_{t} \subset R^{k}

be non empty, closed and connected sets where

U_{t + 1} \subset U_{t}

for

t \in N

and let

V = \cap_{t} U_{t}

. Then V is a non empty bounded set if, and only if,

U_{T}

is bounded for some

T \in N

.

Proof.

Prove first that if

U_{T}

is bounded for some

T \in N

, then

V = \cap_{t} U_{t}

is a non-empty bounded set. Clearly,

V \subset U_{T}

and, therefore, V is bounded, possibly empty. Observe that

U_{t}

for all

t \geq T

is compact and closed. Then V is not empty, by the Cantor’s intersection theorem.

Conversely, prove now that if V is a non empty bounded set, then there exists T such that

U_{T}

is bounded. Assume V is non-empty bounded set, then there exists

r > 0

, such that

V \subset B_{r} (0)

where

B_{r} (0)

is the ball centered at 0 with radius r. Define

U_{t}^{*} = U_{t} ∖ ({\bar{B_{2 t} (0)}}^{'} \cup B_{r} (0)),

(A2)

where

\bar{B_{2 t} (0)}

is the closed ball of radius

2 t

and center 0 and

A^{'} = R^{k} ∖ A

. The sequence

U_{t}^{*}

is of compact and closed subsets, where

U_{t + 1}^{*} \subset U_{t}^{*}

and

\cap_{t} U_{t}^{*}

is empty. Therefore, by Cantor’s intersection theorem, there exists T such that

U_{T}^{*}

is empty. Then

U_{T} \subset {\bar{B_{2 t} (0)}}^{'} \cup B_{r} (0))

. Since

V \subset U_{T}

and

U_{T}

is connected, then

V \subset U_{T} \subset B_{r} (0)

and hence it is bounded as wanted to prove. □

Appendix C. Bottou’s Resemblance

Proposition 5 is a direct consequence of Proposition A2, that we state and prove below, and Proposition 3.

Proposition A2.

Let

Z = (X, γ)

be a stochastic process and

X

be a vector field over

R^{k}

. For

δ > 0

and

T \in N

, define the vector pair set

V_{δ, T} (X, Z) = {(X (η), v) ∣ η \in R^{k} ∖ B_{δ} (K_{X}), v \in E D S_{Z} (η, T)} .

(A3)

Then Z resembles to

X

if, and only if,

(\exists T \in N) (\forall δ > 0) (\exists ϵ > 0) V_{δ, T} (X, Z) is ϵ - acute .

(A4)

Proof.

By definition,

V_{δ, T} (X, X)

is

ϵ

-acute if, and only if, every vector pair

(u, v)

in

V_{δ, T} (X, X)

is

ϵ

-acute. By definition, such vector pairs

(X (η), v)

with

v \in E D S_{X} (η, T)

are

ϵ

-acute if, and only if,

(\forall η \in R^{k} ∖ B_{δ} (K_{X})) X {(η)}^{⊺} \cdot v \geq ϵ > 0, v \in E D S_{X} (η, T) .

(A5)

Previous equation holds if, and only if,

E D S_{X} (η, T) \subset H_{ϵ} (X) (η)

,

η \in R^{k} ∖ B_{δ} (K_{X})

as wanted to prove. □

Appendix D. Sunehag’s Resemblance

The result that translates Theorem A3 with resemblance concepts is Proposition 6, that we prove below.

Proof.

After Proposition A2 and 4 deduce that Z belongs to the half-space of

X

if, and only if, there exists

T \in N

such that for every

δ > 0

and every

t \geq T

there exist symmetric positive-definite

F_{t}

-measurable random matrices

B_{t}

, such that

\begin{matrix} inf_{\begin{matrix} η \in R^{k} ∖ B_{δ} (K_{X}) \\ t \geq T \\ ω \in Ω, Z_{t} (ω) = η \end{matrix}} X {(η)}^{⊺} \cdot B_{t} (ω) \cdot X (η) > 0, \\ B_{t} \cdot X (Z_{t}) = E_{t} [X_{t}] Z_{t} (ω) \notin K_{X} . \end{matrix}

(A6)

This matches with Equation (33). Matrix

B_{t}

is correctly and uniquely defined for all

t \geq T

and all

ω \in Ω

, such that

Z_{t} (ω) \notin K_{X}

. Define

B_{t} = I d

the identity matrix if

Z (ω) \in K_{X}

and also define

Y_{t} : = B_{t}^{- 1} \cdot X_{t} .

(A7)

Observe that

B_{t} \cdot Y_{t} = X_{t}

and that Equation (32) is then met too finishing the proof. □

References

Amari, S.I. Natural Gradient Works Efficiently in Learning. Neural Comput. 1998, 276, 251–276. [Google Scholar] [CrossRef]
Thomas, P.S. GeNGA: A generalization of natural gradient ascent with positive and negative convergence results. In Proceedings of the 31st International Conference on Machine Learning, ICML 2014, Beijing, China, 21–26 June 2014; Volume 5, pp. 3533–3541. [Google Scholar]
Sánchez-López, B.; Cerquides, J. Convergent Stochastic Almost Natural Gradient Descent. In Proceedings of the Artificial Intelligence Research and Development-Proceedings of the 22nd International Conference of the Catalan Association for Artificial Intelligence, Mallorca, Spain, 23–25 October 2019; Volume 319, pp. 54–63. [Google Scholar]
Bottou, L. Online Algorithms and Stochastic Approximations. In Online Learning and Neural Networks; Saad, D., Ed.; Cambridge University Press: Cambridge, UK, 1998; Revised, October 2012. [Google Scholar]
Sunehag, P.; Trumpf, J.; Vishwanathan, S.V.N.; Schraudolph, N. Variable Metric Stochastic Approximation Theory. In Proceedings of the Artificial Intelligence and Statistics, Clearwater, FL, USA, 16–19 April 2009; pp. 560–566. [Google Scholar]
Lyapunov, A.M. The general problem of the stability of motion. Int. J. Control 1992, 55, 531–534. [Google Scholar] [CrossRef]
Robbins, H.; Siegmund, D. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing Methods in Statistics; Rustagi, J.S., Ed.; Academic Press: Cambridge, MA, USA, 1971; pp. 233–257. [Google Scholar]
Karlin, S.; Taylor, H.M. Elements of stochastic processes. In A First Course in Stochastic Processes, 2nd ed.; Karlin, S., Taylor, H.M., Eds.; Academic Press: Boston, MA, USA, 1975; Chapter 1; pp. 1–44. [Google Scholar] [CrossRef]
Ross, S.M.; Kelly, J.J.; Sullivan, R.J.; Perry, W.J.; Mercer, D.; Davis, R.M.; Washburn, T.D.; Sager, E.V.; Boyce, J.B.; Bristow, V.L. Stochastic Processes; Wiley: New York, NY, USA, 1996; Volume 2. [Google Scholar]
Bass, R.F. Stochastic Processes; Cambridge University Press: Cambridge, UK, 2011; Volume 33. [Google Scholar]
Grimmett, G.; Stirzaker, D. Probability and Random Processes; OUP Oxford: Oxford, UK, 2020. [Google Scholar]
Sánchez-López, B.; Cerquides, J. Dual Stochastic Natural Gradient Descent and convergence of interior half-space gradient approximations. arXiv 2021, arXiv:2001.06744. [Google Scholar]
Tao, T. An Introduction to Measure Theory; Graduate Studies in Mathematics, American Mathematical Society: Providence, RI, USA, 2011. [Google Scholar]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Zeiler, M.D. ADADELTA: An Adaptive Learning Rate Method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
Kingma, D.P.; Ba, L.J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]

Figure 1. Path of stochastic process Z with director process X and learning rate

γ

.

Figure 1. Path of stochastic process Z with director process X and learning rate

γ

.

Figure 2. Set of vectors U and its convex vector subspace

C (U)

in

R^{2}

.

Figure 2. Set of vectors U and its convex vector subspace

C (U)

in

R^{2}

.

Figure 3. Shaded area representing

H_{ϵ} (u)

.

Figure 3. Shaded area representing

H_{ϵ} (u)

.

Figure 4. A stochastic process Z that

ϵ

-resembles to

X

at

η

from T on, since vector set

E D S_{Z} (η, T)

of all expected directions of Z at

η

after time T belongs to

H_{ϵ} (X) (η)

.

Figure 4. A stochastic process Z that

ϵ

-resembles to

X

at

η

from T on, since vector set

E D S_{Z} (η, T)

of all expected directions of Z at

η

after time T belongs to

H_{ϵ} (X) (η)

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sánchez-López, B.; Cerquides, J. On the Convergence of Stochastic Process Convergence Proofs. Mathematics 2021, 9, 1470. https://doi.org/10.3390/math9131470

AMA Style

Sánchez-López B, Cerquides J. On the Convergence of Stochastic Process Convergence Proofs. Mathematics. 2021; 9(13):1470. https://doi.org/10.3390/math9131470

Chicago/Turabian Style

Sánchez-López, Borja, and Jesus Cerquides. 2021. "On the Convergence of Stochastic Process Convergence Proofs" Mathematics 9, no. 13: 1470. https://doi.org/10.3390/math9131470

APA Style

Sánchez-López, B., & Cerquides, J. (2021). On the Convergence of Stochastic Process Convergence Proofs. Mathematics, 9(13), 1470. https://doi.org/10.3390/math9131470

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On the Convergence of Stochastic Process Convergence Proofs

Abstract

1. Introduction

2. Main Result. Director Process and the Expected Direction Set

2.1. Locally Bounded Stochastic Processes and Objective of the Work

2.2. Main Result

2.3. Expected Direction Set

2.4. Essential Expected Direction Set

3. Vector Field Half-Spaces and Stochastic Processes. Resemblance

3.1. The Half-Space of a Vector Field

3.2. Resemblance between a Stochastic Process and a Vector Field

4. Proof of Main Result. Reinterpretation of Convergence Theorems

4.1. Resemblance to Conservative Vector Fields and Convergence

4.2. Reinterpretation of Bottou’s Convergence Theorem

4.3. Reinterpretation of Sunehag’s Convergence Theorem

4.4. Convergence of Process in Example 5

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Convergence Theorems

Appendix B. Proof of Corollary 1

Appendix C. Bottou’s Resemblance

Appendix D. Sunehag’s Resemblance

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI