Previous Article in Journal
New Approximations to Bond Prices in the Cox–Ingersoll–Ross Convergence Model with Dynamic Correlation

Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

# On the Convergence of Stochastic Process Convergence Proofs

IIIA-CSIC, Campus UAB, 08193 Cerdanyola, Spain
*
Authors to whom correspondence should be addressed.
Mathematics 2021, 9(13), 1470; https://doi.org/10.3390/math9131470
Received: 5 May 2021 / Revised: 16 June 2021 / Accepted: 19 June 2021 / Published: 23 June 2021

## Abstract

:
Convergence of a stochastic process is an intrinsic property quite relevant for its successful practical for example for the function optimization problem. Lyapunov functions are widely used as tools to prove convergence of optimization procedures. However, identifying a Lyapunov function for a specific stochastic process is a difficult and creative task. This work aims to provide a geometric explanation to convergence results and to state and identify conditions for the convergence of not exclusively optimization methods but any stochastic process. Basically, we relate the expected directions set of a stochastic process with the half-space of a conservative vector field, concepts defined along the text. After some reasonable conditions, it is possible to assure convergence when the expected direction resembles enough to some vector field. We translate two existent and useful convergence results into convergence of processes that resemble to particular conservative vector fields. This geometric point of view could make it easier to identify Lyapunov functions for new stochastic processes which we would like to prove its convergence.

## 1. Introduction

Along most practical research branches, the solution to a given problem is often entrusted to a function optimization problem, where the effectiveness of a solution is measured by a function to be optimized. Machine learning challenges are great examples of this situation. Therefore, optimization algorithms become crucial to solve such problems. Iterative optimization methods start at an initial point and move through parameter space towards trying to minimize the objective function. Its performance may dramatically vary depending on the initial point. This dependence is somewhat diminished if the algorithm is guaranteed to converge in the long term to a minimum. Furthermore, in stochastic optimization algorithms, the quality achieved varies randomly and sometimes there are chances that the algorithm fails to converge. As an example, the stochastic natural gradient descent (SNGD) of Amari [1] and its variants often show instability depending on the starting point and learning rate tuning. Even some experiments are proved to diverge with SNGD [2]. Clearly such issue weights considerably against its practical use.
Convergent algorithms are more stable with respect to both learning rate parameters and initial point estimations. For instance, in [3], the optimization method named convergent stochastic natural gradient descent (CSNGD) is proposed. CSNGD is designed to mimic SNGD but it is proven to be convergent. Sánchez-López and Cerquides show that, unlike SNGD, CSNDG shows stability in the experiments run.
As a consequence, we are interested in understanding better the conditions that make an algorithm convergent. Convergence proofs abound in the literature. In this work we concentrate on two apparently disconnected and well known convergence results. In [4] seminal work, Bottou proved the convergence of stochastic gradient descent (SGD). Later on in [5], Sunehag provided an extended result for variable metric modified SGD. The connection between the proofs of both results are not evident. It is not clear what they have in common, and, therefore, further generalizations seem not to be within reach.
To understand the convergence results (Both theorems are added in Appendix A. However, we alleviate the conditions of Theorem A2 in [5]. The alleviated conditions turn Theorem A2 into Theorem A3 found in the same appendix), it is helpful to take a look at their proofs. Bottou’s proof relies on the construction of a Lyapunov function [6]. On the other hand, Sunehag’s proof uses the Robbins–Siegmundtheorem [7] instead. It can be seen that the latter is proving that the function to optimize serves already as a Lyapunov function, similar to latter chapters in [4]. Therefore, both proofs share some similarity but it is not evident how to raise a connection. Establishing the connection and pointing out its relevance is the main contribution of this paper and results in a generalization from which both results can be easily proved as corollaries.
Stochastic optimization algorithms rely on observations extracted from some possibly unknown probability space. Algorithms subjected to random phenomena are stochastic process [8,9,10,11]. The generalized convergence result for stochastic processes in this article is obtained after 2 main concepts. Precisely, the first one is resemblance between a stochastic process and a vector field. The second one is the locally bounded property of a stochastic process by a function. These two ingredients are enough to state and prove our convergence theorem; a stochastic process Z converges if it is locally bounded by a convex real valued, twice differentiable function $ϕ$ with bounded Hessian and Zresembles to $∇ ϕ$.
Two corollaries are extracted from this result, which we prove to be equivalent to Bottou’s and Sunehag’s convergence theorems. Moreover, we observe that convergence proof in [12] of the algorithm called discrete DSNGD can be addressed by our main theorem, since original convergence theorem of Sunehag is not general enough.
Resemblance concept involves the expected directions set of a stochastic process, that we define in Section 2, and the half-space of a vector field, a concept introduced in Section 3. Then, in Section 4 we state and prove our general result, which highlights the commonalities between Bottou and Sunehag theorems, proving convergence of a wider variety of algorithms.

## 2. Main Result. Director Process and the Expected Direction Set

Let $( Ω , F , P )$ be a probability space and $( S , Σ )$ be a measurable space. A discrete stochastic process on $( Ω , F , P )$ indexed by $N$ is a sequence of random variables $Z = { Z t } t ∈ N$ such that $Z t : Ω → S$. In this work, $S = R k$ and $Σ$ is the corresponding Borel $σ$-algebra. As random variables are used to describe general random phenomena, stochastic processes indexed by $N$ are usually used to model random sequences.

#### 2.1. Locally Bounded Stochastic Processes and Objective of the Work

The difference between two random variables of a stochastic process is a random variable known as increment. We say that random variable $Z t + s − Z t$ with $1 ≥ s ∈ N$ is an s-increment at time t. For example, the 1-increments of a stochastic process Z are
$Z t * = Z t + 1 − Z t .$
We focus our attention to a decomposition of $Z t *$ into $Z t * = − γ ( t ) · X t$, such that $γ : N → R +$ is a positive real valued function and $X = { X t } t ∈ N$ is a stochastic process on $( Ω , F , P )$.
Definition 1.
Let Z and X be stochastic processes and $γ : N → R +$ a function. Then $( X , γ )$ is a decomposition of 1-increments of Z if
$Z t + 1 = Z t − γ ( t ) · X t .$
Name X the director process of Z and γ the learning rate, and note it by $Z = ( X , γ )$.
This way of expressing a process allows to define $Z t + 1$ with respect to $Z t$, which gives us control of the difference between both values by means of $γ ( t ) X ( t )$, as Figure 1 shows. This is very useful if we intend to analyse the convergence of a stochastic process.
As represented in Figure 1, we can think of $Z t$ as the value of the process at time t, while $− γ ( t ) X t$ is the vector going from $Z t$ to $Z t + 1$. For the article, it is important to remember this, since we are constantly referring to $Z t$ as points in $R k$ while $X t$ are managed as direction vectors in $R k$. This distinction is only practical for our purposes.
The trajectories of stochastic approximation algorithms, such as stochastic gradient descent (SGD), are indeed samples of stochastic processes. Furthermore, they are usually expressed by means of their decomposition of 1-increments, as can be seen in the following examples.
Example 1.
SGD [4] is the cornerstone of machine learning to solve the function optimization problem. The objective of SGD is to minimize an objective function $L ( η ) = E z ∼ P * l ( η ; z )$ for some unknown probability distribution $P *$ and random variable $l ( η )$ defined on $( Ω * , F * , P * )$. This function l is known as loss function, and it is usually differentiable with respect to η, allowing the definition of SGD as
$Z t + 1 = Z t − γ ( t ) ∇ η l ( Z t ) , γ ( t ) > 0 t ∈ N$
where $Z t$ are estimates of $η ¯$. We can see $Z = { Z t } t ∈ N$, and, therefore, SGD, as a stochastic process. Indeed, let
$( Ω = ∏ t ∈ N Ω * , F = ∏ t ∈ N F * , P = ∏ t ∈ N P * )$
be the product probability space (This space is guaranteed to exist according to Kolmogorov extension theorem (see for example Theorem 2.4.4 and following examples in [13])) over infinite sequences. Hence we can define the stochastic process X on $( Ω , F , P )$ such that $X t = ∇ η l ( Z t )$ where for every $ω = { ω t } t ∈ N ∈ Ω$ it is $X t ( ω ) = ∇ η l ( Z t ; ω t )$. This implies that $( X , γ )$ is a decomposition of 1-increments of SGD.
In addition, we observe that $Z t + 1$ depends only on last observation $Z t$ and t, which is known as a non-stationary Markov chain.
Example 2.
This example is worked in [5]. Again, we focus on the function optimization problem, using the same notation as in previous example. In this case, the estimation update of the minimum $η ¯$ is defined as
$Z t + 1 = Z t − γ ( t ) B t · Y t , γ ( t ) > 0 t ∈ N$
where $B t$ is a matrix in $R k × k$ known after information $Z 0 , ⋯ , Z t$ available at time t and $Y t = Y ( Z t )$, where Y is a function mapping each $η ∈ R k$ to a random variable on the same probability space $( Ω * , F * , P * )$.
Similarly as in previous example, Y can be thought as a random variable in the product probability space (Equation (4)) that depends on previous $Z t$, such that for every $ω ∈ Ω$ it is $Y t ( ω ) = Y ( Z t ; ω ) = Y ( Z t ; ω t )$. If we define $X t = B t · Y ( Z t )$, then $Z = ( X , γ )$ is a decomposition of 1-increments of Z with $X = { X t } t ∈ N$.
Here, Z is not a (non-stationary) Markov chain, since $B t$ may depend on $Z i$ for all $i < t$.
The naming of $γ$ as learning rate is commonly used in the machine learning research branch [4,14,15,16]. The director process X determines the direction $X t$ at time t of the update Equation (1) with $Z t$ as reference point, while $γ ( t )$ specifies a certain distance to travel along that direction $X t$. Moreover we demand some constraints to both factors. Condition imposed to $γ$ is usually found in the literature [3,4,5]. A learning rate $γ$ holds the standard constraint if;
$∑ t γ ( t ) 2 < ∞ , ∑ t γ ( t ) = ∞ .$
Before we show the condition for the director process X, we fix some notation used throughout the article. Consider the natural filtration $F Z = { F t } t ∈ N$ generated by stochastic process Z, that is, $F t = σ ( Z i − 1 ( A ) ∣ i ≤ t , A ∈ Σ )$ for all $t ∈ N$. Then $F Z$ is a filtration and by definition Z is adapted to $F Z$.
Intuitively, every $F t$ of a filtration is a $σ$-algebra that classifies the elements of $Ω$. For example, if $Ω$ is the set of colours, $F t$ can gather warm and cold colours into separate and complementary sets. The fact that a random variable $Z t$ is $F t$-measurable implies that $Z t$ sends all warm colours to the same value and all cold colours also to the same value. Somehow $Z t$ is then not providing any additional information about elements of $Ω$ beyond the classification of $F t$. The sequence $F t$ is increasing, in the sense that $F t ⊂ F t + 1$ for all t. Therefore, a filtration characterizes space $Ω$ with sequentially higher levels of information or classification. Denote the conditional expectation given $F t$ [10]. Recall that if Y is a random variable in $( Ω , F , P )$ then $E t [ Y ]$ is in turn a $F t$-measurable random variable.
Hence, if $Z = ( X , γ )$ then X is locally and linearly bounded by function $ϕ : R k → R$ if
$( ∃ A , B ) ( ∀ t ) E t ∥ X t ∥ 2 ≤ A + B · ϕ ( Z t ) .$
These two constraints are finally combined to present the kind of stochastic processes we are interested in.
Definition 2.
Let Z be a stochastic process and $ϕ : R k → R$ be a function. We say that Z is locally bounded by$ϕ$if there is a decomposition of 1-increments $( X , γ )$ with γ holding the standard constraint and X locally and linearly bounded by ϕ.
Furthermore, if $Z 0 = η 0$ a.s. we say $η 0$ is the initial point of Z.
For instance, Examples 1 and 2 observed in this section define Z as a locally bounded process. We see it below.
Example 3.
Recall Example 1. In the same reference [4], the optimization algorithm is asked to hold additional conditions in order to prove its convergence. We added the convergence theorem in Appendix A. Some of the conditions are
$∑ t γ ( t ) 2 < ∞ , ∑ t γ ( t ) = ∞ , Z 0 = η 0 ∈ R k , ( ∃ A , B ) ( ∀ t ) E t ∥ X t ∥ 2 ≤ A + B ∥ Z t − η ¯ ∥ 2$
where $η ¯ ∈ R k$ is the optimal point of L. Standard constraint to γ is clearly asked. Moreover $η 0$ is a starting point. It remains to be seen if X is locally and linearly bounded by some function $ϕ : R k → R$. Indeed, if we define $ϕ ( η ) = ∥ η − η ¯ ∥ 2$, then the property is easily checked. Hence Z is locally bounded by ϕ with initial point $η 0$.
Example 4.
Recall Example 2. Convergence theorem in [5], which is added in the Appendix A, demands below conditions;
$∑ t γ ( t ) 2 < ∞ , ∑ t γ ( t ) = ∞ , Z 0 = η 0 ∈ R k , ( ∃ A , B ) ( ∀ t ) E t ∥ X t ∥ 2 ≤ A + B L ( Z t )$
where L is a function to optimize. For this example, Z is then locally bounded by $ϕ = L$ with initial point $η 0$. Just as an observation, property of $B t$ being determined after information available at time t, is the same as seeing $B t$ as a $F t$-measurable random variable over the product probability space.
We are interested on studying the almost sure convergence of Z to a point $η ¯ ∈ R k$. A stochastic process Z almost surely (a.s.) converges to a point $η ¯ ∈ R k$ if
Examples 3 and 4 show us that we can understand the results in [4,5] as the almost sure convergence of some locally bounded processes. In this paper, we are interested in characterizing the almost sure convergence of locally bounded processes. The objective of this work is to create a theory that allows to prove the a.s. convergence of locally bounded processes that covers Examples 3 and 4 and whose applicability generalizes to a wider set of processes, such as the one described below:
Example 5.
Assume the function $f ( η ) = ∥ η ∥ 2$ defined in $R k$, and the optimization method Z defined by its director process $X t = G 1 · G 2 · Z t$ where $G 1$ and $G 2$ are positive definite and symmetric matrices. For simplicity, this example shows a stochastic process with no random phenomena associated. We wonder about the convergence of process Z, and if so, whether it converges to the point of $R k$ that optimizes function f. From Theorems A1 and A2 found in the literature (included in Appendix A) it is not possible to prove a.s. convergence of Z, since conditionsBottou resemblance and C.3, respectively, are not satisfied. That is, because $Z t · ⊺ G 1 · G 2 · Z t$ is possibly negative a.s.
Further on, Z is assumed to be locally bounded by $ϕ$ where $( X , γ )$ is its corresponding decomposition of 1-increments, unless otherwise indicated.

#### 2.2. Main Result

The objective of the article is to proof below theorem, that we prove in Section 4.1.
Theorem 1.
Let Z be a stochastic process on probability space $( Ω , F , P )$. Then Z almost surely converges to a point $η ¯$ if there is a twice differentiable convex function ϕ with unique minimum $η ¯$ defined in $R k$ with bounded Hessian norm, such that
• Z is locally bounded by ϕ;
• Z resembles $∇ ϕ$.
There is one concept of the theorem that needs a definition. That is, when a stochastic process resembles to a vector field. Next sections have that end, with our main definition that fills the gap appearing at Section 3.2. As we will see in Section 4.4, simple Example 5 finds a solution with our main theorem.

#### 2.3. Expected Direction Set

We now define one key object of our work named the expected direction set. It focuses on gathering all directions that the update may take at time t conditioned to $F t$. Before the definition we provide some concepts and notation.
Random variable determines all expected directions of Z at time t that the stochastic process may follow assuming $F t$. For example, if $ω ∈ Ω$ is an observation, then $E t [ X t ] ( ω ) ∈ R k$ is a vector pointing to the expected update direction departing from point $Z t ( ω )$ given $F t$. Denote the expected direction of Z at $ω ∈ Ω$ and time t as
The expected direction from point $η = Z t ( ω )$ of Equation (11) depends on $ω$. That is, the path followed until reaching $η = Z t ( ω ) ∈ R k$ matters. For instance, if $ω 1 , ω 2 ∈ Ω$ are different observations, such that $η = Z t ( ω 1 ) = Z t ( ω 2 )$, then possibly $D Z ( ω 1 , t ) ≠ D Z ( ω 2 , t )$. We collect all expected directions at $η = Z t ( ω )$ and time t in the vector set below;
$S Z ( η , t ) = { D Z ( ω , t ) ∣ ω ∈ Ω , Z t ( ω ) = η } .$
The tools to define the expected direction set at $η ∈ R k$ after time $T ∈ N$ are given, so we proceed to its formal definition.
Definition 3.
Let $Z = ( X , γ )$. Define the expected directions set of Z at $η ∈ R k$ after time $T ∈ N$ as
$E D S Z ( η , T ) : = ⋃ t ≥ T S Z ( η , t ) .$
With a few words, $E D S Z ( η , T )$ is a vector set containing all expected directions (provided by the director process X) conditioned to $F t$ for every outcome $ω$ such that $Z t ( ω ) = η$ where $t ≥ T$. In Definition 3, $E D S$ depends on T. That is because to assess the convergence of an algorithm it is not important to consider all expected directions throughout all the process. For example, if an algorithm converges we can modify randomly all directions of the director process for just a particular time $T ∈ N$, and the resulting algorithm still converges. Roughly speaking, only the tail of a process matters to determine the convergence property. This concept is better addressed with Definition 4 in next section.
Example 6.
Recall Example 1. Assume that Z is then SGD. Then $E D S Z ( η , T )$ is a singleton. Indeed, $D Z ( ω , t )$ is the same vector for all $t ∈ N$ and all ω with $Z t ( ω ) = η$ and hence $S Z ( η , t ) = { D Z ( ω , t ) }$ with any $ω ∈ Ω$ with $Z t ( ω ) = η$. Finally
This is the case of any non-stationary Markov chain.

#### 2.4. Essential Expected Direction Set

Convergence property of an algorithm relates closely to directions followed after time $T ∈ N$ as T tends to infinity. Equivalently, the direction set appearing repeatedly through the whole optimization process matters, while directions set only contemplated for a finite amount of iterations changes nothing, in terms of convergence guarantee. This direction set is named the essential expected directions set in this article.
To define properly the essential expected directions set, we will use the convex vector subspace of a given vector set. Given a vector set U in $R k$, let $C ( U )$ be the smallest convex vector subspace containing U. See Figure 2 as an illustrative example. Observe that $C ( U )$ is always closed, but it may be unbounded.
Definition 4.
Let $Z = ( X , γ )$. Define the essential expected directions set of Z at η as
$E E D S Z ( η ) : = ∩ T C ( E D S Z ( η , T ) ) .$
Example 7.
Assume Z is any non-stationary Markov chain, such that SGD in Example 1. Then $E E D S Z ( η ) = E D S Z ( η , T )$ for any T. Indeed, we have seen in Example 6 that for any $ω ∈ Ω$ and $t ≥ T$ where $Z t ( ω ) = η$. Hence
for any $ω ∈ Ω$ with $Z T ( ω ) = η$.
Definition of $E E D S Z ( η )$ delimits the smallest subspace where all directions at $η$ tend to. Clearly, $E E D S Z ( η )$ is also convex and closed (possibly empty). Deeper properties of this set lead to identify divergence symptoms. For example, if it is empty or unbounded, we face instability of the process at $η$. To see this, observe below result. The proof can be found in the Appendix B.
Corollary 1.
Let $η ∈ R k$. Then $E E D S Z ( η )$ is a non-empty bounded set if, and only if, there exists $T ∈ N$, such that $C ( E D S Z ( η , T ) )$ is bounded.
This result relates $E E D S Z ( η )$ with instability properties of Z. If $E E D S Z ( η )$ is empty or unbounded, then the algorithm is unstable at $η$, since expected directions with arbitrarily large norms exist after enough iterations. Clearly, if this situation is found for all points near the optimum, the algorithm can not converge to the solution. It is desirable instead that $C ( E D S Z ( η , T ) )$ is compact (bounded) for some T for every $η ∈ R k$, or equivalently, that $E E D S Z ( η )$ is compact (bounded) and not empty.
In fact, since we are interested in the case where Z is locally bounded by $ϕ$ (recall Definition 2), we can assume that $E E D S Z ( η )$ is a non empty compact set, by virtue of below results.
Proposition 1.
Let stochastic process Z be locally bounded by ϕ. Then $C ( E D S Z ( η , 0 ) )$ is a non-empty compact set.
Proof.
We know that X is locally and linearly bounded. Hence, applying Jensen’s inequality
$∥ E t [ X t ] ∥ 2 ≤ E t ∥ X t ∥ 2 ≤ A + B · ϕ ( Z t ) .$
Let $η ∈ R k$ and $ω ∈ Ω$, such that $Z t ( ω ) = η$ for some $t ≥ 0$. Therefore, every $v = E t [ X t ] ( w ) ∈ E D S Z ( η , 0 )$ has bounded norm by $A + B · ϕ ( η )$, implying that $C ( E D S Z ( η , 0 ) )$ is a non-empty compact set. □
Below corollary is a consequence of Proposition 1 and Corollary 1.
Corollary 2.
Let stochastic process Z be locally bounded by ϕ. Then $E E D S Z ( η )$ is a non-empty compact set for all $η ∈ R k$.

## 3. Vector Field Half-Spaces and Stochastic Processes. Resemblance

This section defines the main concept of this work; the property of resemblance between a stochastic process and a vector field. The definition highlight some commonalities between Theorems A1 and A3. Both of them prove the convergence of stochastic processes that resemble to particular vector fields. A geometric interpretation and explanation of convergence theorems conditions is established in next Section 4.
Some previous definitions are needed and stated before introducing the main concepts of the article, such as $ϵ$-acute vector pair sets and the half-space of a vector field. The section starts with some basic concepts about vectors.
Definition 5.
Let $u , v ∈ R k$ be two vectors. The pair $( u , v )$ is acute if u and v form an acute angle, that is, if $u ⊺ · v > 0$. Furthermore, if $u ⊺ · v ≥ ϵ > 0$ then $( u , v )$ is ϵ-acute.
Proposition 2.
Let $u , v ∈ R k$ be two vectors. Then the pair $( u , v )$ is ϵ-acute if, and only if, there exists a symmetric positive-definite matrix B, such that $B · u = v$ and $u ⊺ · B · u ≥ ϵ$.
A vector pair set V is a set of vector pairs where I is an index set.
Definition 6.
Let V be a vector pair set. V is ϵ-acute if every vector pair $( u , v ) ∈ V$ is ϵ-acute.
Next result is a direct consequence.
Proposition 3.
Let V be a vector pair set, indexed by I. Then, V is ϵ-acute for some $ϵ > 0$ if, and only if;
$inf i ∈ I ( u i , v i ) ∈ V u i ⊺ v i > 0 .$
Proposition 4.
Let V be a vector pair set, indexed by I. Then, V is ϵ-acute for some $ϵ > 0$ if, and only if, there exist a set of symmetric positive-definite matrices $B = { B i ∣ i ∈ I }$ such that
$inf i ∈ I ( u i , v i ) ∈ V u i ⊺ B i u i > 0 , B i u i = v i .$
Proof.
Prove, first, that if there exist a set of matrices $B = { B i ∣ i ∈ I }$ holding Equation (19) then V is $ϵ$-acute for some $ϵ > 0$. Observe that after Equation (19);
$inf i ∈ I u i ⊺ v i = inf i ∈ I u i ⊺ B i u i > 0 .$
Then, Proposition 3 implies that V is $ϵ$-acute and finishes this part of the proof.
Now assume that V is $ϵ$-acute, prove then that there exist a set of matrices $B = { B i ∣ i ∈ I }$ holding Equation (19). Since V is $ϵ$-acute, in particular, the pair $( u i , v i ) ∈ V$ is $ϵ$-acute for every $i ∈ I$. Apply Proposition 2: for every $i ∈ I$ there exists a symmetric positive-definite matrix $B i$, such that $B i u i = v i$ and $u i ⊺ · B i · u i ≥ ϵ$. This finishes the proof. □

#### 3.1. The Half-Space of a Vector Field

The half-space determined by a vector u is the set of vectors that conform an acute angle with u. This region clearly occupies half of the total space. Additionally, the $ϵ$-half-space of u with $ϵ > 0$ is the set of vectors v, such that the vector pair $( u , v )$ is $ϵ$-acute. This object is needed for afterwards defining the half-space of a vector field. We define these concepts below and illustrate the $ϵ$-half-space of a vector u in Figure 3.
Definition 7.
Let u be a vector of $R k$. The half-space of u is the set
$H ( u ) = { v ∈ R k ∣ u ⊺ · v > 0 } .$
Similarly, the ϵ-half-space of u with $ϵ > 0$ is the set
$H ϵ ( u ) = { v ∈ R k ∣ u ⊺ · v ≥ ϵ } .$
A vector field $X$ over $R k$ is a function assigning to every $η ∈ R k$ a vector of $R k$, that is $X : R k → R k$. For example, if $l : R k → R$ is a twice differentiable function, we can consider the vector field consisting of the gradient vectors at each point $η$. Precisely, denote the gradient vector field (GVF) as $X ∇ l$, where $X ∇ l ( η ) = ∇ l ( η )$.
We are ready to define the half-space of a vector field.
Definition 8.
Let $X$ be a vector field over $R k$. The half-space of $X$ is a function $H ( X )$ mapping every η to $H ( X ) ( η ) = H ( X ( η ) )$. Similarly, the ϵ-half-space of $X$ with $ϵ > 0$ is a function $H ϵ ( X )$ mapping every η to $H ϵ ( X ) ( η ) = H ϵ ( X ( η ) )$.

#### 3.2. Resemblance between a Stochastic Process and a Vector Field

The convergence of any locally bounded process can be proved comparing the expected directions set of the algorithm with some vector fields. When the expected directions resemble the vector field we compare it to, then we can ensure the almost sure convergence to a point of the stochastic process, after some reasonable conditions. By resemblance, we mean that the expected directions set after some time T is a subset of the $ϵ$-half-space of $X$, among other things explained later. Therefore, resemblance asks for every $η ∈ R k$ that every vector $D Z ( ω , t )$ with $t ≥ T$ and every $ω ∈ Ω$ with $η = Z t ( ω )$ form an acute angle with the vector field at $X ( η )$.
However, if the vector field sends a specific point $η$ to $0 ∈ R k$, then no direction can be set by the $D Z ( ω , t )$ to form an acute vector pair. Therefore, resemblance property is evaluated outside the neighborhood of these annulled points. That is why we must consider now the set of annulled points of a vector field and the neighborhoods around the points of this set.
Formally, let $X$ be a vector field defined in $R k$. The set $K X$ is the set of points of $R k$ annulled by $X$, that is, $K X : = { η ∈ R k ∣ X ( η ) = 0 }$. Moreover, consider the closed ball centered on $K X$ of radius $δ$ as $B δ ( K X ) : = ∪ η ∈ K X B δ ( η )$ where $B δ ( η )$ is the closed ball of radius $δ$ centered on $η$.
We also use the notation $A ′ = R k ∖ A$ for the compliment set of subset $A ⊂ R k$. We say that Zϵ-resembles to $X$ at $η$ from T on if $E D S Z ( η , T ) ⊂ H ϵ ( X ) ( η )$. Observe an illustrative example in Figure 4.
This intuition is naturally extended to ϵ-resemblance at sets, when the property is satisfied for every $η$ in the set. With this in mind we can define the key concept of this article.
Definition 9.
Let $Z = ( X , γ )$ be a stochastic process and $X$ be a vector field over $R k$. We say that Z resembles to $X$ from $T ∈ N$ on, if;
$( ∀ δ > 0 ) ( ∃ ϵ > 0 ) Z ϵ - resembles to X at B δ ( K X ) ′ from T on$
We say that Z resembles to $X$ if there is $T ∈ N$ such that it resembles to $X$ from T on.
Everything is set up to accomplish the goal of this paper. We refresh the main theorem of this article in next section and show its proof.

## 4. Proof of Main Result. Reinterpretation of Convergence Theorems

The objective of the article is within reach now. That is, proving main Theorem 1. Moreover, this section addresses afterwards the task of proving that Theorems A1 and A3 are particular examples of our main Theorem 1.

#### 4.1. Resemblance to Conservative Vector Fields and Convergence

Recall main Theorem 1 and observe that it asks the stochastic process Z to be locally bounded by some function $ϕ$ and Z to resemble to $∇ ϕ$. Therefore, $∇ ϕ$ is a particular type of vector field called conservative vector field. That is, a vector field that appears from derivation of a function. That is why we understand our main theorem as a convergence result of locally bounded processes of resemblance to conservative vector field.
In the theorem statement, it says that $ϕ$ has bounded Hessian norm. Similarly to Theorem A3, it means that:
$( ∃ K ) ( ∀ η ) ∥ ∇ η 2 ϕ ( η ) ∥ ≤ K ′ .$
We are ready to prove the main result of the paper.
Proof of main Theorem 1.
Observe that $ϕ$ is bounded from below. Indeed, $η ¯$ is a minimum and $ϕ$ is convex with $X ( η ¯ ) = 0$ where $X = ∇ ϕ$. Therefore, there exists a constant $m ≥ 0$ such that $ϕ ( η ) + m ≥ 0$ for all $η$. Define $ψ ( η ) = ϕ ( η ) + m$. Clearly, $∇ ψ = ∇ ϕ = X$, and, therefore, Zresembles to $∇ ψ$. Moreover, Z is locally bounded by $ψ$ and $ψ$ clearly satisfies the Hessian norm bound.
From here, the prove follows the steps of Theorem A2’s proof. Taylor inequality and Hessian norm bound;
$ψ ( Z t + 1 ) = ψ ( Z t − γ t X t ) ≤ ψ ( Z t ) − γ t X ( Z t ) ⊺ X t + γ t 2 K ∥ X t ∥ 2$
where $K = K ′ 2$. Apply expectation conditioned to information until time t and then use that Z is locally bounded by $ψ$;
Use now that Zresembles to $X$. Then, there exists T such that for every $t ≥ T$, the term is negative. All other conditions of Robbins–Siegmund theorem (in [7], added in Appendix A) also hold for the algorithm after time T, thanks to learning rate constraints. Apply it and deduce that random variables $ψ ( Z t )$ converge almost surely to a random variable (and so does $ϕ ( Z t )$) and that;
Prove now that stochastic process $ϕ ( Z t )$ converges almost surely to value $ϕ ( η ¯ )$. Proceed by contradiction. Assume that for $δ 1 > 0$
this implies, by continuity and convexity of function $ϕ$, that there exists $δ$
By resemblance and definition of the limit, there exists T and $ϵ$ such that $E D S Z ( η , T ) ⊂ H ϵ ( X ) ( η )$ for every $η ∈ B δ ( η ¯ ) ′$. This leads to a contradiction, since using learning rate standard constraint we have
for every $ω ∈ A$, which has measure different to 0 by Equation (28). This clearly contradicts Equation (26).
Hence, $ϕ ( Z t )$ converges almost surely to $ϕ ( η ¯ )$ and $Z t$ converges almost surely to $η ¯$ as wanted. □

#### 4.2. Reinterpretation of Bottou’s Convergence Theorem

The goal now is to deduce Theorem A1 as a direct consequence of main Theorem 1. Consider a particular case of main Theorem 1 where $ϕ ( η ) = ∥ η − η ¯ ∥ 2$, that reads as follows.
Corollary 3.
Let $ϕ ( η ) = ∥ η − η ¯ ∥ 2$ and Z be a stochastic process on probability space $( Ω , F , P )$. Then Z almost surely converges to $η ¯$ if
• Z is locally bounded by ϕ;
• Z resembles $∇ ϕ$.
Additional conditions to $ϕ$, such as Hessian bound or twice differentiability, are not specified in the corollary since with the particular definition of $ϕ$ all those conditions are already satisfied.
To see that Corollary 3 proves Theorem A1 statement, we need to prove that Theorem A1 is assuming that Z is locally bounded by $ϕ$ and that Z resembles $∇ ϕ$. Example 3 already proves that Bottou is assuming that Z is locally bounded by $ϕ$. Therefore, it remains to check that Z resembles to $∇ ϕ$. To that end, see below proposition proved in Appendix C.
Proposition 5.
Let $Z = ( X , γ )$ be a stochastic process and $X$ be a vector field over $R k$. Then Z resembles to $X$ if, and only if,
$( ∃ T ∈ N ) ( ∀ δ > 0 ) inf η ∈ R k ∖ B δ ( K X ) v ∈ E D S Z ( η , T ) X ( η ) ⊺ · v > 0 .$
Observe condition Bottou resemblance of Theorem A1 and Proposition 5. Deduce from it, that the algorithm Z of the theorem resembles to vector field $∇ ϕ$.
Corollary 4.
Let $Z = ( X , γ )$ be a stochastic process and $η ¯ ∈ R k$. Then Z resembles to $∇ ϕ$ with $ϕ ( η ) = ∥ η − η ¯ ∥ 2$ if, and only if,Bottou resemblanceholds.

#### 4.3. Reinterpretation of Sunehag’s Convergence Theorem

Theorem A3 is deduced from main Theorem 1. Similarly to previous section, we provide a version of our main theorem for the case where $ϕ = l$ is a function that we aim to minimize.
Corollary 5.
Let $l : R k → R$ be a twice differentiable cost function with a unique minimum $η ¯$ and bounded Hessian norm, and let Z be a stochastic process on probability space $( Ω , F , P )$. Then Z converges to the minimum $η ¯$ of l almost surely if
• Z is locally bounded by l;
• Z resembles $∇ l$.
The stochastic process described in Theorem A3 has some more properties, such as $X t = B t · Y t$. However, if we prove that Z of that theorem is locally bounded by l and that Zresembles$∇ l$, then it is clear that Corollary 5 deduces Theorem A3. Recall Example 4 and notice that we already proved that Z is locally bounded by l. The remaining property is acquired after below proposition that we prove in Appendix D.
Proposition 6.
Let $Z = ( X , γ )$ be a stochastic process and $X$ be a vector field over $R k$. Then Z resembles to $X$ if, and only if, there exists T, such that for every $t ≥ T$ there are random vectors $Y t$ to $R k$ and symmetric and positive-definite $F t$-measurable random matrices $B t$ such that
$B t · Y t = X t ,$
$E t [ Y t ] = X ( Z t ) Z t ( ω ) ∉ K X ,$
$( ∀ δ > 0 ) inf η ∈ R k ∖ B δ ( K X ) t ≥ T ω ∈ Ω , Z t ( ω ) = η X ( η ) ⊺ · B t ( ω ) · X ( η ) > 0 .$
It is only necessary to put together Proposition 6 and condition C.1 and Sunehag resemblance to finish our objective with the following corollary
Corollary 6.
Let l be a differentiable function and $Z = ( X , γ )$ be a stochastic process. Then Z resembles to $∇ l$ if, and only if, there exist T such that for every $t ≥ T$ there are random vectors $Y t$ to $R k$ and symmetric and positive-definite $F t$-measurable random matrices $B t$ such that $B t · Y t = X t$ and conditionsC.1andSunehag resemblancehold.
Corollaries 4 and 6 nicely show the value of Theorem 1 for proving convergence. To reinforce this, we notice that the convergence of algorithm DSNGD in [12] is easily proved by means of Corollary 5, by combining both Theorem A3 and Corollary 6. This shows that Theorem 1 allows to prove convergence of a wider set of stochastic processes and function optimization methods.

#### 4.4. Convergence of Process in Example 5

Our theorem solves question proposed by Example 5. To see it, just define
$ϕ ( η ) = 1 2 η ⊺ · G 2 · η .$
Twice differentiable and convex function $ϕ$ has bounded Hessian norm, since its Hessian is the constant matrix $G 2$. Moreover, Z is clearly locally bounded by $ϕ$. Indeed, recall Equation (7) and observe;
$∥ G 1 · G 2 · Z t ∥ 2 ≤ B · ϕ ( Z t )$
where $B = 2 λ 1 2 · λ 2 2 μ 2$ such that $λ i$ is the greatest eigenvalue of $G i$ and $μ i$ is the least eigenvalue of $G i$.
Finally, check that Zresembles to $∇ ϕ ( η ) = G 2 η$. Observe that $E D S Z ( η , T ) = { G 1 G 2 η }$ is a singleton for every T. Then for all $δ > 0$ and all $η ∈ B δ ( K ∇ ϕ )$ it is
$∇ ϕ ( η ) ⊺ · G 1 G 2 η = η ⊺ G 2 · G 1 G 2 η ≥ ϵ ,$
where $ϵ = μ 1 · μ 2 2 · δ 2$. Hence Z resembles to $∇ ϕ$ and by virtue of our main Theorem 1 process Z converges a.s. to 0, and, therefore, minimizes function f as wanted.

## 5. Conclusions

We have presented a result that allows us to prove the convergence of stochastic processes. We have proven that two useful convergence results in the literature are a consequence of our theorem. This is made after a new theory that compares the expected directions of the algorithm to conservative vector fields. If the expected directions at a point $η$ resemble enough to vector $X ( η )$ with $∇ ϕ = X$ a conservative vector field, then the process is stable at that point. If this happens for every $η ∈ R k$, and in addition the process is locally bounded by $ϕ$, then the process is globally stable and converges.
Some inspiring paths remain unexplored after this work. For example, finding $ϕ$ function is the key to prove convergence, and it is asked to be a convex twice differentiable function. It is interesting to study how function $ϕ$ can be obtained, for instance as a sum of other convex twice differentiable functions $ϕ i$.
Another promising research line is a deeper analysis of $E D S$ and $E E D S$ objects, which may guarantee the existence of a function $ϕ$ without the need of finding it. If sufficient conditions are established for a stochastic process to ensure resemblance to some unknown conservative vector field, then $ϕ$ searching can be dodged. Even proving the non-existence of such function after a wider study of $E D S$ and $E E D S$ is useful, forbidding the use of our theorem.
It is also interesting to study the converse implication. Specifically, investigating the conditions that lead to divergent instances after ground theory explained in the article. In this sense, Lyapunov characterization of convergent processes becomes a helpful and key theory, since great similarities arise between these two techniques.
Furthermore, in many occasions the function $ϕ$ to optimize can be established beforehand (convex and twice differentiable). Therefore, the opposite process can be considered, that is, generating a set of stochastic processes that resemble to $∇ ϕ$, assuring in consequence the convergence of such candidates.
In [17], one finds another relevant convergence result. It assures the convergence in probability of a stochastic process, instead of almost sure convergence worked in this article. We wonder about the existing commonalities with our theorem, and the possibility to relax the conditions our theorem imposes, yet ensuring convergence in probability of a process.
We are currently working on two weaker resemblance properties, that we name weak and essential resemblance. The intention is to deduce almost sure convergence of a process by only studying its essential expected direction set (EEDS).

## Author Contributions

Conceptualization, B.S.-L. and J.C.; writing—original draft preparation, B.S.-L. and J.C.; review and editing, B.S.-L. and J.C.; supervision, J.C.; funding acquisition, J.C. All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

## Funding

This work is partially supported by the projects Crowd4SDG and Humane-AI-net, which have received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements No. 872944 and No. 952026, respectively. This work is also partially supported by the project CI-SUSTAIN funded by the Spanish Ministry of Science and Innovation (PID2019-104156GB-I00).

Not applicable.

Not applicable.

Not applicable.

## Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the writing of the manuscript, or in the decision to publish the results.

## Appendix A. Convergence Theorems

We state below Bottou’s convergence theorem appearing in [4] and Sunehag et al. convergence theorem in [5]. We provide a generalization of such theorems, whose proofs carry no complications from their original proofs. Moreover, we adapt the notation to our text and replace algorithm concepts by the corresponding terms appearing in the more generic stochastic process theory branch. We name every condition described by the result to refer to them in the article.
Theorem A1
(Bottou’s in [4]). Let $l : R k → R$ be a function with a unique minimum $η ¯$ and $Z t + 1 = Z t − γ ( t ) X t$ be a stochastic process. Then Z converges to $η ¯$ almost surely if the following conditions hold;
Theorem A2
(Theorem 3.2 in [5]). Let $l : R k → R$ be a twice differentiable cost function with a unique minimum $η ¯$ and let $Z t + 1 = Z t − γ t B t Y ( Z t )$ be a stochastic process where $B t$ is symmetric and only depends on information available at time t and. Then Z converges to the $η ¯$ almost surely if the following conditions hold;
$C . 1 ( ∀ t ) E t Y ( Z t ) = ∇ l ( Z t ) C . 2 ( ∃ K ) ( ∀ η ) ∥ ∇ η 2 l ( η ) ∥ ≤ 2 K C . 3 ( ∀ δ > 0 ) inf l ( Z t ) − l ( η ¯ ) > δ ∥ ∇ l ( Z t ) ∥ > 0 C . 4 ( ∃ A , B ) ( ∀ t ) E ∥ Y ( Z t ) ∥ 2 ≤ A + B l ( Z t ) C . 5 ( ∃ a , b : 0 < a < b < ∞ ) ( ∀ t ) s p e c ( B t ) ⊂ [ a , b ] C . 6 ∑ t γ ( t ) 2 < ∞ , ∑ t γ ( t ) = ∞$
where $s p e c ( B )$ are the eigenvalues of matrix B.
Now, we provide a generalization of theorem of sunehag in [5]. Specifically, we deleted condition C.5 and modified (and relaxed) conditions C.3 and C.4 of the original statement. The proof is trivial after the original theorem’s proof, so the modifications present no complications.
Theorem A3
(Generalization of Theorem A2). Let $l : R k → R$ be a twice differentiable cost function with a unique minimum $η ¯$ and let $Z t + 1 = Z t − γ t B t Y t$ a stochastic process where $B t$ is $F t$-measurable. Then Z converges to the $η ¯$ almost surely if the following conditions hold;
$C . 1 ( ∀ t ) E t Y t = ∇ l ( Z t ) η t ≠ η ¯ Hessian bound ( ∃ K ) ( ∀ η ) ∥ ∇ η 2 l ( η ) ∥ ≤ 2 K Sunehag resembance ( ∀ δ > 0 ) inf l ( Z t ) − l ( η ¯ ) > δ ∇ l ( Z t ) ⊺ B t ∇ l ( Z t ) > 0 Sunehag algorithm bound ( ∃ A , B ) ( ∀ t ) E ∥ B t Y t ∥ 2 ≤ A + B l ( Z t ) Learning rate constraint ∑ t γ ( t ) 2 < ∞ , ∑ t γ ( t ) = ∞$
Robbins–Siegmund theorem is the key result to prove almost sure convergence on previous theorems, as well as on our generalization result.
Theorem A4
(Robbins-Siegmund). Let $( Ω , F , P )$ be a probability space and $F 1 ⊆ F 2 ⊆ ⋯$ a sequence of sub-σ-fields of $F$. Let $U t , β t , ϵ t$ and $ζ t$, $t = 1 , 2 , ⋯$ be non-negative $F t$-measurable random variables, such that
$E ( U t + 1 ∣ F t ) ≤ ( 1 + β t ) U t + ϵ t − ζ t , t = 1 , 2 , ⋯$
Then on the set , $U t$ converges almost surely to a random variable, and $∑ t ζ t < ∞$ almost surely.

## Appendix B. Proof of Corollary 1

To prove the corollary, it is enough to prove the generic proposition below.
Proposition A1.
Let $U t ⊂ R k$ be non empty, closed and connected sets where $U t + 1 ⊂ U t$ for $t ∈ N$ and let $V = ∩ t U t$. Then V is a non empty bounded set if, and only if, $U T$ is bounded for some $T ∈ N$.
Proof.
Prove first that if $U T$ is bounded for some $T ∈ N$, then $V = ∩ t U t$ is a non-empty bounded set. Clearly, $V ⊂ U T$ and, therefore, V is bounded, possibly empty. Observe that $U t$ for all $t ≥ T$ is compact and closed. Then V is not empty, by the Cantor’s intersection theorem.
Conversely, prove now that if V is a non empty bounded set, then there exists T such that $U T$ is bounded. Assume V is non-empty bounded set, then there exists $r > 0$, such that $V ⊂ B r ( 0 )$ where $B r ( 0 )$ is the ball centered at 0 with radius r. Define
$U t * = U t ∖ ( B 2 t ( 0 ) ¯ ′ ∪ B r ( 0 ) ) ,$
where $B 2 t ( 0 ) ¯$ is the closed ball of radius $2 t$ and center 0 and $A ′ = R k ∖ A$. The sequence $U t *$ is of compact and closed subsets, where $U t + 1 * ⊂ U t *$ and $∩ t U t *$ is empty. Therefore, by Cantor’s intersection theorem, there exists T such that $U T *$ is empty. Then $U T ⊂ B 2 t ( 0 ) ¯ ′ ∪ B r ( 0 ) )$. Since $V ⊂ U T$ and $U T$ is connected, then $V ⊂ U T ⊂ B r ( 0 )$ and hence it is bounded as wanted to prove. □

## Appendix C. Bottou’s Resemblance

Proposition 5 is a direct consequence of Proposition A2, that we state and prove below, and Proposition 3.
Proposition A2.
Let $Z = ( X , γ )$ be a stochastic process and $X$ be a vector field over $R k$. For $δ > 0$ and $T ∈ N$, define the vector pair set
$V δ , T ( X , Z ) = { ( X ( η ) , v ) ∣ η ∈ R k ∖ B δ ( K X ) , v ∈ E D S Z ( η , T ) } .$
Then Z resembles to $X$ if, and only if,
$( ∃ T ∈ N ) ( ∀ δ > 0 ) ( ∃ ϵ > 0 ) V δ , T ( X , Z ) is ϵ - acute .$
Proof.
By definition, $V δ , T ( X , X )$ is $ϵ$-acute if, and only if, every vector pair $( u , v )$ in $V δ , T ( X , X )$ is $ϵ$-acute. By definition, such vector pairs $( X ( η ) , v )$ with $v ∈ E D S X ( η , T )$ are $ϵ$-acute if, and only if,
$( ∀ η ∈ R k ∖ B δ ( K X ) ) X ( η ) ⊺ · v ≥ ϵ > 0 , v ∈ E D S X ( η , T ) .$
Previous equation holds if, and only if, $E D S X ( η , T ) ⊂ H ϵ ( X ) ( η )$, $η ∈ R k ∖ B δ ( K X )$ as wanted to prove. □

## Appendix D. Sunehag’s Resemblance

The result that translates Theorem A3 with resemblance concepts is Proposition 6, that we prove below.
Proof.
After Proposition A2 and 4 deduce that Z belongs to the half-space of $X$ if, and only if, there exists $T ∈ N$ such that for every $δ > 0$ and every $t ≥ T$ there exist symmetric positive-definite $F t$-measurable random matrices $B t$, such that
$inf η ∈ R k ∖ B δ ( K X ) t ≥ T ω ∈ Ω , Z t ( ω ) = η X ( η ) ⊺ · B t ( ω ) · X ( η ) > 0 , B t · X ( Z t ) = E t [ X t ] Z t ( ω ) ∉ K X .$
This matches with Equation (33). Matrix $B t$ is correctly and uniquely defined for all $t ≥ T$ and all $ω ∈ Ω$, such that $Z t ( ω ) ∉ K X$. Define $B t = I d$ the identity matrix if $Z ( ω ) ∈ K X$ and also define
$Y t : = B t − 1 · X t .$
Observe that $B t · Y t = X t$ and that Equation (32) is then met too finishing the proof. □

## References

1. Amari, S.I. Natural Gradient Works Efficiently in Learning. Neural Comput. 1998, 276, 251–276. [Google Scholar] [CrossRef]
2. Thomas, P.S. GeNGA: A generalization of natural gradient ascent with positive and negative convergence results. In Proceedings of the 31st International Conference on Machine Learning, ICML 2014, Beijing, China, 21–26 June 2014; Volume 5, pp. 3533–3541. [Google Scholar]
3. Sánchez-López, B.; Cerquides, J. Convergent Stochastic Almost Natural Gradient Descent. In Proceedings of the Artificial Intelligence Research and Development-Proceedings of the 22nd International Conference of the Catalan Association for Artificial Intelligence, Mallorca, Spain, 23–25 October 2019; Volume 319, pp. 54–63. [Google Scholar]
4. Bottou, L. Online Algorithms and Stochastic Approximations. In Online Learning and Neural Networks; Saad, D., Ed.; Cambridge University Press: Cambridge, UK, 1998; Revised, October 2012. [Google Scholar]
5. Sunehag, P.; Trumpf, J.; Vishwanathan, S.V.N.; Schraudolph, N. Variable Metric Stochastic Approximation Theory. In Proceedings of the Artificial Intelligence and Statistics, Clearwater, FL, USA, 16–19 April 2009; pp. 560–566. [Google Scholar]
6. Lyapunov, A.M. The general problem of the stability of motion. Int. J. Control 1992, 55, 531–534. [Google Scholar] [CrossRef]
7. Robbins, H.; Siegmund, D. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing Methods in Statistics; Rustagi, J.S., Ed.; Academic Press: Cambridge, MA, USA, 1971; pp. 233–257. [Google Scholar]
8. Karlin, S.; Taylor, H.M. Elements of stochastic processes. In A First Course in Stochastic Processes, 2nd ed.; Karlin, S., Taylor, H.M., Eds.; Academic Press: Boston, MA, USA, 1975; Chapter 1; pp. 1–44. [Google Scholar] [CrossRef]
9. Ross, S.M.; Kelly, J.J.; Sullivan, R.J.; Perry, W.J.; Mercer, D.; Davis, R.M.; Washburn, T.D.; Sager, E.V.; Boyce, J.B.; Bristow, V.L. Stochastic Processes; Wiley: New York, NY, USA, 1996; Volume 2. [Google Scholar]
10. Bass, R.F. Stochastic Processes; Cambridge University Press: Cambridge, UK, 2011; Volume 33. [Google Scholar]
11. Grimmett, G.; Stirzaker, D. Probability and Random Processes; OUP Oxford: Oxford, UK, 2020. [Google Scholar]
12. Sánchez-López, B.; Cerquides, J. Dual Stochastic Natural Gradient Descent and convergence of interior half-space gradient approximations. arXiv 2021, arXiv:2001.06744. [Google Scholar]
13. Tao, T. An Introduction to Measure Theory; Graduate Studies in Mathematics, American Mathematical Society: Providence, RI, USA, 2011. [Google Scholar]
14. Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
16. Kingma, D.P.; Ba, L.J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
17. Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Figure 1. Path of stochastic process Z with director process X and learning rate $γ$.
Figure 1. Path of stochastic process Z with director process X and learning rate $γ$.
Figure 2. Set of vectors U and its convex vector subspace $C ( U )$ in $R 2$.
Figure 2. Set of vectors U and its convex vector subspace $C ( U )$ in $R 2$.
Figure 3. Shaded area representing $H ϵ ( u )$.
Figure 3. Shaded area representing $H ϵ ( u )$.
Figure 4. A stochastic process Z that $ϵ$-resembles to $X$ at $η$ from T on, since vector set $E D S Z ( η , T )$ of all expected directions of Z at $η$ after time T belongs to $H ϵ ( X ) ( η )$.
Figure 4. A stochastic process Z that $ϵ$-resembles to $X$ at $η$ from T on, since vector set $E D S Z ( η , T )$ of all expected directions of Z at $η$ after time T belongs to $H ϵ ( X ) ( η )$.
 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Share and Cite

MDPI and ACS Style

Sánchez-López, B.; Cerquides, J. On the Convergence of Stochastic Process Convergence Proofs. Mathematics 2021, 9, 1470. https://doi.org/10.3390/math9131470

AMA Style

Sánchez-López B, Cerquides J. On the Convergence of Stochastic Process Convergence Proofs. Mathematics. 2021; 9(13):1470. https://doi.org/10.3390/math9131470

Chicago/Turabian Style

Sánchez-López, Borja, and Jesus Cerquides. 2021. "On the Convergence of Stochastic Process Convergence Proofs" Mathematics 9, no. 13: 1470. https://doi.org/10.3390/math9131470

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.