On the Convergence of Stochastic Process Convergence Proofs

: Convergence of a stochastic process is an intrinsic property quite relevant for its successful practical for example for the function optimization problem. Lyapunov functions are widely used as tools to prove convergence of optimization procedures. However, identifying a Lyapunov function for a speciﬁc stochastic process is a difﬁcult and creative task. This work aims to provide a geometric explanation to convergence results and to state and identify conditions for the convergence of not exclusively optimization methods but any stochastic process. Basically, we relate the expected directions set of a stochastic process with the half-space of a conservative vector ﬁeld, concepts deﬁned along the text. After some reasonable conditions, it is possible to assure convergence when the expected direction resembles enough to some vector ﬁeld. We translate two existent and useful convergence results into convergence of processes that resemble to particular conservative vector ﬁelds. This geometric point of view could make it easier to identify Lyapunov functions for new stochastic processes which we would like to prove its convergence.


Introduction
Along most practical research branches, the solution to a given problem is often entrusted to a function optimization problem, where the effectiveness of a solution is measured by a function to be optimized.Machine learning challenges are great examples of this situation.Therefore, optimization algorithms become crucial to solve such problems.Iterative optimization methods start at an initial point and move through parameter space towards trying to minimize the objective function.Its performance may dramatically vary depending on the initial point.This dependence is somewhat diminished if the algorithm is guaranteed to converge in the long term to a minimum.Furthermore, in stochastic optimization algorithms, the quality achieved varies randomly and sometimes there are chances that the algorithm fails to converge.As an example, the stochastic natural gradient descent (SNGD) of Amari [1] and its variants often show instability depending on the starting point and learning rate tuning.Even some experiments are proved to diverge with SNGD [2].Clearly such issue weights considerably against its practical use.
Convergent algorithms are more stable with respect to both learning rate parameters and initial point estimations.For instance, in [3], the optimization method named convergent stochastic natural gradient descent (CSNGD) is proposed.CSNGD is designed to mimic SNGD but it is proven to be convergent.Sánchez-López and Cerquides show that, unlike SNGD, CSNDG shows stability in the experiments run.
As a consequence, we are interested in understanding better the conditions that make an algorithm convergent.Convergence proofs abound in the literature.In this work we concentrate on two apparently disconnected and well known convergence results.In [4] seminal work, Bottou proved the convergence of stochastic gradient descent (SGD).Later on in [5], Sunehag provided an extended result for variable metric modified SGD.The connection between the proofs of both results are not evident.It is not clear what they have in common, and, therefore, further generalizations seem not to be within reach.
To understand the convergence results (Both theorems are added in Appendix A. However, we alleviate the conditions of Theorem A2 in [5].The alleviated conditions turn Theorem A2 into Theorem A3 found in the same appendix), it is helpful to take a look at their proofs.Bottou's proof relies on the construction of a Lyapunov function [6].On the other hand, Sunehag's proof uses the Robbins-Siegmundtheorem [7] instead.It can be seen that the latter is proving that the function to optimize serves already as a Lyapunov function, similar to latter chapters in [4].Therefore, both proofs share some similarity but it is not evident how to raise a connection.Establishing the connection and pointing out its relevance is the main contribution of this paper and results in a generalization from which both results can be easily proved as corollaries.
Stochastic optimization algorithms rely on observations extracted from some possibly unknown probability space.Algorithms subjected to random phenomena are stochastic process [8][9][10][11].The generalized convergence result for stochastic processes in this article is obtained after 2 main concepts.Precisely, the first one is resemblance between a stochastic process and a vector field.The second one is the locally bounded property of a stochastic process by a function.These two ingredients are enough to state and prove our convergence theorem; a stochastic process Z converges if it is locally bounded by a convex real valued, twice differentiable function φ with bounded Hessian and Z resembles to ∇φ.
Two corollaries are extracted from this result, which we prove to be equivalent to Bottou's and Sunehag's convergence theorems.Moreover, we observe that convergence proof in [12] of the algorithm called discrete DSNGD can be addressed by our main theorem, since original convergence theorem of Sunehag is not general enough.
Resemblance concept involves the expected directions set of a stochastic process, that we define in Section 2, and the half-space of a vector field, a concept introduced in Section 3.Then, in Section 4 we state and prove our general result, which highlights the commonalities between Bottou and Sunehag theorems, proving convergence of a wider variety of algorithms.

Main Result. Director Process and the Expected Direction Set
Let (Ω, F , P) be a probability space and (S, Σ) be a measurable space.A discrete stochastic process on (Ω, F , P) indexed by N is a sequence of random variables Z = {Z t } t∈N such that Z t : Ω → S. In this work, S = R k and Σ is the corresponding Borel σ-algebra.As random variables are used to describe general random phenomena, stochastic processes indexed by N are usually used to model random sequences.

Locally Bounded Stochastic Processes and Objective of the Work
The difference between two random variables of a stochastic process is a random variable known as increment.We say that random variable Z t+s − Z t with 1 ≥ s ∈ N is an s-increment at time t.For example, the 1-increments of a stochastic process Z are We focus our attention to a decomposition of a positive real valued function and X = {X t } t∈N is a stochastic process on (Ω, F , P).

Definition 1. Let Z and X be stochastic processes and γ
Name X the director process of Z and γ the learning rate, and note it by Z = (X, γ).
This way of expressing a process allows to define Z t+1 with respect to Z t , which gives us control of the difference between both values by means of γ(t)X(t), as Figure 1 shows.This is very useful if we intend to analyse the convergence of a stochastic process.
As represented in Figure 1, we can think of Z t as the value of the process at time t, while −γ(t)X t is the vector going from Z t to Z t+1 .For the article, it is important to remember this, since we are constantly referring to Z t as points in R k while X t are managed as direction vectors in R k .This distinction is only practical for our purposes.The trajectories of stochastic approximation algorithms, such as stochastic gradient descent (SGD), are indeed samples of stochastic processes.Furthermore, they are usually expressed by means of their decomposition of 1-increments, as can be seen in the following examples.
Example 1. SGD [4] is the cornerstone of machine learning to solve the function optimization problem.The objective of SGD is to minimize an objective function L(η) = E z∼P * l(η; z) for some unknown probability distribution P * and random variable l(η) defined on (Ω * , F * , P * ).This function l is known as loss function, and it is usually differentiable with respect to η, allowing the definition of SGD as where Z t are estimates of η.We can see Z = {Z t } t∈N , and, therefore, SGD, as a stochastic process.Indeed, let be the product probability space (This space is guaranteed to exist according to Kolmogorov extension theorem (see for example Theorem 2.4.4 and following examples in [13])) over infinite sequences.Hence we can define the stochastic process X on (Ω, F , P) such that X t = ∇ η l(Z t ) where for every In addition, we observe that Z t+1 depends only on last observation Z t and t, which is known as a non-stationary Markov chain.Example 2. This example is worked in [5].Again, we focus on the function optimization problem, using the same notation as in previous example.In this case, the estimation update of the minimum η is defined as where B t is a matrix in R k×k known after information Z 0 , . .., Z t available at time t and Y t = Y(Z t ), where Y is a function mapping each η ∈ R k to a random variable on the same probability space (Ω * , F * , P * ).
Similarly as in previous example, Y can be thought as a random variable in the product probability space (Equation ( 4)) that depends on previous Z t , such that for every ω Here, Z is not a (non-stationary) Markov chain, since B t may depend on Z i for all i < t.
The naming of γ as learning rate is commonly used in the machine learning research branch [4,[14][15][16].The director process X determines the direction X t at time t of the update Equation (1) with Z t as reference point, while γ(t) specifies a certain distance to travel along that direction X t .Moreover we demand some constraints to both factors.Condition imposed to γ is usually found in the literature [3][4][5].A learning rate γ holds the standard constraint if; Before we show the condition for the director process X, we fix some notation used throughout the article.Consider the natural filtration F Z = {F t } t∈N generated by stochastic process Z, that is, Intuitively, every F t of a filtration is a σ-algebra that classifies the elements of Ω.For example, if Ω is the set of colours, F t can gather warm and cold colours into separate and complementary sets.The fact that a random variable Z t is F t -measurable implies that Z t sends all warm colours to the same value and all cold colours also to the same value.Somehow Z t is then not providing any additional information about elements of Ω beyond the classification of F t .The sequence F t is increasing, in the sense that F t ⊂ F t+1 for all t.Therefore, a filtration characterizes space Ω with sequentially higher levels of information or classification.Denote Hence, if Z = (X, γ) then X is locally and linearly bounded by function These two constraints are finally combined to present the kind of stochastic processes we are interested in.Definition 2. Let Z be a stochastic process and φ : R k → R be a function.We say that Z is locally bounded by φ if there is a decomposition of 1-increments (X, γ) with γ holding the standard constraint and X locally and linearly bounded by φ.
Furthermore, if Z 0 = η 0 a.s.we say η 0 is the initial point of Z.
For instance, Examples 1 and 2 observed in this section define Z as a locally bounded process.We see it below.
Example 3. Recall Example 1.In the same reference [4], the optimization algorithm is asked to hold additional conditions in order to prove its convergence.We added the convergence theorem in Appendix A. Some of the conditions are where η ∈ R k is the optimal point of L. Standard constraint to γ is clearly asked.Moreover η 0 is a starting point.It remains to be seen if X is locally and linearly bounded by some function φ : R k → R. Indeed, if we define φ(η) = η − η 2 , then the property is easily checked.Hence Z is locally bounded by φ with initial point η 0 Example 4. Recall Example 2. Convergence theorem in [5], which is added in the Appendix A, demands below conditions; where L is a function to optimize.For this example, Z is then locally bounded by φ = L with initial point η 0 .Just as an observation, property of B t being determined after information available at time t, is the same as seeing B t as a F t -measurable random variable over the product probability space.
We are interested on studying the almost sure convergence of Z to a point η ∈ R k .A stochastic process Z almost surely (a.s.) converges to a point η ∈ R k if Examples 3 and 4 show us that we can understand the results in [4,5] as the almost sure convergence of some locally bounded processes.In this paper, we are interested in characterizing the almost sure convergence of locally bounded processes.The objective of this work is to create a theory that allows to prove the a.s.convergence of locally bounded processes that covers Examples 3 and 4 and whose applicability generalizes to a wider set of processes, such as the one described below: Example 5. Assume the function f (η) = η 2 defined in R k , and the optimization method Z defined by its director process X t = G 1 • G 2 • Z t where G 1 and G 2 are positive definite and symmetric matrices.For simplicity, this example shows a stochastic process with no random phenomena associated.We wonder about the convergence of process Z, and if so, whether it converges to the point of R k that optimizes function f .From Theorems A1 and A2 found in the literature (included in Appendix A) it is not possible to prove a.s.convergence of Z, since conditions Bottou resemblance and C.3, respectively, are not satisfied.That is, because Further on, Z is assumed to be locally bounded by φ where (X, γ) is its corresponding decomposition of 1-increments, unless otherwise indicated.

Main Result
The objective of the article is to proof below theorem, that we prove in Section 4.1.
Theorem 1.Let Z be a stochastic process on probability space (Ω, F , P).Then Z almost surely converges to a point η if there is a twice differentiable convex function φ with unique minimum η defined in R k with bounded Hessian norm, such that There is one concept of the theorem that needs a definition.That is, when a stochastic process resembles to a vector field.Next sections have that end, with our main definition that fills the gap appearing at Section 3.2.As we will see in Section 4.4, simple Example 5 finds a solution with our main theorem.

Expected Direction Set
We now define one key object of our work named the expected direction set.It focuses on gathering all directions that the update may take at time t conditioned to F t .Before the definition we provide some concepts and notation.
Random variable E t [X t ] determines all expected directions of Z at time t that the stochastic process may follow assuming F t .For example, if ω ∈ Ω is an observation, then E t [X t ](ω) ∈ R k is a vector pointing to the expected update direction departing from point Z t (ω) given F t .Denote the expected direction of Z at ω ∈ Ω and time t as The expected direction from point η = Z t (ω) of Equation ( 11) depends on ω.That is, the path followed until reaching We collect all expected directions at η = Z t (ω) and time t in the vector set below; The tools to define the expected direction set at η ∈ R k after time T ∈ N are given, so we proceed to its formal definition.
With a few words, EDS Z (η, T) is a vector set containing all expected directions (provided by the director process X) conditioned to F t for every outcome ω such that Z t (ω) = η where t ≥ T. In Definition 3, EDS depends on T. That is because to assess the convergence of an algorithm it is not important to consider all expected directions throughout all the process.For example, if an algorithm converges we can modify randomly all directions of the director process for just a particular time T ∈ N, and the resulting algorithm still converges.Roughly speaking, only the tail of a process matters to determine the convergence property.This concept is better addressed with Definition 4 in next section.Example 6. Recall Example 1. Assume that Z is then SGD.Then EDS Z (η, T) is a singleton.Indeed, D Z (ω, t) is the same vector for all t ∈ N and all ω with Z t (ω) = η and hence S Z (η, t) = {D Z (ω, t)} with any ω ∈ Ω with Z t (ω) = η.
This is the case of any non-stationary Markov chain.

Essential Expected Direction Set
Convergence property of an algorithm relates closely to directions followed after time T ∈ N as T tends to infinity.Equivalently, the direction set appearing repeatedly through the whole optimization process matters, while directions set only contemplated for a finite amount of iterations changes nothing, in terms of convergence guarantee.This direction set is named the essential expected directions set in this article.
To define properly the essential expected directions set, we will use the convex vector subspace of a given vector set.Given a vector set U in R k , let C(U) be the smallest convex vector subspace containing U. See Figure 2 as an illustrative example.Observe that C(U) is always closed, but it may be unbounded.Definition 4. Let Z = (X, γ).Define the essential expected directions set of Z at η as
Definition of EEDS Z (η) delimits the smallest subspace where all directions at η tend to.Clearly, EEDS Z (η) is also convex and closed (possibly empty).Deeper properties of this set lead to identify divergence symptoms.For example, if it is empty or unbounded, we face instability of the process at η.To see this, observe below result.The proof can be found in the Appendix B.
Corollary 1.Let η ∈ R k .Then EEDS Z (η) is a non-empty bounded set if, and only if, there exists T ∈ N, such that C(EDS Z (η, T)) is bounded.This result relates EEDS Z (η) with instability properties of Z.If EEDS Z (η) is empty or unbounded, then the algorithm is unstable at η, since expected directions with arbitrarily large norms exist after enough iterations.Clearly, if this situation is found for all points near the optimum, the algorithm can not converge to the solution.It is desirable instead that C(EDS Z (η, T)) is compact (bounded) for some T for every η ∈ R k , or equivalently, that EEDS Z (η) is compact (bounded) and not empty.
In fact, since we are interested in the case where Z is locally bounded by φ (recall Definition 2), we can assume that EEDS Z (η) is a non empty compact set, by virtue of below results.Proposition 1.Let stochastic process Z be locally bounded by φ.Then C(EDS Z (η, 0)) is a non-empty compact set.
Proof.We know that X is locally and linearly bounded.Hence, applying Jensen's inequality Let η ∈ R k and ω ∈ Ω, such that Z t (ω) = η for some t ≥ 0. Therefore, every Below corollary is a consequence of Proposition 1 and Corollary 1.

Corollary 2.
Let stochastic process Z be locally bounded by φ.Then EEDS Z (η) is a non-empty compact set for all η ∈ R k .

Vector Field Half-Spaces and Stochastic Processes. Resemblance.
This section defines the main concept of this work; the property of resemblance between a stochastic process and a vector field.The definition highlight some commonalities between Theorems A1 and A3.Both of them prove the convergence of stochastic processes that resemble to particular vector fields.A geometric interpretation and explanation of convergence theorems conditions is established in next Section 4.
Some previous definitions are needed and stated before introducing the main concepts of the article, such as -acute vector pair sets and the half-space of a vector field.The section starts with some basic concepts about vectors.Definition 5. Let u, v ∈ R k be two vectors.The pair (u, v) is acute if u and v form an acute angle, that is, if u Proposition 2. Let u, v ∈ R k be two vectors.Then the pair (u, v) is -acute if, and only if, there exists a symmetric positive-definite matrix B, such that B • u = v and u | i ∈ I} where I is an index set.Definition 6.Let V be a vector pair set.V is -acute if every vector pair (u, v) ∈ V is -acute.
Next result is a direct consequence.Proposition 3. Let V be a vector pair set, indexed by I.Then, V is -acute for some > 0 if, and only if; inf Proposition 4. Let V be a vector pair set, indexed by I.Then, V is -acute for some > 0 if, and only if, there exist a set of symmetric positive-definite matrices Proof.Prove, first, that if there exist a set of matrices B = {B i | i ∈ I} holding Equation ( 19) then V is -acute for some > 0. Observe that after Equation (19); Then, Proposition 3 implies that V is -acute and finishes this part of the proof.Now assume that V is -acute, prove then that there exist a set of matrices B = {B i | i ∈ I} holding Equation (19).Since V is -acute, in particular, the pair (u i , v i ) ∈ V is -acute for every i ∈ I. Apply Proposition 2: for every i ∈ I there exists a symmetric positive-definite matrix B i , such that B i u i = v i and u i • B i • u i ≥ .This finishes the proof.

The Half-Space of a Vector Field
The half-space determined by a vector u is the set of vectors that conform an acute angle with u.This region clearly occupies half of the total space.Additionally, the -halfspace of u with > 0 is the set of vectors v, such that the vector pair (u, v) is -acute.This object is needed for afterwards defining the half-space of a vector field.We define these concepts below and illustrate the -half-space of a vector u in Figure 3. Definition 7. Let u be a vector of R k .The half-space of u is the set Similarly, the -half-space of u with > 0 is the set A vector field X over R k is a function assigning to every η ∈ R k a vector of R k , that is X : R k → R k .For example, if l : R k → R is a twice differentiable function, we can consider the vector field consisting of the gradient vectors at each point η.Precisely, denote the gradient vector field (GVF) as X ∇l , where X ∇l (η) = ∇l(η).
We are ready to define the half-space of a vector field.
Definition 8. Let X be a vector field over R k .The half-space of X is a function H(X) mapping every η to H(X)(η) = H(X(η)).Similarly, the -half-space of X with > 0 is a function H (X) mapping every η to H (X)(η) = H (X(η)).

Resemblance between a Stochastic Process and a Vector Field
The convergence of any locally bounded process can be proved comparing the expected directions set of the algorithm with some vector fields.When the expected directions resemble the vector field we compare it to, then we can ensure the almost sure convergence to a point of the stochastic process, after some reasonable conditions.By resemblance, we mean that the expected directions set after some time T is a subset of the -half-space of X, among other things explained later.Therefore, resemblance asks for every η ∈ R k that every vector D Z (ω, t) with t ≥ T and every ω ∈ Ω with η = Z t (ω) form an acute angle with the vector field at X(η).
However, if the vector field sends a specific point η to 0 ∈ R k , then no direction can be set by the D Z (ω, t) to form an acute vector pair.Therefore, resemblance property is evaluated outside the neighborhood of these annulled points.That is why we must consider now the set of annulled points of a vector field and the neighborhoods around the points of this set.
Formally, let X be a vector field defined in R k .The set K X is the set of points of R k annulled by X, that is, We also use the notation A = R k \ A for the compliment set of subset A ⊂ R k .We say that Z -resembles to X at η from T on if EDS Z (η, T) ⊂ H (X)(η). Observe an illustrative example in Figure 4.
This intuition is naturally extended to -resemblance at sets, when the property is satisfied for every η in the set.With this in mind we can define the key concept of this article.η X(η) A stochastic process Z that -resembles to X at η from T on, since vector set EDS Z (η, T) of all expected directions of Z at η after time T belongs to H (X)(η). Definition 9. Let Z = (X, γ) be a stochastic process and X be a vector field over R k .We say that Z resembles to X from T ∈ N on, if; We say that Z resembles to X if there is T ∈ N such that it resembles to X from T on.
Everything is set up to accomplish the goal of this paper.We refresh the main theorem of this article in next section and show its proof.

Proof of Main Result. Reinterpretation of Convergence Theorems
The objective of the article is within reach now.That is, proving main Theorem 1.Moreover, this section addresses afterwards the task of proving that Theorems A1 and A3 are particular examples of our main Theorem 1.

Resemblance to Conservative Vector Fields and Convergence
Recall main Theorem 1 and observe that it asks the stochastic process Z to be locally bounded by some function φ and Z to resemble to ∇φ.Therefore, ∇φ is a particular type of vector field called conservative vector field.That is, a vector field that appears from derivation of a function.That is why we understand our main theorem as a convergence result of locally bounded processes of resemblance to conservative vector field.
In the theorem statement, it says that φ has bounded Hessian norm.Similarly to Theorem A3, it means that: We are ready to prove the main result of the paper.
From here, the prove follows the steps of Theorem A2's proof.Taylor inequality and Hessian norm bound; where K = K 2 .Apply expectation conditioned to information until time t and then use that Z is locally bounded by ψ; Use now that Z resembles to X.Then, there exists T such that for every t ≥ T, the term −γ t X(Z t ) E t [X t ] is negative.All other conditions of Robbins-Siegmund theorem (in [7], added in Appendix A) also hold for the algorithm after time T, thanks to learning rate constraints.Apply it and deduce that random variables ψ(Z t ) converge almost surely to a random variable (and so does φ(Z t )) and that; Prove now that stochastic process φ(Z t ) converges almost surely to value φ(η).Proceed by contradiction.Assume that for δ 1 > 0 this implies, by continuity and convexity of function φ, that there exists δ By resemblance and definition of the limit, there exists T and such that EDS Z (η, T) ⊂ H (X)(η) for every η ∈ B δ (η) .This leads to a contradiction, since using learning rate standard constraint we have for every ω ∈ A, which has measure different to 0 by Equation (28).This clearly contradicts Equation (26).Hence, φ(Z t ) converges almost surely to φ(η) and Z t converges almost surely to η as wanted.

Reinterpretation of Bottou's Convergence Theorem
The goal now is to deduce Theorem A1 as a direct consequence of main Theorem 1.Consider a particular case of main Theorem 1 where φ(η) = η − η 2 , that reads as follows.
Additional conditions to φ, such as Hessian bound or twice differentiability, are not specified in the corollary since with the particular definition of φ all those conditions are already satisfied.
To see that Corollary 3 proves Theorem A1 statement, we need to prove that Theorem A1 is assuming that Z is locally bounded by φ and that Z resembles ∇φ.Example 3 already proves that Bottou is assuming that Z is locally bounded by φ.Therefore, it remains to check that Z resembles to ∇φ.To that end, see below proposition proved in Appendix C. Proposition 5. Let Z = (X, γ) be a stochastic process and X be a vector field over R k .Then Z resembles to X if, and only if, Observe condition Bottou resemblance of Theorem A1 and Proposition 5. Deduce from it, that the algorithm Z of the theorem resembles to vector field ∇φ.

Reinterpretation of Sunehag's Convergence Theorem
Theorem A3 is deduced from main Theorem 1. Similarly to previous section, we provide a version of our main theorem for the case where φ = l is a function that we aim to minimize.Corollary 5. Let l : R k → R be a twice differentiable cost function with a unique minimum η and bounded Hessian norm, and let Z be a stochastic process on probability space (Ω, F , P).Then Z converges to the minimum η of l almost surely if

•
Z is locally bounded by l; • Z resembles ∇l.
The stochastic process described in Theorem A3 has some more properties, such as X t = B t • Y t .However, if we prove that Z of that theorem is locally bounded by l and that Z resembles ∇l, then it is clear that Corollary 5 deduces Theorem A3.Recall Example 4 and notice that we already proved that Z is locally bounded by l.The remaining property is acquired after below proposition that we prove in Appendix D. Proposition 6.Let Z = (X, γ) be a stochastic process and X be a vector field over R k .Then Z resembles to X if, and only if, there exists T, such that for every t ≥ T there are random vectors Y t to R k and symmetric and positive-definite F t -measurable random matrices B t such that It is only necessary to put together Proposition 6 and condition C.1 and Sunehag resemblance to finish our objective with the following corollary Corollary 6.Let l be a differentiable function and Z = (X, γ) be a stochastic process.Then Z resembles to ∇l if, and only if, there exist T such that for every t ≥ T there are random vectors Y t to R k and symmetric and positive-definite F t -measurable random matrices B t such that B t • Y t = X t and conditions C.1 and Sunehag resemblance hold.
Corollaries 4 and 6 nicely show the value of Theorem 1 for proving convergence.To reinforce this, we notice that the convergence of algorithm DSNGD in [12] is easily proved by means of Corollary 5, by combining both Theorem A3 and Corollary 6.This shows that Theorem 1 allows to prove convergence of a wider set of stochastic processes and function optimization methods.

Convergence of Process in Example 5
Our theorem solves question proposed by Example 5. To see it, just define Twice differentiable and convex function φ has bounded Hessian norm, since its Hessian is the constant matrix G 2 .Moreover, Z is clearly locally bounded by φ.Indeed, recall Equation (7) and observe; where such that λ i is the greatest eigenvalue of G i and µ i is the least eigenvalue of G i .

Conclusions
We have presented a result that allows us to prove the convergence of stochastic processes.We have proven that two useful convergence results in the literature are a consequence of our theorem.This is made after a new theory that compares the expected directions of the algorithm to conservative vector fields.If the expected directions at a point η resemble enough to vector X(η) with ∇φ = X a conservative vector field, then the process is stable at that point.If this happens for every η ∈ R k , and in addition the process is locally bounded by φ, then the process is globally stable and converges.Some inspiring paths remain unexplored after this work.For example, finding φ function is the key to prove convergence, and it is asked to be a convex twice differentiable function.It is interesting to study how function φ can be obtained, for instance as a sum of other convex twice differentiable functions φ i .
Another promising research line is a deeper analysis of EDS and EEDS objects, which may guarantee the existence of a function φ without the need of finding it.If sufficient conditions are established for a stochastic process to ensure resemblance to some unknown conservative vector field, then φ searching can be dodged.Even proving the non-existence of such function after a wider study of EDS and EEDS is useful, forbidding the use of our theorem.
It is also interesting to study the converse implication.Specifically, investigating the conditions that lead to divergent instances after ground theory explained in the article.In this sense, Lyapunov characterization of convergent processes becomes a helpful and key theory, since great similarities arise between these two techniques.
Furthermore, in many occasions the function φ to optimize can be established beforehand (convex and twice differentiable).Therefore, the opposite process can be considered, that is, generating a set of stochastic processes that resemble to ∇φ, assuring in consequence the convergence of such candidates.
In [17], one finds another relevant convergence result.It assures the convergence in probability of a stochastic process, instead of almost sure convergence worked in this article.We wonder about the existing commonalities with our theorem, and the possibility to relax the conditions our theorem imposes, yet ensuring convergence in probability of a process.

Figure 1 .
Figure 1.Path of stochastic process Z with director process X and learning rate γ.

Figure 2 .
Figure 2. Set of vectors U and its convex vector subspace C(U) in R 2 .
Example 7. Assume Z is any non-stationary Markov chain, such that SGD in Example 1. Then EEDS Z (η) = EDS Z (η, T) for any T. Indeed, we have seen in Example 6 that EDS Z (η, T) = {E t [X t ](ω)} for any ω ∈ Ω and t ≥ T where Z t