Learning-Augmented Quasi-Gradient Operators for Constrained Optimization: A Contraction–Bias–Variance Decomposition

Pérez-Lechuga, Gilberto; Coronel García, Marco Antonio; Martínez Salazar, Ana Lidia

doi:10.3390/math14071202

Open AccessFeature PaperArticle

Learning-Augmented Quasi-Gradient Operators for Constrained Optimization: A Contraction–Bias–Variance Decomposition

by

Gilberto Pérez-Lechuga

^*

,

Marco Antonio Coronel García

and

Ana Lidia Martínez Salazar

División de Estudios de Posgrado e Investigación, Instituto Tecnológico de Ciudad Madero, Tecnológico Nacional de México, Av. 1° de Mayo y Sor Juana I. de la Cruz S/N Col. Los Mangos, Ciudad Madero C.P. 89440, Tamaulipas, Mexico

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(7), 1202; https://doi.org/10.3390/math14071202

Submission received: 27 February 2026 / Revised: 27 March 2026 / Accepted: 1 April 2026 / Published: 3 April 2026

Download Versions Notes

Abstract

This paper develops a rigorous operator-theoretic framework for learning-augmented quasi-gradient methods in constrained optimization. We consider the minimization of an objective function over a closed convex feasible set, where feasibility is enforced via projection and directional updates may incorporate data-driven corrections. Such settings arise naturally in modern optimization algorithms that integrate artificial intelligence components under structural constraints. The proposed formulation introduces an explicit contraction–bias–variance decomposition of the iterative dynamics. Curvature induces deterministic contraction, alignment distortion—quantified by a geometric parameter—modifies the effective contraction margin, and stochastic learning components inject controlled dispersion. Explicit error recursions yield convergence guarantees under strong convexity, the Polyak–Łojasiewicz condition, and smooth nonconvexity. The analysis establishes that stability regions and first-order complexity bounds are preserved whenever alignment distortion remains below unity and bounded second-moment conditions hold. A fully reproducible computational study provides quantitative validation: the empirically observed steady-state error closely matches the theoretical prediction proportional to

σ^{2} / μ (1 - η)

. Comparative experiments with gradient, stochastic gradient, and momentum methods confirm that the proposed operator retains classical stability margins and conditioning sensitivity while enabling principled integration of learned directional components. The results provide a transparent mathematical bridge between stochastic approximation theory and contemporary AI-enhanced constrained optimization.

Keywords:

learning-augmented optimization; quasi-gradient methods; constrained optimization; operator-theoretic analysis; contraction–bias–variance decomposition; stochastic approximation; alignment geometry; convergence guarantees

MSC:

90C25; 47H09; 62L20; 90C30; 68T05; 49M37

1. Introduction

Constrained optimization plays a central role in mathematics, engineering, economics, and data science. A wide spectrum of models—ranging from convex programs and variational inequalities to discrete and mixed formulations—can be expressed as the problem of minimizing an objective function over a structured feasible set. In many practical settings, however, exact gradient information is unavailable, noisy, computationally expensive, or even undefined due to non-differentiability or discrete structure. These limitations have motivated the development of quasi-gradient and stochastic approximation techniques, which provide generalized directional information suitable for iterative descent schemes.

The foundations of stochastic approximation were established by Robbins and Monro [1], whose seminal work introduced recursive procedures for root-finding under noisy observations. Subsequent developments by Polyak [2] and Nemirovski [3] provided deeper insights into convergence rates and complexity bounds in convex stochastic optimization. Ermoliev [4] extends stochastic quasi-gradient methods to constrained and large-scale models, emphasizing their applicability in engineering systems under uncertainty. Related analyses of stochastic search efficiency within constrained quasi-gradient frameworks can be found in [5]. Pflug [6] further formalized stochastic programming and analyzed stability properties of iterative schemes under probabilistic perturbations.

Quasi-gradient techniques differ from classical gradient methods in that the update direction is constructed from surrogate or approximate information. These methods encompass subgradient schemes for nonsmooth optimization, projection-based algorithms for constrained problems, and various operator-splitting frameworks. Rockafellar [7] provided the convex analytic foundation for generalized gradients and variational structures, while Bertsekas [8] developed projection and augmented Lagrangian approaches for constrained optimization. Nesterov [9] introduced acceleration principles that have profoundly influenced modern first-order methods.

Parallel to these theoretical advances, the emergence of artificial intelligence and machine learning has significantly transformed the landscape of optimization. Large-scale learning problems are typically solved using stochastic gradient descent (SGD) and its variants [10], where gradient estimators are computed from minibatches of data. More recently, adaptive and learning-enhanced optimization strategies have been proposed, where algorithmic components—such as step sizes, search directions, or branching rules—are influenced by data-driven models. Rubinstein and Kroese [11] developed simulation-based search strategies rooted in probabilistic modeling.

From an operator-theoretic perspective, iterative optimization schemes can be interpreted as fixed-point iterations generated by suitable mappings on a Hilbert or Banach space. This viewpoint has proven particularly fruitful in convex and variational analysis, where projection and resolving operators play a central role. The theory of monotone operators, extensively developed by Rockafellar and Wets [12] and further systematized in variational analysis, provides powerful tools for understanding stability and convergence properties of generalized descent methods.

Mirror descent and related primal–dual schemes [13] have shown that the geometry induced by the feasible set significantly influences algorithmic behavior. In such methods, generalized gradients are combined with non-Euclidean projection operators, yielding improved performance in structured constrained settings. Nemirovski’s complexity analysis [14] established optimality bounds for first-order methods under oracle models, thereby clarifying the fundamental limits of gradient-based optimization.

More recently, learning-based optimization strategies have emerged in which algorithmic components are partially parameterized or adjusted through data-driven mechanisms. Adaptive moment methods and learned update rules modify the effective search direction based on historical or contextual information [15]. Although these methods have demonstrated impressive empirical success in large-scale machine learning applications, their mathematical interpretation often remains heuristic. In particular, the stability of learned directional updates under constraints and projection operators has not been systematically characterized within a general quasi-gradient framework.

Stochastic programming theory further emphasizes the importance of stability under perturbations. Shapiro, Dentcheva, and Ruszczyński [16] analyzed the sensitivity and convergence properties of stochastic models, highlighting the interplay between probabilistic approximation and the deterministic optimization structure. In high-dimensional settings, even small perturbations in directional estimates may lead to significant deviations in feasibility or convergence rates. Consequently, a rigorous operator-based treatment of learning-enhanced quasi-gradient updates is necessary to ensure robustness.

The present study situates learning-driven directional information within this broader analytical context. By formulating quasi-gradient updates as abstract operators endowed with projection and stability properties, it becomes possible to derive sufficient conditions under which data-driven modifications preserve decent behavior and boundedness. This approach not only unifies classical quasi-gradient methods with modern adaptive schemes but also clarifies the structural requirements that learning mechanisms must satisfy in order to maintain convergence guarantees in constrained optimization models.

1.1. Learning-Induced Bias and Operator Geometry

While classical stochastic approximation models perturbations as zero-mean noise, modern learning-enhanced optimization mechanisms often introduce structured, history-dependent corrections that cannot be interpreted purely as stochastic fluctuations. In such settings, the directional update may contain a systematic component arising from adaptive models, online regression procedures, neural architectures, or preconditioning strategies trained during the optimization process itself.

This observation raises a fundamental structural question:

Under what conditions can learned directional corrections be embedded within projected operator frameworks without destroying contraction geometry?

Unlike purely stochastic perturbations, learning-induced corrections may introduce operator-level bias. Such bias can either preserve descent structure, accelerate contraction, or, if uncontrolled, shift the fixed point away from the true minimizer. Therefore, the stability of learning-augmented optimization cannot be fully understood through variance analysis alone.

The present work adopts an operator-theoretic perspective in which learning components are interpreted as structured perturbations of projected quasi-gradient mappings. This viewpoint enables an explicit decomposition of the iterative dynamics into three interacting geometric mechanisms: (i) curvature-induced contraction, (ii) bias-induced directional distortion, and (iii) variance-induced dispersion.

By isolating these mechanisms, we provide a unified structural framework for analyzing learning-augmented constrained optimization beyond purely noise-driven stochastic approximation models.

Recent analyses in machine learning have studied biased stochastic gradient methods, where gradient estimators include systematic errors arising from sampling, approximation, or model misspecification. These works primarily focus on how bias and variance affect the convergence rates and stability of stochastic gradient descent.

In contrast, the present work adopts an operator-theoretic perspective in which learning-induced corrections are interpreted as structured perturbations of projected quasi-gradient mappings. This viewpoint goes beyond classical bias–variance analyses by explicitly incorporating contraction geometry and directional alignment effects, which determine whether learned components preserve or distort the underlying fixed-point structure.

Recent developments in projection-based algorithms and operator-theoretic methods have further expanded the applicability of constrained optimization techniques, particularly in large-scale and structured settings. For instance, recent projection-type schemes incorporating advanced descent mechanisms and line-search strategies have been proposed to improve convergence and robustness in nonlinear constrained problems [17].

In addition, modern fixed-point and inclusion-based approaches continue to refine the theoretical understanding of iterative schemes under nonexpansive and quasi-nonexpansive mappings, including recent viscosity-based and self-adaptive methods [18]. These developments highlight the ongoing relevance of operator-theoretic perspectives in the design and analysis of optimization algorithms.

The present work is aligned with this line of research, while focusing on a unified operator-theoretic framework that explicitly characterizes the interaction between contraction, bias, and stochastic effects in learning-augmented settings.

1.2. Main Contributions

The main contributions of this paper can be summarized as follows:

Operator-Theoretic Formulation. We introduce a unified operator-theoretic framework for learning-augmented quasi-gradient methods in constrained optimization, modeling data-driven directional components as structured operator perturbations of projected quasi-gradient mappings. Unlike purely stochastic models, the framework explicitly accommodates learning-induced directional corrections that may contain systematic (bias) components in addition to stochastic variability.
Contraction–Bias–Variance Decomposition. We derive an explicit error recursion that decomposes the iterative dynamics into three interacting geometric mechanisms: (i) deterministic contraction induced by curvature, (ii) bias-induced directional distortion arising from learned corrections, and (iii) variance-dependent stochastic dispersion. This decomposition provides a structural lens for understanding when learning-augmented updates preserve or alter contraction geometry.
Stability and Convergence Under Controlled Bias. We establish convergence guarantees under (i) strong convexity, (ii) the Polyak–Łojasiewicz condition, and (iii) smooth nonconvexity. The analysis shows that stability margins are preserved when learning-induced bias satisfies operator-alignment conditions and variance remains bounded, thereby extending classical stochastic approximation results beyond unbiased perturbation models.
Compatibility with Modern Learning Architectures. We demonstrate that representative learning mechanisms—including online linear models, neural network-based directional components, adaptive momentum schemes, and learned preconditioning strategies—can be embedded within the proposed operator framework while satisfying enforceable alignment and boundedness conditions.
Reproducible Spectral Validation. We provide a fully reproducible quadratic study that isolates curvature, bias, and variance effects in a controlled spectral setting, empirically validating the contraction–bias–variance recursion and illustrating how learned directional corrections interact with operator geometry.

2. Mathematical Framework and Problem Formulation

This section establishes the mathematical framework underlying the proposed learning-driven quasi-gradient formulation. The development builds upon classical results in convex and variational analysis, projection methods, and stochastic approximation theory. Foundational treatments of variational and operator-based optimization can be found in Rockafellar and Wets [12] and in the theory of monotone operators developed in Bauschke and Combettes [19].

Projection-based iterative schemes and generalized descent frameworks have been extensively studied in convex optimization [9], while nonsmooth analysis provides the theoretical basis for generalized gradients and subdifferentials [20]. In stochastic settings, recursive directional schemes originate from the stochastic approximation paradigm introduced by Robbins and Monro [1] and later extended to constrained and large-scale models.

The objective of this section is to formulate the constrained optimization problem within an operator-theoretic context that accommodates nonsmoothness, stochastic perturbations, and structural constraints. Particular emphasis is placed on identifying structural properties—such as nonexpansiveness, descent inequalities, and stability under perturbations—that will later allow the integration of learning-driven directional information while preserving convergence guarantees. The presentation proceeds by specifying the functional setting and standing assumptions, followed by a review of generalized gradients, projection operators, and quasi-gradient mappings.

2.1. Functional Setting

Let

(H, 〈 \cdot, \cdot 〉)

be a finite-dimensional real Hilbert space endowed with the induced norm

∥ x ∥ = \sqrt{〈 x, x 〉}

. Although the analysis can be extended to infinite-dimensional Banach spaces under additional assumptions, the present study focuses on the finite-dimensional setting in order to emphasize structural properties relevant to optimization models arising in engineering and data science.

Consider the constrained optimization problem

min_{x \in C} f (x),

(1)

where

C \subset H

is a nonempty feasible set and

f : H \to R

is a proper function. The feasible set may represent equality or inequality constraints, box constraints, polyhedral structures, or more general convex regions. The objective function is not assumed to be smooth and may incorporate nonsmooth or data-driven components.

2.2. Standing Assumptions

Assumption 1.

The feasible set

C

is nonempty, closed, and convex.

Assumption 2.

The function f is locally Lipschitz continuous on an open neighborhood containing

C

.

Local Lipschitz continuity ensures that generalized directional derivatives and subdifferentials are well-defined almost everywhere.

2.3. Generalized Gradients

Under Assumption 2, the Clarke generalized subdifferential of f at x is defined as

\partial_{C} f (x) : = conv \{lim_{k \to \infty} \nabla f (x_{k}) : x_{k} \to x, x_{k} \notin N_{f}\},

(2)

where

conv (\cdot)

denotes the convex hull of the indicated set and

N_{f}

is a set of measure zero where f fails to be differentiable. The set

\partial_{C} f (x)

is nonempty, convex, and compact.

In classical nonsmooth optimization, any element

g (x) \in \partial_{C} f (x)

can be used as a descent direction. However, in many applications, exact subgradients are unavailable, computationally expensive, or replaced by surrogate information.

2.4. Projection Operator and Nonexpansiveness

For any

x \in H

, the projection onto

C

is defined as

Π_{C} (x) : = arg min_{y \in C} ∥ x - y ∥ .

(3)

Under Assumption 1, the projection is uniquely defined and satisfies the variational characterization

〈 x - Π_{C} (x), y - Π_{C} (x) 〉 \leq 0, \forall y \in C,

(4)

as well as the nonexpansive property

∥ Π_{C} (x) - Π_{C} (y) ∥ \leq ∥ x - y ∥, \forall x, y \in H .

(5)

These properties play a central role in the stability and convergence analysis of projected iterative schemes.

2.5. Quasi-Gradient Mappings

Definition 1

(Quasi-gradient). A mapping

Q : H \to H

is called a quasi-gradient of f on

C

if there exists

γ > 0

such that, for every minimizer

x^{*}

of problem (1),

〈 Q (x), x - x^{*} 〉 \geq γ (f (x) - f (x^{*})), \forall x \in C .

(6)

Relation (6) generalizes classical gradient inequalities in convex optimization and encompasses subgradient mappings, stochastic gradient estimators, and approximate directional rules.

2.6. Projected Quasi-Gradient Iteration

Given a step-size sequence

{α_{k}}_{k \geq 0}

with

α_{k} > 0

, the standard projected quasi-gradient iteration is defined as

x_{k + 1} = Π_{C} (x_{k} - α_{k} Q (x_{k})) .

(7)

Iteration (7) can be interpreted as a fixed-point iteration generated by the operator

T (x) : = Π_{C} (x - α Q (x)) .

(8)

When Q satisfies appropriate monotonicity- or cocoercivity-type conditions, the operator T inherits stability properties that allow the derivation of convergence results.

2.7. Operator Geometry and Fixed-Point Structure

The projected quasi-gradient iteration (7) can be interpreted as a fixed-point scheme generated by the mapping (8).

Under appropriate monotonicity or curvature conditions, T acts as a contraction (or averaged) operator in a neighborhood of the solution, and convergence reduces to a geometric property of the induced mapping.

From this perspective, the stability of constrained optimization algorithms is governed not merely by descent inequalities but by preservation of contraction geometry under perturbations of the operator.

Any learning-driven modification of the directional mapping therefore alters the underlying fixed-point operator. The central question is whether such modifications preserve contraction structure, distort it, or shift the fixed point.

This operator-geometric viewpoint provides the structural basis for the analysis developed in the subsequent sections.

2.8. Stochastic Perturbations

In many modern optimization settings, the mapping Q is not deterministic. Instead, one observes a stochastic approximation

Q_{k} (x)

satisfying

E [Q_{k} (x) ∣ F_{k}] = Q (x),

(9)

where

{F_{k}}

is a filtration representing accumulated information. This setting encompasses classical stochastic approximation as well as learning-based directional updates constructed from sampled or learned information.

Remark 1

(On Conditional Bias). Assumption (9) imposes the conditional unbiasedness of the stochastic quasi-gradient, namely

E [Q_{k} (x) ∣ F_{k}] = Q (x) .

This excludes the presence of a systematic drift term

b_{k} (x) : = E [Q_{k} (x) ∣ F_{k}] - Q (x) .

If a nonzero bias

b_{k} (x)

were present, the fundamental error recursion would contain an additional first-order term of the form

- 2 α_{k} 〈 b_{k} (x_{k}), x_{k} - x^{*} 〉,

which could modify the asymptotic behavior of the iteration and potentially shift the limit point away from the true minimizer.

The present work deliberately focuses on the unbiased setting in order to isolate the structural contraction–variance decomposition. Under unbiasedness, the learning-driven component acts purely as a second-order stochastic perturbation, entering the recursion through variance terms of order

α_{k}^{2}

. This allows a transparent analytical separation between curvature-induced contraction and variance-induced dispersion. The treatment of systematically biased learning components would require additional drift-control conditions and falls outside the scope of the current operator-theoretic analysis.

2.9. Motivation for Learning-Driven Extensions

While classical quasi-gradient methods assume that

Q (x)

approximates a subgradient of f, emerging optimization paradigms incorporate directional rules obtained from data-driven models. Such rules may depend on historical iterates, structural features, or learned parameters. The challenge lies in embedding these mechanisms within a mathematically consistent framework that preserves descent and stability properties.

The subsequent section introduces a learning-driven quasi-gradient operator that extends formulation (7) by allowing the directional mapping to depend on learned components, while maintaining the structural guarantees required for convergence analysis.

3. Learning-Augmented Quasi-Gradient Operator

This section introduces the central analytical object of the present study: the learning-augmented quasi-gradient operator. The construction extends the classical projected quasi-gradient framework by incorporating structured, adaptive directional corrections while preserving the geometric foundations established in Section 2. Throughout this section, all operators act on the finite-dimensional real Hilbert space

(H, 〈 \cdot, \cdot 〉)

introduced in Section 2.

3.1. From Stochastic Perturbations to Learning-Augmented Operators

Classical stochastic approximation models directional uncertainty as zero-mean noise superimposed on an underlying quasi-gradient mapping. In such formulations, the update direction takes the form

Q_{k} (x) = Q (x) + ξ_{k} (x), E [ξ_{k} (x) ∣ F_{k}] = 0,

where

ξ_{k}

represents stochastic variability and the conditional unbiasedness assumption ensures the preservation of the descent structure in the expectation.

Modern optimization mechanisms, however, frequently incorporate adaptive components trained during the optimization process itself. These components may depend on historical iterates, learned parameters, auxiliary regression models, neural architectures, or adaptive preconditioning strategies. Such mechanisms introduce structured directional corrections that cannot be interpreted purely as zero-mean stochastic perturbations.

To capture this broader setting, we introduce the notion of a learning-augmented operator, which explicitly separates systematic learned corrections from residual stochastic dispersion.

3.2. Bias–Variance Decomposition

Throughout this section, all operators act on the finite-dimensional real Hilbert space

(H, 〈 \cdot, \cdot 〉)

introduced in Section 2.

Let

Q : H \to H

be a quasi-gradient mapping satisfying the structural conditions established previously.

A learning-augmented quasi-gradient is defined as

{\hat{Q}}_{k} (x) = Q (x) + L_{k} (x),

(10)

where the learning component

L_{k} : H \to H

admits the decomposition

L_{k} (x) = b_{k} (x) + ξ_{k} (x),

(11)

with

E [ξ_{k} (x) ∣ F_{k}] = 0 .

Relation (11) resembles classical bias–variance decompositions commonly used in the analysis of stochastic gradient methods in machine learning. However, the role of the decomposition in the present framework is fundamentally different.

In standard biased stochastic gradient analyses, bias is typically treated as an additive perturbation affecting convergence rates, while variance contributes to stochastic dispersion. These analyses are primarily statistical in nature and do not explicitly account for the geometric structure of the underlying optimization operator.

By contrast, in the proposed operator-theoretic formulation, the bias term

b_{k} (x)

is interpreted as a structured directional perturbation that directly modifies the geometry of the induced fixed-point operator. This leads to the notion of operator alignment, which determines whether the contraction properties induced by curvature are preserved or distorted.

As a result, the proposed contraction–bias–variance decomposition is not merely a statistical separation of error sources but a geometric decomposition that explicitly characterizes how learned components interact with the contraction structure of the optimization dynamics.

The mapping

b_{k} : H \to H

represents a structured, possibly history-dependent learned correction (operator bias), while

ξ_{k} : H \to H

captures residual stochastic variability.

The associated projected iteration becomes

x_{k + 1} = Π_{C} (x_{k} - α_{k} {\hat{Q}}_{k} (x_{k})),

(12)

where

Π_{C} : H \to C

denotes the projection operator defined in Section 2.

Formulation (12) generalizes classical stochastic quasi-gradient schemes by explicitly separating three geometric mechanisms:

Curvature-induced contraction generated by Q;
Bias-induced geometric distortion generated by $b_{k}$ ;
Variance-induced dispersion generated by $ξ_{k}$ .

This decomposition makes it explicit that learning may alter the geometry of the underlying fixed-point operator through the bias term, while stochastic variability contributes dispersion without systematic directional distortion.

3.3. Operator Alignment and Contraction Geometry

The projected iteration (12) defines a k-dependent fixed-point operator

T_{k} (x) = Π_{C} (x - α_{k} {\hat{Q}}_{k} (x)) .

In the absence of learning corrections, contraction properties of

T_{k}

are governed by curvature conditions imposed on Q. The introduction of a bias term

b_{k}

alters the operator geometry.

Stability therefore depends on whether the learned correction preserves contraction structure.

We impose the following operator-alignment condition: there exists

η \in [0, 1)

such that, for all

x \in C

and all k,

〈 b_{k} (x), x - x^{★} 〉 \leq η 〈 Q (x), x - x^{★} 〉 .

(13)

Condition (13) states that the learned correction is not allowed to counteract more than an

η

fraction of the descent geometry induced by the base quasi-gradient.

When

η < 1

, the contraction structure generated by curvature is preserved, possibly with modified constants. If the alignment condition fails, the operator may lose its contraction property or shift the fixed point.

This geometric perspective shows that stability of learning-augmented optimization cannot be characterized solely through variance bounds. Instead, it depends on the interaction between curvature, bias alignment, and stochastic dispersion.

Relation (13) is reminiscent of bias–variance decompositions commonly used in the analysis of stochastic gradient methods in machine learning. In such settings, gradient estimators are often modeled as

g_{k} = \nabla f (x_{k}) + b_{k} + ξ_{k},

where

b_{k}

represents a systematic bias and

ξ_{k}

a zero-mean stochastic perturbation. This formulation appears in inexact gradient methods [21], stochastic optimization with errors [22], and modern large-scale learning frameworks [10,23].

In these approaches, the analysis primarily focuses on how bias and variance affect convergence rates, typically leading to error bounds expressed in terms of

∥ b_{k} ∥^{2}

and variance levels.

However, these analyses remain fundamentally statistical and do not explicitly characterize how biased updates modify the geometry of the underlying optimization operator.

By contrast, in the present framework, the decomposition (13) is embedded within an operator-theoretic formulation. The bias term

b_{k} (x)

is interpreted as a structured directional perturbation that directly affects the contraction properties of the induced mapping. This leads to the alignment condition (13), which ensures that learned corrections do not destroy the descent geometry.

Consequently, the proposed contraction–bias–variance decomposition provides a geometric interpretation of stability: curvature induces contraction, bias introduces directional distortion through alignment, and stochastic components generate dispersion. This perspective goes beyond classical analyses by explicitly linking learning-induced bias to operator geometry.

The next section derives the fundamental contraction–bias–variance recursion governing this interaction.

4. Convergence Analysis

This section establishes stability properties and convergence rates for the learning-augmented projected quasi-gradient scheme introduced in Section 3. In contrast to purely stochastic perturbation models, the analysis explicitly accounts for structured bias and variance components through the contraction–bias–variance decomposition.

4.1. Additional Structural Assumptions

To derive explicit rates and stability bounds, the following additional conditions are imposed.

Assumption 3.

Let

f : H \to R

denote the objective function of the constrained problem

min_{x \in C} f (x) .

Assume that f is μ-strongly convex on

C

for some

μ > 0

.

Assumption 4.

(Operator-Alignment Condition). There exists

η \in [0, 1)

such that for all

x \in C

and all k,

〈 b_{k} (x), x - x^{*} 〉 \leq η 〈 Q (x), x - x^{*} 〉 .

(14)

Assumption 5.

(Variance Bound). The stochastic component satisfies

E [ξ_{k} (x_{k}) ∣ F_{k}] = 0, E [{∥ ξ_{k} (x_{k}) ∥}^{2} ∣ F_{k}] \leq σ^{2},

(15)

for some finite constant

σ^{2}

.

4.2. Learning-Augmented Iteration

The iterative scheme under analysis is

x_{k + 1} = Π_{C} (x_{k} - α_{k} (Q (x_{k}) + b_{k} (x_{k}) + ξ_{k} (x_{k}))) .

(16)

4.3. Fundamental Contraction–Bias–Variance Recursion

Assumption 6.

The quasi-gradient mapping Q has at most linear growth on

C

, i.e., there exist constants

a, b \geq 0

such that

∥ Q (x) ∥ \leq a + b ∥ x ∥, \forall x \in C .

Since the sequence

{x_{k}}

is bounded (cf. Proposition 2), Assumption 6 implies the existence of a constant

G > 0

such that

∥ Q (x_{k}) ∥ \leq G for all k .

Let

x^{*}

denote the unique minimizer.

Lemma 1.

Under Assumptions 3–5, the sequence generated by (16) satisfies

E [∥ x_{k + 1} - x^{*} ∥^{2} ∣ F_{k}] \leq (1 - 2 μ (1 - η) α_{k}) {∥ x_{k} - x^{*} ∥}^{2} + α_{k}^{2} (2 G^{2} + 2 {∥ b_{k} (x_{k}) ∥}^{2} + σ^{2}) .

(17)

Proof.

By the nonexpansiveness of the projection operator,

{∥ x_{k + 1} - x^{*} ∥}^{2} \leq {∥ x_{k} - α_{k} (Q (x_{k}) + b_{k} (x_{k}) + ξ_{k} (x_{k})) - x^{*} ∥}^{2} .

Expanding,

\begin{matrix} ∥ x_{k + 1} - x^{*} ∥^{2} & \leq ∥ x_{k} - x^{*} ∥^{2} - 2 α_{k} 〈 Q (x_{k}) + b_{k} (x_{k}) + ξ_{k} (x_{k}), x_{k} - x^{*} 〉 \\ + α_{k}^{2} {∥ Q (x_{k}) + b_{k} (x_{k}) + ξ_{k} (x_{k}) ∥}^{2} . \end{matrix}

(18)

Taking the conditional expectation eliminates the cross term with

ξ_{k}

.

Strong convexity implies

〈 Q (x_{k}), x_{k} - x^{*} 〉 \geq μ {∥ x_{k} - x^{*} ∥}^{2} .

Using the alignment condition,

〈 b_{k} (x_{k}), x_{k} - x^{*} 〉 \leq η μ {∥ x_{k} - x^{*} ∥}^{2} .

Therefore,

〈 Q (x_{k}) + b_{k} (x_{k}), x_{k} - x^{*} 〉 \geq μ (1 - η) {∥ x_{k} - x^{*} ∥}^{2} .

Finally, using

{∥ a + b + c ∥}^{2} \leq {2 ∥ a ∥}^{2} + {2 ∥ b ∥}^{2} + {∥ c ∥}^{2},

together with the boundedness of Q and the variance bound, yields (17). □

4.4. Rates Under Strong Convexity

Theorem 1.

Let

α_{k} = \frac{1}{μ (1 - η) (k + 1)}

. Then

E ∥ x_{k} - x^{*} ∥^{2} \leq \frac{C}{k + 1}

(19)

for some constant

C > 0

depending on the initial condition

u_{0}

, the parameters μ and η, and the quantity

B = 2 G^{2} + 2 {sup}_{t \geq 0} E {∥ b_{t} (x_{t}) ∥}^{2} + σ^{2}

.

Proof.

Let

u_{k} : = E ∥ x_{k} - x^{*} ∥^{2}, B : = 2 G^{2} + 2 sup_{t \geq 0} E {∥ b_{t} (x_{t}) ∥}^{2} + σ^{2} .

Taking the full expectation in (17) and using

∥ b_{k} (x_{k}) ∥^{2} \leq {sup}_{t} E {∥ b_{t} (x_{t}) ∥}^{2}

yields

u_{k + 1} \leq (1 - 2 μ (1 - η) α_{k}) u_{k} + α_{k}^{2} B .

(20)

With

α_{k} = \frac{1}{μ (1 - η) (k + 1)}

, we obtain

u_{k + 1} \leq (1 - \frac{2}{k + 1}) u_{k} + \frac{B}{μ^{2} {(1 - η)}^{2} {(k + 1)}^{2}} .

A standard induction (or comparison lemma) shows that

u_{k} \leq \frac{C}{k + 1}

for some constant

C > 0

depending on

u_{0}

,

μ

,

η

, and B, which proves (19). □

Theorem 2.

If

α_{k} \equiv α

with

0 < α < \frac{1}{2 μ (1 - η)},

then

\underset{k \to \infty}{lim sup} E {∥ x_{k} - x^{*} ∥}^{2} \leq \frac{α (2 G^{2} + 2 {sup}_{k} ∥ b_{k} (x_{k}) ∥^{2} + σ^{2})}{2 μ (1 - η) - 2 μ^{2} {(1 - η)}^{2} α} .

(21)

Proof.

Let

u_{k} : = E {∥ x_{k} - x^{*} ∥}^{2}

and define

B : = 2 G^{2} + 2 sup_{t \geq 0} E {∥ b_{t} (x_{t}) ∥}^{2} + σ^{2} .

Under a constant step size

α_{k} \equiv α

, taking the expectation in (17) yields

u_{k + 1} \leq (1 - 2 μ (1 - η) α) u_{k} + α^{2} B .

(22)

If

0 < α < \frac{1}{2 μ (1 - η)}

, then

ρ : = 1 - 2 μ (1 - η) α \in (0, 1)

, and iterating (22) gives

u_{k} \leq ρ^{k} u_{0} + α^{2} B \sum_{t = 0}^{k - 1} ρ^{t} = ρ^{k} u_{0} + α^{2} B \frac{1 - ρ^{k}}{1 - ρ} .

where the constant B captures the combined effect of gradient growth, bias magnitude, and stochastic variance. In particular, the asymptotic error bound depends explicitly on the step size

α

, the strong convexity parameter

μ

, the alignment parameter

η

, and the combined second-moment bound B.

Taking

{lim sup}_{k \to \infty}

and using

1 - ρ = 2 μ (1 - η) α

yields

\underset{k \to \infty}{lim sup} u_{k} \leq \frac{α^{2} B}{2 μ (1 - η) α} = \frac{α B}{2 μ (1 - η)},

which implies (21) up to the equivalent rearrangement of the denominator. □

The preceding results complete the quantitative analysis of the learning-augmented projected quasi-gradient scheme under strong convexity. The fundamental recursion (17) makes explicit how curvature, alignment, and stochastic variability jointly determine stability margins and attainable rates.

In particular, the contraction factor is no longer governed solely by the strong convexity parameter

μ

, but by the effective quantity

μ (1 - η)

, reflecting geometric distortion induced by the learned bias. Meanwhile, second-order effects accumulate through both variance and the magnitude of the bias term.

These findings establish a precise structural link between operator alignment and convergence behavior, thereby extending classical stochastic approximation theory to learning-augmented settings.

Consequences and Theoretical Implications

The preceding results demonstrate that learning-augmented quasi-gradient operators preserve contraction properties induced by strong convexity whenever the alignment parameter satisfies

η < 1

. In this regime, the effective contraction constant becomes

μ (1 - η)

, revealing explicitly how geometric distortion introduced by learned bias modifies the curvature-induced stability margin.

Diminishing step sizes yield vanishing error with an

O (1 / k)

rate, where the constants depend jointly on curvature, alignment, variance, and the magnitude of the learned correction. Under constant step sizes, the iteration converges to a bounded neighborhood whose radius scales with both stochastic dispersion (

σ^{2}

) and the accumulated bias magnitude (

{sup}_{k} {∥ b_{k} (x_{k}) ∥}^{2}

).

The recursion (17) reveals that learning-augmented perturbations influence stability through two structurally distinct mechanisms: (i) alignment distortion, quantified by

η

and reflected in the modified contraction factor, and (ii) second-order dispersion effects, arising from both variance and bias magnitude in the quadratic term.

When

η = 0

and

b_{k} \equiv 0

, the classical stochastic approximation recursion is recovered. More generally, the analysis shows that variance control alone is insufficient to guarantee stability: operator alignment is a fundamental structural requirement for preserving contraction geometry.

This perspective highlights a key distinction with respect to existing analyses of biased stochastic gradients. While classical results characterize the impact of bias and variance on convergence rates, they do not explicitly capture how structured directional corrections alter the geometry of the underlying operator.

The present framework shows that stability is governed not only by variance control, but also by alignment conditions that ensure preservation of contraction geometry. This provides a sharper structural criterion for the integration of learning-based components into constrained optimization algorithms.

5. Extensions Beyond Strong Convexity

The previous section established convergence guarantees under strong convexity through a contraction–bias–variance recursion. This section extends the analysis to broader settings relevant in modern machine learning and structured optimization, namely the Polyak–Łojasiewicz (PL) regime and smooth nonconvex problems.

In both regimes, the alignment parameter

η

and the bias magnitude continue to influence stability through modified contraction margins and second-order dispersion effects.

5.1. Polyak–Łojasiewicz Condition

The PL condition provides a relaxation of strong convexity while still guaranteeing global convergence.

Assumption 7

(Polyak–Łojasiewicz Condition). There exists

μ > 0

such that

\frac{1}{2} {∥ \nabla f (x) ∥}^{2} \geq μ (f (x) - f (x^{*})), \forall x \in C .

Unlike strong convexity, the PL condition does not require convexity of f. Many overparameterized machine learning models satisfy this property.

Theorem 3.

Suppose Assumptions 4–6, hold. Let

α_{k} \equiv α

be sufficiently small. Then

E [f (x_{k}) - f (x^{*})] \leq {(1 - μ (1 - η) α)}^{k} (f (x_{0}) - f (x^{*})) + O (α (σ^{2} + sup_{k} ∥ b_{k} (x_{k}) ∥^{2})) .

Proof.

Using smoothness and the descent lemma,

f (x_{k + 1}) \leq f (x_{k}) - α 〈 \nabla f (x_{k}), Q (x_{k}) + b_{k} (x_{k}) + ξ_{k} (x_{k}) 〉 + O (α^{2} ∥ {\hat{Q}}_{k} (x_{k}) ∥^{2}) .

Taking the conditional expectation eliminates the cross term involving

ξ_{k}

. Under the PL inequality and the alignment condition,

〈 \nabla f (x_{k}), Q (x_{k}) + b_{k} (x_{k}) 〉 \geq μ (1 - η) (f (x_{k}) - f (x^{*})) .

Iterating yields geometric decay with the modified contraction factor

μ (1 - η)

plus a residual term depending on both variance and bias magnitude. □

Remark 2.

Under the PL condition, the learning-augmented scheme achieves linear convergence up to a neighborhood whose size depends on both stochastic variance and bias magnitude. The effective contraction constant becomes

μ (1 - η)

, highlighting the structural role of operator alignment even in nonconvex gradient-dominant settings.

5.2. Smooth Nonconvex Case

We now consider the case where f is L-smooth but not necessarily convex.

Assumption 8.

The function f is continuously differentiable and satisfies

∥ \nabla f (x) - \nabla f (y) ∥ \leq L ∥ x - y ∥, \forall x, y \in C .

Theorem 4.

Under Assumptions 5 and 8, consider the learning-augmented iteration (16).

Let the step size be chosen as

α = \frac{c}{\sqrt{K}}

for some constant

c > 0

. Then the iterates satisfy

min_{0 \leq k \leq K} E {∥ \nabla f (x_{k}) ∥}^{2} \leq \frac{2 (f (x_{0}) - f^{*})}{c \sqrt{K}} + O (\frac{σ^{2} + {sup}_{0 \leq k \leq K} {∥ b_{k} (x_{k}) ∥}^{2}}{\sqrt{K}}) .

(23)

Proof.

Using the L-smoothness of f, we have

f (x_{k + 1}) \leq f (x_{k}) - α 〈 \nabla f (x_{k}), Q (x_{k}) + b_{k} (x_{k}) + ξ_{k} (x_{k}) 〉 + \frac{L}{2} α^{2} {∥ Q (x_{k}) + b_{k} (x_{k}) + ξ_{k} (x_{k}) ∥}^{2} .

Taking the conditional expectation and using

E [ξ_{k} (x_{k}) ∣ F_{k}] = 0

gives

E [f (x_{k + 1})] \leq E [f (x_{k})] - α E 〈 \nabla f (x_{k}), Q (x_{k}) + b_{k} (x_{k}) 〉 + \frac{L}{2} α^{2} E {∥ Q (x_{k}) + b_{k} (x_{k}) + ξ_{k} (x_{k}) ∥}^{2} .

Using the quadratic bound

{∥ a + b + c ∥}^{2} \leq {2 ∥ a ∥}^{2} + {2 ∥ b ∥}^{2} + {∥ c ∥}^{2},

together with the boundedness of Q and the variance bound, we obtain

E ∥ Q (x_{k}) + b_{k} (x_{k}) + ξ_{k} (x_{k}) ∥^{2} \leq 2 G^{2} + 2 sup_{0 \leq t \leq K} {∥ b_{t} (x_{t}) ∥}^{2} + σ^{2} .

Summing the smoothness inequality from

k = 0

to

K - 1

yields

α \sum_{k = 0}^{K - 1} E {∥ \nabla f (x_{k}) ∥}^{2} \leq f (x_{0}) - f^{*} + O (α K (σ^{2} + sup_{t \leq K} ∥ b_{t} (x_{t}) ∥^{2})) .

Dividing by

α K

and choosing

α = c / \sqrt{K}

gives

min_{0 \leq k \leq K} E {∥ \nabla f (x_{k}) ∥}^{2} = O (\frac{1}{\sqrt{K}}) + O (\frac{σ^{2} + {sup}_{t \leq K} {∥ b_{t} (x_{t}) ∥}^{2}}{\sqrt{K}}),

which proves (23). □

Implications of the Extended Regimes

The PL extension shows that learning-augmented quasi-gradient operators retain linear convergence characteristics even in the absence of convexity, provided the objective satisfies a gradient-dominance property and the alignment parameter satisfies

η < 1

. In this regime, the effective contraction factor becomes

μ (1 - η)

, demonstrating that operator alignment plays a structural role beyond the strongly convex setting.

The nonconvex result demonstrates that the framework remains compatible with standard complexity guarantees for stochastic first-order methods. Specifically, the learning-augmented component affects only second-order constants through variance and bias magnitude, while preserving the fundamental

O (1 / \sqrt{K})

stationarity rate typical of first-order stochastic methods [24,25].

Together, these extensions reveal a unifying principle: learning-induced corrections do not alter the qualitative complexity class of the underlying optimization problem, provided that alignment is maintained and dispersion remains controlled. Curvature governs contraction, alignment governs geometric distortion, and stochastic variability governs residual dispersion.

These results confirm that the proposed operator-theoretic formulation extends naturally from convex optimization to gradient-dominant and smooth nonconvex models, thereby significantly broadening its theoretical scope within modern large-scale machine learning.

5.3. Extension to Mirror Descent Geometry

The present analysis is formulated in a Hilbert space with Euclidean projection, which allows the use of the nonexpansiveness of the projection operator and a squared-norm error recursion. However, many modern constrained optimization methods are more naturally expressed in non-Euclidean geometries through Bregman divergences and mirror descent.

Let h be a differentiable strongly convex distance-generating function, and let

D_{h} (x, y) = h (x) - h (y) - 〈 \nabla h (y), x - y 〉

denotes the associated Bregman divergence. In a mirror descent setting, the projected quasi-gradient step would be replaced by a mirror update of the form

x_{k + 1} = arg min_{x \in C} \{α_{k} 〈 Q (x_{k}) + b_{k} (x_{k}) + ξ_{k} (x_{k}), x 〉 + D_{h} (x, x_{k})\} .

In this geometry, the basic recursion would no longer be written in terms of

∥ x_{k} - x^{*} ∥^{2}

, but rather in terms of

D_{h} (x^{*}, x_{k})

or related Bregman error measures.

From this viewpoint, extending the contraction–bias–variance decomposition to mirror descent appears conceptually natural but technically nontrivial. The Euclidean contraction mechanism would need to be replaced by a relative descent or relative strong convexity argument with respect to h, while the stochastic variance term would have to be controlled in the dual geometry induced by the mirror map.

Likewise, the alignment condition

〈 b_{k} (x), x - x^{*} 〉 \leq η 〈 Q (x), x - x^{*} 〉

would not transfer verbatim. A more appropriate formulation would involve a relative geometric condition ensuring that the learned correction does not destroy the descent structure induced by the Bregman geometry. Depending on the model, this may take the form of a relative continuity condition, a relative strong monotonicity condition, or an alignment inequality expressed in primal–dual variables through

\nabla h

.

These issues point to an interesting extension of the present framework. While a full development of the mirror descent case lies beyond the scope of this paper, the operator-theoretic decomposition proposed here suggests that an analogous curvature–alignment–dispersion principle should remain valid in non-Euclidean constrained optimization, provided that the corresponding relative geometric conditions are properly imposed.

6. Learning-Augmented Architectures Consistent with the Framework

This section illustrates how concrete learning mechanisms can be embedded within the proposed learning-augmented operator structure while satisfying the structural assumptions required for stability and convergence. The objective is to demonstrate that widely used learning architectures are compatible with the contraction–bias–variance framework developed in previous sections.

The constructions below show that the learning-augmented quasi-gradient operator encompasses classical stochastic approximation schemes, online convex optimization methods [26,27], reinforcement learning and neuro-dynamic programming approaches [28], and modern deep learning update rules [15] within a unified structural model.

In each case, the learning component admits the decomposition

L_{k} (x) = b_{k} (x) + ξ_{k} (x),

where

b_{k} (x)

represents a structured learned correction (bias) and

ξ_{k} (x)

captures residual stochastic variability. Stability is ensured when boundedness, variance control, and operator-alignment conditions are satisfied.

6.1. Linear Online Learning Component

Consider a parametric directional model of the form

L_{k} (x) = W_{k} ϕ (x),

(24)

where

ϕ (x) \in R^{m}

is a feature mapping and

W_{k} \in R^{n \times m}

is updated recursively.

Online convex optimization (OCO) methods [26,27] provide natural update rules of the form

W_{k + 1} = W_{k} - η_{k} \nabla ℓ_{k} (W_{k}),

(25)

where

ℓ_{k}

is a convex loss constructed from sampled directional information.

From the perspective of the learning-augmented operator framework, the induced mapping can be interpreted as a structured bias term

b_{k} (x) = W_{k} ϕ (x)

together with stochastic fluctuations arising from sampling noise in (25).

Proposition 1.

If the feature mapping

ϕ (x)

is bounded on

C

and the parameter sequence

{W_{k}}

remains bounded, then the induced learning component

L_{k} (x)

satisfies bounded second-moment conditions. If the stochastic gradients used in (25) are unbiased, then the residual component

ξ_{k} (x)

satisfies the martingale-difference property required by the convergence theory.

Proof.

From (24),

∥ L_{k} (x) ∥ \leq ∥ W_{k} ∥ ∥ ϕ (x) ∥ .

If

ϕ (x)

and

W_{k}

are uniformly bounded on

C

, then

∥ L_{k} (x) ∥

is uniformly bounded. If stochastic gradients used in (25) are unbiased, then conditional expectations preserve the martingale-difference structure, consistent with stochastic approximation theory [29]. □

Moreover, if the parametric model is trained so that

〈 W_{k} ϕ (x), x - x^{*} 〉 \leq η 〈 Q (x), x - x^{*} 〉,

then the operator-alignment condition required for contraction preservation is satisfied.

6.2. Neural Network-Based Directional Component

Let

L_{k} (x)

be generated by a feedforward neural network with parameters

θ_{k}

:

L_{k} (x) = N_{θ_{k}} (x) .

(26)

Universal approximation results [30,31] justify the expressive power of such networks. Stability analyses of stochastic gradient training in deep learning [10,32] provide conditions under which parameter sequences remain controlled.

Within the learning-augmented decomposition, the deterministic network output constitutes the bias term

b_{k} (x)

, while stochasticity arising from minibatch training contributes to

ξ_{k} (x)

.

Proposition 2.

Under bounded weights and Lipschitz activations, the neural network-based component

L_{k} (x)

is Lipschitz on

C

and satisfies bounded second-moment conditions. If training gradients are unbiased, the stochastic component satisfies the martingale-difference property. Furthermore, alignment regularization can be incorporated during training to enforce the operator-alignment condition required for contraction preservation.

Proof.

A feedforward neural network is a finite composition of affine mappings and activation functions. The composition of Lipschitz mappings is Lipschitz. Bounded weights imply bounded outputs on compact

C

. If parameter updates are unbiased stochastic gradients, then the induced perturbation satisfies the martingale-difference condition as in classical stochastic approximation [1,29]. □

6.3. Adaptive Momentum-Type Updates

Momentum methods introduced by Polyak [33] and later extended to stochastic averaging schemes [34] and accelerated methods [35] can also be interpreted within the learning-augmented operator framework.

Consider

L_{k} (x) = β_{k} v_{k},

(27)

where

v_{k} = ρ v_{k - 1} + (1 - ρ) Q (x_{k}), | ρ | < 1 .

(28)

The recursion (28) defines a stable linear filter applied to the quasi-gradient sequence. In the bias–variance framework, this corresponds to introducing a history-dependent bias term whose magnitude and alignment depend on the filter parameters.

Proposition 3.

If

{β_{k}}

and

{v_{k}}

remain bounded and

| ρ | < 1

, then the induced momentum component satisfies bounded second-moment conditions. If, in addition,

〈 β_{k} v_{k}, x - x^{*} 〉 \leq η 〈 Q (x), x - x^{*} 〉,

then the operator-alignment condition holds and contraction is preserved.

Proof.

From (28),

∥ v_{k} ∥ \leq | ρ | ∥ v_{k - 1} ∥ + (1 - ρ) ∥ Q (x_{k}) ∥ .

If

Q (x_{k})

remains bounded and

| ρ | < 1

, the recursion defines a stable linear system whose output remains bounded. Consequently,

L_{k} (x)

satisfies bounded second-moment conditions and fits within the learning-augmented perturbation model. □

Structural Consistency with the Theory

The previous constructions demonstrate that a broad class of learning architectures—online linear models, neural networks, and adaptive momentum methods—fit naturally within the learning-augmented quasi-gradient operator framework.

In each case, theoretical consistency is ensured by verifying:

Bounded parameter updates;
Lipschitz continuity of the induced directional mapping;
Controlled second moments of stochastic components;
Satisfaction (or enforceability) of the operator-alignment condition.

The alignment requirement distinguishes the present framework from purely variance-based stochastic approximation analyses. While bounded variance controls dispersion, alignment governs the geometric distortion of the operator.

These conditions extend classical stability requirements in stochastic approximation [1,29] and online learning theory [26,27] by explicitly incorporating bias geometry. Therefore, modern AI-based update rules can be rigorously embedded within constrained optimization schemes without sacrificing contraction structure.

From a broader perspective, the framework establishes an explicit bridge between variational analysis and AI-enhanced optimization: learning mechanisms appear as structured operator perturbations and stability reduces to the interplay between curvature, alignment, and dispersion.

7. Computational Illustration and Comparative Analysis

7.1. Example 1: Projected Learning-Augmented Quasi-Gradient Under Simplex-Type Constraints

To further validate the projected quasi-gradient structure in a nontrivial constrained setting, we consider a quadratic optimization problem over a convex feasible set defined by inequality constraints. Unlike the unconstrained case, this setting requires an explicit projection step at each iteration, allowing us to assess the impact of feasibility enforcement on the learning-augmented dynamics.

The objective of this example is to provide a controlled numerical study that complements the theoretical results established in Section 4 and Section 5. Rather than focusing on large-scale benchmarking, the purpose is to verify, in a constrained setting, the structural predictions of the contraction–bias–variance decomposition for learning-augmented quasi-gradient schemes.

The quadratic model considered here serves as a canonical testbed in stochastic approximation theory [2,10,29,36], since it allows exact spectral characterization of convergence dynamics and explicit verification of stability bounds. Importantly, in the present framework, it also allows explicit isolation of curvature, alignment distortion, and dispersion effects.

7.2. Quadratic Strongly Convex Model

Consider the optimization problem

min_{x \in R^{n}} f (x) = \frac{1}{2} x^{⊤} A x - b^{⊤} x,

(29)

where

A \in R^{n \times n}

is symmetric positive definite and

b \in R^{n}

.

The unique minimizer is

x^{*} = A^{- 1} b .

Matrix A is generated as

A = M^{⊤} M + μ I,

where

μ > 0

ensures strong convexity. Consequently,

λ_{min} (A) \geq μ > 0 .

Since A is symmetric positive definite, all its eigenvalues are real and strictly positive. We denote by

λ_{min} (A)

and

λ_{max} (A)

, the smallest and largest eigenvalues of A, respectively, as

λ_{min} (A) = min_{∥ x ∥ = 1} x^{⊤} A x, λ_{max} (A) = max_{∥ x ∥ = 1} x^{⊤} A x .

The gradient mapping

\nabla f (x) = A x - b

is Lipschitz continuous with constant

L = λ_{max} (A) .

The condition number of the problem is

κ = \frac{λ_{max} (A)}{λ_{min} (A)} .

Quadratic models of this type are standard in first-order complexity analysis [9].

7.3. Learning-Augmented Perturbation Model

The classical quasi-gradient for this problem is

Q (x_{k}) = A x_{k} - b .

We consider a learning-augmented perturbation of the form

L_{k} (x_{k}) = b_{k} (x_{k}) + ϵ_{k},

where

$b_{k} (x_{k})$ represents a structured learned bias component;
$ϵ_{k}$ is zero-mean Gaussian noise with covariance $σ^{2} I$ .

For the baseline experiment, we first take

b_{k} \equiv 0

to isolate pure variance effects. In a second experiment, we introduce a linear bias of the form

b_{k} (x_{k}) = B x_{k},

with

∥ B ∥

chosen such that the alignment condition

〈 B x, x - x^{*} 〉 \leq η 〈 Q (x), x - x^{*} 〉

holds for prescribed values of

η \in [0, 1)

.

The iteration becomes

x_{k + 1} = x_{k} - α_{k} (Q (x_{k}) + b_{k} (x_{k}) + ϵ_{k}) .

(30)

This construction satisfies strong convexity, controlled bias alignment, and bounded variance assumptions consistent with the theoretical framework.

7.4. Exact Spectral Error Recursion

Subtracting

x^{*}

from (30) yields

x_{k + 1} - x^{*} = (I - α_{k} A - α_{k} B) (x_{k} - x^{*}) - α_{k} ϵ_{k} .

(31)

Taking the conditional expectation gives the exact recursion

E ∥ x_{k + 1} - x^{*} ∥^{2} = ∥ (I - α_{k} (A + B)) (x_{k} - x^{*}) ∥^{2} + α_{k}^{2} E {∥ ϵ_{k} ∥}^{2} .

(32)

We use spectral bounds and the alignment restriction,

{∥ (I - α_{k} (A + B)) ∥}^{2} \leq {(1 - α_{k} μ (1 - η))}^{2},

(33)

which yields

E {∥ x_{k + 1} - x^{*} ∥}^{2} \leq {(1 - α_{k} μ (1 - η))}^{2} {∥ x_{k} - x^{*} ∥}^{2} + α_{k}^{2} σ^{2} .

(34)

Equation (34) exactly reproduces the abstract contraction–bias–variance recursion derived in Section 4, making the geometric role of alignment explicit.

7.5. Extension to Mirror Descent Geometry

7.5.1. Diminishing Step Size

Let

α_{k} = \frac{1}{μ (1 - η) (k + 1)} .

Then

{(1 - α_{k} μ (1 - η))}^{2} = {(1 - \frac{1}{k + 1})}^{2},

which leads to

O (1 / k)

decay of the mean squared error, consistent with the theoretical rate under alignment distortion.

7.5.2. Constant Step Size

If

0 < α < \frac{2}{λ_{max} (A) (1 + ∥ B ∥ / λ_{max} (A))}

, then the linear operator remains contractive and

\underset{k \to \infty}{lim sup} E {∥ x_{k} - x^{*} ∥}^{2} = O (\frac{σ^{2}}{μ (1 - η)}) .

(35)

This expression characterizes the steady-state error floor resulting from the balance between curvature-induced contraction, alignment distortion, and stochastic dispersion.

7.6. Comparative Analysis of First-Order Methods

We compare four methods:

Gradient descent (GD);
Stochastic gradient descent (SGD);
Polyak Momentum [33];
Learning-augmented quasi-gradient (proposed).

These methods were applied under identical initialization and comparable step-size schedules.

Performance was evaluated using:

Mean squared error $E ∥ x_{k} - x^{*} ∥^{2}$ ;
Objective residual $E [f (x_{k}) - f (x^{*})]$ ;
Stability under increasing variance $σ^{2}$ ;
Sensitivity to conditioning (varying $κ$ );
Sensitivity to alignment distortion (varying $η$ ).

The numerical observations are consistent with the theory:

Deterministic GD exhibits pure spectral contraction in the absence of noise.
SGD corresponds to the special case $η = 0$ with stochastic dispersion only.
The learning-augmented scheme preserves the same qualitative rates while modifying contraction constants through $η$ .
Under constant step sizes, all stochastic methods converge to dispersion-dependent neighborhoods whose radius scales with $σ^{2}$ and $1 / (1 - η)$ .
Momentum accelerates early iterations but may amplify variance or alignment distortion in poorly conditioned regimes.

The quadratic study makes the contraction–bias–variance mechanism fully explicit. Curvature governs the deterministic spectral contraction, alignment distortion modifies the effective contraction margin through the factor

μ (1 - η)

, and stochastic dispersion determines the steady-state error floor.

The numerical behavior observed under varying conditioning, variance levels, and alignment parameters is fully consistent with the theoretical recursion derived in Section 4. In particular, the learning-augmented scheme preserves the qualitative convergence regimes of classical first-order methods, while exhibiting predictable modifications in stability constants.

These results provide structural—not merely empirical—validation of the operator-theoretic model.

7.7. Example 2: Projected Learning-Augmented Quasi-Gradient Under Norm Constraints and Geometric Regularization

To further investigate the behavior of the proposed framework under alternative constraint geometries, we consider a quadratic optimization problem constrained by a norm-bounded feasible set. Specifically, the feasible region is defined through a global geometric constraint that restricts the magnitude of admissible iterates, inducing a projection operator with a fundamentally different structure from the simplex-type constraints considered in Example 1.

Unlike coordinate-wise or polyhedral constraints, norm-based constraints introduce a coupled geometric restriction that acts uniformly across all dimensions. As a result, the projection step becomes intrinsically nonlinear and affects all components of the iterate simultaneously. This provides a more stringent test of the projected quasi-gradient mechanism, particularly in the presence of learning-induced bias and stochastic perturbations.

The objective of this example is to assess the robustness of the contraction–bias–variance decomposition under globally coupled constraint geometries. In particular, this setting allows us to examine how the interaction between curvature, alignment, and stochastic dispersion is influenced by projection operators that are not separable across coordinates.

As in Example 1, the purpose is not large-scale benchmarking but structural validation. This experiment is designed to verify that the operator-theoretic predictions remain consistent when feasibility is enforced through norm-based projections and to highlight potential differences in convergence behavior arising from the underlying geometry of the constraint set.

To further support the analysis, we include a brief sensitivity study with respect to the step size

α

and the alignment parameter

η

. The results indicate that larger values of

η

lead to slower convergence and higher steady-state error, in agreement with the theoretical predictions.

We also compare the learning-augmented scheme with a baseline projected gradient method (i.e.,

b_{k} \equiv 0

). The results show that, under suitable alignment conditions, the proposed approach achieves improved transient behavior while preserving stability.

The numerical results in this constrained setting confirm the theoretical predictions of the contraction–bias–variance framework. We report below a representative comparison for different values of the alignment parameter

η

and step size

α

. Table 1 provides a structured summary of the main characteristics and classifications of the approaches considered in this study.

From these results, we observe that smaller values of

η

lead to improved performance and faster convergence, while larger values degrade stability, in agreement with the alignment condition. Moreover, the learning-augmented scheme outperforms the baseline projected gradient method in the well-aligned regime (

η

small), particularly in terms of transient behavior.

We also observe that the projection step is active in a significant fraction of the iterations (between 35% and 60% depending on the parameters), confirming that feasibility constraints play a nontrivial role in shaping the dynamics.

Overall, these results validate that the contraction–bias–variance mechanism accurately predicts performance trends in constrained settings.

7.8. Example 2: Projected Learning-Augmented Quasi-Gradient Under Norm Constraints and Geometric Coupling

We consider the constrained quadratic optimization problem

min_{x \in C} f (x) = \frac{1}{2} x^{⊤} A x - b^{⊤} x,

where

A \in R^{n \times n}

is symmetric positive definite and

b \in R^{n}

.

The feasible set is defined as a Euclidean ball:

C = {x \in R^{n} : ∥ x ∥_{2} \leq R},

where

R > 0

controls the admissible region.

Unlike the simplex-type constraints of Example 1, this feasible set induces a globally coupled geometric restriction. The projection operator

Π_{C}

is given explicitly by

Π_{C} (x) = \{\begin{matrix} x, & ∥ x ∥ \leq R, \\ \frac{R}{∥ x ∥} x, & ∥ x ∥ > R, \end{matrix}

which introduces a nonlinear rescaling that affects all coordinates simultaneously.

The learning-augmented projected iteration takes the form

x_{k + 1} = Π_{C} (x_{k} - α_{k} (Q (x_{k}) + b_{k} (x_{k}) + ϵ_{k})),

where

Q (x_{k}) = A x_{k} - b

,

b_{k} (x_{k})

represents a structured bias component and

ϵ_{k}

is a zero-mean stochastic perturbation.

To study alignment effects, we consider a linear bias of the form

b_{k} (x_{k}) = B x_{k},

where the matrix B is chosen such that the alignment condition

〈 B x, x - x^{*} 〉 \leq η 〈 Q (x), x - x^{*} 〉

holds for a prescribed parameter

η \in [0, 1)

.

This setting allows us to explicitly analyze how the projection interacts with contraction and alignment. In particular:

When $∥ x_{k} ∥ \leq R$ , the iteration behaves as an unconstrained learning-augmented quasi-gradient method.
When $∥ x_{k} ∥ > R$ , the projection introduces a nonlinear normalization effect that modifies the effective step direction.

From a geometric perspective, the projection acts as a radial contraction that preserves direction but rescales magnitude. This creates an additional interaction between curvature-induced contraction and constraint-induced normalization.

The objective of this example is to verify that the contraction–bias–variance mechanism remains structurally valid under this globally coupled projection. In particular, we examine:

The influence of the radius R on convergence behavior;
The interaction between alignment parameter $η$ and projection;
The resulting steady-state error under stochastic perturbations.

This example provides a complementary validation of the theoretical framework, showing that the operator-theoretic decomposition extends beyond polyhedral constraints to geometrically coupled feasible sets.

We consider

min_{x \in C} f (x) = \frac{1}{2} x^{⊤} A x - b^{⊤} x,

where

A \in R^{n \times n}

is symmetric positive definite,

b \in R^{n}

, and the feasible set is the Euclidean ball

C = {x \in R^{n} : ∥ x ∥_{2} \leq R} .

The projection onto C is explicitly given by

Π_{C} (x) = \{\begin{matrix} x, & {∥ x ∥}_{2} \leq R, \\ \frac{R}{{∥ x ∥}_{2}} x, & {∥ x ∥}_{2} > R . \end{matrix}

The learning-augmented projected iteration is

x_{k + 1} = Π_{C} (x_{k} - α_{k} (Q (x_{k}) + b_{k} (x_{k}) + ϵ_{k})),

with

Q (x_{k}) = A x_{k} - b

, a structured bias term

b_{k} (x_{k}) = B x_{k}

, and stochastic perturbation

ϵ_{k} \sim N (0, σ^{2} I)

.

This experiment is used to evaluate the effect of norm-based feasibility enforcement on the contraction–bias–variance dynamics. In particular, we report the mean squared error, the sensitivity with respect to the alignment parameter

η

, and the fraction of iterations for which the projection step is active.

A similar sensitivity analysis is conducted for the norm-constrained setting. The results confirm that the interaction between the alignment parameter and the projection geometry influences convergence behavior, particularly near the boundary of the feasible set.

In addition, a comparison with the baseline projected gradient method highlights that the learning-augmented scheme maintains the predicted contraction properties while exhibiting improved adaptability under stochastic perturbations.

The results obtained under norm-based constraints provide a complementary perspective on the contraction–bias–variance framework, highlighting the role of global geometric coupling induced by the projection.

We report below a representative sensitivity analysis with respect to the alignment parameter

η

and the radius R of the feasible set. Table 2 provides a comparative analysis of the considered methods, highlighting their behavior in terms of contraction, bias, and variance within the proposed operator-theoretic framework.

The results show that smaller values of

η

again lead to improved convergence behavior, confirming the importance of alignment in preserving contraction properties. In contrast to the simplex case, the radius R plays a significant role: smaller values of R increase the frequency of projection and introduce a stronger boundary effect, which impacts the steady-state error.

The comparison with the baseline projected gradient method indicates that the learning-augmented scheme achieves improved performance in the well-aligned regime, while maintaining stability across all tested configurations.

We also observe that the projection step is active in a substantial fraction of the iterations (ranging from 40% to 70%), particularly when R is small. This confirms that the norm constraint induces a strong coupling effect across coordinates and significantly influences the optimization dynamics.

Overall, these results further validate that the contraction–bias–variance mechanism extends to geometrically coupled constraint sets, while revealing the additional impact of global projection effects.

8. Reproducible Computational Implementation

This section provides the complete computational protocol necessary to reproduce the numerical experiments presented herein. The objective is to ensure methodological transparency, numerical stability, and independent verifiability of all reported results.

All experiments are implemented in Python [37], using exclusively the NumPy library to ensure portability and full transparency. The computational setup described in this section applies to both constrained scenarios considered in Section 7, including the simplex-type and norm-based feasible sets. All experiments are implemented using the same framework, ensuring consistency across examples and full reproducibility of the reported results.

8.1. Experimental Design

The computational experiment follows a deterministic quadratic construction combined with controlled stochastic and structured perturbations consistent with the contraction–bias–variance framework developed in the main text.

Quadratic Model

We consider

f (x) = \frac{1}{2} x^{⊤} A x - b^{⊤} x,

where

A = M^{⊤} M + μ I,

ensuring

λ_{min} (A) \geq μ > 0

. The exact minimizer is computed via

x^{*} = A^{- 1} b .

8.2. Algorithms Implemented

The following schemes are evaluated:

Gradient descent (GD);
Stochastic gradient descent (SGD);
Polyak Momentum;
Learning-augmented quasi-gradient (Proposed).

The learning-augmented update takes the form

x_{k + 1} = x_{k} - α_{k} (Q (x_{k}) + b_{k} (x_{k}) + ϵ_{k}),

where

b_{k}

satisfies the alignment condition

〈 b_{k} (x), x - x^{*} 〉 \leq η 〈 Q (x), x - x^{*} 〉 .

8.3. Empirical Behavior Under Mirror-Based Updates

Diminishing:

α_{k} = \frac{1}{μ (1 - η) (k + 1)} .

Constant:

0 < α < \frac{2}{λ_{max} (A)} .

8.4. Perturbation Model

ϵ_{k} \sim N (0, σ^{2} I), σ^{2} \in {0.01, 0.1, 0.5} .

Structured bias levels:

η \in {0, 0.3, 0.6} .

Theoretical steady-state prediction:

\underset{k \to \infty}{lim sup} E {∥ x_{k} - x^{*} ∥}^{2} = O (\frac{σ^{2}}{μ (1 - η)}) .

8.5. Diminishing Step-Size Results

Table 3 reports the empirical mean squared error after

K = 10^{4}

iterations under diminishing step sizes.

As shown in Table 3, all stochastic methods exhibit the predicted

O (1 / k)

decay. When

η = 0

, the learning-augmented scheme is statistically indistinguishable from classical SGD, confirming that the contraction structure remains unchanged in the absence of alignment distortion.

8.6. Constant Step-Size Results

Table 4 reports steady-state mean squared errors under constant step sizes with alignment parameter

η = 0.3

. Gradient Descent is not included in Table 4 since, in the absence of stochastic perturbations, it converges linearly to the exact minimizer and does not exhibit a steady-state error floor.

Table 4 confirms the theoretical steady-state scaling proportional to

σ^{2} / (μ (1 - η))

. Compared to the

η = 0

case, the increased steady-state error reflects the reduction in effective contraction margin.

8.7. Alignment Sensitivity Study

Table 5 isolates the effect of varying

η

while fixing

σ^{2} = 0.1

.

As seen in Table 5, the steady-state error increases approximately according to the predicted factor

1 / (1 - η)

, providing empirical confirmation of the alignment-dependent contraction derived in Section 4.

8.8. Structural Comparison

Table 6 summarizes the structural behavior of the considered first-order methods.

Table 6 highlights the key structural distinction: while classical SGD is governed solely by curvature and variance, the learning-augmented operator introduces an explicit geometric alignment parameter that modulates the contraction margin.

Overall, the empirical results across Table 3, Table 4, Table 5 and Table 6 validate the contraction–bias–variance prediction: curvature controls contraction, alignment modifies stability margins, and stochastic dispersion determines the asymptotic neighborhood.

8.9. Extended Performance Metrics

In addition to the mean squared error, we evaluate objective residuals, stability under increasing variance, conditioning sensitivity, and alignment distortion effects.

Table 7 shows that objective residuals and gradient norms follow the same asymptotic order as the mean squared error due to strong convexity. The learning-augmented method remains comparable to SGD when alignment distortion is moderate.

8.10. Sensitivity Analysis

Table 8 confirms the structural predictions of the theory: variance controls dispersion, conditioning affects contraction speed through the spectral gap, and alignment distortion reduces the effective contraction margin, enlarging the steady-state neighborhood.

8.11. Theoretical vs. Empirical Error Floor

To directly validate the steady-state prediction derived from the contraction–bias–variance recursion, we compare the theoretical error floor

E_{theory} = \frac{σ^{2}}{μ (1 - η)}

(36)

with the empirically observed steady-state mean squared error.

As shown in Table 9, the empirical steady-state errors closely match the theoretical prediction (36). The ratio between observed and predicted values remains near unity across variance and alignment regimes, providing quantitative confirmation of the contraction–bias–variance model.

This agreement demonstrates that alignment distortion only modifies the contraction margin, without altering the fundamental scaling structure of the asymptotic neighborhood.

8.12. Contraction–Bias–Variance Analysis

The computational evidence presented in this section provides quantitative validation of the contraction–bias–variance framework. Across all regimes, the observed dynamics closely match the theoretical predictions derived from the operator-theoretic analysis.

In particular, Table 4 and Table 9 confirm that the steady-state mean squared error scales proportionally to

σ^{2} / (μ (1 - η))

, as predicted by the theoretical recursion. The empirical-to-theoretical ratios reported in Table 9 remain close to unity across variance and alignment regimes, demonstrating that the model is not merely qualitatively accurate but quantitatively predictive.

Under diminishing step sizes, Table 3 verifies the expected

O (1 / k)

decay, while the sensitivity analyses summarized in Table 5 and Table 8 confirm the distinct structural roles of curvature, conditioning, variance, and alignment distortion.

Overall, the learning-augmented quasi-gradient scheme preserves the qualitative convergence behavior of classical first-order methods while introducing an explicit geometric control parameter through operator alignment. The numerical results therefore corroborate the central claim of this work: stability and convergence in learning-augmented optimization are governed by the interplay between curvature, alignment, and stochastic dispersion.

9. Discussion of Results

The theoretical and computational findings establish a unified structural framework for learning-augmented quasi-gradient operators in constrained optimization.

From a theoretical standpoint, the principal achievement of this work is the explicit contraction–bias–variance decomposition of the iterative dynamics. The recursion derived in Section 4,

E ∥ x_{k + 1} - x^{*} ∥^{2} \leq (1 - 2 μ (1 - η) α_{k}) {∥ x_{k} - x^{*} ∥}^{2} + α_{k}^{2} σ^{2},

reveals that learning-augmented perturbations influence the dynamics through two distinct channels: geometric alignment distortion (captured by

η

) and stochastic dispersion (captured by

σ^{2}

). Deterministic contraction remains governed by curvature, but the effective contraction margin is reduced by the alignment factor

(1 - η)

.

This structural refinement constitutes a central contribution. Artificial intelligence components do not arbitrarily modify descent geometry; rather, their influence can be decomposed into an explicit alignment distortion and a second-order variance effect. The framework therefore provides a rigorous answer to a fundamental question in AI-enhanced optimization—whether learned directional information compromises stability. The analysis demonstrates that stability is preserved whenever

η < 1

, with admissible step-size regions directly determined by

μ (1 - η)

.

The extension to the Polyak–ojasiewicz regime further strengthens applicability. Linear convergence follows from gradient-dominance conditions without requiring convexity, while alignment distortion modifies only the contraction constant. This significantly broadens the theoretical scope of the framework, aligning it with modern machine learning models that satisfy PL-type inequalities despite nonconvex parameterizations.

In the smooth nonconvex regime, the classical

O (1 / \sqrt{K})

stationarity rate is preserved. The learning-augmented component modifies constants through alignment and dispersion terms but does not alter first-order complexity. Thus, the proposed operator retains optimal stochastic rates while enabling flexible integration of learned directions.

The computational study provides transparent empirical validation of these theoretical conclusions. The quadratic experiment furnishes a controlled spectral environment in which contraction factors are explicitly determined by eigenvalues. The results reported in Table 3, Table 4 and Table 8 confirm:

Sublinear $O (1 / k)$ decay under diminishing step sizes;
Variance-proportional steady-state error under constant step sizes;
Explicit enlargement of the steady-state neighborhood as $η$ increases;
Stability within the admissible spectral range.

Most importantly, Table 9 demonstrates quantitative agreement between the theoretical prediction

E_{theory} = \frac{σ^{2}}{μ (1 - η)}

and empirically observed steady-state errors. The empirical-to-theoretical ratios remain close to unity across variance and alignment regimes, confirming that the contraction–bias–variance model is not merely qualitative but quantitatively predictive.

The comparative evaluation against deterministic GD, SGD, and momentum-based methods provides additional structural clarity. Deterministic GD achieves pure spectral contraction in noise-free settings. Stochastic methods converge to variance-dependent neighborhoods under constant step sizes, with radius scaling proportionally to

σ^{2} / μ

when

η = 0

, and to

σ^{2} / (μ (1 - η))

in the learning-augmented case.

Momentum accelerates transient behavior but exhibits greater sensitivity under ill-conditioning and stochastic perturbations. In contrast, the learning-augmented quasi-gradient operator preserves the robustness and admissible stability region of SGD while introducing a controlled geometric flexibility parameterized by

η

.

A central insight emerging from both theory and experiments is that long-run behavior is governed by the balance between deterministic spectral contraction and perturbation geometry. Curvature induces contraction through the spectral radius of

(I - α A)

, alignment modulates the contraction margin, and stochastic perturbations inject dispersion at a rate of

α^{2} σ^{2}

. Diminishing step sizes attenuate dispersion and ensure exact convergence; constant step sizes preserve a controlled noise floor whose magnitude is precisely characterized by the theoretical model.

Overall, the proposed framework establishes that learning-augmented quasi-gradient operators admit a mathematically transparent decomposition into curvature-driven contraction, alignment distortion, and stochastic variance effects. This decomposition bridges classical operator theory and contemporary AI-enhanced optimization, providing a rigorous structural foundation for integrating data-driven mechanisms into constrained first-order methods while preserving stability, convergence rates, and spectral admissibility conditions.

10. Conclusions

This work establishes a rigorous operator-theoretic foundation for learning-augmented quasi-gradient methods in constrained optimization, positioning modern AI-enhanced update rules within the core principles of stochastic approximation and variational analysis.

Primary Contribution: The central contribution of this work is the explicit contraction–bias–variance decomposition of the iterative dynamics. Unlike purely stochastic perturbation models, the present framework separates three structural mechanisms:

Curvature-induced contraction (governed by $μ$ );
Alignment distortion (quantified by $η$ );
Stochastic dispersion (controlled by $σ^{2}$ ).

The fundamental recursion

E ∥ x_{k + 1} - x^{*} ∥^{2} \leq (1 - 2 μ (1 - η) α_{k}) {∥ x_{k} - x^{*} ∥}^{2} + α_{k}^{2} σ^{2}

reveals that learning-driven components influence convergence through a reduced contraction margin

(1 - η)

and second-order dispersion terms while preserving the underlying descent geometry whenever

η < 1

.

This explicit geometric interpretation constitutes a conceptual advancement over classical variance-only stochastic approximation models.

Operator-Theoretic Framework: Under strong convexity, the admissible stability range remains explicitly characterized in spectral terms. Under the Polyak–ojasiewicz condition, linear convergence persists without convexity, aligning the framework with overparameterized machine learning models. In smooth nonconvex regimes, classical

O (1 / \sqrt{K})

stationarity rates are preserved, demonstrating that learning-augmented perturbations modify constants but not first-order complexity classes.

Thus, the framework applies beyond quadratic objectives to a broad range of models, including:

Regularized empirical risk minimization with constraints;
Constrained logistic and cross-entropy learning;
Sparse optimization via projected $ℓ_{1}$ -regularization;
Low-rank matrix factorization under box or norm constraints;
Constrained reinforcement learning policy updates;
Variational inequality formulations and saddle-point problems.

These examples illustrate that the operator formulation extends naturally to structured optimization problems arising in modern AI systems.

Quantitative Predictive Validation: A distinguishing feature of the present work is the quantitative agreement between theory and computation. The empirical steady-state error closely matches the theoretical prediction

E_{theory} = \frac{σ^{2}}{μ (1 - η)},

with observed ratios near unity across variance and alignment regimes. This confirms that the contraction–bias–variance model is not merely qualitative but quantitatively predictive.

Structural Insight: A key insight emerging from this analysis is that AI-enhanced updates need not compromise stability. Learning components act as structured operator perturbations whose admissibility is governed by alignment geometry rather than heuristic tuning. Curvature determines contraction, alignment modulates contraction margins, and stochastic perturbations determine asymptotic dispersion.

This unified perspective bridges classical stochastic approximation theory (Robbins–Monro, Polyak, Kushner–Yin) with contemporary AI-integrated optimization architectures.

Broader Impact: By formalizing alignment as a geometric stability parameter, the framework provides a principled method to evaluate when learned directional corrections are safe to incorporate into constrained first-order methods. Rather than treating learning modules as black-box accelerators, this work embeds them within a mathematically transparent contraction structure.

Future Research Directions: The contraction–bias–variance framework opens multiple avenues for further investigation:

Alignment-Adaptive Learning: Designing learning mechanisms that dynamically control $η$ to preserve contraction margins.
Variance-Reduced Learning-Augmented Operators: Integrating control variates or SARAH/ SPIDER-type mechanisms within the alignment framework.
Operator-Theoretic Analysis of Saddle-Point and Game Dynamics: Extending the decomposition to monotone variational inequalities and adversarial learning.
Conditioning-Sensitive Perturbation Design: Studying how learned corrections interact with spectral anisotropy in ill-conditioned problems.
Nonconvex Constraint Geometry: Extending the theory to projection-induced nonsmooth and manifold-constrained optimization.
Adaptive Step-Size Policies Coupled to Alignment Estimates: Joint estimation of curvature and alignment distortion.
Deterministic Learning Corrections: Characterizing bias-only regimes and their long-term contraction properties.

Overall, the learning-augmented quasi-gradient operator establishes a principled mathematical bridge between operator theory and AI-enhanced optimization. By identifying explicit geometric and probabilistic conditions under which data-driven perturbations preserve contraction properties, this work provides a foundational analytical framework for hybrid optimization–learning systems [38].

Artificial intelligence components can enrich directional information without compromising spectral stability, convergence rates, or admissible conditioning ranges—provided their alignment with descent geometry remains controlled.

The contraction–bias–variance decomposition thus offers a transparent structural lens through which modern AI-integrated optimization algorithms can be analyzed, designed, and rigorously validated.

Author Contributions

Conceptualization, M.A.C.G.; methodology, G.P.-L. and A.L.M.S.; software, G.P.-L.; validation, A.L.M.S.; formal analysis, G.P.-L. and M.A.C.G.; investigation, G.P.-L.; writing—review and editing, M.A.C.G. and A.L.M.S.; supervision, M.A.C.G.; project administration, A.L.M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All numerical experiments reported in this study were implemented in Python exclusively using the NumPy library. The full source code, including data generation, algorithm implementations, and automated reproduction of all tables, is available for editorial and peer-review purposes. The computational scripts ensure deterministic reproducibility through controlled random seeds and complete documentation of all hyperparameters. The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Polyak, B.T. Introduction to Optimization; Optimization Software: New York, NY, USA, 1987. [Google Scholar]
Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. Robust Stochastic Approximation Approach to Stochastic Programming. SIAM J. Optim. 2009, 19, 1574–1609. [Google Scholar] [CrossRef]
Ermoliev, Y.M. Stochastic Quasigradient Methods and Their Applications. Stochastics 1988, 25, 1–23. [Google Scholar]
Pérez-Lechuga, G. A Perspective on Stochastic Search Efficiency via Quasigradient Techniques in Constrained Models. Am. J. Oper. Res. 2025, 15, 195–221. [Google Scholar] [CrossRef]
Pflug, G.C. Optimization of Stochastic Models; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1996. [Google Scholar]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
Bertsekas, D.P. Nonlinear Programming, 2nd ed.; Athena Scientific: Belmont, MA, USA, 1999. [Google Scholar]
Nesterov, Y. Introductory Lectures on Convex Optimization; Springer: New York, NY, USA, 2004. [Google Scholar]
Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
Rubinstein, R.Y.; Kroese, D.P. The Cross-Entropy Method; Springer: New York, NY, USA, 2004. [Google Scholar]
Rockafellar, R.T.; Wets, R.J.-B. Variational Analysis; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
Beck, A.; Teboulle, M. Mirror Descent and Nonlinear Projected Subgradient Methods for Convex Optimization. Oper. Res. Lett. 2003, 31, 167–175. [Google Scholar] [CrossRef]
Nemirovski, A. Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators. SIAM J. Optim. 2004, 15, 229–251. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In ICLR; ArXiv: Ithaca, NY, USA, 2015. [Google Scholar]
Shapiro, A.; Dentcheva, D.; Ruszczyński, A. Lectures on Stochastic Programming: Modeling and Theory; SIAM: Philadelphia, PA, USA, 2009. [Google Scholar]
Wang, K.; Li, D.; Wang, S. A modified RMIL conjugate gradient-based projection algorithm for constrained nonlinear equations: Application to image denoising. Demonstr. Math. 2025, 58, 20250200. [Google Scholar] [CrossRef]
Mewomo, O.T.; Uzor, V.A.; Agyingi, E. Efficient algorithms for solving a class of generalized inclusion problem with application to optimal control. Rend. Circ. Mat. Palermo 2026, 75, 58. [Google Scholar] [CrossRef]
Bauschke, H.H.; Combettes, P.L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces; Springer: New York, NY, USA, 2011. [Google Scholar]
Clarke, F.H. Optimization and Nonsmooth Analysis; SIAM: Philadelphia, PA, USA, 1990. [Google Scholar]
Devolder, O.; Glineur, F.; Nesterov, Y. First-order methods of smooth convex optimization with inexact oracle. Math. Program. 2014, 146, 37–75. [Google Scholar] [CrossRef]
Schmidt, M.; Roux, N.L.; Bach, F. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in Neural Information Processing Systems 24 (NeurIPS 2011); Curran Associates, Inc.: Red Hook, NY, USA, 2011; pp. 1458–1466. [Google Scholar]
Karimireddy, S.P.; Rebjock, Q.; Stich, S.U.; Jaggi, M. Error feedback fixes SignSGD and other gradient compression schemes. In Proceedings of the 37th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Ghadimi, S.; Lan, G. Nonconvex Stochastic Programming. SIAM J. Optim. 2013, 23, 2341–2368. [Google Scholar] [CrossRef]
Bubeck, S. Convex Optimization: Algorithms and Complexity. Found. Trends Mach. Learn. 2015, 8, 231–357. [Google Scholar] [CrossRef]
Zinkevich, M. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML 2003); AAAI Press: Washington, DC, USA, 2003; pp. 928–936. [Google Scholar]
Hazan, E. Introduction to Online Convex Optimization. Found. Trends Optim. 2016, 2, 157–325. [Google Scholar] [CrossRef]
Bertsekas, D.P.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Belmont, MA, USA, 1996. [Google Scholar]
Kushner, H.J.; Yin, G.G. Stochastic Approximation and Recursive Algorithms and Applications, 2nd ed.; Springer: New York, NY, USA, 2003. [Google Scholar]
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
Hornik, K.; Stinchcombe, M.; White, H. Multilayer Feedforward Networks Are Universal Approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
Hardt, M.; Recht, B.; Singer, Y. Stability of Stochastic Gradient Descent. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016); PMLR: New York, NY, USA, 2016; Volume 48, pp. 1225–1234. [Google Scholar]
Polyak, B.T. Some Methods of Speeding Up the Convergence of Iteration Methods. USSR Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
Polyak, B.T.; Juditsky, A.B. Acceleration of Stochastic Approximation by Averaging. SIAM J. Control Optim. 1992, 30, 838–855. [Google Scholar] [CrossRef]
Nesterov, Y. A Method for Solving the Convex Programming Problem with Convergence Rate O(1/k²). Sov. Math. Dokl. 1983, 27, 372–376. [Google Scholar]
Ermoliev, Y. Stochastic Quasigradient Methods and Their Application to System Optimization; Springer: Berlin/Heidelberg, Germany, 1976. [Google Scholar]
Free Software, Open Standards, and Web Services for Interactive Computing Across All Programming Languages (Introduction to the JupyterLab and Jupyter Notebooks). Available online: https://jupyter.org/try-jupyter/lab/ (accessed on 15 February 2025).
Karimi, H.; Nutini, J.; Schmidt, M. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak–Lojasiewicz Condition. In ECML PKDD 2016 LNCS 9851; Springer: Berlin/Heidelberg, Germany, 2016; pp. 795–811. [Google Scholar]

Table 1. Sensitivity with respect to

η

and

α

under simplex constraints.

Table 1. Sensitivity with respect to

η

and

α

under simplex constraints.

$η$	$α$	Final Error	PG Baseline Error
0.1	0.05	0.012	0.018
0.3	0.05	0.021	0.022
0.5	0.05	0.038	0.030
0.3	0.10	0.027	0.035

Table 2. Sensitivity with respect to

η

and R under norm constraints.

Table 2. Sensitivity with respect to

η

and R under norm constraints.

$η$	R	Final Error	PG Baseline Error
0.1	1.0	0.015	0.020
0.3	1.0	0.026	0.028
0.5	1.0	0.044	0.037
0.3	0.5	0.032	0.041

Table 3. Empirical MSE after

K = 10^{4}

iterations (diminishing steps).

Table 3. Empirical MSE after

K = 10^{4}

iterations (diminishing steps).

Method	$σ^{2} = 0.01$	$σ^{2} = 0.1$	$σ^{2} = 0.5$
GD	$1.2 \times 10^{- 6}$	–	–
SGD	$3.5 \times 10^{- 5}$	$8.7 \times 10^{- 5}$	$4.1 \times 10^{- 4}$
Momentum	$2.9 \times 10^{- 5}$	$9.5 \times 10^{- 5}$	$5.2 \times 10^{- 4}$
Learning-Augmented ( $η = 0$ )	$3.6 \times 10^{- 5}$	$8.9 \times 10^{- 5}$	$4.3 \times 10^{- 4}$

Table 4. Steady-State MSE (constant steps,

η = 0.3

).

Table 4. Steady-State MSE (constant steps,

η = 0.3

).

Method	$σ^{2} = 0.01$	$σ^{2} = 0.1$	$σ^{2} = 0.5$
SGD	$1.8 \times 10^{- 4}$	$1.6 \times 10^{- 3}$	$8.4 \times 10^{- 3}$
Momentum	$2.4 \times 10^{- 4}$	$2.2 \times 10^{- 3}$	$1.1 \times 10^{- 2}$
Learning-Augmented	$2.3 \times 10^{- 4}$	$2.1 \times 10^{- 3}$	$1.0 \times 10^{- 2}$

Table 5. Effect of alignment parameter

η

(

σ^{2} = 0.1

).

Table 5. Effect of alignment parameter

η

(

σ^{2} = 0.1

).

$η$	0	0.3	0.6
Steady-State MSE	$1.6 \times 10^{- 3}$	$2.1 \times 10^{- 3}$	$3.9 \times 10^{- 3}$

Table 6. Structural comparison of first-order methods.

Method	Rate	Error Floor	Alignment Sensitivity	Conditioning
GD	$O ({(1 - α μ)}^{k})$	0	None	High
SGD	$O (1 / k)$	$\propto σ^{2} / μ$	Low	Moderate
Momentum	Faster transient	Amplified	Moderate	High
Learning-A.	$O (1 / k)$	$\propto σ^{2} / (μ (1 - η))$	Explicit via $η$	Moderate

Table 7. Final performance metrics (

K = 10^{4}

,

η = 0.3

,

κ \approx 50

).

Table 7. Final performance metrics (

K = 10^{4}

,

η = 0.3

,

κ \approx 50

).

Method	$E ∥ x_{K} - x^{*} ∥^{2}$	$E [f (x_{K}) - f (x^{*})]$	$E ∥ \nabla f (x_{K}) ∥$
GD	$1.2 \times 10^{- 6}$	$6.3 \times 10^{- 7}$	$2.1 \times 10^{- 4}$
SGD ( $σ^{2} = 0.1$ )	$8.9 \times 10^{- 5}$	$3.5 \times 10^{- 5}$	$6.2 \times 10^{- 3}$
Momentum ( $σ^{2} = 0.1$ )	$9.5 \times 10^{- 5}$	$3.9 \times 10^{- 5}$	$6.8 \times 10^{- 3}$
Learning-Augmented ( $σ^{2} = 0.1$ )	$8.7 \times 10^{- 5}$	$3.4 \times 10^{- 5}$	$6.1 \times 10^{- 3}$

Table 8. Sensitivity to variance, conditioning, and alignment.

Parameter	Setting	Steady-State MSE	Observed Trend
$σ^{2}$	0.01 → 0.5	$1.8 \times 10^{- 4} \to 8.4 \times 10^{- 3}$	Linear scaling
$κ$	10 → 200	$6.5 \times 10^{- 4} \to 4.2 \times 10^{- 3}$	Slower contraction
$η$	0 → 0.6	$1.6 \times 10^{- 3} \to 3.9 \times 10^{- 3}$	$\propto 1 / (1 - η)$

Table 9. Theoretical vs. empirical steady-state error.

$σ^{2}$	$η$	$μ$	Theoretical $E_{theory}$	Empirical MSE	Ratio (Emp/Theory)
0.01	0.0	0.5	$0.020$	$0.019$	0.95
0.10	0.0	0.5	$0.200$	$0.184$	0.92
0.10	0.3	0.5	$0.286$	$0.268$	0.94
0.50	0.3	0.5	$1.429$	$1.352$	0.95
0.10	0.6	0.5	$0.500$	$0.472$	0.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pérez-Lechuga, G.; Coronel García, M.A.; Martínez Salazar, A.L. Learning-Augmented Quasi-Gradient Operators for Constrained Optimization: A Contraction–Bias–Variance Decomposition. Mathematics 2026, 14, 1202. https://doi.org/10.3390/math14071202

AMA Style

Pérez-Lechuga G, Coronel García MA, Martínez Salazar AL. Learning-Augmented Quasi-Gradient Operators for Constrained Optimization: A Contraction–Bias–Variance Decomposition. Mathematics. 2026; 14(7):1202. https://doi.org/10.3390/math14071202

Chicago/Turabian Style

Pérez-Lechuga, Gilberto, Marco Antonio Coronel García, and Ana Lidia Martínez Salazar. 2026. "Learning-Augmented Quasi-Gradient Operators for Constrained Optimization: A Contraction–Bias–Variance Decomposition" Mathematics 14, no. 7: 1202. https://doi.org/10.3390/math14071202

APA Style

Pérez-Lechuga, G., Coronel García, M. A., & Martínez Salazar, A. L. (2026). Learning-Augmented Quasi-Gradient Operators for Constrained Optimization: A Contraction–Bias–Variance Decomposition. Mathematics, 14(7), 1202. https://doi.org/10.3390/math14071202

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning-Augmented Quasi-Gradient Operators for Constrained Optimization: A Contraction–Bias–Variance Decomposition

Abstract

1. Introduction

1.1. Learning-Induced Bias and Operator Geometry

1.2. Main Contributions

2. Mathematical Framework and Problem Formulation

2.1. Functional Setting

2.2. Standing Assumptions

2.3. Generalized Gradients

2.4. Projection Operator and Nonexpansiveness

2.5. Quasi-Gradient Mappings

2.6. Projected Quasi-Gradient Iteration

2.7. Operator Geometry and Fixed-Point Structure

2.8. Stochastic Perturbations

2.9. Motivation for Learning-Driven Extensions

3. Learning-Augmented Quasi-Gradient Operator

3.1. From Stochastic Perturbations to Learning-Augmented Operators

3.2. Bias–Variance Decomposition

3.3. Operator Alignment and Contraction Geometry

4. Convergence Analysis

4.1. Additional Structural Assumptions

4.2. Learning-Augmented Iteration

4.3. Fundamental Contraction–Bias–Variance Recursion

4.4. Rates Under Strong Convexity

Consequences and Theoretical Implications

5. Extensions Beyond Strong Convexity

5.1. Polyak–Łojasiewicz Condition

5.2. Smooth Nonconvex Case

Implications of the Extended Regimes

5.3. Extension to Mirror Descent Geometry

6. Learning-Augmented Architectures Consistent with the Framework

6.1. Linear Online Learning Component

6.2. Neural Network-Based Directional Component

6.3. Adaptive Momentum-Type Updates

Structural Consistency with the Theory

7. Computational Illustration and Comparative Analysis

7.1. Example 1: Projected Learning-Augmented Quasi-Gradient Under Simplex-Type Constraints

7.2. Quadratic Strongly Convex Model

7.3. Learning-Augmented Perturbation Model

7.4. Exact Spectral Error Recursion

7.5. Extension to Mirror Descent Geometry

7.5.1. Diminishing Step Size

7.5.2. Constant Step Size

7.6. Comparative Analysis of First-Order Methods

7.7. Example 2: Projected Learning-Augmented Quasi-Gradient Under Norm Constraints and Geometric Regularization

7.8. Example 2: Projected Learning-Augmented Quasi-Gradient Under Norm Constraints and Geometric Coupling

8. Reproducible Computational Implementation

8.1. Experimental Design

Quadratic Model

8.2. Algorithms Implemented

8.3. Empirical Behavior Under Mirror-Based Updates

8.4. Perturbation Model

8.5. Diminishing Step-Size Results

8.6. Constant Step-Size Results

8.7. Alignment Sensitivity Study

8.8. Structural Comparison

8.9. Extended Performance Metrics

8.10. Sensitivity Analysis

8.11. Theoretical vs. Empirical Error Floor

8.12. Contraction–Bias–Variance Analysis

9. Discussion of Results

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI