Incremental Weak Subgradient Methods for Non-Smooth Non-Convex Optimization Problems

Araboljadidi, Narges; De Simone, Valentina

doi:10.3390/info16060509

Open AccessArticle

Incremental Weak Subgradient Methods for Non-Smooth Non-Convex Optimization Problems

by

Narges Araboljadidi

^†,‡

and

Valentina De Simone

^*,‡

Department of Mathematics and Physics, University of Campania “Luigi Vanvitelli”, Viale Abramo Lincoln, 5, 81100 Caserta, Italy

^*

Author to whom correspondence should be addressed.

^†

Current address: Viale Abramo Lincoln, 5, 81100 Caserta, Italy.

^‡

These authors contributed equally to this work.

Information 2025, 16(6), 509; https://doi.org/10.3390/info16060509

Submission received: 24 April 2025 / Revised: 29 May 2025 / Accepted: 12 June 2025 / Published: 19 June 2025

(This article belongs to the Special Issue Emerging Research in Optimization and Machine Learning)

Download Versions Notes

Abstract

Non-smooth, non-convex optimization problems frequently arise in modern machine learning applications, yet solving them efficiently remains a challenge. This paper addresses the minimization of functions of the form

f (x) = \sum_{i = 1}^{m} f_{i} (x)

where each component is Lipschitz continuous but potentially non-smooth and non-convex. We extend the incremental subgradient method by incorporating weak subgradients, resulting in a framework better suited for non-convex objectives. We provide a comprehensive convergence analysis for three step size strategies: constant, diminishing, and a novel dynamic approach. Our theoretical results show that all variants converge to a neighborhood of the optimal solution, with the size of this neighborhood governed by the weak subgradient parameters. Numerical experiments on classification tasks with non-convex regularization, evaluated on the Breast Cancer Wisconsin dataset, demonstrate the effectiveness of the proposed approach. In particular, the dynamic step size method achieves superior practical performance, outperforming both classical and diminishing step size variants in terms of accuracy and convergence speed. These results position the incremental weak subgradient framework as a promising tool for scalable and efficient optimization in machine learning settings involving non-convex objectives.

Keywords:

non-smooth optimization; non-convex optimization; incremental methods; weak subgradients; step sizes; machine learning

1. Introduction

Optimization problems involving the minimization of a sum of component functions arise in numerous applications spanning machine learning, neural network training, and large-scale data analysis. These problems take the general form

min_{x \in X} f (x) : = \sum_{i = 1}^{m} f_{i} (x) .

(1)

Assume that

f_{i} : R^{d} \to R, \forall i = 1, \dots, m

is

L_{i}

-Lipschitz continuous on X and it is not necessarily convex and smooth. X is also a nonempty, closed, and convex subset of

R^{d}

. While traditional subgradient methods are well established for convex optimization, their extension to non-convex settings requires more sophisticated tools. Although widely applicable, traditional subgradient methods suffer from significant numerical instability near non-differentiable points, leading to erratic updates and slow convergence rates [1]. To address the limitations of classical subgradient methods in non-smooth optimization, we adopt the framework of weak subgradients, a relaxed notion of the standard subdifferential. This approach offers several advantages:

Numerical stability: Weak subgradients mitigate the high variance of standard subgradients near non-differentiable points, enabling more stable and smoother convergence trajectories [2].
Scalability: In large-scale problems like empirical risk minimization, they support efficient incremental or stochastic updates without requiring full subgradient computations [3].
Adaptive step sizes: Their synergy with dynamic or diminishing step size schemes enhances both convergence speed and robustness in practice [4].
Non-smooth structures: Particularly effective in settings with non-smooth regularizers (e.g., $ℓ_{1}$ ), piecewise-linear losses (e.g., hinge loss), or discontinuous activations, where classical subgradients are ill defined [5,6].

One such advancement is the weak subgradient concept introduced by Azimov and Gasimov [7], which replaces the supporting hyperplanes used in convex analysis with supporting conic surfaces to the graph of a function. This approach enables the handling of non-convex functions without requiring convexity assumptions.

A weak subgradient is defined as a pair

(v, c) \in R^{d} \times R_{+}

such that the graph of a superlinear function

η_{(v, c)} (x) = 〈 v, x 〉 - c ∥ x ∥

supports the epigraph of the function under consideration at a given point [8]. This construction significantly expands the class of functions that can be effectively analyzed, including many non-smooth and non-convex functions that arise in practical applications.

The relationship between directional derivatives and weak subdifferentials established by Kasimbeyli and Mammadov [9] provides a theoretical foundation for developing computational methods for non-convex optimization. Their work showed that under certain conditions, the directional derivative can be expressed as a “sup” relation using elements from the weak subdifferential, which generalizes the well-known “max-relation” from convex analysis.

The computational burden of processing large objective functions led to the development of incremental methods, which process component functions sequentially rather than evaluating the entire objective at each iteration. The incremental subgradient algorithm was first systematically studied by Nedic and Bertsekas [1], who established convergence properties under various step size rules. Their work was motivated by applications in neural network training [10,11], image reconstruction, and machine learning problems where the objective function naturally decomposes into a sum of many component functions. Earlier incremental gradient methods for smooth problems [12,13,14] provided important foundations, but extensions to non-smooth settings required more sophisticated analysis.

A critical factor affecting the performance of subgradient methods is the step size selection strategy. Traditional approaches include: constant step sizes, which typically converge to a neighborhood of the optimal solution whose size depends on the step magnitude [15]; diminishing step sizes satisfying the conditions, which guarantee asymptotic convergence but often exhibit slow practical performance [16]; Polyak step sizes, which utilize estimates of the optimal function value to determine appropriate step lengths [16].

A significant advancement came with Goffin and Kiwiel’s [17] introduction of dynamic step size rules, which automatically adapt based on the algorithm’s progress. This approach was extended to incremental settings by Nedic and Bertsekas [1], where it demonstrated superior empirical performance compared to other step size strategies. Yang and Wang [18] further refined this approach, proposing two modified dynamic step size rules that demonstrated improved convergence properties and practical performance for separable convex optimization problems.

While convex optimization benefits from well-established theory and methods, non-convex problems present additional challenges due to the potential presence of multiple local minima and saddle points. The concept of weak subgradients offers a more suitable framework for non-convex optimization, providing a generalization that accommodates the complex geometry of non-convex functions while maintaining useful theoretical properties. Recent work by Wang [19] has explored subgradient algorithms on Riemannian manifolds with lower bounded curvatures, while Nesterov [20] contributed with primal–dual subgradient methods, establishing important theoretical foundations. Yao et al. [21] explored projected subgradient algorithms for pseudomonotone equilibrium problems, demonstrating the breadth of applications for advanced subgradient techniques.

This paper addresses a significant gap in the literature by extending incremental subgradient methods to non-smooth, non-convex optimization problems using weak subgradients. Our primary contributions include the following: (1) Development of an incremental weak subgradient framework that accommodates non-convex component functions while maintaining theoretical convergence guarantees. (2) Comprehensive analysis of three-step size strategies—constant, diminishing, and a novel dynamic approach—establishing that all variants converge to a neighborhood of optimal solutions, with the neighborhood size determined by weak subgradient parameters. (3) Introduction of a modified dynamic step size algorithm with practical enhancements that improve robustness and convergence behavior, particularly for challenging optimization landscapes. (4) Empirical validation through extensive numerical experiments on classification tasks with non-convex regularization demonstrates our approach’s practical effectiveness compared to conventional methods.

Our theoretical analysis builds upon and extends the work of Yang and Wang [18] on incremental subgradient methods with dynamic step sizes for convex optimization, adapting their insights to the more challenging non-convex setting through the use of weak subgradients. The proposed methods maintain the computational efficiency of incremental approaches while accommodating the additional complexity introduced by non-convexity.

We establish that all three variants of our algorithm—with constant, diminishing, and dynamic step sizes—converge to a neighborhood of the optimal solution, with the size of this neighborhood determined by the parameters of the weak subgradients. Notably, our dynamic step size approach demonstrates superior practical performance by automatically adjusting to problem characteristics during the optimization process.

The theoretical contributions of this paper build upon extensive literature, extending the analysis to accommodate both the incremental processing of component functions and the use of weak subgradients for non-convex optimization. By establishing convergence guarantees for constant, diminishing, and dynamic step size strategies in this context, we bridge an important gap in the existing theory while providing practical algorithms for challenging real-world problems.

The remainder of this paper is organized as follows: Section 2 introduces preliminary concepts and properties of weak subgradients. Section 3 presents the incremental weak subgradient algorithm and analyzes its convergence properties under different step size strategies. Section 4 provides experimental results on classification tasks with non-convex regularization, and Section 5 concludes with a discussion of implications and future research directions.

2. Preliminaries

In this section, an exploration of the weak subdifferential and its fundamental properties is undertaken. While all of the theorems presented may not be directly utilized in the current proofs, they serve to establish a foundational comprehension of the weak subdifferential. For a deeper insight into this topic, readers are encouraged to consult references [7,9,22] for additional information.

Definition 1.

Let

f : S \to (- \infty, + \infty]

be a given function, where

S

is a nonempty and compact subset of

R

, and

\bar{x} \in S

. A pair

(v, c) \in R^{d} \times R_{+}

is called a weak subgradient of f at

\bar{x}

on

S

if

f (x) \geq f (\bar{x}) + 〈 v, x - \bar{x} 〉 - c ∥ x - \bar{x} ∥, \forall x \in S .

The set

\partial_{S}^{w} f (\bar{x}) = \{(v, c) \in R^{d} \times R_{+} : f (x) \geq f (\bar{x}) + 〈 v, x - \bar{x} 〉 - c ∥ x - \bar{x} ∥, \forall x \in S\}

of all weak subgradients of f at

\bar{x}

is called the weak subdifferential of f at

\bar{x}

on

S

.

Definition 2.

A function

f : S \to (- \infty, + \infty]

is called locally lower Lipschitz at

\bar{x} \in S

if there exists a non-negative number L (Lipschitz constant), and a neighborhood

N (\bar{x})

of

\bar{x}

such that

f (x) - f (\bar{x}) \geq - L ∥ x - \bar{x} ∥, \forall x \in N (\bar{x}) .

(2)

If the above inequality holds true for all

x \in S

, then f is called lower Lipschitz at

\bar{x}

with Lipschitz constant L.

Theorem 1.

Let

f : S \to (- \infty, + \infty]

be a given function.

Let $f (\bar{x})$ be lower semi-continuous at $\bar{x} \in S$ . Then, f is weakly subdifferentiable at $\bar{x}$ if there is lower locally Lipschitz at $\bar{x}$ and there exists a point $\tilde{x}$ such that f is subdifferentiable there.
The function f is weakly subdifferentiable at $\bar{x}$ if f is lower Lipschitz at $\bar{x}$ .
The function f is weakly subdifferentiable at $\bar{x}$ if there is lower locally Lipschitz at $\bar{x}$ and bounded below.
The function f is weakly subdifferentiable at $\bar{x}$ if f is a lower locally Lipschitz at $\bar{x}$ , and there exist numbers $p \geq 0$ and q such that $f (x) \geq$ $- p ∥ x ∥ + q$ for all $x \in S$ .
If f is a positively homogeneous function bounded from below on some neighborhood of $0_{R^{d}}$ , then f is weakly subdifferentiable at $0_{R^{d}}$ .

Theorem 2.

The weak subdifferential

\partial^{w} f (\cdot)

of a function f is convex and closed.

3. Incremental Subgradient Method for Non-Smooth Non-Convex Optimization

The incremental weak subgradient method addresses problem (3) by processing the component functions sequentially in cycles. The following algorithm presents the core update procedure for a single cycle.

The critical element that determines the performance of Algorithm 1 is the choice of step size

α_{n}

. In this section, we analyze three strategies for selecting the step size: constant, diminishing, and a novel dynamic approach. For each strategy, we establish convergence properties and characterize the quality of the solutions obtained. We use the notation

d_{X} = diam (X) = max_{x_{1}, x_{2}} ∥x_{1} - x_{2}∥, f^{*} = f (x^{*}) = inf_{x \in X} f (x)

At each iteration n, the weak subgradient approximation algorithm [8] can be used. This algorithm is based on the approximate representation

\partial_{X}^{w} f_{i} (φ_{i - 1, n})

of the weak subdifferential, which assumes that there exist positive numbers

D_{i}

and

K_{i}

such that

\begin{matrix} ∥ g_{i, n} ∥ & \leq D_{i} \forall n \geq 0 i = 1, \dots, m, \\ c_{i, n} & \leq K_{i} . \end{matrix}

Algorithm 1 Incremental weak subgradient update step.

Require:: Current point $x_{n}$ , step size $α_{n}$
1:: Set $φ_{0} = x_{n}$
2:: for $i = 1$ to m do
3:: Select $(g_{i, n}, c_{i, n}) \in \partial_{X}^{w} f_{i} (φ_{i - 1, n})$ such that $∥ g_{i, n} ∥$ is minimized
4:: $φ_{i, n} = P_{X} (φ_{i - 1, n} - α_{n} g_{i, n})$
5:: end for
6:: return $φ_{m, n}$

Lemma 1.

Let

{x_{n}}

be the sequence generated by the incremental subgradient method. Then, for all

n \geq 0

and

y \in X

, we have

∥ x_{n + 1} {- y ∥}^{2} \leq {∥ x_{n} - y ∥}^{2} - 2 α_{n} (f (x_{n}) - f (y)) + M_{1} α_{n}^{2} + M_{2} α_{n}

where

M_{1} : = 2 \sum_{i = 1}^{m} {(L_{i} + K_{i}) \sum_{j = 1}^{i - 1} D_{j} D_{i}^{2}}

and

M_{2} : = 2 d_{X} \sum_{i = 1}^{m} K_{i}

.

Proof.

Using the nonexpansion property of the projection, the weak subgradient boundedness, and the weak subgradient inequality for each component function

f_{i}

, we obtain for all

y \in X

\begin{matrix} ∥ φ_{i, n} {- y ∥}^{2} & = ∥ P_{X} (φ_{i - 1, n} - α_{n} g_{i, n}) {- y ∥}^{2} \\ \leq ∥ φ_{i - 1, n} - α_{n} g_{i, n} {- y ∥}^{2} \\ \leq ∥ φ_{i - 1, n} {- y ∥}^{2} - 2 α_{n} g_{i, n}^{T} (φ_{i - 1, n} - y) + α_{n}^{2} {∥ g_{i, n} ∥}^{2} \\ \leq ∥ φ_{i - 1, n} {- y ∥}^{2} - 2 α_{n} g_{i, n}^{T} (φ_{i - 1, n} - y) + α_{n}^{2} D_{i}^{2} \\ \leq ∥ φ_{i - 1, n} {- y ∥}^{2} - 2 α_{n} (f_{i} (φ_{i - 1, n}) - f_{i} (y) - c_{i, n} ∥ y - φ_{i - 1, n} ∥) + α_{n}^{2} D_{i}^{2} . \end{matrix}

Since

∥ φ_{i - 1, n} - x_{n} ∥ \leq α_{n} \sum_{j = 1}^{i} D_{j} i = 1, \dots, m, n \geq 0

, by adding these inequalities over

i = 1, \dots, m

, we have, for all

y \in X

and n,

\begin{matrix} ∥ x_{n + 1} {- y ∥}^{2} & = ∥ φ_{m, n} {- y ∥}^{2} \\ \leq ∥ x_{n} {- y ∥}^{2} - 2 α_{n} \sum_{i = 1}^{m} \{f_{i} (φ_{i - 1, n}) - f_{i} (y) - c_{i, n} ∥ y - φ_{i - 1, n} ∥\} + α_{n}^{2} \sum_{i = 1}^{m} D_{i}^{2} \\ = ∥ x_{n} {- y ∥}^{2} - 2 α_{n} (f (x_{n}) - f (y) + \sum_{i = 1}^{m} \{f_{i} (φ_{i - 1, n}) - f_{i} (x_{n}) - c_{i, n} ∥ φ_{i - 1, n} - y ∥\}) + α_{n}^{2} \sum_{i = 1}^{m} D_{i}^{2} \\ \leq ∥ x_{n} {- y ∥}^{2} + 2 α_{n} (f (y) - f (x_{n}) + \sum_{i = 1}^{m} \{L_{i} ∥ φ_{i - 1, n} - x_{n} ∥ + K_{i} (∥ φ_{i - 1, n} - x_{n} ∥ + ∥ x_{n} - y ∥)\}) \\ + α_{n}^{2} \sum D_{i}^{2} \\ \leq ∥ x_{n} {- y ∥}^{2} - 2 α_{n} (f (x_{n}) - f (y)) + α_{n}^{2} (2 \sum_{i = 1}^{m} {(L_{i} + K_{i}) \sum_{j = 1}^{i - 1} D_{j} D_{i}^{2}}) + 2 d_{X} \sum_{i = 1}^{m} K_{i} α_{n} \\ = ∥ x_{n} {- y ∥}^{2} - 2 α_{n} (f (x_{n}) - f (y)) + M_{1} α_{n}^{2} + M_{2} α_{n} \end{matrix}

where

M_{1} : = 2 \sum_{i = 1}^{m} {(L_{i} + K_{i}) \sum_{j = 1}^{i - 1} D_{j} D_{i}^{2}}

and

M_{2} : = 2 d_{X} \sum_{i = 1}^{m} K_{i}

. □

3.1. Convergence Analysis for the Constant Step Size

We start the convergence analysis with the constant step size rule.

Proposition 1.

For the sequence

{x_{n}}

generated by the weak incremental subgradient method with a constant step size

α_{n} = α

, we have

\underset{n \to \infty}{lim inf} f (x_{n}) \leq f^{*} + \frac{α M_{1} + M_{2}}{2}

Proof.

If the result does not hold, there must exist an epsilon > 0 such that

\underset{n \to \infty}{lim inf} f (x_{n}) > f^{*} + \frac{α M_{1} + M_{2}}{2} + 2 ϵ

Let

\hat{y} \in X

be such that

\underset{n \to \infty}{lim inf} f (x_{n}) \geq f (\hat{y}) + \frac{α M_{1} + M_{2}}{2} + 2 ϵ

and let

k_{0}

be large enough so that for all

n \geq k_{0}

,

f (x_{n}) - f (\hat{y}) \geq \frac{α M_{1} + M_{2}}{2} + ϵ

By adding the preceding two relations, we obtain, for all

n \geq k_{0}

,

f (x_{n}) - f (\hat{y}) \geq \frac{α M_{1} + M_{2}}{2} + ϵ

Using Lemma 1 for the case where

y = \hat{y}

together with the above relation, we obtain, for all

n \geq k_{0}

,

\begin{matrix} ∥ x_{n + 1} - \hat{y} ∥^{2} - 2 α ϵ & \leq ∥ x_{n - 1} - \hat{y} ∥^{2} - 2 α ϵ \\ \leq \dots \leq ∥ x_{k_{0}} - \hat{y} ∥^{2} - 2 (n + 1 - k_{0}) α ϵ \end{matrix}

which cannot hold for k sufficiently large, so it is a contradiction. □

3.2. Convergence Analysis for the Diminishing Step Size

Proposition 2.

Assume that the step size

α_{n}

satisfies

lim_{n \to \infty} α_{n} = 0 \sum_{n = 0}^{\infty} α_{n} = \infty .

Then, for the sequence

{x_{n}}

generated by the weak incremental method, we have

lim inf f (x_{n}) \leq f^{*} + \frac{M_{2}}{2} .

Proof.

Assume, to arrive at a contradition, that the above inequality does not hold, so there exists an

ϵ > 0

such that

lim inf f (x_{n}) > f (\hat{y}) + \frac{M_{2}}{2} + ϵ

Let

k_{0}

be large enough so that for all

n \geq k_{0}

, we have

f (x_{n}) - f (\hat{y}) \geq \frac{M_{2}}{2} + ϵ

Again, by using Lemma 1, in a similar way used in the proof of Proposition 1, we obtain

\begin{matrix} ∥ x_{n + 1} - \hat{y} ∥^{2} & \leq ∥ x_{n} - \hat{y} ∥^{2} - 2 α_{n} (\frac{M_{2}}{2} + ϵ) + M_{1} α_{n}^{2} + M_{2} α_{n} \\ = ∥ x_{n} - \hat{y} ∥^{2} - α_{n} M_{2} - 2 α_{n} ϵ + M_{1} α_{n}^{2} + M_{2} α_{n} \\ = ∥ x_{n} - \hat{y} ∥^{2} - α_{n} (2 ϵ - α_{n} M_{1}) \end{matrix}

Because

α_{n} \to 0

, and

{g_{n}}

is bounded, without loss of generality, we may assume that

k_{0}

is large enough so that

2 ϵ - α_{n} M_{1} < ϵ

for all

n \geq k_{0}

, implying that

∥ x_{n + 1} - \hat{y} ∥^{2} \leq ∥ x_{n} - \hat{y} ∥^{2} - α_{n} ϵ \leq \dots \leq {∥ x_{k_{0}} - \hat{y} ∥}^{2} - ϵ \sum_{j = k_{0}}^{n} α_{j}

Since

\sum_{j = k_{0}}^{n} α_{j} = \infty

, this relation cannot hold for n sufficiently large, so it is a contradiction. □

Proposition 3.

Assume that the step size

α_{n}

satisfies

\sum α_{n} = \infty \sum α_{n}^{2} < \infty

Then, for the sequence

{x_{n}}

generated by the weak incremental method, we have

lim inf f (x_{n}) \leq f^{*} + \frac{M_{2}}{2} .

Proof.

Assume, to arrive at a contradition, that the above inequality does not hold, so there exists an

ϵ > 0

such that

lim inf f (x_{n}) > f (\hat{y}) + \frac{M_{2}}{2} + ϵ

Let

k_{0}

be large enough so that for all

n \geq k_{0}

, we have

f (x_{n}) - f (\hat{y}) \geq \frac{M_{2}}{2} + ϵ

Again, by using Lemma 1, in a similar way used in the proof of Proposition 1, we obtain

\begin{matrix} ∥ x_{n + 1} - \hat{y} ∥^{2} & \leq ∥ x_{n} - \hat{y} ∥^{2} - 2 α_{n} ϵ + M_{1} α_{n}^{2} \\ \leq ∥ x_{n - 1} - \hat{y} ∥^{2} - 2 (α_{n} + α_{n - 1}) ϵ + M_{1} (α_{n}^{2} + α_{n - 1}^{2}) \\ \leq \dots \leq ∥ x_{0} - \hat{y} ∥^{2} - 2 (\sum_{j = 0}^{n} α_{j}) ϵ + M_{1} (\sum_{j = 0}^{n} α_{j}^{2}) \end{matrix}

According to the assumption, we reach a contradiction. □

3.3. Convergence Analysis for the Dynamic Step Size

Now, we propose a dynamic step size strategy for the incremental weak subgradient method that adapts based on the progress of the optimization process. This approach offers advantages over both constant and diminishing step size methods by automatically adjusting the step size based on the current function value and a target level.

In Algorithm 2, the step size

α_{n}

is determined dynamically using the following formula:

α_{n} = max \{γ_{n} \frac{f (x_{n}) - f_{lev}^{n}}{C^{2}}, α_{min}\}

(3)

where

$f (x_{n})$ is the current function value at iteration n;
$f_{lev}^{n} = f_{rec}^{n} - δ_{l}$ is the target level, with $f_{rec}^{n}$ being the best function value found so far;
$γ_{n} \in (0, \bar{γ})$ where $\bar{γ} < 2$ is a control parameter;
$α_{min} > 0$ is a minimum step size threshold.

Algorithm 2 Incremental weak subgradient method with dynamic step size.

Require:: Initial point $x_{0} \in X$ , initial target level parameter $δ_{0} > 0$ , minimum target level $δ_{m i n} > 0$
1:: Set $n = 0$ , $l = 0$ , $f_{r e c}^{- 1} = + \infty$ .
2:: while stopping criterion not satisfied do
3:: Compute $f (x_{n})$
4:: if $f (x_{n}) < f_{r e c}^{n - 1}$ then
5:: $f_{r e c}^{n} = f (x_{n})$ , $x_{r e c}^{n} = x_{n}$
6:: else
7:: $f_{r e c}^{n} = f_{r e c}^{n - 1}$ , $x_{r e c}^{n} = x_{r e c}^{n - 1}$
8:: end if
9:: if $(0_{R^{d}}, 0_{R}) \in \partial^{w} f (x_{n})$ then
10:: STOP
11:: end if
12:: if $δ_{l} < δ_{t o l}$ then
13:: RETURN $x_{r e c}^{n}$ {Early termination when target level is sufficiently small}
14:: end if
15:: if $f (x_{n}) \leq f_{r e c}^{n - 1} - \frac{1}{2} δ_{l}$ then
16:: $δ_{l + 1} = δ_{l}$ , $l = l + 1$
17:: else
18:: $δ_{l + 1} = max {\frac{δ_{0}}{l + 1}, δ_{m i n}}$ , $l = l + 1$
19:: end if
20:: Set $f_{l e v}^{n} = f_{r e c}^{n} - δ_{l}$
21:: Set $C^{2} = \sum_{i = 1}^{m} D_{i}^{2}$ {Define C based on weak subgradient bounds}
22:: Consider step size $α_{n} = max {γ_{n} \frac{f (x_{n}) - f_{l e v}^{n}}{C^{2}}, α_{m i n}}$ , where $0 < \underset{̲}{γ} \leq γ_{n} \leq \bar{γ} < 2$
23:: Calculate $φ_{m, n}$ according Algorithm 1 and then set $x_{n + 1} = φ_{m, n}$
24:: Set $n = n + 1$
25:: end while
26:: return $x_{r e c}^{n}$

This adaptive step size strategy offers several advantages over traditional approaches:

Problem-aware adaptation: The term $f (x_{n}) - f_{lev}^{n}$ allows the step size to automatically adjust based on the gap between the current function value and our target level. This enables larger steps when far from the target (accelerating progress) and smaller, more careful steps as we approach it (preventing overshooting).
Gradient normalization: The denominator $C^{2}$ normalizes the step size relative to the magnitudes of the weak subgradients, ensuring appropriate scaling regardless of the problem’s conditioning.
Lower bound guarantee: The minimum step size $α_{min}$ prevents the algorithm from stalling in challenging regions of non-convex functions, such as plateaus or shallow local minima.
Theoretical convergence: The parameter $γ_{n} < 2$ ensures the method maintains theoretical convergence guarantees while $α_{min}$ provides practical robustness.

We have enhanced the standard method with several practical improvements:

A minimum target level $δ_{min}$ to prevent excessive step size reduction;
Early termination based on tolerance $δ_{tol}$ ;
Explicit definition of parameter C based on weak subgradient bounds;
A minimum step size $α_{min}$ for numerical stability.

Together, these modifications provide robustness for non-convex optimization while maintaining theoretical convergence guarantees. Unlike constant step sizes (which cannot adapt to varying problem difficulty) or diminishing step sizes (which decrease regardless of progress), our dynamic approach intelligently responds to the local geometry of the objective function, balancing theoretical convergence guarantees with practical efficiency for non-smooth, non-convex optimization problems.

The main reason for the selection approach to sequence

{δ_{l}}_{l \in N}

in non-convex optimization is to address the fundamentally different landscape characteristics compared to convex problems. In convex optimization, progress tends to be relatively predictable and monotonic, with the function value consistently decreasing as we approach the global minimum. Standard methods with fixed or simply diminishing step sizes work well because the landscape does not contain local minima or saddle points to complicate the trajectory.

Non-convex functions, however, present several unique challenges:

Multiple local minima that can trap optimization algorithms;
Saddle points where progress may temporarily stall;
Regions of varying curvature requiring different step sizes;
Plateaus and steep valleys that demand different exploration strategies.

The dynamic delta selection addresses these challenges by enabling the algorithm to automatically sense and adapt to the local geometry of the function. When the algorithm encounters steep descents, it maintains the current target level to capitalize on rapid progress. When it hits challenging regions like plateaus or approaches local minima, it reduces the target level to facilitate more careful exploration.

This adaptive mechanism acts as a form of “landscape sensing” that is not necessary in convex optimization but becomes crucial for efficiently navigating the complex topography of non-convex functions. The addition of a minimum target level

δ_{min}

further enhances this approach by ensuring the algorithm maintains sufficient exploration capability to escape shallow local minima—a concern that does not exist in the convex case.

We now establish several important theoretical properties of this algorithm, starting with the behavior of the target level parameter.

Lemma 2 (Finiteness of Parameters).

Let

{x_{n}}

be the sequence generated by Algorithm 2. Then, exactly one of the following statements holds:

The algorithm terminates with $(0_{R^{d}}, 0_{R}) \in \partial^{w} f (x_{n})$ ;
The algorithm terminates with $δ_{l} < δ_{t o l}$ ;
$l \to \infty$ and $lim {inf}_{l \to \infty} δ_{l} = δ_{m i n}$ .

Proof.

From the structure of Algorithm 2, there are only two explicit termination conditions:

Lines 9–11: The algorithm terminates if $(0_{R^{d}}, 0_{R}) \in \partial^{w} f (x_{n})$ , which indicates that a necessary condition for optimality holds.
Lines 12–14: The algorithm terminates if $δ_{l} < δ_{t o l}$ , which is the early termination criterion based on the target level parameter becoming sufficiently small.

If neither of these conditions is met, the algorithm continues executing. We will now prove that in this case,

l \to \infty

and

lim {inf}_{l \to \infty} δ_{l} = δ_{m i n}

.

Assume, by contradiction, that the algorithm does not terminate due to either condition above, and that l remains bounded (i.e.,

l < \infty

) with

δ_{l} > δ_{m i n}

for all l beyond some index. Since l is incremented in each iteration where the algorithm does not terminate, this means the algorithm must terminate after finitely many iterations, contradicting our assumption. Thus, if the algorithm does not terminate due to the explicit conditions, we must have

l \to \infty

.

Now, we need to show that

lim {inf}_{l \to \infty} δ_{l} = δ_{m i n}

. According to the update rule in lines 15–19 of the algorithm, particularly line 17,

δ_{l + 1} = max {\frac{δ_{0}}{l + 1}, δ_{m i n}}

(4)

For sufficiently large l, either

δ_{l + 1} = δ_{l}

when progress is significant or

δ_{l + 1}

is determined by the formula above. Observe that for any

l > \frac{δ_{0}}{δ_{m i n}} - 1

, we have

\frac{δ_{0}}{l + 1} < δ_{m i n}

, which means

δ_{l + 1} = δ_{m i n}

for all sufficiently large l. Therefore, the sequence

{δ_{l}}

eventually becomes constant at

δ_{m i n}

, which implies

lim {inf}_{l \to \infty} δ_{l} = δ_{m i n}

.

However, to be thorough, we will also prove that

δ_{l}

cannot remain above

δ_{m i n}

indefinitely. Assume that there exists

\bar{δ} > δ_{m i n}

such that

δ_{l} \geq \bar{δ}

for all l.

From the step size calculation in line 21 of the algorithm and the fact that

f_{l e v}^{n} = f_{r e c}^{n} - δ_{l}

, with our redefined

C^{2} = \sum_{i = 1}^{m} D_{i}^{2}

,

α_{n} \geq max {γ_{n} \frac{f (x_{n}) - f_{r e c}^{n} + \bar{δ}}{\sum_{i = 1}^{m} D_{i}^{2}}, α_{m i n}}

(5)

We now consider two cases:

Case 1:

f (x_{n}) - f_{r e c}^{n} \geq - \frac{\bar{δ}}{2}

for infinitely many n.

For these iterations:

α_{n} \geq max {γ_{n} \frac{\bar{δ} / 2}{\sum_{i = 1}^{m} D_{i}^{2}}, α_{m i n}} \geq max {\frac{\underset{̲}{γ} \bar{δ}}{2 \sum_{i = 1}^{m} D_{i}^{2}}, α_{m i n}} = μ > 0

(6)

Using Lemma 1 with

y = x^{*}

, for the iterations where

α_{n} \geq μ

:

∥ x_{n + 1} - x^{*} ∥^{2} \leq {∥ x_{n} - x^{*} ∥}^{2} - 2 μ (f (x_{n}) - f^{*}) + M_{1} μ^{2} + M_{2} μ

(7)

If

f (x_{n}) > f^{*} + \frac{M_{1} μ + M_{2}}{2} + ε

for some

ε > 0

and infinitely many n, then

∥ x_{n + 1} - x^{*} ∥^{2} \leq {∥ x_{n} - x^{*} ∥}^{2} - μ ε

(8)

Applying this inequality recursively for all such iterations would eventually make

∥ x_{n} - x^{*} ∥^{2}

negative, which is a contradiction.

Therefore, for infinitely many iterations, we must have

f (x_{n}) \leq f^{*} + \frac{M_{1} μ + M_{2}}{2} + ε

(9)

Since this holds for any

ε > 0

, we have

{lim inf}_{n \to \infty} f (x_{n}) \leq f^{*} + \frac{M_{1} μ + M_{2}}{2}

, where

μ = max {\frac{\underset{̲}{γ} \bar{δ}}{2 \sum_{i = 1}^{m} D_{i}^{2}}, α_{m i n}}

.

Given that

f (x_{n}) - f_{r e c}^{n} \geq - \frac{\bar{δ}}{2}

for infinitely many n in this case, we know that

f_{r e c}^{n} \leq f (x_{n}) + \frac{\bar{δ}}{2}

for these iterations. Combined with the bound on

lim inf f (x_{n})

, this implies that for infinitely many iterations,

f_{r e c}^{n} \leq f^{*} + \frac{M_{1} μ + M_{2}}{2} + \frac{\bar{δ}}{2} + ε

(10)

For sufficiently small

ε

, this means

f (x_{n}) \leq f_{r e c}^{n} - \frac{\bar{δ}}{2}

for infinitely many n, which would cause

δ_{l}

to remain constant by the condition in line 15. This contradicts our assumption that

l \to \infty

.

Case 2:

f (x_{n}) - f_{r e c}^{n} < - \frac{\bar{δ}}{2}

for all but finitely many n.

This means

f (x_{n}) < f_{r e c}^{n} - \frac{\bar{δ}}{2}

for almost all n. According to the update rule in line 15, this would cause

δ_{l}

to remain constant. Since we have established that

l \to \infty

, this is a contradiction.

Therefore, our assumption that

δ_{l} \geq \bar{δ} > δ_{m i n}

for all l must be false. Combined with our earlier observation that

δ_{l} = δ_{m i n}

for all sufficiently large l, we can conclude that

lim {inf}_{l \to \infty} δ_{l} = δ_{m i n}

.

Thus, we have proven that exactly one of the three statements in the lemma must hold: either the algorithm terminates with

(0_{R^{d}}, 0_{R}) \in \partial^{w} f (x_{n})

, or it terminates with

δ_{l} < δ_{t o l}

, or

l \to \infty

and

lim {inf}_{l \to \infty} δ_{l} = δ_{m i n}

. □

Theorem 3 (Convergence to Neighborhood).

Let

{x_{n}}

be the sequence generated by Algorithm 2. Then,

lim inf_{n \to \infty} f (x_{n}) \leq f^{*} + \frac{M_{2}}{2}

(11)

where

M_{2}

is as defined in Lemma 1.

Proof.

We prove by contradiction. Assume the inequality does not hold. Then, there exists

ε > 0

, such that

lim inf_{n \to \infty} f (x_{n}) > f^{*} + \frac{M_{2}}{2} + ε

(12)

By Lemma 2, exactly one of three cases must occur:

Case 1: The algorithm terminates with

(0_{R^{d}}, 0_{R}) \in \partial^{w} f (x_{n})

.

In this case,

x_{n}

is an optimal point, so

f (x_{n}) = f^{*}

. This directly contradicts our assumption that

lim {inf}_{n \to \infty} f (x_{n}) > f^{*} + \frac{M_{2}}{2} + ε

.

Case 2: The algorithm terminates with

δ_{l} < δ_{t o l}

.

If the algorithm terminates after finitely many iterations,

lim {inf}_{n \to \infty} f (x_{n})

is not well defined. However, we can interpret our contradiction assumption as meaning

f (x_{n}) > f^{*} + \frac{M_{2}}{2} + ε

for the final iterate

x_{n}

. We will show this is impossible.

Case 3: The algorithm continues indefinitely with

l \to \infty

and

lim {inf}_{l \to \infty} δ_{l} = δ_{m i n}

.

This is the main case to analyze. Let

\hat{y} \in X

be a point such that

f (\hat{y}) \leq f^{*} + \frac{ε}{3}

(such a point exists by definition of

f^{*}

).

From our contradiction assumption, there exists an index

k_{0}

such that for all

n \geq k_{0}

,

f (x_{n}) > f^{*} + \frac{M_{2}}{2} + \frac{2 ε}{3}

(13)

This implies that

f (x_{n}) - f (\hat{y}) > (f^{*} + \frac{M_{2}}{2} + \frac{2 ε}{3}) - (f^{*} + \frac{ε}{3}) = \frac{M_{2}}{2} + \frac{ε}{3}

(14)

Using Lemma 1 with

y = \hat{y}

,

\begin{matrix} ∥ x_{n + 1} - \hat{y} ∥^{2} & \leq ∥ x_{n} - \hat{y} ∥^{2} - 2 α_{n} (f (x_{n}) - f (\hat{y})) + M_{1} α_{n}^{2} + M_{2} α_{n} \end{matrix}

(15)

\begin{matrix} < ∥ x_{n} - \hat{y} ∥^{2} - 2 α_{n} (\frac{M_{2}}{2} + \frac{ε}{3}) + M_{1} α_{n}^{2} + M_{2} α_{n} \end{matrix}

(16)

\begin{matrix} = ∥ x_{n} - \hat{y} ∥^{2} - \frac{2 ε α_{n}}{3} + M_{1} α_{n}^{2} \end{matrix}

(17)

Let

N = {n_{j} | j \in N, where l increases at iteration n_{j}}

. Since we are in Case 3, where

l \to \infty

, this set is infinite.

From the step size calculation with our redefined

C^{2} = \sum_{i = 1}^{m} D_{i}^{2}

, and given that

f_{l e v}^{n} = f_{r e c}^{n} - δ_{l}

, with

δ_{l}

approaching

δ_{m i n}

, there are infinitely many iterations where

α_{n} \geq α_{m i n} > 0

(18)

For sufficiently large

n \geq n_{0}

, we can choose

α_{m i n}

small enough that

M_{1} α_{n} < \frac{ε}{3}

(19)

For these iterations, the inequality becomes

∥ x_{n + 1} - \hat{y} ∥^{2} \leq ∥ x_{n} - \hat{y} ∥^{2} - \frac{2 ε α_{n}}{3} + \frac{ε α_{n}}{3} = {∥ x_{n} - \hat{y} ∥}^{2} - \frac{ε α_{n}}{3}

(20)

Applying this inequality recursively for all iterations from

n_{0}

to N,

∥ x_{N + 1} - \hat{y} ∥^{2} < {∥ x_{n_{0}} - \hat{y} ∥}^{2} - \frac{ε}{3} \sum_{n = n_{0}}^{N} α_{n}

(21)

Since

α_{n} \geq α_{m i n} > 0

for all n, and N can be arbitrarily large, the sum

\sum_{n = n_{0}}^{N} α_{n}

can be made arbitrarily large as

N \to \infty

. This would eventually make

∥ x_{N + 1} - \hat{y} ∥^{2}

negative, which contradicts the non-negativity of squared norms.

Therefore, our initial assumption must be false in all three cases, and we have

lim inf_{n \to \infty} f (x_{n}) \leq f^{*} + \frac{M_{2}}{2}

(22)

This confirms that the convergence guarantee holds even with the redefined

C^{2} = \sum_{i = 1}^{m} D_{i}^{2}

, although the practical convergence behavior may differ due to the different step size calculations. □

Proposition 4. (Convergence Rate).

Let

{x_{n}}

be the sequence generated by Algorithm 2. Let

{f_{r e c}^{n}}

be the sequence of recorded best function values. Then, for any

ε > 0

, there exists an iteration number

N (ε)

such that

f_{r e c}^{n} \leq f^{*} + \frac{M_{2}}{2} + ε \forall n \geq N (ε)

(23)

Furthermore, if the algorithm parameters are chosen such that

δ_{l} = max {\frac{δ_{0}}{l}, δ_{m i n}}

and

δ_{m i n}

is sufficiently small, then

f_{r e c}^{n} - f^{*} - \frac{M_{2}}{2} = O (\frac{1}{\sqrt{n}})

(24)

Proof.

From Theorem 3, we know that

{lim inf}_{n \to \infty} f (x_{n}) \leq f^{*} + \frac{M_{2}}{2}

. Since

f_{r e c}^{n}

is a non-increasing sequence that tracks the best function value found so far, and

f_{r e c}^{n} \leq f (x_{n})

whenever

f (x_{n})

improves upon the previous best value, we have

lim_{n \to \infty} f_{r e c}^{n} \leq f^{*} + \frac{M_{2}}{2}

(25)

This implies that for any

ε > 0

, there exists

N (ε)

such that

f_{r e c}^{n} \leq f^{*} + \frac{M_{2}}{2} + ε

for all

n \geq N (ε)

.

For the convergence rate, we need to analyze the case where the algorithm does not terminate finitely. By Lemma 2, this means

l \to \infty

and

lim {inf}_{l \to \infty} δ_{l} = δ_{m i n}

.

When

δ_{m i n}

is chosen to be sufficiently small, the target level update becomes effectively

δ_{l} = \frac{δ_{0}}{l}

for most iterations.

From the step size calculation (3) with

C^{2} = \sum_{i = 1}^{m} D_{i}^{2}

, to establish the claimed rate, we analyze the recurrence relation for

∥ x_{n} - x^{*} ∥^{2}

from the inequality in Lemma 1. When

δ_{l} = \frac{δ_{0}}{l}

, the step sizes

α_{n}

are related to

\frac{1}{\sqrt{n}}

in order of magnitude.

Let

{n_{k}}

be the sequence of iterations where a new best function value is found, i.e.,

f_{r e c}^{n_{k}} < f_{r e c}^{n_{k} - 1}

. For these iterations, we have

f (x_{n_{k}}) = f_{r e c}^{n_{k}}

.

For each k, we can bound the progress as

\begin{matrix} ∥ x_{n_{k}} - x^{*} ∥^{2} & \leq ∥ x_{0} - x^{*} ∥^{2} - 2 \sum_{j = 0}^{n_{k} - 1} α_{j} (f (x_{j}) - f^{*}) + \sum_{j = 0}^{n_{k} - 1} (M_{1} α_{j}^{2} + M_{2} α_{j}) \end{matrix}

(26)

\begin{matrix} \leq ∥ x_{0} - x^{*} ∥^{2} - 2 \sum_{j = 0}^{n_{k} - 1} α_{j} (f_{r e c}^{j} - f^{*}) + \sum_{j = 0}^{n_{k} - 1} (M_{1} α_{j}^{2} + M_{2} α_{j}) \end{matrix}

(27)

Due to the relationship between

δ_{l}

and

\frac{1}{l}

, and consequently between

α_{n}

and

\frac{1}{\sqrt{n}}

, we can show that

\sum_{j = 0}^{n - 1} α_{j} \sim O (\sqrt{n}) and \sum_{j = 0}^{n - 1} α_{j}^{2} \sim O (1)

(28)

Rearranging the inequality and using the fact that

∥ x_{n_{k}} - x^{*} ∥^{2} \geq 0

,

2 \sum_{j = 0}^{n_{k} - 1} α_{j} (f_{r e c}^{j} - f^{*}) \leq {∥ x_{0} - x^{*} ∥}^{2} + \sum_{j = 0}^{n_{k} - 1} (M_{1} α_{j}^{2} + M_{2} α_{j})

(29)

Since

f_{r e c}^{j}

is non-increasing, we have

2 (f_{r e c}^{n_{k}} - f^{*}) \sum_{j = 0}^{n_{k} - 1} α_{j} \leq {∥ x_{0} - x^{*} ∥}^{2} + M_{1} \sum_{j = 0}^{n_{k} - 1} α_{j}^{2} + M_{2} \sum_{j = 0}^{n_{k} - 1} α_{j}

(30)

Dividing both sides by

2 \sum_{j = 0}^{n_{k} - 1} α_{j}

,

f_{r e c}^{n_{k}} - f^{*} \leq \frac{∥ x_{0} - x^{*} ∥^{2} + M_{1} \sum_{j = 0}^{n_{k} - 1} α_{j}^{2}}{2 \sum_{j = 0}^{n_{k} - 1} α_{j}} + \frac{M_{2}}{2}

(31)

Using our earlier asymptotic bounds on the sums of step sizes, and noting that

n_{k} \leq n

for recording the best value at iteration n, we obtain

f_{r e c}^{n} - f^{*} - \frac{M_{2}}{2} \leq \frac{∥ x_{0} - x^{*} ∥^{2} + O (1)}{2 \cdot O (\sqrt{n})} = O (\frac{1}{\sqrt{n}})

(32)

This establishes the claimed

O (\frac{1}{\sqrt{n}})

convergence rate for the optimality gap

f_{r e c}^{n} - f^{*} - \frac{M_{2}}{2}

.

The rate matches the best known rate for subgradient methods with diminishing step sizes, while the dynamic approach offers practical advantages through its automatic adaptation to the problem difficulty. The specific value of

C^{2} = \sum_{i = 1}^{m} D_{i}^{2}

affects the constants in the convergence rate but not the asymptotic order. □

The theoretical analysis demonstrates that the modified dynamic step size strategy achieves the same asymptotic convergence guarantees as the diminishing step size approach, with additional robustness provided by the minimum target level and step size parameters. As we will show in our numerical experiments, the dynamic approach offers significant practical advantages by automatically adjusting to the optimization landscape, leading to faster convergence and better solution quality across a variety of test problems.

4. Numerical Results

In this section, we present the results of numerical experiments conducted to evaluate the performance of incremental weak subgradient algorithms in non-smooth and non-convex settings. We begin by comparing the results with those obtained using state-of-the-art methods for standard non-convex benchmark problems. Although the proposed algorithm is inherently distributed, evaluating its behavior in a sequential setting still offers valuable insights, particularly regarding verifying the correctness and assessing convergence. Next, we analyze the algorithm’s performance on a binary classification task. Finally, we investigate the scalability of a preliminary parallel implementation to identify strengths and potential bottlenecks. All the tests were run with MATLAB R2024a on an Apple Mac equipped with an M3 Pro chip, featuring a 12-core CPU, 18-core GPU, and 36 GB of unified memory.

4.1. Benchmarking Optimization Methods

This subsection compares the proposed subgradient-based algorithms and classical and modern optimization solvers designed for non-smooth and non-convex problems. The baseline methods include MATLAB’s built-in fminsearch and patternsearch functions, representing well-established derivative-free approaches. Furthermore, we consider two popular adaptive gradient-based optimizers: Adam [23] and Adagrad [24], both widely used in large-scale optimization due to their ability to adjust learning rates during training.

Step size strategies for the subgradient-based methods are defined as follows:

Constant step: $α_{k} = 0.01$
Diminishing step: $α_{k} = \frac{α_{0}}{1 + β k}$ , with $α_{0} = 0.1$ , $β = 0.5$
Dynamic step: An adaptive strategy based on an incremental weak subgradient framework, where the subgradient bound C from definition (3) is not fixed a priori but adaptively estimated. Specifically, we maintain a vector $D \in R^{m}$ , where each $D_{i}$ stores the maximum observed norm of the weak subgradient for the corresponding component function $f_{i}$ . Initially, $D_{i} = 1$ for all i, and during optimization, it is updated as follows:

$D_{i} = max {D_{i}, ∥ g_{i} ∥}$

(33)

where $g_{i}$ denotes the current weak subgradient of component i. The global subgradient bound is then computed as $C^{2} = \sum_{i = 1}^{m} D_{i}^{2}$ . We fix $δ_{0} = 0.01, γ_{0} = 0.5, ϵ = 10^{- 6}, α_{m i n} = 10^{- 5}, α_{m a x} = 10^{- 1}$ .

The evaluation is conducted across three benchmark problems:

Rastrigin.

$f (x) = 10 n + \sum_{i = 1}^{n} |x_{i}^{2} - 10 cos (2 π x_{i})| .$

(34)

This function is a non-smooth variant of the classical Rastrigin function, characterized by a multimodal landscape with many local minima.
Rosenbrock

$f (x) = \sum_{i = 1}^{n - 1} |100 {(x_{i + 1} - x_{i}^{2})}^{2} + {(1 - x_{i})}^{2}| .$

(35)

This is a non-smooth version of the Rosenbrock function, presenting a curved narrow valley with a challenging landscape.
Smoothly clipped absolute deviation (SCAD)

$f (x) : = \sum_{i = 1}^{n} ρ_{λ, a} (x_{i})$

(36)

where $ρ_{λ, a}$ is defined as

$ρ_{λ, a} (x_{i}) = \{\begin{matrix} λ | x_{i} | & if | x_{i} | \leq λ, \\ \frac{2 a λ | x_{i} | - x_{i}^{2} - λ^{2}}{2 (a - 1)} & if λ < | x_{i} | \leq a λ, \\ \frac{(a + 1) λ^{2}}{2} & if | x_{i} | > a λ, \end{matrix}$

(37)

with parameters $a = 3.7$ , $λ = 0.01$ .

Each problem is defined over a high-dimensional space with

n = 1000

, and all experiments start from a randomly generated initial point

x_{0} \sim N (0, I_{n})

. The maximum number of iterations is set to 1000.

The results presented in Table 1 demonstrate the effectiveness of the proposed subgradient-based methods, particularly the dynamic step size strategy, compared to classical and modern optimization algorithms. While adaptive gradient methods such as Adam and Adagrad achieve lower final objective values in some cases, their computational cost is considerably higher, especially noticeable in large-scale scenarios. The dynamic weak subgradient method achieves competitive final objective values with substantially reduced computation times relative to Adam and Adagrad. This efficiency gain is crucial for large-dimensional problems, where scalability and computational budget are often limiting factors. Derivative-free methods like fminsearch and patternsearch show reliable performance but suffer from significantly longer runtimes, rendering them less practical for high-dimensional problems. Moreover, by adaptively estimating subgradient bounds during optimization, the weak subgradient framework enables the dynamic method to adjust step sizes intelligently, balancing convergence speed and stability. This feature highlights its robustness and suitability for non-smooth, non-convex problems where traditional gradient-based methods might struggle. These results position the dynamic weak subgradient approach as a competitive and computationally efficient alternative to classical and adaptive gradient methods, making it particularly attractive for large-scale non-smooth optimization tasks.

4.2. Application of Incremental Weak Subgradient Methods to Classification

To validate the practical relevance of our theoretical framework, we apply the proposed incremental weak subgradient methods to a binary classification task with non-convex SCAD regularization. This setup reflects the challenges typical of modern machine learning, where optimization landscapes are often both non-smooth and non-convex.

We conducted experiments using the Breast Cancer Wisconsin dataset [25], a widely adopted benchmark for predictive breast cancer diagnosis, publicly available through the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), accessed on 1 January 2025). The dataset comprises 569 instances, including 357 benign (62.7%) and 212 malignant (37.3%) cases. Each instance is described by 30 numerical features of cell nuclei (e.g., mean radius, texture, perimeter), extracted from digitized images of breast tissue biopsies. The classification task aims to distinguish between benign and malignant tumors. We transformed the dataset by (1) scaling features with different magnitudes, (2) introducing correlations between selected features, and (3) adding outliers to 5% of samples.

The classification task is formulated as a regularized empirical risk minimization problem:

min_{w \in R^{d}} f (w) : = \sum_{i = 1}^{m} ℓ (y_{i}, x_{i}^{T} w) + λ \sum_{j = 1}^{d} ρ (w_{j})

(38)

where ℓ is the hinge loss function for binary classification:

ℓ (y_{i}, x_{i}^{T} w) = max (0, 1 - y_{i} x_{i}^{T} w)

(39)

and the SCAD penalty

ρ

promotes sparsity while avoiding the excessive penalization of large coefficients that occurs with conventional

L_{1}

regularization, making it particularly valuable for feature selection in high-dimensional problems [26,27].

For the Breast Cancer dataset with approximately 400 training samples, the initial value of C is

\sqrt{400} = 20

, which then adapts throughout training.

The effectiveness of the dynamic step size method is closely tied to the careful tuning of its hyperparameters. We employed grid-based cross-validation to explore the following parameter ranges:

δ_{0} \in {0.05, 0.1, 0.5, 1, 2, 5, 10, 25, 50}

,

γ \in {0.1, 0.5, 1.0, 1.5, 2.0}

,

λ \in {0.0001, 0.001, 0.01, 0.05, 0.1, 0.5}

, and

c_{init} \in {0.1, 0.3, 0.5, 0.7, 0.9}

. The optimal configuration identified was

δ_{0} = 5.0

,

γ = 1.5

,

λ = 0.01

, and

c_{init} = 0.1

, which yielded the best validation performance. All methods were initialized consistently using random weights scaled by 0.01 to ensure comparability. For the computation of weak subgradients, we implemented an adaptive epsilon selection strategy, dynamically testing values

ϵ \in {10^{- 6}, 10^{- 5}, 10^{- 4}, 10^{- 3}}

to maintain numerical stability throughout the optimization process. This meticulous hyperparameter tuning is crucial for the dynamic step size method to effectively adjust to the local geometry of the optimization landscape, thereby enhancing convergence speed and overall performance.

Performance evaluation was performed using five-fold cross-validation, with 70% of data used for training and 30% for testing. Early stopping was employed when the change in objective function value fell below

10^{- 6}

or after 200 iterations.

Table 2 presents the comparative performance of our three incremental weak subgradient methods on the classification task.

The results demonstrate that the dynamic step size strategy significantly enhances the performance of subgradient methods in the classification setting. The dynamic approach achieves 71.93% accuracy, representing a 17.84% improvement over the constant step size method and a 28.58% improvement over the diminishing step size approach. These results align with our theoretical analysis and highlight the importance of adaptive step sizing for non-convex classification problems.

Performance Analysis by Method:

Constant Step Size (Case 1): As established in Proposition 1, this method converges to a neighborhood of the optimal solution, but the neighborhood size depends on the fixed step magnitude. This limits fine-tuning near the optimum in the complex SCAD-regularized classification landscape.
Diminishing Step Size (Case 2): While Proposition 2 guarantees asymptotic convergence, the rigid step reduction schedule cannot adapt to the varying curvature and difficulty regions in the non-convex objective function, resulting in suboptimal classification performance.
Dynamic Step Size (Case 3): Theorem 3 demonstrates that our adaptive approach provides both theoretical convergence guarantees and superior adaptivity to problem characteristics. The method automatically adjusts step sizes based on the local geometry of the SCAD-regularized hinge loss, enabling effective navigation of the non-convex optimization landscape.

The dynamic method demonstrates the shortest convergence time (1.21 s), indicating computational efficiency and improved accuracy. The adaptation range was measured at

18.4 \times

from maximum to minimum step size, illustrating the method’s ability to adjust to varying problem difficulties during optimization automatically. This adaptivity, combined with the

O (1 / \sqrt{n})

convergence rate established in Proposition 4, explains why the dynamic approach outperforms alternatives on challenging non-convex classification problems.

The accuracy results in Table 2 highlight the relative effectiveness of our dynamic step size method, though they remain below the 95%+ accuracy typically reported for the Breast Cancer Wisconsin dataset. This performance gap stems from several factors that point to promising directions for future research:

Model Simplicity: We focused on a linear model with hinge loss and SCAD regularization to emphasize optimization aspects. Applying the weak subgradient framework to more expressive models (e.g., kernels, neural networks, ensembles) could yield better classification results.
Challenging Preprocessing: Our deliberately complex preprocessing—featuring outliers, correlated features, and uneven scaling—complicates optimization. Future work could explore robust or adaptive preprocessing techniques to preserve difficulty while improving performance.
Adaptive Hyperparameters: Parameters such as $c_{init}$ , $γ$ , and $δ_{0}$ could be tuned dynamically during training using adaptive schemes or meta-learning, enhancing both convergence and generalization.
Optimization Enhancements: Incorporating momentum, variance reduction, or second-order approximations may accelerate convergence and improve final accuracy by better exploiting the problem’s structure.

These directions could significantly enhance the performance and applicability of weak subgradient methods in complex non-convex settings, helping bridge the gap with state-of-the-art classifiers.

5. First Scalability Evaluation in a Parallel Setting

In this work, we consider a preliminary parallel version of the incremental weak subgradient algorithm to explore the potential of the method, with a focus on analyzing its weak scalability. This approach evaluates how effectively the algorithm distributes a fixed computational workload per core as the total problem size increases proportionally with the number of processors. Such analysis provides a realistic and informative baseline, especially for large-scale optimization problems where computational resources are scaled accordingly. Moreover, it helps identify parallel overheads and potential inefficiencies in the current implementation before moving on to more demanding scalability studies, such as strong scaling. We tested the dynamic variant of the incremental weak subgradient method on the minimization of the SCAD function to assess weak scalability. The problem is decomposed into 10 fixed blocks (functions), each working on approximately

n / 10

variables. In our experiments, a constant workload of 500 variables per core was assigned, ensuring a fixed computational burden per processor. This configuration aligns with the principles of weak scalability by isolating the effects of increasing problem size under parallel execution.

The algorithm was parallelized using MATLAB’s parfor construct, which enables concurrent evaluation of weak subgradients for each component function. Each parallel worker computes the weak subgradient of a single component function independently. These partial gradients are then communicated back to the main process, where they are aggregated to form the full subgradient vector. This aggregation introduces synchronization at each iteration, resulting in communication overhead that tends to increase with the number of workers. As the optimization problem is unconstrained, the projection operator reduces to the identity map, eliminating additional costs associated with constraint handling.

Table 3 reports the wall-clock execution times for increasing numbers of cores and corresponding problem sizes. As the problem dimension grows from 500 variables on a single core to 6000 variables on twelve cores, total execution time increases from 3.17 s to 23.58 s. While this growth is expected, it is not strictly proportional to the problem size, revealing the impact of synchronization and communication costs. Nevertheless, the average time per variable decreases from approximately 6.33 ms to 3.93 ms, indicating that the cost of parallelization is effectively amortized as the problem size increases. These results highlight the algorithm’s promising weak scalability properties, as it maintains near-constant per-variable efficiency despite larger workloads.

These preliminary results demonstrate that, even in its current form, the parallel implementation of the proposed method is well suited for weakly scalable problems with separable structure. The total execution time increases moderately with more cores, from 1.19 s on a single core to 2.47 s on twelve cores. This increase is expected due to overhead from parallel execution, such as communication and synchronization. However, the time per variable decreases significantly—from 2.38 ms at 1 core down to 0.41 ms at 12 cores—indicating that the algorithm achieves good weak scalability. The decreasing per-variable cost demonstrates that the parallel implementation effectively amortizes overhead as the problem size and computational resources grow. Future work may address these by introducing asynchronous updates to reduce synchronization delays, adopting adaptive step size strategies to improve convergence speed, and implementing dynamic load balancing to compensate for variability in subfunction evaluation times. Additionally, detailed profiling of communication and synchronization costs would provide valuable insight into performance bottlenecks, facilitating more refined optimizations on larger parallel systems.

6. Conclusions

We have extended incremental subgradient methods to handle non-smooth, non-convex optimization problems using weak subgradients. Our theoretical analysis confirms that all three variants—constant, diminishing, and dynamic step size strategies—converge to a neighborhood of the optimal solution, with the size of this neighborhood influenced by weak subgradient parameters.

Among these, the dynamic step size method stands out by automatically adapting to the local geometry of the objective function, leading to superior performance. Experimental results on classification tasks with non-convex SCAD regularization show that the dynamic method significantly outperforms classical approaches, achieving 71.93% accuracy compared to 59.64% and 53.22% for constant and diminishing step sizes, respectively.

Notably, the method’s strong point lies in its fast convergence and ease of implementation, making it a practical and computationally efficient alternative to classical and adaptive gradient methods. However, performance depends on careful hyperparameter tuning.

Future work will explore stochastic extensions, momentum incorporation, and scalability to large distributed systems. Overall, this approach offers a promising tool for solving complex non-convex optimization problems in machine learning and related domains, particularly when speed and simplicity of implementation are key considerations.

Author Contributions

Conceptualization, N.A. and V.D.S.; methodology, N.A. and V.D.S.; software, N.A. and V.D.S.; validation, N.A. and V.D.S.; formal analysis, N.A. and V.D.S.; writing—original draft preparation, N.A. and V.D.S.; writing—review and editing, N.A. and V.D.S.; supervision, V.D.S. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors acknowledge the MIUR-PRIN 2022 Project “Numerical Optimization with Adaptive Accuracy and Applications to Machine Learning” grant number 2022N3ZNAX (CUP E53D2300 7700006), under the National Recovery and Resilience Plan (PNRR), Italy, Mission 04 Component 2 Investment 1.1 funded by the European Commission and the National Group for Scientific Computation (INdAM-GNCS).

Conflicts of Interest

The authors declare no conflict of interest.

References

Nedic, A.; Bertsekas, P.D. Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 2001, 12, 109–138. [Google Scholar] [CrossRef]
Hiriart-Urruty, J.-B.; Lemaréchal, C. Convex Analysis and Minimization Algorithms II; Springer: New York, NY, USA, 1993. [Google Scholar]
Bertsekas, D.P. Convex Optimization Algorithms; Athena Scientific: Nashua, NH, USA, 2015. [Google Scholar]
Polyak, B.T. Introduction to Optimization; Optimization Software, Inc.: New York, NY, USA, 1987. [Google Scholar]
Liu, J.; Wright, S.J.; Ré, C.; Bittorf, V.; Sridhar, S. An asynchronous parallel stochastic coordinate descent algorithm. In Proceedings of the ICML 2014: 31st International Conference on Machine Learning, Beijing, China, 21–36 June 2014. [Google Scholar]
Pan, T.Y.; Yang, G.; Zhao, J.; Ding, J. Smoothing Piecewise Linear Activation Functions Based on Mollified Square Root Functions. Math. Found. Comput. 2023, 7, 578–601. [Google Scholar] [CrossRef]
Azimov, A.; Gasimov, R. On weak conjugency, weak subdifferentials and dualty with zero gap in nonconvex optimiztion. Int. J. Appl. Math. 1999, 1, 171–192. [Google Scholar]
Dinc Yalcin, G.; Kasimbeyli, R. Weak subgradient method for solving nonsmooth nonconvex optimization problems. Optimization 2021, 70, 1513–1553. [Google Scholar] [CrossRef]
Kasimbeyli, R.; Mammadov, M. On weak subdifferentials, directional derivatives, and radial epiderivatives for nonconvex functions. SIAM J. Optim. 2009, 20, 841–855. [Google Scholar] [CrossRef]
Gaivoronski, A.A. Convergence analysis of parallel backpropagation algorithm for neural network. Optim. Methods Softw. 1994, 4, 117–134. [Google Scholar] [CrossRef]
Grippo, L. A class of unconstrained minimization methods for neural network training. Optim. Methods Softw. 1994, 4, 135–150. [Google Scholar] [CrossRef]
Bertsekas, D.P. A new class of incremental gradient methods for least squares problems. SIAM J. Optim. 1997, 7, 913–926. [Google Scholar] [CrossRef]
Solodov, M.V.; Zavriev, S.K. Incremental gradient algorithms with stepsizes bounded away from zero. Comput. Optim. Appl. 1998, 11, 23–35. [Google Scholar] [CrossRef]
Tseng, P. An Incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM J. Optim. 1998, 8, 506–531. [Google Scholar] [CrossRef]
Shor, N.Z. Minimization Methods for Nondifferentiable Functions; Naukova Dumka: Kiev, Ukraine, 1979. [Google Scholar]
Bertsekas, D.P. Nonlinear Programming. J. Oper. Res. Soc. 1997, 48, 334. [Google Scholar] [CrossRef]
Goffin, J.L.; Kiwiel, K.C. Convergence of a simple subgradient level method. Math. Program. 1999, 85, 207–211. [Google Scholar] [CrossRef]
Yang, D.; Wang, X. Incremental subgradient algorithms with dynamic step sizes for separable convex optimizations. Math. Methods Appl. Sci. 2023, 46, 7108–7124. [Google Scholar] [CrossRef]
Wang, X.M. Subgradient algorithms on Riemannian manifolds of lower bounded curvatures. Optimization 2018, 67, 1–16. [Google Scholar] [CrossRef]
Nesterov, Y. Primal-dual subgradient methods for convex problems. Math. Program. 2009, 120, 221–259. [Google Scholar] [CrossRef]
Yao, Y.H.; Naseer, S.; Yao, J.C. Projected subgradient algorithms for pseudomonotone equilibrium problems and fixed points of pseudocontractive operators. Mathematics 2020, 8, 461. [Google Scholar] [CrossRef]
Azimov, A.Y.; Gasimov, R.N. Stability and duality of nonconvex problems via augmented Lagrangian. Cybern. Syst. Anal. 2002, 38, 412–421. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Street, W.N.; Wolberg, W.H.; Mangasarian, O.L. Nuclear feature extraction for breast tumor diagnosis. In Proceedings of the IS&T/SPIE’s Symposium on Electronic Imaging: Science and Technology, San Jose, CA, USA, 31 January–5 February 1993; pp. 861–870. [Google Scholar]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Wu, Z.; Xie, G.; Ge, Z.; De Simone, V. Nonconvex multi-period mean- variance portfolio optimization. Ann. Oper. Res. 2024, 332, 617–644. [Google Scholar] [CrossRef]

Table 1. Performance of optimization methods on three non-smooth and non-convex benchmark solvers. The table reports computation time (in seconds) and final objective value for each method and problem.

Problem	Method	Time [s]	$f_{finale}$
Ragistrin	Constant	0.061	$1.52 \times 10^{4}$
	Diminishing	0.037	$1.63 \times 10^{4}$
	Dynamic	0.036	$1.63 \times 10^{4}$
	Adam	15.470	$1.00 \times 10^{4}$
	Adagrad	16.378	$1.00 \times 10^{4}$
	Fminsearch	1.267	$1.64 \times 10^{4}$
	Patternsearch	22.766	$1.54 \times 10^{4}$
Rosembroack	Constant	0.044	$4.74 \times 10^{6}$
	Diminishing	0.021	$3.18 \times 10^{5}$
	Dynamic	0.026	$3.82 \times 10^{5}$
	Adam	3.056	$8.34 \times 10^{4}$
	Adagrad	3.053	$1.40 \times 10^{5}$
	Fminsearch	1.386	$4.49 \times 10^{5}$
	Patternsearch	10.194	$3.55 \times 10^{5}$
SCAD	Constant	0.026	$2.32 \times 10^{- 1}$
	Diminishing	0.020	$2.32 \times 10^{- 1}$
	Dynamic	0.015	$2.32 \times 10^{- 1}$
	Adam	2.146	$2.27 \times 10^{- 1}$
	Adagrad	2.140	$2.27 \times 10^{- 1}$
	Fminsearch	2.135	$2.32 \times 10^{- 1}$
	Patternsearch	12.383	$1.91 \times 10^{- 1}$

Table 2. Performance comparison of incremental weak subgradient methods for non-convex classification.

Method	Accuracy (%)	Conv. Time (s)	Rel. Improvement (%)
Case 1	59.64	1.43	–
Case 2	53.22	1.87	−10.76
Case 3	71.93	1.21	+17.84

Table 3. Execution times and time per variable for weak scalability test using the SCAD function decomposed into 10 fixed blocks.

Cores	Problem Dimension (n)	Execution Time [s]	Time per Variable [ms]
1	500	1.1911	2.38
2	1000	1.2654	1.27
4	2000	1.4477	0.72
8	4000	2.1513	0.54
12	6000	2.4738	0.41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Araboljadidi, N.; De Simone, V. Incremental Weak Subgradient Methods for Non-Smooth Non-Convex Optimization Problems. Information 2025, 16, 509. https://doi.org/10.3390/info16060509

AMA Style

Araboljadidi N, De Simone V. Incremental Weak Subgradient Methods for Non-Smooth Non-Convex Optimization Problems. Information. 2025; 16(6):509. https://doi.org/10.3390/info16060509

Chicago/Turabian Style

Araboljadidi, Narges, and Valentina De Simone. 2025. "Incremental Weak Subgradient Methods for Non-Smooth Non-Convex Optimization Problems" Information 16, no. 6: 509. https://doi.org/10.3390/info16060509

APA Style

Araboljadidi, N., & De Simone, V. (2025). Incremental Weak Subgradient Methods for Non-Smooth Non-Convex Optimization Problems. Information, 16(6), 509. https://doi.org/10.3390/info16060509

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Incremental Weak Subgradient Methods for Non-Smooth Non-Convex Optimization Problems

Abstract

1. Introduction

2. Preliminaries

3. Incremental Subgradient Method for Non-Smooth Non-Convex Optimization

3.1. Convergence Analysis for the Constant Step Size

3.2. Convergence Analysis for the Diminishing Step Size

3.3. Convergence Analysis for the Dynamic Step Size

4. Numerical Results

4.1. Benchmarking Optimization Methods

4.2. Application of Incremental Weak Subgradient Methods to Classification

5. First Scalability Evaluation in a Parallel Setting

6. Conclusions

Author Contributions

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI