DARN: Distributed Adaptive Regularized Optimization with Consensus for Non-Convex Non-Smooth Composite Problems

Li, Cunlin; Ma, Yinpu

doi:10.3390/sym17071159

Open AccessArticle

DARN: Distributed Adaptive Regularized Optimization with Consensus for Non-Convex Non-Smooth Composite Problems

by

Cunlin Li

^†

and

Yinpu Ma

^*,†

School of Mathematics and Information Science, Northern Minzu University, Yinchuan 750021, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Symmetry 2025, 17(7), 1159; https://doi.org/10.3390/sym17071159

Submission received: 7 June 2025 / Revised: 9 July 2025 / Accepted: 14 July 2025 / Published: 20 July 2025

(This article belongs to the Section Mathematics)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a Distributed Adaptive Regularization Algorithm (DARN) for solving composite non-convex and non-smooth optimization problems in multi-agent systems. The algorithm employs a three-phase iterative framework to achieve efficient collaborative optimization: (1) a local regularized optimization step, which utilizes proximal mappings to enforce strong convexity of weakly convex objectives and ensure subproblem well-posedness; (2) a consensus update based on doubly stochastic matrices, guaranteeing asymptotic convergence of agent states to a global consensus point; and (3) an innovative adaptive regularization mechanism that dynamically adjusts regularization strength using local function value variations to balance stability and convergence speed. Theoretical analysis demonstrates that the algorithm maintains strict monotonic descent under non-convex and non-smooth conditions by constructing a mixed time-scale Lyapunov function, achieving a sublinear convergence rate. Notably, we prove that the projection-based update rule for regularization parameters preserves lower-bound constraints, while spectral decay properties of consensus errors and perturbations from local updates are globally governed by the Lyapunov function. Numerical experiments validate the algorithm’s superiority in sparse principal component analysis and robust matrix completion tasks, showing a 6.6% improvement in convergence speed and a 51.7% reduction in consensus error compared to fixed-regularization methods. This work provides theoretical guarantees and an efficient framework for distributed non-convex optimization in heterogeneous networks.

Keywords:

distributed optimization; adaptive regularization; non-convex and non-smooth optimization; consensus algorithms; Lyapunov convergence; proximal gradient methods

1. Introduction

Distributed optimization over multi-agent networks has emerged as a fundamental paradigm for solving large-scale problems in machine learning, signal processing, and control systems, where data privacy, communication efficiency, and scalability are critical concerns [1,2,3]. The symmetry principle plays a pivotal role in designing consensus mechanisms, where doubly stochastic weight matrices enforce symmetric information exchange among agents to balance local computational autonomy with global coordination. A central challenge lies in designing algorithms that reconcile non-smooth composite objectives—ubiquitous in sparse recovery, robust estimation, and deep learning—with the constraints of decentralized computation and time-varying network topologies. While significant progress has been made in convex settings, extension to non-convex and non-smooth problems remains theoretically intricate and practically demanding, particularly under heterogeneous network conditions. Recent advances in decentralized proximal methods, such as PG-EXTRA [4] and its variants [5,6,7], have demonstrated the potential of exploiting composite structures through gradient-proximal splitting techniques. These methods achieve exact convergence with fixed step sizes by leveraging double-stochastic consensus protocols; yet, their performance remains constrained by network-dependent step size bounds and limited adaptability to non-convex landscapes. Parallel developments in continuous-time Distributed Gradient Descent (DGD) [8] reveal intrinsic tradeoffs between consensus convergence rates and centralized optimization dynamics, often resulting in suboptimal synchronization under time-varying topologies. Meanwhile, state-of-the-art approaches for non-convex optimization, exemplified by Distributed Proximal Gradient (DPG) algorithms [9], address time-varying networks through increasing consensus rounds but suffer from computational inefficiency due to inexact proximal approximations and diminishing step-size requirements.

Notably, existing methods primarily target undirected graphs. Although directed graph settings are more challenging, recent works such as [10] have proposed distributed robust optimization frameworks for networked systems with unknown nonlinearities, while [11] developed randomized constraint-solving algorithms for unbalanced time-varying digraphs. These advances partially address graph asymmetry via techniques such as gradient tracking and row-stochastic matrices, although non-convex non-smooth composite problems remain open.

Three fundamental limitations persist in existing frameworks:

Network dependency: Step size selection in gradient-proximal methods often requires global knowledge of network spectral properties [4,5,6], limiting scalability in heterogeneous environments.
Non-convex–non-smooth coupling: Current analyses for composite objectives predominantly rely on convexity or Kurdyka–Łojasiewicz assumptions [12], failing to guarantee monotonic descent in general non-convex settings with non-smooth regularizers.
Adaptivity–consensus tradeoff: Fixed regularization schemes [8] and rigid consensus protocols lack mechanisms to dynamically balance local optimization accuracy with global consensus stability, particularly under time-varying communication constraints.

Theoretical contributions transcend conventional convex-analytic approaches by establishing [13]:

Global monotonicity: A mixed time-scale Lyapunov function certifies strict objective decrease despite non-convexity and proximal approximation errors.
Network-agnostic convergence: The sublinear rate $O (1 / \sqrt{k})$ is proven to be independent of spectral graph properties, resolving prior dependencies on Laplacian eigenvalues.
Critical point consensus: Agent states provably converge to a common critical set without requiring diminishing step sizes or gradient tracking.

Through numerical validations on sparse PCA and robust matrix completion, this work bridges the gap between adaptive regularization theory and decentralized non-convex optimization, offering a unified framework for large-scale composite problems in dynamically evolving networks.

The composite structure

f_{i} + g

introduces two fundamental difficulties: (1) Multimodality induced by non-convexity: Local objectives may contain numerous saddle points and suboptimal local minima, causing gradient-based methods to stagnate at non-critical regions. (2) Non-smoothness-gradient incoherence: The absence of subgradient boundedness (

∥ \partial f_{i} (x) ∥ ≰ L

) invalidates classic descent lemmas, while the non-coincidence of

\partial (f_{i} + g)

and

\partial f_{i} + \partial g

disrupts optimality analysis.

Decentralization exacerbates these issues through:

Consensus–optimization conflict: Non-smooth terms induce $O (1)$ consensus error under fixed step sizes (see Lemma 4), conflicting with optimization precision requirements.
Heterogeneous landscape misalignment: When $μ_{i} \neq μ_{j}$ , local strong convexity parameters diverge, preventing synchronous convergence.
Subgradient communication incompleteness: Transmitting $\partial f_{i}$ requires $O (d^{2})$ bandwidth for matrix-valued variables, becoming prohibitive for large-scale problems.

These challenges collectively necessitate: (i) A mechanism to bypass unbounded subgradients (solved via Moreau smoothing in Section 3.1). (ii) Time scale decoupling for consensus and optimization (achieved by mixed Lyapunov analysis in Theorem 1). (iii) Adaptive regularization to reconcile

μ_{i}

-heterogeneity (proposed in Section 2.2).

2. Problem Formulation and Algorithm Design

2.1. Network Model and Objective Function

Consider a network of n agents over an undirected graph

G = (V, E)

, where

V = {1, \dots, n}

and where

E

denotes the edge set; each agent i holds a local objective holds a local objective

f_{i} : R^{d} \to R

. The global optimization problem is then

\min_{x \in R^{d}} \{f (x) = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (x) + g (x)\} .

Assumptions:

$f_{i}$ is $L_{i} - Lipschitz$ continuous (possibly non-convex and non-smooth).
$g (x)$ is $L_{g} - Lipschitz$ continuous (possibly non-convex and non-smooth).
Agents communicate via a doubly stochastic adjacency matrix $W = [w_{i j}] \in R^{n \times n}$ satisfying $W 1 = 1, 1^{⊤} W = 1^{⊤}$ and $w_{i j} > 0$ iff $(i, j) \in E$ .

2.2. Distributed Adaptive Regularization Algorithm (DARN)

The complete procedure is formalized in Algorithm 1.

Initialization: Each agent i initializes $x_{i}^{(0)} \in R^{d}$ , $λ_{i}^{(0)} > 0$ and weights $w_{i j}$ .
Iteration Steps (at step k):

Local Regularized Optimization:

$x_{i}^{(k + \frac{1}{2})} = arg \min_{x \in R^{d}} \{f_{i} (x) + \frac{λ_{i}^{(k)}}{2} {∥ x - x_{i}^{(k)} ∥}^{2}\} .$

This proximal regularization ensures stability of local solutions.
Consensus Update:

$x_{i}^{(k + 1)} = \sum_{j = 1}^{n} w_{i j} x_{j}^{(k + \frac{1}{2})} .$

This achieves aggregation of local variables towards a global consensus point through weighted averaging.
The doubly stochastic matrix W induces a symmetric interaction topology in which each agent’s influence on its neighbors mirrors its receptiveness to others. This preserves the spectral symmetry of the Laplacian, which is crucial for exponential consensus convergence.
Adaptive Regularization Tuning:

$λ_{i}^{(k + 1)} = P_{[λ_{\min}, λ_{\max}]} (λ_{i}^{(k)} + \frac{γ}{L_{i} λ_{i}^{(k)}} (f_{i} (x_{i}^{(k + \frac{1}{2})}) - f_{i} (x_{i}^{(k)}))) .$

This dynamically adjusts the regularization strength based on local function value changes, where $η_{i}^{(k)} = \frac{γ}{L_{i} λ_{i}^{(k)}}$ , with $γ > 0$ as the adjustment parameter.

2.3. Mathematical Details and Rationale

Proposition 1.

(Strong Convexity of the Regularized Objective)

If each function $f_{i}$ is $(μ_{i}, L_{i})$ -weakly convex and $λ_{i}^{(k)} > μ_{i}$ , then the function

h_{i} (x) = f_{i} (x) + \frac{λ_{i}^{(k)}}{2} {∥ x - x_{i}^{(k)} ∥}^{2}

is

(λ_{i}^{(k)} - μ_{i})

-strongly convex. Consequently, there exists a unique minimizer

x_{i}^{(k + \frac{1}{2})}

.

Proof.

Definition of Weak Convexity.

A function $f_{i} (x) + \frac{μ_{i}}{2} {∥ x ∥}^{2}$ is convex. Therefore, for any $x, y \in R^{d}$ and $θ \in [0, 1]$ , we have

\begin{matrix} f_{i} (θ x + (1 - θ) y) + \frac{μ_{i}}{2} {∥ θ x + (1 - θ) y ∥}^{2} \leq \\ θ (f_{i} (x) + \frac{μ_{i}}{2} {∥ x ∥}^{2}) + (1 - θ) (f_{i} (y) + \frac{μ_{i}}{2} {∥ y ∥}^{2}) . \end{matrix}

Expansion and regularization term.

Consider the function $h_{i} (x)$ :

h_{i} (x) = f_{i} (x) + \frac{λ_{i}^{(k)}}{2} {∥ x - x_{i}^{(k)} ∥}^{2} .

Then, expand the squared norm term:

∥ x - x_{i}^{(k)} ∥^{2} = {∥ x ∥}^{2} - 2 x_{i}^{(k) ⊤} x + {∥ x_{i}^{(k)} ∥}^{2} .

Substituting into

h_{i} (x)

yields:

h_{i} (x) = f_{i} (x) + \frac{λ_{i}^{(k)}}{2} {∥ x ∥}^{2} - λ_{i}^{(k)} x_{i}^{(k) ⊤} x + \frac{λ_{i}^{(k)}}{2} {∥ x_{i}^{(k)} ∥}^{2} .

Strong Convexity Verification.

Because $f_{i} (x) + \frac{μ_{i}}{2} {∥ x ∥}^{2}$ is convex (by weak convexity) and $\frac{λ_{i}^{(k)}}{2} {∥ x ∥}^{2}$ is strongly convex with parameter $λ_{i}^{(k)}$ , the sum $h_{i} (x)$ is strongly convex.
The strong convexity parameter of $h_{i} (x)$ is $λ_{i}^{(k)} - μ_{i}$ . This follows from the fact that the quadratic term $\frac{λ_{i}^{(k)} - μ_{i}}{2} {∥ x ∥}^{2}$ dominates the convex function $f_{i} (x) + \frac{μ_{i}}{2} {∥ x ∥}^{2}$ .
Uniqueness of Minimizer:
Strong convexity guarantees that $h_{i} (x)$ has a unique minimizer $x_{i}^{(k + \frac{1}{2})}$ satisfying.

h_{i} (x_{i}^{(k + \frac{1}{2})}) = \min_{x \in R^{d}} h_{i} (x) .

□

Well-Posedness of Local Optimization.
For non-convex and non-smooth functions $f_{i}$ , the addition of the regularization term $\frac{λ_{i}^{(k)} - μ_{i}}{2} {∥ x ∥}^{2}$ ensures the existence and uniqueness of a solution to the local subproblem. Specifically, the regularized objective function is strongly convex, which guarantees a unique minimizer.
Convergence of Consensus Update.
Let $x^{(k)} = {[x_{1}^{(k)}, . . ., x_{n}^{(k)}]}^{T} \in R^{n \times d}$ . Then, the consensus step can be written in matrix form as follows:

$x^{(k + 1)} = W x^{(k + \frac{1}{2})} .$

Because W is doubly stochastic and the graph is connected, by the Perron–Frobenius theorem, repeated application of W will drive $x^{(k)}$ towards consensus, that is:

$lim_{k \to \infty} ∥ x_{i}^{(k)} - {\bar{x}}^{(k)} ∥ = 0, where {\bar{x}}^{(k)} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{(k)} .$
Adaptive Regularization Adjustment Mechanism.
Define the local function value decrease as follows:

$Δ_{i}^{(k)} = f_{i} (x_{i}^{(k + \frac{1}{2})}) - f_{i} (x_{i}^{(k)}) .$

If $Δ_{i}^{(k)} < 0$ (function value decreases), increase $λ_{i}^{(k + 1)}$ to strengthen regularization and suppress oscillations; otherwise, decrease $λ_{i}$ . By setting $η_{i}^{(k)} \propto 1 / (L_{i} λ_{i}^{(k)})$ , the adjustment ensures that the update is inversely proportional to the local Lipschitz constant, balancing the heterogeneity among agents.

2.4. Key Lemma and Convergence Support

Lemma 1.

(Descent of Local Solutions): For any

k \geq 0

and agent i, the following holds:

f_{i} (x_{i}^{(k + \frac{1}{2})}) + \frac{λ_{i}^{(k)}}{2} {∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥}^{2} \leq f_{i} (x_{i}^{(k)}) .

Proof.

By the definition in Proposition 1,

x_{i}^{(k + \frac{1}{2})}

is a minimizer. Substituting directly, we obtain

f_{i} (x_{i}^{(k + \frac{1}{2})}) + \frac{λ_{i}^{(k)}}{2} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥^{2} \leq f_{i} (x_{i}^{(k)}) + \frac{λ_{i}^{(k)}}{2} {∥ x_{i}^{(k)} - x_{i}^{(k)} ∥}^{2} = f_{i} (x_{i}^{(k)}) .

□

Lemma 2.

(Consensus Error Decay): There exists a constant

ρ \in (0, 1)

such that

∥ x^{(k + 1)} - 1 {\bar{x}}^{(k + 1)} ∥ \leq ρ ∥ x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} ∥ .

Proof.

Due to the spectral properties of W, its second largest eigenvalue

σ_{2} (W) < 1

. Therefore, we can choose

ρ = σ_{2} (W)

. Consequently, we have

∥ x^{(k + 1)} - 1 {\bar{x}}^{(k + 1)} ∥ = ∥ W x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} ∥ .

Utilizing the property of the matrix W and its eigenvalues, specifically that

σ_{2} (W)

governs the convergence rate of consensus algorithms, we can bound the norm as follows:

∥ W x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} ∥ \leq σ_{2} (W) ∥ x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} ∥

where

{\bar{x}}^{(k + \frac{1}{2})}

denotes the average of the elements in

x^{(k + \frac{1}{2})}

.

Given our choice of $ρ = σ_{2} (W)$ , we obtain

∥ x^{(k + 1)} - 1 {\bar{x}}^{(k + 1)} ∥ \leq ρ ∥ x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} ∥,

which completes the proof. □

2.5. Mathematical Description of the Algorithm Pseudocode

Input: Initial states

x_{i} (0)

for all agents,

λ_{i} (0) > 0

, adjacency matrix W, and parameter

γ > 0

.

Algorithm 1 Distributed Adaptive Regularization Algorithm (DARN)
1: Input: Initial states $x_{i} (0)$ , $λ_{i} (0) > 0$ , adjacency matrix W, parameter $γ > 0$
2: for $k = 0, 1, 2, \dots$ do
3: $x_{i}^{(k + \frac{1}{2})} \leftarrow arg \min_{x} \{f_{i} (x) + \frac{λ_{i}^{(k)}}{2} {∥ x - x_{i}^{(k)} ∥}^{2}\}$	▹ Local optimization
4: $x_{i}^{(k + 1)} \leftarrow \sum_{j = 1}^{n} w_{i j} x_{j}^{(k + \frac{1}{2})}$	▹ Consensus update
5: $λ_{i}^{(k + 1)} \leftarrow P_{[λ_{\min}, λ_{\max}]} (λ_{i}^{(k)} + \frac{γ}{L_{i} λ_{i}^{(k)}} (f_{i} (x_{i}^{(k + \frac{1}{2})}) - f_{i} (x_{i}^{(k)})))$
6: end for

Remark 1.

(Consensus for Directed Graphs): For directed communication topologies, DARN can employ the push-sum [14] protocol as an alternative:

\begin{matrix} x_{i}^{(k + 1)} & = \sum_{j \in N_{i}^{in}} a_{i j} x_{j}^{(k + \frac{1}{2})} \\ ϕ_{i}^{(k + 1)} & = \sum_{j \in N_{i}^{in}} a_{i j} ϕ_{j}^{(k)} \\ z_{i}^{(k + 1)} & = x_{i}^{(k + 1)} / ϕ_{i}^{(k + 1)} \end{matrix}

where

A = [a_{i j}]

is row-stochastic. This extension enables DARN operation in unidirectional networks but reduces convergence to the sublinear rate

O (1 / k)

.

2.6. Supplementary Remarks

Handling Non-Smooth Terms: If $f_{i}$ and $g_{i}$ contain non-smooth terms (e.g., $l_{1} - n o r m$ ), a proximal operator can be introduced in the local optimization step. Specifically, when $f_{i} (x) = h_{i} (x) + r_{i} (x)$ , where $h_{i}$ is smooth and $r_{i}$ is non-smooth, the local step is modified as follows:

$x_{i}^{(k + \frac{1}{2})} = {prox}_{r_{i}, λ_{i}^{(k)}} (x_{i}^{(k)} - \frac{1}{λ_{i}^{(k)}} \nabla h_{i} (x_{i}^{(k)}))$

where the proximal operator is defined as

${prox}_{r, λ} (y) = arg \min_{x} {r (x) + \frac{λ}{2} {∥ x - y ∥}^{2}} .$
Integration of Global Term $g (x)$ : Because $g (x)$ is a global term, it cannot be directly handled in the local optimization step. Instead, it is implicitly optimized through the consensus step, where each $x_{i}$ converges to a common point x that minimizes $\frac{1}{n} \sum f_{i} (x) + g (x)$ .
Constrained Optimization Extension: DARN extends to constrained problems via

$\begin{matrix} Local step : x_{i}^{(k + \frac{1}{2})} = arg \min_{x \in X_{i}} \{f_{i} (x) + \frac{λ_{i}^{(k)}}{2} {∥ x - x_{i}^{(k)} ∥}^{2}\} \\ Global constraints : x_{i}^{(k + 1)} = {proj}_{C} (\sum_{j} w_{i j} x_{j}^{(k + \frac{1}{2})}) \\ λ_{i}^{(k + 1)} = P (λ_{i}^{(k)} + \frac{γ}{L_{i} λ_{i}^{(k)}} (Δ f_{i}^{(k)} - β \cdot ∥ x_{i}^{(k)} - {proj}_{C} (x_{i}^{(k)}) ∥)) . \end{matrix}$

Projection non-expansiveness preserves strong convexity, while the extended Lyapunov function ensures constraint violation convergence.

3. Convergence Theory Analysis for Distributed Adaptive Regularization Algorithm (DARN)

3.1. Convergence Theory Analysis

Critical Point Quality Guarantee: DARN ensures convergence to high-quality critical points via dual mechanisms:

(1): Strict monotonic descent of the mixed time-scale Lyapunov function $Φ^{(k)}$ (Theorem 1) avoids high-value saddles.
(2): Adaptive regularization guides optimization path through $λ_{i}$ tuning:

$\begin{matrix} λ_{i} ↑ & when Δ f_{i}^{(k)} ≪ 0 (stabilize updates during rapid descent) \\ λ_{i} ↓ & when Δ f_{i}^{(k)} \approx 0 (refine solutions near stationarity) \end{matrix}$

which prioritizes convergence to deep minima, supplemented by multi-start initialization for enhanced robustness.

Assumptions:

Weakly Convex-Lipschitz Function Class: Each $f_{i} (x)$ is a ( $μ_{i}, L_{i})$ -weakly convex-Lipschitz function [15], i.e.,

$f_{i} (x) + \frac{μ_{i}}{2} {∥ x ∥}^{2}$

is convex:

$∥ f_{i} (x) - f_{i} (y) ∥ \leq L_{i} ∥ x - y ∥ f o r a l l x, y \in R^{d} .$
Communication Graph and Mixing Matrix: The adjacency matrix W is doubly stochastic and satisfies the spectral gap condition

$∥ W - \frac{1}{n} 1 1^{⊤} ∥_{2} \leq ρ < 1 .$
Bounded Regularization Parameters: There exist constants $λ_{\min}, λ_{\max}$ such that

$0 < λ_{\min} \leq λ_{i}^{(k)} \leq λ_{\max}$

for all $i, k$ . $λ_{\min} > \max_{i} μ_{i}$ .
Global Regularizer Properties: The function $g (x)$ is $L_{g} - L i p s c h i t z$ continuous and lower semi-continuous (l.s.c.).

3.2. Key Lemmas and Preliminaries

Proposition 2

([16]). Kurdyka–Łojasiewicz (KL): If a function

h : R^{d} \to R

satisfies the condition that, in a neighborhood of a point

x^{*}

, there exist

η > 0

,

θ \in [0, 1)

, and a concave function

φ (s) = s^{1 - θ}

such that

φ^{'} (h (x) - h (x^{*})) \cdot ∥ \partial h (x) ∥ \geq 1, \forall x \in B (x^{*}, η) \cap {x ∣ 0 < h (x) - h (x^{*}) < η},

then

h (x)

is said to satisfy the Kurdyka–Łojasiewicz (KL) property at

x^{*}

. Functions that are semi-algebraic are known to satisfy the global KL property, which serves as a fundamental tool in the convergence analysis of non-convex optimization problems.

Lemma 3

([17]). (Smoothness of the Moreau Envelope): For any weakly convex-Lipschitz function

f_{i} (x)

, its Moreau envelope

M_{λ f_{i}} (y)

satisfies

\nabla M_{λ f_{i}} (y) = λ (y - arg \min_{x} \{f_{i} (x) + \frac{λ}{2} {∥ x - y ∥}^{2}\}),

and when

λ > μ_{i}

, then

M_{λ f_{i}} (y)

is

\frac{λ L_{i}}{λ - μ_{i}}

-smooth.

Proof.

Let

x^{*} (y) = arg \min_{x} \{f_{i} (x) + \frac{λ}{2} {∥ x - y ∥}^{2}\}

. For

λ > μ_{i}

, the strong convexity of

h_{i} (x; y) = f_{i} (x) + \frac{λ}{2} {∥ x - y ∥}^{2}

ensures that

x^{*} (y)

exists uniquely. The gradient expression follows directly from the envelope theorem.

To prove smoothness, consider $y_{1}, y_{2} \in R^{d}$ with $x_{1}^{*} = x^{*} (y_{1})$ , $x_{2}^{*} = x^{*} (y_{2})$ . The optimality conditions provide

\begin{matrix} 0 & \in \partial f_{i} (x_{1}^{*}) + λ (x_{1}^{*} - y_{1}), \\ 0 & \in \partial f_{i} (x_{2}^{*}) + λ (x_{2}^{*} - y_{2}) . \end{matrix}

By weak convexity of

f_{i}

, the subdifferential satisfies

〈 λ (y_{1} - x_{1}^{*}) - λ (y_{2} - x_{2}^{*}), x_{1}^{*} - x_{2}^{*} 〉 \geq - μ_{i} {∥ x_{1}^{*} - x_{2}^{*} ∥}^{2} .

Rearranging yields the following key inequality:

λ 〈 y_{1} - y_{2}, x_{1}^{*} - x_{2}^{*} 〉 \geq (λ - μ_{i}) {∥ x_{1}^{*} - x_{2}^{*} ∥}^{2} .

(1)

Applying the Cauchy–Schwarz inequality to (1),

∥ x_{1}^{*} - x_{2}^{*} ∥ \leq \frac{λ}{λ - μ_{i}} ∥ y_{1} - y_{2} ∥ .

(2)

The gradient difference is

∥ \nabla M_{λ f_{i}} (y_{1}) - \nabla M_{λ f_{i}} (y_{2}) ∥ = λ ∥ (y_{1} - y_{2}) - (x_{1}^{*} - x_{2}^{*}) ∥ .

Substituting (2) and refining the bound using

∥ \partial f_{i} (\cdot) ∥ \leq L_{i}

(from Lipschitz continuity) provides

λ ∥ (y_{1} - y_{2}) - (x_{1}^{*} - x_{2}^{*}) ∥ \leq \frac{λ L_{i}}{λ - μ_{i}} ∥ y_{1} - y_{2} ∥ . □

Lemma 4.

(Lower Bound Preservation of Regularization Parameters):

If the initial values satisfy $λ_{i}^{(0)} > μ_{i}$ and the parameter update rule is provided by

λ_{i}^{(k + 1)} = P_{[λ_{\min}, λ_{\max}]} (λ_{i}^{(k)} + \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)}),

then for all

k \geq 0

we have

λ_{i}^{(k)} \geq λ_{\min} > μ_{i}

.

Proof.

Base Case: For

k = 0

, the assumption

λ_{i}^{(0)} \geq λ_{\min} > μ_{i}

holds.

Inductive Hypothesis: Assume for some k, $λ_{i}^{(k)} \geq λ_{\min} > μ_{i}$ .
Update Analysis: By the descent property of the local optimization step (Lemma 1), the unprojected update value is

Δ_{i}^{(k)} = f_{i} (x_{i}^{(k + \frac{1}{2})}) - f_{i} (x_{i}^{(k)}) \leq - \frac{λ_{i}^{(k)} - μ_{i}}{2} {∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥}^{2} \leq 0 .

Thus,

{\tilde{λ}}_{i}^{(k + 1)} = λ_{i}^{(k)} + \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)} \leq λ_{i}^{(k)} .

Projection Analysis:

Because the projection operator $P_{[λ_{\min}, λ_{\max}]}$ restricts values to $[λ_{\min}, λ_{\max}]$ and $λ_{\min} > μ_{i}$ , by the inductive hypothesis we have

λ_{i}^{(k + 1)} = P_{[λ_{\min}, λ_{\max}]} ({\tilde{λ}}_{i}^{(k + 1)}) \geq λ_{\min} > μ_{i} .

By mathematical induction,

λ_{i}^{(k)} \geq λ_{\min} > μ_{i}

holds for all k. □

Lemma 5

([18]). (Lyapunov Function Descent Under Projection)

Define the modified Lyapunov function

\begin{matrix} Φ^{(k)} & = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (x_{i}^{(k)}) + g ({\bar{x}}^{(k)}), \\ + \frac{α}{2} {∥ x^{(k)} - 1 {\bar{x}}^{(k)} ∥}^{2}, \\ + \frac{β}{n} \sum_{i = 1}^{n} {(λ_{i}^{(k)} - λ_{\min})}^{2}, \end{matrix}

where

β > 0

is a tuning parameter. Then, there exists

γ > 0

such that

Φ^{(k + 1)} \leq Φ^{(k)} - γ (\frac{1}{n} \sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥^{2} + {∥ x^{(k)} - 1 {\bar{x}}^{(k)} ∥}^{2}) .

Proof.

Non-Expansiveness of the Projection Operator. For any

a \leq b

and

x, y \in R,

we have

| P_{[a, b]} (x) - P_{[a, b]} (y) | \leq | x - y | .

Thus, the projection operation does not increase parameter variations.

Parameter Update Decomposition. Let $Δ_{i}^{(k)} = f_{i} (x_{i}^{(k + \frac{1}{2})}) - f_{i} (x_{i}^{(k)})$ ; then, we have

{(λ_{i}^{(k + 1)} - λ_{\min})}^{2} \leq {(λ_{i}^{(k)} + \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)} - λ_{\min})}^{2} .

Expanding the right-hand side provides

{(λ_{i}^{(k)} - λ_{\min})}^{2} + 2 \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)} (λ_{i}^{(k)} - λ_{\min}) + {(\frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)})}^{2} .

Lyapunov Function Descent Analysis. Combining the original Lyapunov function descent (Theorem 1) and the parameter update terms yields

\begin{matrix} Φ^{(k + 1)} & \leq Φ^{(k)} - γ_{1} (\frac{1}{n} \sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥^{2} + {∥ x^{(k)} - 1 {\bar{x}}^{(k)} ∥}^{2}) \\ + \frac{β}{n} \sum_{i = 1}^{n} [2 \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)} (λ_{i}^{(k)} - λ_{\min}) + {(\frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)})}^{2}] . \end{matrix}

Because

Δ_{i}^{(k)} \leq 0

and

λ_{i}^{(k)} - λ_{\min} \geq 0

, the second term is non-positive. By choosing sufficiently small

β

and

γ

, the overall descent is guaranteed. □

Lemma 6.

(Consensus Error Recursion). Let

x^{(k)} = {[x_{1}^{(k)}, \dots, x_{n}^{(k)}]}^{⊤}, {\bar{x}}^{(k)} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{(k)}

; then

∥ x^{(k + 1)} - 1 {\bar{x}}^{(k + 1)} ∥ \leq ρ ∥ x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} ∥ + C \sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥,

where

C = \frac{\sqrt{n} L_{\max}}{λ_{\min}}

and

L_{\max} = \max_{i} L_{i}

.

Proof.

Consensus Error Decomposition: The consensus update step is

x^{(k + 1)} = W x^{(k + \frac{1}{2})}

and the global average satisfies

{\bar{x}}^{(k + 1)} = \frac{1}{n} 1^{⊤} x^{(k + 1)} = \frac{1}{n} 1^{⊤} W x^{(k + \frac{1}{2})} = {\bar{x}}^{(k + \frac{1}{2})}

.

We define the consensus error

e^{(k + 1)} = x^{(k + 1)} - 1 {\bar{x}}^{(k + 1)} = W x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} .

and decompose

x^{(k + \frac{1}{2})}

into the global average and error terms

x^{(k + \frac{1}{2})} = 1 {\bar{x}}^{(k + \frac{1}{2})} + e^{(k + \frac{1}{2})},

where

e^{(k + \frac{1}{2})} = x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})}

. Substituting into the consensus error expression yields

e^{(k + 1)} = W (1 {\bar{x}}^{(k + \frac{1}{2})} + e^{(k + \frac{1}{2})}) - 1 {\bar{x}}^{(k + \frac{1}{2})} = W e^{(k + \frac{1}{2})},

as

W 1 = 1

.

Spectral Norm Contraction. From the spectral properties [19] of the doubly stochastic matrix $W$ , there exists $ρ = σ_{2} (W) < 1$ such that

∥ W e^{(k + \frac{1}{2})} ∥ \leq ρ ∥ e^{(k + \frac{1}{2})} ∥ .

Thus,

∥ e^{(k + 1)} ∥ = ∥ W e^{(k + \frac{1}{2})} ∥ \leq ρ ∥ e^{(k + \frac{1}{2})} ∥ .

Impact of Local Updates. The local optimization step can be viewed as a perturbation of the previous state

x^{(k + \frac{1}{2})} = x^{(k)} + Δ x^{(k)},

where

Δ x^{(k)} = {[x_{1}^{(k + \frac{1}{2})} - x_{1}^{(k)}, \dots, x_{n}^{(k + \frac{1}{2})} - x_{n}^{(k)}]}^{⊤}

.

The local optimization step satisfies the following optimality condition:

0 \in \partial f_{i} (x_{i}^{(k + \frac{1}{2})}) + λ_{i}^{(k)} (x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)}) .

Per the

L_{i} - Lipschitz

continuity of

f_{i}

, the subgradient is bounded as follows:

∥\partial f_{i} (x_{i}^{(k + \frac{1}{2})})∥ \leq L_{i} .

Combining with the optimality condition,

∥λ_{i}^{(k)} (x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)})∥ \leq L_{i} ⟹ ∥x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)}∥ \leq \frac{L_{i}}{λ_{i}^{(k)}} .

By the assumption

λ_{i}^{(k)} \geq λ_{\min}

, we obtain a uniform upper bound:

∥x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)}∥ \leq \frac{L_{\max}}{λ_{\min}} .

Effect of Perturbations on Consensus Error. Incorporating the perturbation term into the consensus error recursion, we have

e^{(k + \frac{1}{2})} = x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} = x^{(k)} + Δ x^{(k)} - 1 {\bar{x}}^{(k + \frac{1}{2})} .

Noting that

1 {\bar{x}}^{(k + \frac{1}{2})} = 1 {\bar{x}}^{(k)} + 1 ({\bar{x}}^{(k + \frac{1}{2})} - {\bar{x}}^{(k)})

, through combination with

x^{(k)} = 1 {\bar{x}}^{(k)} + e^{(k)}

we obtain

e^{(k + \frac{1}{2})} = e^{(k)} + Δ x^{(k)} - 1 ({\bar{x}}^{(k + \frac{1}{2})} - {\bar{x}}^{(k)}) .

Because

{\bar{x}}^{(k + \frac{1}{2})} - {\bar{x}}^{(k)} = \frac{1}{n} \sum_{i = 1}^{n} Δ x_{i}^{(k)}

and

1 ({\bar{x}}^{(k + \frac{1}{2})} - {\bar{x}}^{(k)}) = \frac{1}{n} 1 1^{⊤} Δ x^{(k)}

, we have

e^{(k + \frac{1}{2})} = e^{(k)} + (I - \frac{1}{n} 1 1^{⊤}) Δ x^{(k)} .

Let

J = \frac{1}{n} 1 1^{⊤}

denote the projection matrix; then,

I - J

is the centering projection matrix satisfying

{∥ I - J ∥}_{2} = 1

.

Final Form of the Error Recursion.
Combining the spectral contraction (Consensus Error Decomposition) and the perturbation decomposition (Effect of Perturbations on Consensus Error), we have

e^{(k + 1)} = W e^{(k + \frac{1}{2})} = W (e^{(k)} + (I - J) Δ x^{(k)}) .

Taking the norms and applying the triangle inequality yields

∥ e^{(k + 1)} ∥ \leq ∥ W e^{(k)} ∥ + ∥ W (I - J) Δ x^{(k)} ∥ .

Further, using the spectral norm properties

{∥ W ∥}_{2} \leq 1

and

{∥ I - J ∥}_{2} = 1

, we have

∥ e^{(k + 1)} ∥ \leq ρ ∥ e^{(k)} ∥ + ∥ Δ x^{(k)} ∥ .

Substituting the perturbation bound from (Impact of Local Updates)

∥ Δ x^{(k)} ∥ \leq \sqrt{n} \cdot \frac{L_{\max}}{λ_{\min}},

we obtain

∥ e^{(k + 1)} ∥ \leq ρ ∥ e^{(k)} ∥ + \frac{\sqrt{n} L_{\max}}{λ_{\min}} \sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥ .

Letting

C = \frac{\sqrt{n} L_{\max}}{λ_{\min}}

, the lemma then follows:

∥ e^{(k + 1)} ∥ \leq ρ ∥ e^{(k)} ∥ + C \sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥ .

□

4. Global Convergence Analysis

Theorem 1.

(Monotonicity of the Lyapunov Function). Define the global Lyapunov function as

Φ^{(k)} = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (x_{i}^{(k)}) + g ({\bar{x}}^{(k)}) + \frac{α}{2} {∥ x^{(k)} - 1 {\bar{x}}^{(k)} ∥}^{2} + \frac{β}{n} \sum_{i = 1}^{n} {(λ_{i}^{(k)})}^{2},

where

α, β > 0

are tuning parameters. Under Assumptions 1–4, there exists a constant

γ > 0

such that

Φ^{(k + 1)} \leq Φ^{(k)} - γ (\frac{1}{n} \sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥^{2} + {∥ x^{(k)} - 1 {\bar{x}}^{(k)} ∥}^{2}) .

Proof of Theorem 1.

Local Optimization Descent Analysis. By the optimality condition of the local optimization step, for any agent i we have:

0 \in \partial f_{i} (x_{i}^{(k + \frac{1}{2})}) + λ_{i}^{(k)} (x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)}) .

Combining this with the weak convexity assumption (

f_{i} (x) + \frac{μ_{i}}{2} {∥ x ∥}^{2}

is convex), we obtain

f_{i} (x_{i}^{(k + \frac{1}{2})}) \leq f_{i} (x_{i}^{(k)}) - \frac{λ_{i}^{(k)} - μ_{i}}{2} {∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥}^{2} .

Because

λ_{i}^{(k)} \geq λ_{\min} > μ_{i}

(Assumption 3), the local optimization step guarantees a decrease in the function value.

Impact of Consensus Update on the Lyapunov Function. After the consensus update, the global average variable is

{\bar{x}}^{(k + 1)} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{(k + 1)} = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{n} w_{i j} x_{j}^{(k + \frac{1}{2})} = \frac{1}{n} \sum_{j = 1}^{n} x_{j}^{(k + \frac{1}{2})} = {\bar{x}}^{(k + \frac{1}{2})} .

Thus, the global regularization term satisfies

g ({\bar{x}}^{(k + 1)}) = g ({\bar{x}}^{(k + \frac{1}{2})}) \leq g ({\bar{x}}^{(k)}) + L_{g} ∥ {\bar{x}}^{(k + \frac{1}{2})} - {\bar{x}}^{(k)} ∥ .

Consensus Error Recursion and Contraction. By Lemma 4 and the boundedness of local updates,

∥ x^{(k + 1)} - 1 {\bar{x}}^{(k + 1)} ∥ \leq ρ ∥ x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} ∥ + C \sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥ .

Further, using the local optimization descent bound

∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥ \leq \frac{2}{λ_{\min} - μ_{i}}

\sqrt{f_{i} (x_{i}^{(k)}) - f_{i} (x_{i}^{(k + \frac{1}{2})})}

, we obtain

∥ x^{(k + 1)} - 1 {\bar{x}}^{(k + 1)} ∥^{2} \leq ρ^{2} {∥ x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} ∥}^{2} + \frac{C^{2} n}{λ_{\min}^{2}} \sum_{i = 1}^{n} (f_{i} (x_{i}^{(k)}) - f_{i} (x_{i}^{(k + \frac{1}{2})})) .

Energy Control of Regularization Parameter Updates. The regularization parameter update rule is

λ_{i}^{(k + 1)} = λ_{i}^{(k)} + \frac{γ}{L_{i} λ_{i}^{(k)}} (f_{i} (x_{i}^{(k + \frac{1}{2})}) - f_{i} (x_{i}^{(k)})) .

Because

f_{i} (x_{i}^{(k + \frac{1}{2})}) \leq f_{i} (x_{i}^{(k)})

, the update term is non-positive; thus,

λ_{i}^{(k + 1)} \leq λ_{i}^{(k)}

. Combined with the boundedness in Assumption 3, there exists a constant

D > 0

such that

\sum_{i = 1}^{n} {(λ_{i}^{(k + 1)})}^{2} \leq \sum_{i = 1}^{n} {(λ_{i}^{(k)})}^{2} - D \sum_{i = 1}^{n} (f_{i} (x_{i}^{(k)}) - f_{i} (x_{i}^{(k + \frac{1}{2})})) .

Overall Descent of the Lyapunov Function. Combining Steps 1–4, the descent of the Lyapunov function is as follows:

Φ^{(k + 1)} \leq Φ^{(k)} - \sum_{i = 1}^{n} [\frac{λ_{\min} - μ_{i}}{2 n} {∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥}^{2}] - \frac{α (1 - ρ^{2})}{2} {∥ x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} ∥}^{2} .

Choosing

α = \frac{2 C^{2} n}{λ_{\min}^{2} (1 - ρ^{2})}

and

β = \frac{D}{n}

, we obtain

Φ^{(k + 1)} \leq Φ^{(k)} - γ (\frac{1}{n} \sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥^{2} + {∥ x^{(k)} - 1 {\bar{x}}^{(k)} ∥}^{2}),

where

γ = \min \{\frac{λ_{\min} - μ_{i}}{2}, \frac{α (1 - ρ^{2})}{2}\}

.

Impact of Projection on the Lyapunov Function. By Lemma 3, the projection operation does not disrupt the descent property:

Definitions and Assumptions:
The Lyapunov function is defined as

$Φ^{(k)} = \underset{Original Lyapunov Term}{\underset{︸}{\frac{1}{n} \sum_{i = 1}^{n} f_{i} (x_{i}^{(k)}) + g ({\bar{x}}^{(k)}) + \frac{α}{2} {∥ x^{(k)} - 1 {\bar{x}}^{(k)} ∥}^{2}}} + \underset{Additional Parameter Penalty}{\underset{︸}{\frac{β}{n} \sum_{i = 1}^{n} {(λ_{i}^{(k)} - λ_{\min})}^{2}}} .$

The parameter update rule is

$λ_{i}^{(k + 1)} = P_{[λ_{\min}, λ_{\max}]} (λ_{i}^{(k)} + \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)}),$

where $Δ_{i}^{(k)} = f_{i} (x_{i}^{(k + \frac{1}{2})}) - f_{i} (x_{i}^{(k)})$ .
Known conditions:

$Δ_{i}^{(k)} \leq - \frac{λ_{i}^{(k)} - μ_{i}}{2} {∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥}^{2} (Lemma 1) .$

$λ_{i}^{(k)} \geq λ_{\min} > μ_{i} (Lemma 2) .$
Analysis of the Projection’s Impact:
Goal: Show that the additional parameter penalty term satisfies

$\frac{β}{n} \sum_{i = 1}^{n} {(λ_{i}^{(k + 1)} - λ_{\min})}^{2} \leq \frac{β}{n} \sum_{i = 1}^{n} {(λ_{i}^{(k)} + \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)} - λ_{\min})}^{2} .$

Proof.

Non-Expansiveness of the Projection Operator [20]. For any

a \leq b

and

x, y \in R

, the projection operator

P_{[a, b]}

satisfies

| P_{[a, b]} (x) - P_{[a, b]} (y) | \leq | x - y | .

Specifically, when

y = λ_{\min}

, we have the following.

Summation Expansion: For each agent i, we expand the right-hand side:

{(λ_{i}^{(k)} + \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)} - λ_{\min})}^{2} = {(λ_{i}^{(k)} - λ_{\min})}^{2} + 2 \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)} (λ_{i}^{(k)} - λ_{\min}) + {(\frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)})}^{2} .

Overall Summation: Summing over all agents and multiplying by

\frac{β}{n}

yields

\frac{β}{n} \sum_{i = 1}^{n} {(λ_{i}^{(k + 1)} - λ_{\min})}^{2} \leq \frac{β}{n} \sum_{i = 1}^{n} [{(λ_{i}^{(k)} - λ_{\min})}^{2} + 2 \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)} (λ_{i}^{(k)} - λ_{\min}) + {(\frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)})}^{2}] .

Combining with Lyapunov Function Descent: The goal is to show that the additional terms on the right-hand side can be offset by the original Lyapunov descent terms.

Decomposition of Lyapunov Function Change:

Φ^{(k + 1)} - Φ^{(k)} \leq - γ (\frac{1}{n} \sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥^{2} + {∥ x^{(k)} - 1 {\bar{x}}^{(k)} ∥}^{2}) + \frac{β}{n} \sum_{i = 1}^{n} [2 \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)} (λ_{i}^{(k)} - λ_{\min}) + {(\frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)})}^{2}] .

Sign Analysis of Terms:

Cross Term: $2 \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)} (λ_{i}^{(k)} - λ_{\min})$ is non-positive, since $Δ_{i}^{(k)} \leq 0$ and $λ_{i}^{(k)} - λ_{\min} \geq 0 .$ Squared Term: ${(\frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)})}^{2} \geq 0 .$
Control of the Squared Term:
Using $Δ_{i}^{(k)} \leq - \frac{λ_{i}^{(k)} - μ_{i}}{2} {∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥}^{2}$ , we obtain

{(\frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)})}^{2} \leq {(\frac{γ (λ_{i}^{(k)} - μ_{i})}{2 L_{i} λ_{i}^{(k)}} {∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥}^{2})}^{2} .

Choosing

β

to be sufficiently small

(e . g ., β \leq \frac{α L_{\min}^{2} λ_{\min}^{2}}{γ^{2} {(λ_{\max} - μ_{\max})}^{2}})

ensures

\frac{β}{n} \sum_{i = 1}^{n} {(\frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)})}^{2} \leq \frac{γ_{1}}{n} \sum_{i = 1}^{n} {∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥}^{2},

where

γ_{1} < γ

. □

Integration of Results.
Combining the above analysis, we obtain the following inequality for the Lyapunov function descent:

Φ^{(k + 1)} - Φ^{(k)} \leq - γ (\frac{1}{n} \sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥^{2} + {∥ x^{(k)} - 1 {\bar{x}}^{(k)} ∥}^{2}) + \frac{β}{n} \sum_{i = 1}^{n} [Non-positive Terms + Offset Terms] .

Because the non-positive terms (e.g., the cross term

2 \frac{γ}{L_{i} λ_{i}^{(k)}} Δ_{i}^{(k)} (λ_{i}^{(k)} - λ_{\min})

do not increase the Lyapunov function and because the squared terms are controlled by the original descent terms, we can further simplify the inequality to

Φ^{(k + 1)} - Φ^{(k)} \leq - γ^{'} (\frac{1}{n} \sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥^{2} + {∥ x^{(k)} - 1 {\bar{x}}^{(k)} ∥}^{2}),

where

γ^{'} = \min \{γ, γ_{1}\}

is a positive constant.

This completes the proof of Theorem 3, showing that the Lyapunov function $Φ^{(k)}$ monotonically decreases with each iteration and ensuring global convergence of the algorithm. □

5. Convergence Rate and Complexity Analysis

Theorem 2.

(Sublinear Convergence Rate): Under Assumptions 1–4, there exists a constant

C > 0

such that for any

K \geq 1

, the sequence generated by the algorithm satisfies

\min_{0 \leq k \leq K} (\frac{1}{n} \sum_{i = 1}^{n} {∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥}^{2}) \leq \frac{C Φ^{(0)}}{K},

where

C = \frac{2}{γ (λ_{\min} - μ_{\max})}

and

μ_{\max} = \max_{i} μ_{i}

.

Proof of Theorem 2.

Accumulation of Lyapunov Descent. By Theorem 1, summing over K iterations yields

\sum_{k = 0}^{K - 1} (\frac{1}{n} \sum_{i = 1}^{n} {∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥}^{2}) \leq \frac{Φ^{(0)} - Φ^{(K)}}{γ} \leq \frac{Φ^{(0)}}{γ} .

Minimum Value Inequality. There exists some iteration

k^{*} \in {0, 1, \dots, K - 1}

such that

\frac{1}{n} \sum_{i = 1}^{n} {∥ x_{i}^{(k^{*} + \frac{1}{2})} - x_{i}^{(k^{*})} ∥}^{2} \leq \frac{Φ^{(0)}}{γ K} .

Critical Point Approximation. By the optimality condition of the local optimization step

\nabla M_{λ_{i}^{(k^{*})} f_{i}} (x_{i}^{(k^{*})}) = λ_{i}^{(k^{*})} (x_{i}^{(k^{*})} - x_{i}^{(k^{*} + \frac{1}{2})}),

and combining with the first-order condition of the global objective, we obtain

∥\frac{1}{n} \sum_{i = 1}^{n} \nabla M_{λ_{i}^{(k^{*})} f_{i}} (x_{i}^{(k^{*})}) + \nabla g ({\bar{x}}^{(k^{*})})∥ \leq \frac{C Φ^{(0)}}{K} .

□

Complexity Analysis

Communication Complexity:
- Each iteration requires one round of neighborhood communication, transmitting d-dimensional vectors.
- To achieve an $ε - c r i t i c a l$ point, $K = O (\frac{1}{ε^{2}})$ communication rounds are needed, resulting in a total communication cost of
  
  $Total Communication Cos t = O (\frac{n d}{ε^{2}}) .$
Computational Complexity:
- Local Optimization Step: Assuming each local problem is solved using the proximal gradient method in $T_{i} = O (log \frac{1}{ε_{local}})$ steps to achieve precision $ε_{local}$ .
- Total Gradient Computations:
  
  $Total Computational Cos t = O (\frac{n}{ε^{2}} log \frac{1}{ε_{local}}) .$

6. Stability and Robustness Analysis

Theorem 3.

(Stability Under Dynamic Perturbations): Suppose that there exists a perturbation sequence

{δ_{i}^{(k)}}

satisfying

\sum_{k = 0}^{\infty} ∥ δ_{i}^{(k)} ∥ < \infty

. Then, the modified update rule

x_{i}^{(k + 1)} = \sum_{j = 1}^{n} w_{i j} x_{j}^{(k + \frac{1}{2})} + δ_{i}^{(k)}

still ensures that the sequence generated by the algorithm converges to a critical point of the original problem.

Proof of Theorem 3.

Incorporating Perturbations into the Lyapunov Function. Define the modified Lyapunov function as follows:

{\tilde{Φ}}^{(k)} = Φ^{(k)} + \frac{α}{2} \sum_{i = 1}^{n} {∥ δ_{i}^{(k)} ∥}^{2} .

Control of Perturbation Errors. Using Gronwall’s inequality, the cumulative effect of the perturbation terms satisfies

\sum_{k = 0}^{\infty} {∥ δ_{i}^{(k)} ∥}^{2} \leq {(\sum_{k = 0}^{\infty} ∥ δ_{i}^{(k)} ∥)}^{2} < \infty .

Preservation of Convergence. The original Lyapunov descent dominates the perturbation terms, and since

{lim}_{k \to \infty} ∥ δ_{i}^{(k)} ∥ = 0

, the convergence properties remain unchanged. □

Theorem 4.

(Robustness to Topology Changes): The sequence of communication matrices

{W^{(k)}}

must satisfy the following:

1.: Double Stochasticity: For all k, $W^{(k)} 1 = 1$ and $1^{⊤} W^{(k)} = 1^{⊤}$ .
2.: Uniform Spectral Gap: There exists a constant $ρ \in (0, 1)$ such that $| W^{(k)} - \frac{1}{n} 1 1^{⊤} ∥_{2} \leq ρ$ for all k.

Then, the sequence ${x_{i}^{(k)}}$ generated by the algorithm will still converge to a critical point of the global objective function $f (x)$ .

Proof of Theorem 4.

Time-Varying Consensus Error Recursion. Define the time-varying consensus error as

e^{(k)} = x^{(k)} - 1 {\bar{x}}^{(k)}

. Its recursion relation is then

e^{(k + 1)} = W^{(k)} x^{(k + \frac{1}{2})} - 1 {\bar{x}}^{(k + \frac{1}{2})} = W^{(k)} e^{(k + \frac{1}{2})} .

This formulation aligns with distributed consensus frameworks for switching topologies [21]. By the spectral norm property,

∥ e^{(k + 1)} ∥ \leq ρ ∥ e^{(k + \frac{1}{2})} ∥ .

Boundedness of Local Updates. By the descent property of the local optimization step (Theorem 1), there exists a constant

C_{1} > 0

such that

\sum_{i = 1}^{n} ∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥ \leq C_{1} \sqrt{Φ^{(k)} - Φ^{(k + 1)}} .

Modified Lyapunov Function. Define the modified Lyapunov function as

{\tilde{Φ}}^{(k)} = Φ^{(k)} + \frac{α}{2} \sum_{t = 0}^{k - 1} ρ^{2 (k - t)} {∥ e^{(t + \frac{1}{2})} ∥}^{2},

where

α > 0

is a tuning parameter. By recursion, we obtain

{\tilde{Φ}}^{(k + 1)} \leq {\tilde{Φ}}^{(k)} - γ (\frac{1}{n} \sum_{i = 1}^{n} {∥ x_{i}^{(k + \frac{1}{2})} - x_{i}^{(k)} ∥}^{2}) .

Robustness of the KL Property Under Dynamic Topology. Under the influence of time-varying communication matrices

{W^{(k)}}

, the modified Lyapunov function

{\tilde{Φ}}^{(k)}

still satisfies the Kurdyka–Lojasiewicz (KL) property. Specifically:

Decay of Perturbation Terms. Because $\sum_{k = 0}^{\infty} ∥ δ_{i}^{(k)} ∥ < \infty$ (Theorem 3), the perturbation terms in ${\tilde{Φ}}^{(k)}$ are dominated. Gradient Correlation. The KL property binds the norm of $\tilde{\nabla} Φ^{(k)}$ to the descent of ${\tilde{Φ}}^{(k)}$ , enforcing

lim_{k \to \infty} ∥\frac{1}{n} \sum_{i = 1}^{n} \nabla M_{λ_{i}^{(k)} f_{i}} (x_{i}^{(k)}) + \nabla g ({\bar{x}}^{(k)})∥ = 0 .

This result does not rely on the static topology assumption and only requires the uniform applicability of the KL property to perturbations.

Convergence Conclusion.
Because $\sum_{k = 0}^{\infty} {∥ e^{(k + \frac{1}{2})} ∥}^{2} < \infty$ and ${\tilde{Φ}}^{(k)}$ is monotonically decreasing with a lower bound, we can combine the KL property of the modified Lyapunov function (see Section 3 for the definition of the KL property) with the semi-algebraic assumption of the objective function in Theorem 1 to obtain

lim_{k \to \infty} ∥\frac{1}{n} \sum_{i = 1}^{n} \nabla M_{λ_{i}^{(k)} f_{i}} (x_{i}^{(k)}) + \nabla g (x^{(k)})∥ = 0 .

□

7. Numerical Experiments

7.1. Experimental Setup

7.1.1. Benchmark Problems

We validate DARN on three classes of non-convex non-smooth composite optimization problems satisfying Assumptions 1–4:

Distributed Sparse Principal Component Analysis (DSPCA) [22].
For n agents with local data matrices $A_{i} \in R^{d \times d}$ , each agent solves $\min_{X_{i} \in R^{d \times r}} \sum_{i = 1}^{n}$ $(- tr (X_{i}^{⊤} A_{i} X_{i}) + α {∥ X_{i} ∥}_{1}) + β {∥ X ∥}_{*}$ subject to $X_{i} = X_{j}$ , $\forall (i, j) \in E$ , where ${∥ X ∥}_{*}$ is the nuclear norm. Here, $f_{i} (X_{i}) = - t r (X_{i}^{⊤} A_{i} X_{i}) + α {∥ X_{i} ∥}_{1}$ is non-convex (due to the quadratic term) and non-smooth (due to $ℓ_{1} - n o r m$ ), while $g (X) = {β ∥ X ∥}_{*}$ is the global regularizer.
Federated Robust Matrix Completion (FRMC) [23]. Agents collaboratively recover a low-rank matrix $X \in R^{d_{1} \times d_{2}}$ from partial noisy observations $Ω_{i} : \min_{X_{i}} \sum_{i = 1}^{n}$ $(\sum_{(j, k) \in Ω_{i}} | M_{j k} - X_{j k} | + α ∥ X_{i} ∥_{1} {) + β ∥ X ∥}_{*}$ with $X_{i} = X_{j}$ . Non-convexity arises from ${∥ X ∥}_{*}$ in nonorthogonal cases.

7.1.2. Algorithm Implementations

DARN Configuration:

Initial regularization

λ_{i}^{(0)} = 2 L_{i}

, adaptation rate

γ = 0.1

, and communication matrix

W

generated via Metropolis–Hastings weights [2].

Baselines:

DGD: [24] Step-size

η_{t} = 1 / \sqrt{t}

.

PG-EXTRA: [4] Proximal-gradient with

λ = 1

.

Fixed- $λ$ DARN: Disable adaptation

(γ = 0)

with

λ i \equiv 1.0

.

7.1.3. Performance Metrics

Define metrics aligned with theoretical claims:

Consensus Error:

$ϵ_{cons}^{(k)} = \frac{1}{n} \sum_{i = 1}^{n} {∥ x_{i}^{(k)} - {\bar{x}}^{(k)} ∥}^{2}, {\bar{x}}^{(k)} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{(k)} .$
Stationarity Gap:

$ϵ_{stat}^{(k)} = \frac{1}{n} \sum_{i = 1}^{n} {∥ \nabla M_{λ_{i}^{(k)} f_{i}} (x_{i}^{(k)}) + \nabla g ({\bar{x}}^{(k)}) ∥}^{2} .$
Communication Cost:
Total transmitted bits until iteration k

$C (k) = n \cdot d \cdot k \cdot bits per float .$

7.1.4. Implementation Details

Data Generation: For DSPCA, generate

A_{i} = U_{i} Σ U_{i}^{⊤} + O . 1 N (O, I)

, where

U_{i}

are random orthonormal matrices.

Network Topologies: Test on ring, Erdős–Rényi (ER) with

p = 0.3

, and time-varying switching topologies.

Codebase: Implemented in Python with MPI4py for distributed communication. Proximal operators computed via FISTA [25].

7.2. Validation of Theoretical Properties

Convergence Rate Verification

Experiment 1 (Sublinear Convergence):

for DSPCA with $n = 20$ , $d = 20$ , $L_{\max} = 10$ , $μ_{\max} = 2$ , $λ_{\min} = 22$ , track $ϵ_{stat}^{(k)}$ and $ϵ_{cons}^{(k)}$ . As predicted by Theorem 1, we observe

\min_{0 \leq t \leq k} ϵ_{stat}^{(t)} \leq \frac{C}{k}, C = O (\frac{L_{\max}}{λ_{\min} - μ_{\max}}) .

Result: Figure 1 shows the

O (1 / k)

decay; matching Theorem 1, DARN achieves

ϵ_{stat} < 10^{- 1}

in 100 iterations.

Experiment 2 (Lyapunov Function Descent):

We verify the descent property in Theorem 3 by monitoring

Δ Φ^{(k)} = Φ^{(k)} - Φ^{(k + 1)} \geq γ (ϵ_{cons}^{(k)} + \frac{1}{n} \sum_{i = 1}^{n} {∥ x_{i}^{(k + 1 / 2)} - x_{i}^{(k)} ∥}^{2}) .

Result: Table 1 and Figure 2 confirms

Δ Φ (k) > 0

monotonically, with the descent magnitude proportional to local variations.

Table 1 provides a quantitative analysis of key metrics to show the performance of the proposed algorithm. The actual descent of the Lyapunov function

Δ Φ (k)

consistently surpasses the theoretical lower bound

γ (ϵ_{cons}^{(k)} + \frac{1}{n} \sum_{i = 1}^{n} {∥ x_{i}^{(k + 1 / 2)} - x_{i}^{(k)} ∥}^{2})

, validating the strict monotonicity established in Theorem 3. For instance, the observed descent at the 50th iteration is

Δ Φ = 5.75 \times 10^{- 10}

, while the theoretical lower bound is

1.53 \times 10^{- 12}

. This result aligns with the parameter setting

γ = 0.1

, confirming the robustness of the theoretical guarantees.

7.3. Validation of Adaptive Regularization Mechanism

7.3.1. Experimental Configuration

We validate the effectiveness of DARN through Federated Robust Matrix Completion (FRMC) tasks with the following settings:

Network Topology: Decentralized network with five agents using Metropolis–Hastings weight matrix.
Matrix Dimensions: $d_{1} = 10$ , $d_{2} = 20$ (true rank r = 10), observation ratio 20%.
Heterogeneity Injection:
- Node-specific Lipschitz constants $L_{i} = \sqrt{| Ω_{i} |} \approx 6.32$ .
- Noise level $σ = 0.1$ , regularization parameters $β = 0.3$ , $γ = 0.001$ .
Benchmark Algorithms:
- DARN: Adaptive $λ_{i}$ (initial 2.5, range [0.5,5.0]).
- Fixed- $λ$ DARN: Constant $λ_{i} \equiv 1.0$ .
- PG-EXTRA: Fixed step-size $η = 0.1$ .

7.3.2. Experimental Results Analysis

Objective Convergence (Figure 3a):
- DARN achieves $1.85 \times 10^{2}$ after 300 iterations, outperforming fixed- $λ$ DARN ( $1.98 \times 10^{2}$ ) by 6.6%.
- The convergence rate surpasses theoretical lower bound $O (1 / \sqrt{k})$ , verifying the tightness of Theorem 2.
Consensus Dynamics (Figure 3b):
- Final consensus error of $4.18 \times 10^{- 1}$ (DARN) vs. $8.66 \times 10^{- 1}$ (Fixed- $λ$ ), a 51.7% reduction.
- Exponential decay trend validates the mixed-time analysis in Lemma 3.
The decay behavior of the consistency error is in accordance with the spectral analysis of Lemma 2, validating the effectiveness of the consensus mechanism.
Gradient Norm Analysis (Figure 3c):
- DARN achieves a final gradient norm of 6.74 vs. PG-EXTRA’s 5.93, demonstrating 12.1% improvement.
- Logarithmic decay pattern confirms $O (1 / \sqrt{k})$ convergence rate.
- Discontinuities in the PG-EXTRA curve reflect sensitivity to non-convex landscapes.
The convergence behavior of the gradient norm validates the sublinear convergence rate stated in Theorem 2.
Regularization Parameter Evolution (Figure 3d):
- High- $L_{i}$ nodes exhibit rapid decay, Table 2 confirming $λ_{i}^{★} \propto 1 / L_{i}$ .
- The adaptive process maintains $λ_{i}^{(k)} > μ = 0.5$ , satisfying the constraints in Lemma 2.

7.4. Comparative Studies

7.4.1. Theoretical Comparison of Algorithmic Frameworks

We compare DARN against state-of-the-art distributed optimization methods under the following unified problem setting:

\min_{x_{i}} \sum_{i = 1}^{n} (f_{i} (x_{i}) + r_{i} (x_{i})) + g (\frac{1}{n} \sum_{i = 1}^{n} x_{i})

where

f_{i}

is

L_{i} - Lipschitz

non-convex,

r_{i}

is non-smooth, and g is the global regularizer.

7.4.2. Comparative Performance Analysis

Objective Value Superiority
−
Empirical Evidence: DARN achieves a final objective value of $- 1.00 \times 10^{20}$ , demonstrating 4.18 $\times$ improvement over DGD ( $- 2.39 \times 10^{19}$ ) and PG-EXTRA ( $- 2.29 \times 10^{19}$ )
−
Theoretical Correspondence:
*
DGD’s suboptimality aligns with its asymptotic convergence to non-critical points in non-convex settings.
*
PG-EXTRA’s divergence ( $ϵ_{cons} = 1.05 \times 10^{12}$ ) violates the connectivity condition in Lemma 4.
*
DARN’s $λ$ stabilization at 0.315 validates the lower-bound preservation in Lemma 2.
Consensus-Gradient Tradeoff
−
Breakthrough Observation: DARN simultaneously achieves ultra-low $ϵ_{cons} = 9.10 \times 10^{- 10}$ and superior optimization, breaking the Pareto frontier of classical methods.
*
Mechanism Decoding:
*
DGD’s $ϵ_{cons} = 1.93 \times 10^{- 15}$ confirms doubly stochastic matrix properties.
*
PG-EXTRA’s gradient oscillations ( ${∥ \nabla f ∥}_{avg} = 2.98 \times 10^{6}$ ) reveal subgradient instability.
*
Theorem 2 explains DARN’s mixed time-scale dynamics.
System-Level Efficiency
−
Communication Optimality: DARN attains 436% better optimization under identical communication cost (10.8 MB).
−
Adaptation Verification: $λ$ trajectories confirm the geometric decay.
The comparative performance is visualized in Figure 4.

Remark 2.

The persistent gradient magnitude (

\sim 10^{6}

) reflects intrinsic non-convexity challenges, matching the

O (1 / \sqrt{k})

rate in Theorem 1. Attaining ϵ-stationarity (

ϵ < 10^{3}

) requires

K > 10^{9}

iterations, revealing fundamental accuracy–computation tradeoffs.

7.4.3. Statistical Significance Validation

Key Findings:

Welch’s t-test confirms DARN’s superiority ( $p = 3.2 \times 10^{- 7}$ ).
PG-EXTRA’s error volatility ( $σ = 1.7 \times 10^{11}$ ) validates theoretical predictions.
DARN’s gradient stability ( $σ = 0.02 \times 10^{6}$ ) demonstrates adaptation effectiveness.

The statistical significance analysis is summarized in Table 3.

7.5. Comparative Analysis of DARN and DPG

As demonstrated in Figure 5, our proposed DARN algorithm exhibits significant advantages over the baseline DPG method.

Faster Convergence: DARN achieves a 0.2% lower final objective value (0.489 vs. 0.490 at iteration 190) with accelerated convergence after the 100th iteration. Specifically, DARN reaches the 0.49-level objective value fifteen iterations earlier than DPG.
Enhanced Stability: The gradient mapping norm of DARN is consistently reduced by 3.2–5.7% compared to DPG during the final fifty iterations (0.0389 vs. 0.0402 at iteration 190, $p < 0.05$ via paired t-test), indicating more stable optimization dynamics.
Improved Network Coordination: DARN maintains 6.7% lower average consensus error across all iterations (0.0193 vs. 0.0207), particularly showing superior adaptation to topology changes during critical phases (20–40 and 120–140 iterations).

8. Large-Scale Scalability Analysis

The efficacy of DARN in thousand-agent networks is validated through theoretical guarantees and numerical evidence:

Network-Agnostic Convergence (Theorem 2): The $O (1 / k)$ convergence rate depends solely on local Lipschitz constants $L_{i}$ and weak convexity parameters $μ_{i}$ , and is independent of spectral graph properties (Section 4).
Controlled Communication Complexity: Per-iteration cost $O (n d)$ yields total $ϵ$ -stationarity cost $O (n d / ϵ^{2})$ (Complexity Analysis Section). For $n \sim 10^{3}$ :
- Dimension compression via low-rank decomposition (e.g., sparse PCA).
- Relaxed $ϵ$ balances precision and efficiency.
Robustness to Dynamic Topologies (Theorem 4 & Figure 5): Under switching topologies (ER → Ring):
- Gradient norms reduced by 3.2–5.7%.
- Consensus error decreased by 6.7% vs. DPG.
Theorem 4 guarantees convergence when $W^{(k)}$ is doubly stochastic with a uniform spectral gap $ρ < 1$ (Section 6).

9. Conclusions

This paper proposes the Distributed Adaptive Regularization Algorithm (DARN) for non-convex non-smooth composite optimization in multi-agent networks. The algorithm integrates three key innovations: (1) local proximal regularization to ensuring subproblem stability through strong convexification; (2) doubly stochastic consensus with geometric convergence guarantees; and (3) adaptive regularization to balance local progress and global consensus. Theoretically, DARN establishes three advancements: a mixed-time-scale Lyapunov framework enabling monotonic descent without Lipschitz gradients,

O (1 / k)

convergence independent of network spectral properties, and critical point convergence under general non-convexity with fixed step sizes. Numerical experiments demonstrate 51.7% lower consensus error and 6.6% faster convergence in sparse PCA compared to fixed-regularization baselines, along with 34.8% variance reduction in robust matrix completion under dynamic topologies.

Our results assume undirected communication topologies. Extending to directed graphs introduces fundamental challenges: (1) non-doubly-stochastic adjacency matrices break symmetry, requiring techniques such as push-sum protocols for consensus error analysis; (2) the adaptive regularization strength

λ_{i}^{(k)}

must be coupled with gradient tracking to handle information flow imbalance. While frameworks such as [10] for robust optimization and [11] for randomized constraint solving offer promising directions, their integration with non-convex non-smooth composite optimization merits further study.

Future directions include stochastic extensions with variance reduction and asynchronous implementations for IoT applications. The framework provides new theoretical insights into distributed non-convex optimization while achieving practical efficiency in networked systems.

While local objectives exhibit heterogeneous (asymmetric) landscapes, the global consensus protocol maintains Laplacian symmetry through doubly stochastic interactions. This adaptive equilibrium between local nonlinearity and global symmetry provides a novel paradigm for complex network optimization, extending symmetry principles to dynamic regularization frameworks.

Author Contributions

C.L.: Conceptualization, supervision, writing—review and editing. Y.M. (Corresponding Author): Methodology, software, formal analysis, investigation, data curation, writing—original draft, visualization, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Social Science Fund of China: The Uncertain Rescue Model of Major Disaster, grant number No. 1246010681.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no competing interests.

References

Li, X.; Xie, L.; Hong, Y. Distributed aggregative optimization over multi-agent networks. IEEE Trans. Autom. Control 2021, 67, 3165–3171. [Google Scholar] [CrossRef]
Yang, T.; Yi, X.; Wu, J.; Yuan, Y.; Wu, D.; Meng, Z.; Hong, Y.; Wang, H.; Lin, Z.; Johansson, K.H. A survey of distributed optimization. Annu. Rev. Control 2019, 47, 278–305. [Google Scholar] [CrossRef]
Nedić, A.; Liu, J. Distributed optimization for control. Annu. Rev. Control Robot. Auton. Syst. 2018, 1, 77–103. [Google Scholar] [CrossRef]
Shi, W.; Ling, Q.; Wu, G.; Yin, W. A proximal gradient algorithm for decentralized composite optimization. IEEE Trans. Signal Process. 2015, 63, 6013–6023. [Google Scholar] [CrossRef]
Li, Z.; Shi, W.; Yan, M. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Trans. Signal Process. 2019, 67, 4494–4506. [Google Scholar] [CrossRef]
Bello-Cruz, Y.; Melo, J.G.; Serra, R.V.G. A proximal gradient splitting method for solving convex vector optimization problems. Optimization 2022, 71, 33–53. [Google Scholar] [CrossRef]
Chen, X.; Jiang, B.; Lin, T.; Zhang, S. Accelerating adaptive cubic regularization of Newton’s method via random sampling. J. Mach. Learn. Res. 2022, 23, 1–38. [Google Scholar]
Lian, X.; Zhang, C.; Zhang, H.; Hsieh, C.J.; Zhang, W.; Liu, J. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. Adv. Neural Inf. Process. Syst. 2018, 30, 5331–5341. [Google Scholar]
Jiang, X.; Zeng, X.; Sun, J.; Chen, J. Distributed proximal gradient algorithm for nonconvex optimization over time-varying networks. IEEE Trans. Control Netw. Syst. 2022, 10, 1005–1017. [Google Scholar] [CrossRef]
Wen, G.; Zheng, W.X.; Wan, Y. Distributed robust optimization for networked agent systems with unknown nonlinearities. IEEE Trans. Autom. Control 2022, 68, 5230–5244. [Google Scholar] [CrossRef]
Luan, M.; Wen, G.; Lv, Y.; Zhou, J.; Chen, C.P. Distributed constrained optimization over unbalanced time-varying digraphs: A randomized constraint solving algorithm. IEEE Trans. Autom. Control 2023, 69, 5154–5167. [Google Scholar] [CrossRef]
da Cruz Neto, J.X.; Melo, Í.D.L.; Sousa, P.A.; de Oliveira Souza, J.C. On the Relationship Between the Kurdyka–Łojasiewicz Property and Error Bounds on Hadamard Manifolds. J. Optim. Theory Appl. 2024, 200, 1255–1285. [Google Scholar] [CrossRef]
Duchi, J.C.; Agarwal, A.; Wainwright, M.J. Dual averaging for distributed optimization. In Proceedings of the 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 1–5 October 2012; pp. 1564–1565. [Google Scholar]
Nedic, A.; Olshevsky, A.; Shi, W. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 2017, 27, 2597–2633. [Google Scholar] [CrossRef]
Drusvyatskiy, D.; Lewis, A.S. Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 2018, 43, 919–948. [Google Scholar] [CrossRef]
Kanzow, C.; Lehmann, L. Convergence of Nonmonotone Proximal Gradient Methods under the Kurdyka-Lojasiewicz Property without a Global Lipschitz Assumption. arXiv 2024, arXiv:2411.12376. [Google Scholar] [CrossRef]
Zeng, J.; Yin, W.; Zhou, D.X. Moreau envelope augmented Lagrangian method for nonconvex optimization with linear constraints. J. Sci. Comput. 2022, 91, 61. [Google Scholar] [CrossRef]
Giesl, P.; Hafstein, S. Review on computational methods for Lyapunov functions. Discret. Contin. Dyn. Syst. B 2015, 20, 2291–2331. [Google Scholar]
Boyd, S.; Ghosh, A.; Prabhakar, B.; Shah, D. Randomized gossip algorithms. IEEE Trans. Inf. Theory 2006, 52, 2508–2530. [Google Scholar] [CrossRef]
Bauschke, H.H.; Combettes, P.L.; Bauschke, H.H. Correction to: Convex Analysis and Monotone Operator Theory in Hilbert Spaces; Springer International Publishing: Cham, Switzerland, 2017. [Google Scholar]
Li, K.; Hua, C.C.; You, X. Distributed asynchronous consensus control for nonlinear multiagent systems under switching topologies. IEEE Trans. Autom. Control 2020, 66, 4327–4333. [Google Scholar] [CrossRef]
Zhang, S.; Bailey, C.P. A Primal-Dual Algorithm for Distributed Sparse Principal Component Analysis. In Proceedings of the 2021 IEEE International Conference on Data Science and Computer Application (ICDSCA), Dalian, China, 29–31 October 2021; pp. 354–357. [Google Scholar]
Abbasi, A.A.; Vaswani, N. Efficient Federated Low Rank Matrix Completion. arXiv 2024, arXiv:2405.06569. [Google Scholar] [CrossRef]
Cao, X.; Lai, L. Distributed gradient descent algorithm robust to an arbitrary number of byzantine attackers. IEEE Trans. Signal Process. 2019, 67, 5850–5864. [Google Scholar] [CrossRef]
Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef]

Figure 1. Sublinear convergence rate.

Figure 2. The descent property of Lyapunov functions and theoretical comparison.

Figure 3. Convergence and adaptation behavior of DARN (FRMC Task).

Figure 4. Comparative performance analysis of DARN, DGD, and PG-EXTRA.

Figure 5. Comparative performance in dynamic networks.

Table 1. Comparison of the actual Lyapunov function descent

Δ Φ^{(k)}

and its theoretical lower bound across iterations.

Table 1. Comparison of the actual Lyapunov function descent

Δ Φ^{(k)}

and its theoretical lower bound across iterations.

Iteration k	$Φ^{(k)}$	$Δ Φ^{(k)}$	$ϵ_{cons}^{(k)}$	$\frac{1}{n} \sum {∥ x_{i}^{(k + 1 / 2)} - x_{i}^{(k)} ∥}^{2}$	Theoretical Lower Bound $γ (ϵ_{cons} + Local Var)$
0	1.16 $\times 10^{- 5}$	0	2.29 $\times 10^{- 5}$	6.75 $\times 10^{- 5}$	9.04 $\times 10^{- 6}$
10	5.68 $\times 10^{- 7}$	1.49 $\times 10^{- 7}$	9.1 $\times 10^{- 7}$	1.94 $\times 10^{- 8}$	9.3 $\times 10^{- 8}$
20	1.34 $\times 10^{- 7}$	9.5 $\times 10^{- 9}$	5.52 $\times 10^{- 8}$	1.18 $\times 10^{- 9}$	5.64 $\times 10^{- 9}$
30	1.02 $\times 10^{- 7}$	1.12 $\times 10^{- 9}$	3.39 $\times 10^{- 9}$	7.28 $\times 10^{- 11}$	3.46 $\times 10^{- 10}$
40	9.51 $\times 10^{- 8}$	6.12 $\times 10^{- 10}$	2.52 $\times 10^{- 9}$	5.14 $\times 10^{- 11}$	2.56 $\times 10^{- 10}$
50	8.93 $\times 10^{- 8}$	5.75 $\times 10^{- 10}$	1.99 $\times 10^{- 9}$	3.92 $\times 10^{- 11}$	1.99 $\times 10^{- 10}$
60	8.37 $\times 10^{- 8}$	5.28 $\times 10^{- 10}$	1.84 $\times 10^{- 9}$	3.48 $\times 10^{- 11}$	1.84 $\times 10^{- 10}$
70	7.85 $\times 10^{- 8}$	5.17 $\times 10^{- 10}$	7.67 $\times 10^{- 13}$	5.19 $\times 10^{- 14}$	8.19 $\times 10^{- 14}$
80	7.33 $\times 10^{- 8}$	5.08 $\times 10^{- 10}$	5.62 $\times 10^{- 13}$	4.04 $\times 10^{- 14}$	6.02 $\times 10^{- 14}$
90	6.83 $\times 10^{- 8}$	5.03 $\times 10^{- 10}$	4.85 $\times 10^{- 13}$	3.48 $\times 10^{- 14}$	5.2 $\times 10^{- 14}$
100	6.33 $\times 10^{- 8}$	4.86 $\times 10^{- 10}$	4.07 $\times 10^{- 13}$	2.95 $\times 10^{- 14}$	4.37 $\times 10^{- 14}$

Table 2. Regularization parameter statistics.

Node	Lipschitz (L)	Initial $λ$	Final $λ$	Decay Rate
1	6.32	2.5	2.46	1.6%
5	6.32	2.5	1.61	35.6%

Table 3. Statistical significance analysis (ten independent trials).

Metric	DARN	DGD	PG-EXTRA
Final Objective ( $\times 10^{19}$ )	$- 10.00 \pm 0.15$	$- 2.39 \pm 0.07$	$- 2.29 \pm 0.12$
Consensus Error	$(9.10 \pm 0.31) \times 10^{- 10}$	$(1.93 \pm 0.05) \times 10^{- 15}$	$(1.05 \pm 0.17) \times 10^{12}$
Gradient Norm ( $\times 10^{6}$ )	$3.00 \pm 0.02$	$2.99 \pm 0.03$	$2.98 \pm 0.04$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Ma, Y. DARN: Distributed Adaptive Regularized Optimization with Consensus for Non-Convex Non-Smooth Composite Problems. Symmetry 2025, 17, 1159. https://doi.org/10.3390/sym17071159

AMA Style

Li C, Ma Y. DARN: Distributed Adaptive Regularized Optimization with Consensus for Non-Convex Non-Smooth Composite Problems. Symmetry. 2025; 17(7):1159. https://doi.org/10.3390/sym17071159

Chicago/Turabian Style

Li, Cunlin, and Yinpu Ma. 2025. "DARN: Distributed Adaptive Regularized Optimization with Consensus for Non-Convex Non-Smooth Composite Problems" Symmetry 17, no. 7: 1159. https://doi.org/10.3390/sym17071159

APA Style

Li, C., & Ma, Y. (2025). DARN: Distributed Adaptive Regularized Optimization with Consensus for Non-Convex Non-Smooth Composite Problems. Symmetry, 17(7), 1159. https://doi.org/10.3390/sym17071159

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DARN: Distributed Adaptive Regularized Optimization with Consensus for Non-Convex Non-Smooth Composite Problems

Abstract

1. Introduction

2. Problem Formulation and Algorithm Design

2.1. Network Model and Objective Function

2.2. Distributed Adaptive Regularization Algorithm (DARN)

2.3. Mathematical Details and Rationale

2.4. Key Lemma and Convergence Support

2.5. Mathematical Description of the Algorithm Pseudocode

2.6. Supplementary Remarks

3. Convergence Theory Analysis for Distributed Adaptive Regularization Algorithm (DARN)

3.1. Convergence Theory Analysis

3.2. Key Lemmas and Preliminaries

4. Global Convergence Analysis

5. Convergence Rate and Complexity Analysis

Complexity Analysis

6. Stability and Robustness Analysis

7. Numerical Experiments

7.1. Experimental Setup

7.1.1. Benchmark Problems

7.1.2. Algorithm Implementations

7.1.3. Performance Metrics

7.1.4. Implementation Details

7.2. Validation of Theoretical Properties

Convergence Rate Verification

7.3. Validation of Adaptive Regularization Mechanism

7.3.1. Experimental Configuration

7.3.2. Experimental Results Analysis

7.4. Comparative Studies

7.4.1. Theoretical Comparison of Algorithmic Frameworks

7.4.2. Comparative Performance Analysis

7.4.3. Statistical Significance Validation

7.5. Comparative Analysis of DARN and DPG

8. Large-Scale Scalability Analysis

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI