Next Article in Journal
Mitigating the Drawbacks of the L0 Norm and the Total Variation Norm
Previous Article in Journal
Existence and Uniqueness of Solutions for Impulsive Stochastic Differential Variational Inequalities with Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Inexact Nonsmooth Quadratic Regularization Algorithm

by
Anliang Wang
,
Xiangmei Wang
* and
Chunfang Liao
School of Mathematics and Statistics, Guizhou University, Guiyang 550025, China
*
Author to whom correspondence should be addressed.
Axioms 2025, 14(8), 604; https://doi.org/10.3390/axioms14080604 (registering DOI)
Submission received: 4 June 2025 / Revised: 5 July 2025 / Accepted: 6 July 2025 / Published: 4 August 2025

Abstract

The quadratic regularization technique is widely used in the literature for constructing efficient algorithms, particularly for solving nonsmooth optimization problems. We propose an inexact nonsmooth quadratic regularization algorithm for solving large-scale optimization, which involves a large-scale smooth separable item and a nonsmooth one. The main difference between our algorithm and the (exact) quadratic regularization algorithm is that it employs inexact gradients instead of the full gradients of the smooth item. Also, a slightly different update rule for the regularization parameters is adopted for easier implementation. Under certain assumptions, it is proved that the algorithm achieves a first-order approximate critical point of the problem, and the iteration complexity of the algorithm is O ( ε 2 ) . In the end, we apply the algorithm to solve LASSO problems. The numerical results show that the inexact algorithm is more efficient than the corresponding exact one in large-scale cases.

1. Introduction

Consider the following nonsmooth optimization problem:
min x R d f ( x ) + h ( x ) ,
where f : R d R is a continuously differentiable function and h : R d R { + } is a lower semi-continuous proper function (both f and h can be nonconvex). When h 0 , the problem reduces to a smooth optimization. The trust-region-based algorithm is one of the classical methods for solving the problem, and it has attracted great attention in the literature since it was introduced. Comprehensive and systematical introductions can be found, for example, in book [1] and two novel variants in [2], which include regularization versions of the algorithm. By introducing a second-order nonlinear step-size control for the algorithm, Grapiglia et al. [3] proved the iteration complexity of the trust-region algorithm is O ( ε 3 ) ( 0 < ε < 1 ) . Bergou et al. [4] used the scaled norms for the trust-region algorithm and a cubic adaptive regularization algorithm, and by introducing line-search into the algorithm, the iteration complexity of the algorithm was improved to O ( ε 3 / 2 ) . In cases where f 0 in problem (1), the nonsmooth trust-region algorithm and the nonsmooth regularization methods (based on general directional derivatives) have also been proposed and studied; see details, e.g., [5,6,7]. For more smooth/nonsmooth methods for solving these two kinds of optimizations, the readers are referred to [8,9,10] and references therein. In the composite case of problem (1) when f and h are convex, Lee et al. [11] proposed a proximal Newton method for solving the problem and established the global convergence of the algorithm. Kim et al. [12] introduced a nonsmooth trust-region algorithm for the problem, which employs the full gradients of f in each subproblem. For a more general case when h : = g c , where g is convex and c is continuously differentiable, Cartis et al. [13] proposed both the nonsmooth trust-region algorithm and the quadratic regularization variants for solving the problem. The iteration complexities of these algorithms were proven to O ( ε 2 ) . For more generalized and improved nonsmooth trust-region-based algorithms for solving such such problems, see, e.g., [14,15,16]. It is worth mentioning that the proximal gradient method is also a commonly used class of algorithms for solving this problem. They have been extensively studied and applied to problems such as (group) sparse optimization and constrained optimization. For more details, refer to [17,18,19,20,21] and their references.
We focus on the more general composite problem (1) (where both f and h can be nonconvex). Li et al. [22] proposed a monotone proximal gradient algorithm for solving the problem. Its sublinear convergence properties are established under the Kurdyka-Łojasiewicz condition. In particular, we note the work by Aravkin et al. [23], where the nonsmooth trust-region algorithm and the nonsmooth quadratic regularization algorithms were proposed and studied. Under certain conditions, they proved that the iteration complexities of both algorithms are O ( ε 2 ) . It should be noted that some of the aforementioned works related to the smooth term f all use the full gradients f of f. To solve the following large-scale optimizations, we shall propose an algorithm that uses the inexact gradients of f instead of the full gradients. Specifically, we consider the following large-scale separable nonsmooth optimization problem:
min x R d f x + h x : = 1 n i = 1 n f i x + h x ,
where each component function f i : R d R is continuously differentiable ( i = 1 , 2 , , n ), h : R d R { + } is a proper and lower semi-continuous function, and n 1 (i.e., n is very large). Such large-scale problems have broad applications in areas like machine learning, signal processing, image analysis and restoration, etc.; see, e.g., [24,25,26,27]. To reduce the computational complexity and memory usage in each iteration, some inexact techniques are applied in the algorithm design. For example, the incremental algorithms compute by processing data incrementally, typically one sample or a small batch at a time; see, e.g., [24]. Stochastic algorithms update estimates sequentially using stochastic batches of data; see, e.g., [25,26,27] and references therein. For solving problem (2), we propose an inexact nonsmooth quadratic regularization algorithm, which employs inexact gradients of f in the subproblems and also an account for nonsmooth function h (note that the inexact gradient of f includes sub-sampled gradient 1 | I g | i I g f i of f as a special case, where I g is a sample collection from I : = { 1 , 2 , , n } ; see Remark 4 for more details). The algorithm is actually an inexact version of the algorithm in ([23], Algorithm 6.1). The main difference between our algorithm and ([23], Algorithm 6.1) is that we employ inexact gradients of f, while the latter uses full gradients of f. Moreover, we adopt a different stopping criteria and slightly different update rules for the regularization parameters; see details in Remarks 1 and 2. Under certain conditions, we show that the iteration complexity of the algorithm is O ( ε 2 ) , which is similar to the result about the exact algorithm established in [23]. Finally, the algorithm is applied to solve LASSO problems, and numerical experiments show that in large-scale cases, the inexact algorithm is generally more efficient than the corresponding exact one.
The paper is organized as follows. The next section provides some necessary notions, notation and results. The main results are presented in Section 3, including the inexact nonsmooth quadratic regularization algorithm and the convergence analysis. Some numerical comparisons are illustrated in Section 4 for solving LASSO problems. The conclusion is given afterwards.
In addition, some useful notations are displayed in Table 1.

2. Preliminaries

We gather some notation, notions, and results which will be used in this paper. For more details, the readers are referred to [1,23,28].
Let R denote the set of all real numbers and N denote the set of all non-negative integers. Let · and · , · denote the 2 -norm and the inner product in Euclidean space R d , respectively. The closed ball centered at 0 with radius δ > 0 is denoted by B ( 0 , δ ) ; that is, B ( 0 , δ ) : = { x R d : x δ } . Let x R d , and let A R d be a nonempty subset. The distance from x to A is denoted by dist ( x ; A ) , which is defined by dist ( x ; A ) : = inf a A | | a x | | . Let h : R d R ¯ : = R { + } be a lower semi-continuous proper function. As usual, dom h stands for the domain of h, that is,
dom h : = { x R d : h ( x ) < + } .
The function h is said to be convex if the following inequality holds for any x , y dom h
h ( λ x + ( 1 λ ) y ) λ h ( x ) + ( 1 λ ) h ( y ) λ ( 0 , 1 ) .
Furthermore, h is said to be strongly convex with modulus c > 0 if there exists a constant c > 0 such that for any x , y dom h , the following holds:
h ( λ x + ( 1 λ ) y ) λ h ( x ) + ( 1 λ ) h ( y ) 1 2 c λ ( 1 λ ) x y 2 λ ( 0 , 1 ) .
For α R , we use L α ( h ) to stand for the α -level set of h, that is,
L α ( h ) : = { x R d : h ( x ) α } .
If L α ( h ) is bounded for any α R , then h is said to be level-bounded. In the case when a lower semi-continuous proper function h is level-bounded, then the set arg min x R d h ( x ) : = { z R d : h ( z ) = min x R d h ( x ) } is nonempty and compact; see, e.g., ([28], Theorem 1.9).
The following concepts of the Moreau envelope and the proximal mapping of h can be found in ([23], Definition 2.1).
Definition 1 (Moreau Envelope and Proximal Mapping).
Let x R d , and λ > 0 . The Moreau envelope e λ h ( x ) and the proximal mapping prox λ h ( x ) of the function h at x are, respectively, defined by
e λ h ( x ) : = inf ω 1 2 λ ω x 2 + h ( ω ) = λ 1 inf ω 1 2 ω x 2 + λ h ( ω )
and
prox λ h ( x ) : = arg min ω 1 2 λ ω x 2 + h ( ω ) = arg min ω 1 2 ω x 2 + λ h ( ω ) .
For a general lower semi-continuous proper function h, the set prox λ h ( x ) can be empty or contain multiple elements. The range of parameter values for which the Moreau envelope of h admits a finite value is given by the following definition, which is taken from ([28], Definition 1.23).
Definition 2 (Prox-boundedness).
The function h is said to be prox-bounded if there exists a parameter λ > 0 and at least one point x R d such that e λ h ( x ) > . The threshold of prox-boundedness λ h of h is the supremum of all such λ > 0 .
It can be shown that the threshold of prox-boundedness of h is independent of the point x in the definition. In the special case when h is level-bounded, we see that λ h = + . A prox-bounded function exhibits the following properties ([28], Theorem 1.25).
Proposition 1.
Let h : R d R ¯ be a prox-bounded function with proximal boundedness threshold λ h > 0 . Then, for any λ ( 0 , λ h ) and x R d , the following properties hold:
(i) prox λ h ( x ) is a nonempty and compact set;
(ii) the function ( λ , x ) e λ h ( x ) is continuous, and as λ monotonically decreases to 0, e λ h ( x ) monotonically increases to h ( x ) .
We end this section with some optimality conditions for the nonsmooth optimization
min x R d ϕ ( x ) ,
where ϕ : R d R ¯ is a lower semi-continuous proper function, which can be found in [23] or [28]. To this end, we first recall the following notions of the subgradient and subdifferential.
Definition 3 (Subgradient and Subdifferential).
Let x ¯ dom ϕ . A vector v R d is said to be a regular subgradient of ϕ at the point x ¯ if
lim inf x x ¯ ϕ ( x ) ϕ ( x ¯ ) v T ( x x ¯ ) x x ¯ 0 .
The set of all regular subgradients of ϕ at x ¯ is denoted by ^ ϕ ( x ¯ ) , which is called the Fréchet subdifferential of ϕ at x ¯ . Moreover, if there exist sequences { x k } R d and { v k } ^ ϕ ( x k ) satisfying x k x ¯ , ϕ ( x k ) ϕ ( x ¯ ) and v k v , then the vector v is called a general subgradient of ϕ at x ¯ . The set of all general subgradients of ϕ at x ¯ is denoted by ϕ ( x ¯ ) , which is called the limiting subdifferential of ϕ at x ¯ .
It is clear that ^ ϕ ( x ¯ ) ϕ ( x ¯ ) for any x ¯ dom ϕ . In the case when ϕ is convex, the Fréchet and limiting subdifferentials coincide with the subdifferential of convex analysis. Let ϕ : = f + h , where f : R d R is a continuously differentiable function, and let x ¯ dom ϕ . Then, ϕ ( x ¯ ) = f ( x ¯ ) + h ( x ¯ ) . The following necessary optimality condition for (5) is well known.
Proposition 2.
Let x ¯ R d be a local minimizer of ϕ. Then, 0 ^ ϕ ( x ¯ ) ϕ ( x ¯ ) . If ϕ is convex, then x ¯ is a global minimizer of ϕ.
A point x ¯ R d is called a critical point of ϕ if 0 ϕ ( x ¯ ) . In the next iteration complexity analysis, we need the following definition of approximate critical points, where dist ( · ; S ) : = inf y S y · is the distance associated with S R d .
Definition 4.
Let ε > 0 . A point x ¯ dom ϕ is said to be a ε-approximate critical point of ϕ if dist ( 0 ; ϕ ( x ¯ ) ) ε .

3. Inexact Nonsmooth Quadratic Regularization Algorithm

We consider the following large-scale separable nonsmooth optimization problem:
min x R d f x + h x : = 1 n i = 1 n f i x + h x ,
where n 1 (n is very large), each component function f i : R d R is continuously differentiable ( i = 1 , 2 , , n ) , and h : R d R ¯ is a lower semi-continuous proper function. Since n is very large, the computational cost of the (full) gradient of f becomes prohibitively expensive. To reduce the computational cost per iteration, we shall propose an inexact nonsmooth quadratic regularization algorithm (i.e., Algorithm 1) for solving problem (6), which employs approximate gradients of f instead of the full gradients in subproblems. The algorithm is actually an inexact version of the (exact) quadratic regularization algorithm proposed in ([23], Algorithm 6.1), for solving nonsmooth optimizations. In the k-th iteration, we construct an approximate model φ ( · ; x k ) + ψ ( · ; x k ) for f ( x k + · ) + h ( x k + · ) at the iterate x k and then solve a quadratic regularization subproblem:
min s m ( s ; x k , σ k ) : = φ ( s ; x k ) + ψ ( s ; x k ) + 1 2 σ k s 2 ,
where σ k > 0 is the regularization parameter.
We always make the following blanket assumption on the model.
Assumption 1 (Model Assumption).
For any x R d ,
(i) φ ( · ; x ) is a linear function, which is defined by
φ ( s ; x ) = f ( x ) + g ( x ) T s s R d ,
where g ( x ) f ( x ) ;
(ii) ψ ( · ; x ) : R d R ¯ is a lower semi-continuous proper function, and it satisfies ψ ( 0 ; x ) = h ( x ) and ψ ( · ; x ) ( 0 ) = h ( x ) . Furthermore, ψ ( · ; x ) is uniformly prox-bounded; that is, λ ψ ( 0 , + ) such that λ ψ < λ ψ ( · ; y ) for any y R d .
Remark 1.
We note that Assumption 1 is similar to the corresponding one made in [23] for the (exact) nonsmooth quadratic regularization algorithm ([23], Algorithm 6.1), except the latter uses the full gradient of f in (8), that is, g ( x ) = f ( x ) in (8). This means that both our algorithm (see Algorithm 1 below) and ([23], Algorithm 6.1) adopt a linear approximation φ of f in the quadratic regularization subproblem. The main difference lies in
  • The first-order term of φ: Algorithm 1 uses the inexact gradient g as stated in (8), while ([23], Algorithm 6.1) uses the full gradient f , that is,
    φ ( · ; x ) = f ( x ) + f ( x ) T ( · ) f o r   a n y   x R d .
    We adopt a different tolerance that is easier to verify, which in turn requires a slightly modified update rule for the regularization parameters in Algorithm 1. These are specified as follows; see more explanations in Remark 2.
  • Different stopping criterion: the stopping criterion of Algorithm 1 is s k < ε (see Step 5 of Algorithm 1), while the stopping criterion of ([23], Algorithm 6.1) is
    f ( x k ) + h ( x k ) p ( σ max ; x k ) < ε 2 .
  • Different update rule for { σ k } : Algorithm 1 uses parameter σ min ( 0 , λ ψ 1 ) to ensure that all regularization parameters { σ k } have a positive lower bound, while ([23], Algorithm 6.1) does not use such a bound.
Below, we show that one can use proximal mappings to solve the quadratic regularization subproblems of Algorithm 1. To proceed, let k N . Noting the definition of subproblem (7) in the k-th iteration, we set
p ( σ k ; x k ) : = min s m ( s ; x k , σ k ) ,
P ( σ k ; x k ) : = arg min s m ( s ; x k , σ k ) .
In view of (7) and (8), we can write
m ( s ; x k , σ k ) = f ( x k ) + g ( x k ) T s + ψ ( s ; x k ) + 1 2 σ k s 2 = 1 2 σ k s + σ k 1 g ( x k ) 2 + ψ ( s ; x k ) + f ( x k ) 1 2 σ k 1 g ( x k ) 2 .
Recalling (3) and (4), it holds from (9) and (10) that
p ( σ k ; x k ) = e σ k 1 ψ ( · ; x k ) ( σ k 1 g ( x k ) ) + f ( x k ) 1 2 σ k 1 g ( x k ) 2
and
P ( σ k ; x k ) = arg min s 1 2 σ k s + σ k 1 g ( x k ) 2 + ψ ( s ; x k ) = prox σ k 1 ψ ( · ; x k ) ( σ k 1 g ( x k ) ) .
Noting that φ ( · ; x ) is a linear function, it is easy to verify that λ ψ ( · ; x ) is also the threshold of the prox-boundedness of φ ( · ; x ) + ψ ( · ; x ) . Thus, the following result is immediate from Proposition 1(i).
Proposition 3.
Let x R d and σ 1 ( 0 , λ ψ ( · ; x ) ) . Then, P ( σ ; x ) is a nonempty and compact set.
By Assumption 1 and (13), the following decrease is guaranteed by the proximal gradient method.
Lemma 1.
Let x dom h and σ 1 ( 0 , λ ψ ( · ; x ) ) . Then, we have
f ( x ) + h ( x ) ( φ ( s ; x ) + ψ ( s ; x ) ) 1 2 σ s 2 s P ( σ ; x ) .
Below, we present the inexact nonsmooth quadratic regularization algorithm for solving problem (6), where the parameter σ min ( 0 , λ ψ 1 ) .
Algorithm 1 Inexact Nonsmooth Quadratic Regularization Algorithm
1:
Input: Choose 0 < σ min < σ 0 , 0 < η 1 η 2 < 1 , 0 < γ 3 1 < γ 1 γ 2 , ε > 0 .
2:
Choose an initial point x 0 dom h , compute f ( x 0 ) + h ( x 0 ) .
3:
for  k = 0 , 1 , do
4:
 Solve the subproblem (7) to obtain s k P ( σ k ; x k ) .
5:
 If s k < ε , then stop and output x k .
6:
 Compute
ρ k : = f ( x k ) + h ( x k ) ( f ( x k + s k ) + h ( x k + s k ) ) φ ( 0 ; x k ) + ψ ( 0 ; x k ) ( φ ( s k ; x k ) + ψ ( s k ; x k ) ) .
7:
 Set x k + 1 = x k + s k if ρ k η 1 , x k otherwise .
8:
 Update the regularization parameter
σ k + 1 [ max { σ min , γ 3 σ k } , σ k ] if ρ k η 2 ( very successful iteration ) , [ σ k , γ 1 σ k ] if η 1 ρ k < η 2 ( successful iteration ) , [ γ 1 σ k , γ 2 σ k ] if ρ k < η 1 ( unsuccessful iteration ) .
9:
 Set k = k + 1 and go to step 3.
10:
end for
Remark 2.
Compared to ([23], Algorithm 6.1), we adopt a slightly different update rule for the regularization parameters at very successful iterations. This is because we use a different tolerance s k < ε in Algorithm 1, which is easier to be verified than the tolerance ξ ( σ max ; x k ) : = f ( x k ) + h ( x k ) p ( σ max ; x k ) < ε 2 used in [23]. This leads to the necessity of the update rule when we study the iteration complexity of Algorithm 1 in what follows. In addition, we note that just like ([23], Algorithm 6.1), there may be iterations of Algorithm 1 where σ k 1 λ ψ ( · ; x k ) . At such iterations, σ k is increased, and then after a finite number of such increases, there holds that σ k 1 < λ ψ ( · ; x k ) .
Similar to the (exact) quadratic regularization algorithm in ([23], Algorithm 6.1), we always make the following step assumption.
Assumption 2 (Step Assumption).
There exists κ m > 0 such that for all k,
| f ( x k + s k ) + h ( x k + s k ) φ ( s k ; x k ) ψ ( s k ; x k ) | κ m s k 2 .
Remark 3.
Let k N , and let ψ ( s ; x k ) be defined as ψ ( s ; x k ) : = h ( x k + s ) . If f is Lipschitz continuous with constant L > 0 and there exist constants C > 0 such that for any x R d ,
f ( x k ) g ( x k ) C s k ,
then Assumption 2 is satisfied. In fact, in this case,
| f ( x k + s k ) + h ( x k + s k ) φ ( s k ; x k ) ψ ( s k ; x k ) | = | f ( x k + s k ) φ ( s k ; x k ) | .
By the mean value theorem, there exists x ¯ k , 1 on the line segment connecting x k + s k and x k such that
f ( x k + s k ) = f ( x k ) + f ( x ¯ k , 1 ) T s k .
From the definition of φ and (16), we have
| f ( x k + s k ) φ ( s k ; x k ) | f ( x ¯ k , 1 ) f ( x k ) · s k + f ( x k ) g ( x k ) · s k ( L + C ) s k 2 .
Thus, Assumption 2 holds with κ m : = L + C . Additionally, when h = g c , where c : R d R l has a Lipschitz continuous Jacobian and g : R l R is Lipschitz continuous, the above condition ψ ( s ; x k ) = h ( x k + s ) can be replaced by ψ ( s ; x k ) = g ( c ( x k ) + c ( x k ) T s ) , and Assumption 2 is still satisfied.
Regarding the sequence of regularization parameters { σ k } , the following important property holds, where
σ succ : = max { 1 λ ψ , 2 κ m 1 η 2 } > 0 .
Lemma 2.
If Algorithm 1 does not terminate at the k-th iteration and the regularization parameter satisfies σ k σ succ , then the k-th iteration is very successful, and σ k + 1 σ k .
Proof of Lemma 2.
Let σ k σ succ , and let x k and s k be the corresponding iteration point and direction, respectively. Since Algorithm 1 does not terminate in finite steps, we have s k 0 . Combining Assumption 2 and (14), we obtain
| ρ k 1 | = | f ( x k + s k ) + h ( x k + s k ) ( φ ( s k ; x k ) + ψ ( s k ; x k ) ) | φ ( 0 ; x k ) + ψ ( 0 ; x k ) ( φ ( s k ; x k ) + ψ ( s k ; x k ) ) κ m s k 2 1 2 σ k s k 2 = 2 κ m σ k .
From the definition of σ succ and σ k σ succ , we have | ρ k 1 | 1 η 2 . Noting that ρ k > 0 and 0 < η 2 < 1 , it follows that ρ k η 2 . By the definition of Algorithm 1, the k-th iteration is very successful, and σ k + 1 σ k . □
Lemma 2 and the definition of Algorithm 1 indicate that the regularization parameters { σ k } generated by the algorithm have positive lower and upper bounds, i.e.,
σ min σ k σ max k N ,
where σ max : = max { σ 0 , γ 2 σ succ } . Next, we show the iteration complexity of Algorithm 1. To this end, let k ( ε ) represent the number of iterations such that the termination condition s k < ε is satisfied. Define the sets of all successful iterations and all unsuccessful iterations as
S ( ε ) : = { k < k ( ε ) : ρ k η 1 } and U ( ε ) : = { k < k ( ε ) : ρ k < η 1 } .
Let | J | stand for the cardinality of the index subset J N . The following lemma gives estimates about the cardinalities of S ( ε ) and U ( ε ) , respectively. We note that the estimation method is similar to the complexity analysis of the classical regularization algorithm by Cartis et al. [13] or that of the nonsmooth quadratic regularization algorithm in [23].
Lemma 3.
Suppose that f ( x k ) + h ( x k ) ( f + h ) low for all k N . Then, the following estimates hold:
| S ( ε ) | 2 ( ( f + h ) ( x 0 ) ( f + h ) low ) η 1 σ min ε 2 = O ( ε 2 )
and
| U ( ε ) | log γ 1 ( σ max / σ 0 ) + | S ( ε ) | | log γ 1 γ 3 | = O ( ε 2 ) .
Proof of Lemma 3.
Let k S ( ε ) . By the definition of Algorithm 1 and (14), there holds that
f ( x k ) + h ( x k ) f ( x k + s k ) h ( x k + s k ) η 1 ( φ ( 0 ; x k ) + ψ ( 0 ; x k ) ( φ ( s k ; x k ) + ψ ( s k ; x k ) ) ) 1 2 η 1 σ k s k 2 σ min 2 η 1 ε 2 .
Summing up the above inequalities over all k S ( ε ) , we get that
k S ( ε ) ( f + h ) ( x k ) ( f + h ) ( x k + 1 ) | S ( ε ) | σ min 2 η 1 ε 2 .
Noting that the sequence { f ( x k ) + h ( x k ) } is decreasing and bounded from below by ( f + h ) low , it follows that
( f + h ) ( x 0 ) ( f + h ) low | S ( ε ) | σ min 2 η 1 ε 2 ,
which implies (18). Note that each unsuccessful iteration increases the regularization parameter by at least a factor of γ 1 , while each successful iteration reduces the regularization parameter by at most a factor of γ 3 ( 0 < γ 3 1 < γ 1 ). Therefore, we have
σ max σ k ( ε ) 1 σ 0 γ 1 | U ( ε ) | γ 3 | S ( ε ) | ,
and so
log γ 1 ( σ max / σ 0 ) | U ( ε ) | log γ 1 γ 1 + | S ( ε ) | log γ 1 γ 3 .
Thus, (19) is valid, completing the proof. □
The following iteration complexity for Algorithm 1 is immediate from the above lemma.
Theorem 1.
Under the assumptions made in Lemma 3, we have k ( ε ) = O ( ε 2 ) .
The following proposition provides some conditions to ensure that Algorithm 1 terminates at an approximate stationary point.
Proposition 4.
Suppose that Algorithm 1 terminates at the k-th iteration, that is, s k < ε . Suppose further that f is Lipschitz continuous with constant L > 0 , ψ ( · ; x k ) : = h ( x k + · ) , and
f ( x k ) g ( x k ) c k s k
holds with some c k > 0 . Then, x k + s k is a ( σ k + c k + L ) ε -approximate critical point of f + h .
Proof of Proposition 4.
By assumption and the definition of s k , we have
0 g ( x k ) + ψ ( · ; x k ) ( s k ) + σ k s k .
Thus,
σ k s k + ( f ( x k + s k ) g ( x k ) ) f ( x k + s k ) + ψ ( · ; x k ) ( s k ) = f ( x k + s k ) + h ( x k + s k ) = ( f + h ) ( x k + s k ) .
Noting that
σ k s k + ( f ( x k + s k ) g ( x k ) ) σ k s k + f ( x k + s k ) g ( x k )
and
f ( x k + s k ) g ( x k ) f ( x k + s k ) f ( x k ) + f ( x k ) g ( x k ) ,
we get that
dist ( 0 ; ( f + h ) ( x k + s k ) ) < ( σ k + c k + L ) ε .
By definition, this implies that x k + s k is a ( σ k + c k + L ) ε -approximate critical point of f + h . The proof is complete. □
Remark 4.
Noting that f : = 1 n i = 1 n f i is separable, one can adopt sub-sampled inexact gradient g : = 1 | I g | i I g f i , where I g is a sample collection with or without replacement from I : = { 1 , 2 , , n } . There are some methods to provide sufficient sample sizes to guarantee that the sub-sampled inexact gradient g approximates f in a probabilistic way; for more details, the readers are referred to, for example, ([29], Lemma 3). In the next section, we always adopt sub-sampled inexact gradients in implementing Algorithm 1.

4. Implementation and Numerical Results

We consider the following LASSO problem:
min x R d 1 2 n A x b 2 2 + μ x 1 = 1 2 n i I ( A i x b i ) 2 + μ x 1 ,
where A R n × d is a given design matrix, I : = { 1 , 2 , , n } , · 1 is the 1 -norm in Euclidean space R d , A i is the i-th row vector of matrix A, and b R n is the observed noisy data. This model aims to recover sparse coefficient x, which is a popular model used in high-dimensional data analysis, sparse feature selection, and dimensionality reduction, etc.; see, e.g., for example [23,30]. All numerical experiments are implemented in Python 3.11.7 and on a Lenovo 90M2CTO1WW PC (Intel(R) Core(TM) i5-9500, 3.00 GHz, 8.00 GB (2667 MHz) RAM).
In particular, we generate a sparse vector x true R d which contains mostly zeros and 10 values of ± 1 , where both the index of the nonzero entry and ± 1 are randomly generated. We generate the matrix A R n × d using i.i.d. (independent and identically distributed) random entries, where each entry is sampled from the standard normal distribution N ( 0 , 1 ) , and b : = A x true + ϵ , where ϵ N ( 0 , 0.01 ) . The parameter μ of problem (21) is set by μ : = 0.01 .
We apply Algorithm 1 to solve problem (21), where the subproblem (7) is set as
min s m ( s ; x k , σ k ) : = f ( x k ) + g ( x k ) T s + μ x k + s 1 + 1 2 σ k s 2 2 k N
( f ( · ) : = 1 2 n i = 1 n ( A i ( · ) b i ) 2 ). Two versions of Algorithm 1 are considered as follows.
(1)
Algorithm 1 employs inexact gradients, which are referred to as IG-QR for short. That is, g ( x k ) : = 1 | I g | i I g ( A i x k b i ) A i T in the subproblem (11) in the k-th iteration, where I g is an index subset randomly sampled from I without replacement. The sampling technique is referred to the stochastic gradient algorithm, for example [25,26,31]. More specifically, we set the sampling ratio at 10 % .
(2)
Algorithm 1 employs full gradients, which are referred to as FG-QR for short. That is, g ( x k ) : = f ( x k ) = 1 | I | i I ( A i x k b i ) A i T in the subproblem (11) in the k-th iteration.
In all implementations, the initial points x 0 are drawn from a standard normal distribution, which is denoted as x 0 N ( 0 , 1 ) . The parameter values are set as follows: σ 0 : = 10 , σ min : = 8.0 , η 1 : = 0.25 , η 2 : = 0.75 , γ 1 : = 2.0 , γ 2 : = 3.0 , γ 3 : = 0.5 and ε : = 0.001 .
Table 2 reports the numerical results of FG-QR and IG-QR across various dimensions. For the special case when n = 150,000 , d = 500 , we performed 20 independent trials. In each trail, we regenerate a sparse vector x true , data pairs ( A , b ) and initial points x 0 . As illustrated in Figure 1, we compare the objective function values of FG-QR and IG-QR versus CPUtime across 20 independent trials. Meanwhile, Figure 2 shows x true alongside the recovered coefficients obtained by FG-QR and IG-QR. We observe from the numerical results that FG-QR and IG-QR achieve comparable accuracy in recovering x true . Notably, although IG-QR requires more iterations, it outperforms FG-QR in terms of less CPUtime. This observation shows the superiority of IG-QR especially when dealing with high-dimensional problems.

5. Conclusions

This paper proposes an inexact nonsmooth quadratic regularization algorithm based on approximate gradients. The primary objective is to reduce the computational cost per iteration when dealing with large-scale problems. We have shown that the iteration complexity of this algorithm is comparable to that of the corresponding algorithm using full gradients. When the approximate gradient meets certain conditions, an approximate first-order stationary point of the original problem can be achieved. To verify the efficiency of the algorithm in practical applications, we apply it to solve large-scale LASSO problems. Numerical results indicate that Algorithm 1 outperforms the exact algorithm using full gradients, especially when dealing high-dimensional LASSO problems. Note that in general, it is difficult to get an exact solution of the subproblem. Our future work will consider algorithms for the inexact solutions of subproblems as well as some specific computational issues. Additionally, we will consider cubic regularization methods and more general nonlinear step-size control approaches.

Author Contributions

Conceptualization, A.W. and X.W.; methodology, A.W. and X.W.; software, A.W. and C.L.; validation, A.W. and C.L.; formal analysis, A.W. and X.W.; investigation, A.W. and C.L.; resources, A.W.; data curation, A.W.; writing—original draft preparation, A.W.; writing—review and editing, X.W. and C.L.; visualization, X.W.; supervision, X.W. and C.L.; project administration, A.W., X.W. and C.L.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number: 12161017), Guizhou Provincial Science and Technology Projects (grant number: ZK[2022]110). The APC was funded by X.W.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors are indebted to the handling editor and the referees.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

  1. Conn, A.R.; Gould, N.I.M.; Toint, P.L. Trust Region Methods, 1st ed.; SIAM: Philadelphia, PA, USA, 2000; pp. 113–434. ISBN 978-0-89871-985-7. [Google Scholar]
  2. Toint, P.L. Nonlinear stepsize control, trust regions and regularizations for unconstrained optimization. Optim. Method Softw. 2013, 28, 82–95. [Google Scholar] [CrossRef]
  3. Grapiglia, G.N.; Yuan, J.Y.; Yuan, Y.X. Nonlinear stepsize control algorithms: Complexity bounds for first- and second-order optimality. J. Optim. Theory Appl. 2016, 171, 980–997. [Google Scholar] [CrossRef]
  4. Bergou, E.H.; Diouane, Y.; Gratton, S. A line-search algorithm inspired by the adaptive cubic regularization framework and complexity analysis. J. Optim. Theory Appl. 2018, 178, 885–913. [Google Scholar] [CrossRef]
  5. Dennis, J.E., Jr.; Li, S.B.B.; Tapia, R.A. A unified approach to global convergence of trust region methods for nonsmooth optimization. Math. Program. 1995, 68, 319–346. [Google Scholar] [CrossRef]
  6. Qi, L.Q.; Sun, J. A trust region algorithm for minimization of locally Lipschitzian functions. Math. Program. 1994, 66, 25–43. [Google Scholar] [CrossRef]
  7. Mashreghi, Z.; Nasri, M. Bregman distance regularization for nonsmooth and nonconvex optimization. Can. Math. Bul.-Bul. Can. Math. 2024, 67, 415–424. [Google Scholar] [CrossRef]
  8. Wang, X.M. Subgradient algorithms on Riemannian manifolds of lower bounded curvatures. Optimization 2018, 67, 179–194. [Google Scholar] [CrossRef]
  9. Wang, J.H.; Wang, X.M.; Li, C.; Yao, J.C. Convergence analysis of gradient algorithms on Riemannian manifolds without curvature constraints and application to Riemannian Mass. SIAM J. Optim. 2021, 31, 172–199. [Google Scholar] [CrossRef]
  10. Sun, W.M.; Liu, H.W.; Liu, Z.X. A regularized limited memory subspace minimization conjugate gradient method for unconstrained optimization. Numer. Algorithms 2023, 94, 1919–1948. [Google Scholar] [CrossRef]
  11. Lee, J.D.; Sun, Y.K.; Saunders, M.A. Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim. 2014, 24, 1420–1443. [Google Scholar] [CrossRef]
  12. Kim, D.; Sra, S.; Dhillon, I. A scalable trust-region algorithm with application to mixed-norm regression. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 21–24 June 2010; pp. 519–526. [Google Scholar]
  13. Cartis, C.; Gould, N.I.M.; Toint, P.L. On the evaluation complexity of composite function minimization with applications to nonconvex nonlinear programming. SIAM J. Optim. 2011, 21, 1721–1739. [Google Scholar] [CrossRef]
  14. Chen, Z.A.; Milzarek, A.; Wen, Z.W. A trust-region method for nonsmooth nonconvex optimization. J. Comput. Math. 2023, 41, 683–716. [Google Scholar] [CrossRef]
  15. Liu, R.Y.; Pan, S.H.; Wu, Y.Q.; Yang, X.Q. An inexact regularized proximal Newton method for nonconvex and nonsmooth optimization. Comput. Optim. Appl. 2024, 88, 603–641. [Google Scholar] [CrossRef]
  16. Bolte, J.; Sabach, S.; Teboulle, M. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 2014, 146, 459–494. [Google Scholar] [CrossRef]
  17. Fukushima, M.; Mine, H. A generalized proximal point algorithm for certain non-convex minimization problems. Int. J. Syst. Sci. 1981, 12, 989–1000. [Google Scholar] [CrossRef]
  18. Tseng, P. On accelerated proximal gradient methods for convex-concave optimization. SIAM J. Optim. 2008, 2, 1–20. [Google Scholar]
  19. Tseng, P.; Yun, S. Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. J. Optim. Theory Appl. 2009, 140, 513–535. [Google Scholar] [CrossRef]
  20. Nesterov, Y. Gradient methods for minimizing composite functions. Math. Program. 2013, 140, 125–161. [Google Scholar] [CrossRef]
  21. Wu, Q.Q.; Peng, D.T.; Zhang, X. Continuous exact relaxation and alternating proximal gradient algorithm for partial sparse and partial group sparse optimization problems. J. Sci. Comput. 2024, 100, 20. [Google Scholar] [CrossRef]
  22. Li, H.; Lin, Z.C. Accelerated proximal gradient methods for nonconvex programming. In Proceedings of the 29th International Conference on Neural Information Processing Systems (NIPS’15), Montreal, QC, Canada, 7–12 December 2015; pp. 379–387. [Google Scholar]
  23. Aravkin, A.Y.; Baraldi, R.; Orban, D. A proximal quasi-Newton trust-region method for nonsmooth regularized optimization. SIAM J. Optim. 2022, 32, 900–929. [Google Scholar] [CrossRef]
  24. Yang, D.; Wang, X.M. Incremental subgradient algorithms with dynamic step sizes for separable convex optimizations. Math. Meth. Appl. Sci. 2023, 46, 7108–7124. [Google Scholar] [CrossRef]
  25. Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Statist. 1951, 22, 400–407. [Google Scholar] [CrossRef]
  26. Franchini, G.; Porta, F.; Ruggiero, V.; Trombini, I.; Zanni, L. A stochastic gradient method with variance control and variable learning rate for Deep Learning. J. Comput. Appl. Math. 2024, 451, 116083. [Google Scholar] [CrossRef]
  27. Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’13), Lake Tahoe, NV, USA, 5–10 December 2013; pp. 315–323. [Google Scholar]
  28. Rockafellar, R.T.; Wets, R.J.B. Variational Analysis, 3rd ed.; Springer Science & Business Media: Berlin, Germany, 2009; pp. 1–74. ISBN 978-3-642-02431-3. [Google Scholar]
  29. Roosta-Khorasani, F.; Mahoney, M.W. Sub-sampled Newton methods. Math. Program. 2019, 174, 293–326. [Google Scholar] [CrossRef]
  30. Shen, H.L.; Peng, D.T.; Zhang, X. Smoothing composite proximal gradient algorithm for sparse group Lasso problems with nonsmooth loss functions. J. Appl. Math. Comput. 2024, 70, 1887–1913. [Google Scholar] [CrossRef]
  31. Yang, J.D.; Song, H.M.; Li, X.X.; Hou, D. Block Mirror Stochastic Gradient Method For Stochastic Optimization. J. Sci. Comput. 2023, 94, 69. [Google Scholar] [CrossRef]
Figure 1. Numerical performance of FG-QR and IG-QR, when n = 150,000 , d = 500 , was evaluated across 20 independent trials.
Figure 1. Numerical performance of FG-QR and IG-QR, when n = 150,000 , d = 500 , was evaluated across 20 independent trials.
Axioms 14 00604 g001
Figure 2. x true alongside the recovered coefficients obtained by FG-QR and IG-QR in a trial.
Figure 2. x true alongside the recovered coefficients obtained by FG-QR and IG-QR in a trial.
Axioms 14 00604 g002
Table 1. Notations and descriptions.
Table 1. Notations and descriptions.
NotationDescription
f ( x ) gradient of function f at point x
g ( x ) approximation of f ( x )
^ h ( x ) Fréchet subdifferential of h at x
h ( x ) limiting subdifferential of h at x
e λ h ( x ) Moreau envelope of h at x with parameter λ
prox λ h ( x ) proximal mapping of h at x with parameter λ
λ h * threshold of prox-boundedness of function h
σ k regularization parameter in the k-th iteration
IG-QRquadratic regularization algorithm employing inexact gradients
FG-QRquadratic regularization algorithm employing full gradients
* The definition of the threshold of prox-boundedness is given in Definition 2.
Table 2. Numerical results across various dimensions, where O-v = f ( x ) + μ x 1 , R-e = x x true / x true , I-k is the number of iterations, and T-s is the CPU time (in seconds).
Table 2. Numerical results across various dimensions, where O-v = f ( x ) + μ x 1 , R-e = x x true / x true , I-k is the number of iterations, and T-s is the CPU time (in seconds).
DimensionsTrueFG-QRIG-QR
n d O-v O-v R-e I-k T-s O-v R-e I-k T-s
100,0002000.1050160.1045500.011752494.6010110.1045580.012024512.642049
100,0005000.1050250.1045310.0117124812.8623750.1045450.011038527.583251
100,0008000.1049990.1045230.0121775020.4390990.1045260.0118225713.037192
150,0002000.1049850.1045110.011667495.9192240.1045210.011709503.301263
150,0005000.1050120.1045260.0120195013.4282640.1045230.011853528.217642
150,0008000.1050010.1045280.0118145025.8276590.1045390.0119635616.310059
200,0002000.1050090.1045480.011770497.6831340.1045470.011395504.764928
200,0005000.1049750.1045030.0116004926.5799660.1045100.0117545215.295921
200,0008000.1050240.1045520.0121105039.6425530.1045540.0117765524.811985
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, A.; Wang, X.; Liao, C. An Inexact Nonsmooth Quadratic Regularization Algorithm. Axioms 2025, 14, 604. https://doi.org/10.3390/axioms14080604

AMA Style

Wang A, Wang X, Liao C. An Inexact Nonsmooth Quadratic Regularization Algorithm. Axioms. 2025; 14(8):604. https://doi.org/10.3390/axioms14080604

Chicago/Turabian Style

Wang, Anliang, Xiangmei Wang, and Chunfang Liao. 2025. "An Inexact Nonsmooth Quadratic Regularization Algorithm" Axioms 14, no. 8: 604. https://doi.org/10.3390/axioms14080604

APA Style

Wang, A., Wang, X., & Liao, C. (2025). An Inexact Nonsmooth Quadratic Regularization Algorithm. Axioms, 14(8), 604. https://doi.org/10.3390/axioms14080604

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop