Directed Douglas–Rachford Splitting Method with Application to Feature Selection

Dong, Yunda; Chen, Miaomiao

doi:10.3390/modelling6030096

Open AccessCommunication

Directed Douglas–Rachford Splitting Method with Application to Feature Selection

by

Yunda Dong

^*

and

Miaomiao Chen

School of Mathematics and Statistics, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Modelling 2025, 6(3), 96; https://doi.org/10.3390/modelling6030096

Submission received: 7 July 2025 / Revised: 28 August 2025 / Accepted: 29 August 2025 / Published: 3 September 2025

Download

Browse Figures

Versions Notes

Abstract

In this article, we study a directed version of Douglas–Rachford splitting method in real Hilbert spaces. By using new, self-contained, and simplified techniques, we prove its weak convergence. The major innovation is that we exploit the firm non-expansiveness of the Douglas–Rachford operator for the first time to derive the best possible upper bounds on direction factors, assuming that the involved factors remain constant. We give a new rare feature selection model equipped with the TripAdvisor hotel-review dataset. Numerical results confirm the user-friendliness and efficiency of directed Douglas–Rachford splitting in solving this model.

Keywords:

Douglas–Rachford splitting; directed factors; firm non-expansiveness; weak convergence; feature selection

1. Introduction

Let

H

be an infinite-dimensional real Hilbert space equipped with the standard inner product

〈 x, y 〉

and the induced norm

∥ x ∥ = \sqrt{〈 x, x 〉}

for

x, y \in H

. We are mainly concerned with the following monotone inclusion:

\begin{matrix} 0 \in A (x) + B (x), \end{matrix}

(1)

where both

A : H ⇉ H

and

B : H ⇉ H

are maximal monotone operators, and the notation “double arrows” means that the mapping may be multi-valued. Throughout this article, we always assume that (1) has at least zero point. This problem model covers monotone variational inequalities, convex minimizations, optimal control and feature selection in deep learning and so on [1,2].

A popular iterative scheme is the following Douglas–Rachford (DR for short) splitting method (cf. Equation (10) of [3]). Choose

μ > 0

. Choose

z^{0} \in H

. At k-th iteration, compute the following:

z^{k + 1} = [{(I + μ B)}^{- 1} (2 {(I + μ A)}^{- 1} - I) + I - {(I + μ A)}^{- 1}] (z^{k}) .

(2)

Denote the DR operator by the following:

N : = {(I + μ B)}^{- 1} (2 {(I + μ A)}^{- 1} - I) + I - {(I + μ A)}^{- 1} .

(3)

Then (2) becomes the following:

z^{k + 1} = N (z^{k}) .

(4)

To improve numerical efficiency of the DR splitting method, we may resort to its directed version, written as follows: [4]

z^{k + 1} = (1 - α_{k}) z^{k} + α_{k} N (z^{k}) + (1 - α_{k}) δ_{k} (z^{k} - z^{k - 1}),

(5)

where

α_{k} \in (0, 1]

and

δ_{k} \geq 0

further satisfy some assumptions (to be specified later), and

(1 - α_{k}) δ_{k}

is called directed factors. Obviously, in the case of

α_{k} \equiv 1

, it reduces to the DR splitting method above.

For the directed DR splitting method (5), an important task is to find the largest possible upper bound of directed factors. As we know, the traditional technique of estimating this bound is as follows. First, view (5) as the Krasnosel’skiĭ–Mann iteration for non-expansive DR operator. Then, borrow the corresponding estimate concerning this iteration as the upper bound of directed factors [2,4].

In this article, we aim at deriving the best possible estimates of the directed factors. To this end, we skillfully exploit firm non-expansiveness (Lemma 1 of [3]) of the DR operator for the first time. Thus, as long as the directed factors are concerned, our new estimates of their upper bounds are much larger than those [2,4].

Clearly, the new technique corresponds to firm non-expansiveness of the DR operator, whereas the traditional one corresponds to its non-expansiveness.

For an application to deep learning, we model rare feature selection [5,6] (see (33) below) into an appropriate monotone inclusion (36) below, which has been a special case of (1). As shown in Section 6, both DR splitting method and its directed version can be used to solve this inclusion, without worrying about the issue of choosing multiple proximal parameters involved in those splitting methods [7]. Numerical results indicate that they can solve (36) to similar accuracy to those [7] in about 7 or 8 seconds for given stopping criteria, but they avoid the time-consuming (tens of minutes or more) trial and error process of determining the parameters. In this sense, the directed the DR splitting method is user-friendly and efficient in solving this feature selection model.

The rest of this article is organized as follows. In Section 2, we review the definitions of firmly non-expansive operator and maximal monotone operator, and give some related properties. In Section 3, we formally state the aforementioned directed DR splitting method, with new and much weaker conditions on the directed factors; see (7) and (8) below. In Section 4, we prove its weak convergence of the directed DR splitting method. In Section 5, we compare new conditions on the directed factors with existing ones, and we numerically demonstrate upper bounds derived by the former are much larger. In Section 6, we model rare feature selection in deep learning in a new way, and we conduct numerical experiments to confirm the efficiency of our directed DR splitting method in solving the corresponding test problems. Finally, in Section 7, we conclude this article.

2. Preliminaries

In this section, we first give some basic definitions and then provide some auxiliary results for later use.

Definition 1

([3,8]). Let

C \subseteq H

be a nonempty subset. An operator

N : C \to C

is called non-expansive if and only if the following is met:

‖ N (x) - N (y) ‖ \leq ‖ x - y ‖, \forall x, y \in C;

firmly non-expansive if and only if the following is met:

{∥ N (x) - N (y) ∥}^{2} \leq 〈 x - y, N (x) - N (y) 〉, \forall x, y \in C .

Definition 2

([3,8]). Let

A : H ⇉ H

be an operator. It is called monotone iff, written as follows:

〈 x - x^{'}, a - a^{'} 〉 \geq 0, \forall (x, a) \in A, \forall (x^{'}, a^{'}) \in A;

maximal monotone iff it is monotone and for given

\hat{x} \in H

and

\hat{a} \in H

the following implication relation holds:

〈 x - \hat{x}, a - \hat{a} 〉 \geq 0, \forall (x, a) \in A \Rightarrow (\hat{x}, \hat{a}) \in A .

For any given operator

A : H ⇉ H

, its effective domain is defined by the following:

d o m A = {x \in H : A (x) \neq \emptyset}

.

Assume that

A : H ⇉ H

is maximal monotone. A fundamental property is that, as proved by Minty [9], for any given positive number

μ > 0

and

\hat{x} \in H

, there exists a unique

x \in H

such that

(μ I + A) (x) ∋ \hat{x}

or

(I + μ A) (x) ∋ \hat{x}

, where I is the identity operator, i.e.,

I : x \to x

for all

x \in H

. Thus,

J_{μ A} = {(I + μ A)}^{- 1}

is well-defined and is called the resolvent of A.

Definition 3.

Let

f : H \to [- \infty, + \infty]

be a convex function. Then for any given

x \in H

the sub-differential of f at x is defined by the following:

\partial f (x) : = {s \in H : f (y) - f (x) \geq 〈 s, y - x 〉, \forall y \in H} .

Each s is called a sub-gradient of f at x. Moreover, if f is further continuously differentiable, then

\partial f (x) = {\nabla f (x)}

, where

\nabla f (x)

is the gradient of f at x.

As is well known, the sub-differential of any closed proper convex function in an infinite-dimensional Hilbert space is maximal monotone as well. An important example is the sub-differential of the indicator function defined by

δ_{C} (x) = \{\begin{matrix} 0, & if x \in C, \\ + \infty, & if x \notin C, \end{matrix}

where

C

is some nonempty closed convex set in

R^{n}

. Moreover, for any given positive number

μ > 0

, we have

P_{C} = {(I + μ \partial δ_{C})}^{- 1},

where

P_{C}

is usual projection onto

C

.

Lemma 1

(Theorem 2 of [8]). Let μ be any positive number. An operator A on H is monotone if and only if its resolvent

J_{μ A} = {(I + μ A)}^{- 1}

is firmly non-expansive.

Below, we give a lemma, which characterizes the relation between the problem’s zero point and the fixed-point of N.

Lemma 2

([3]). If

x \in H

solves (1), then there exists

z \in H

such that the following is written:

x = J_{μ A} (z), N (z) = z .

Lemma 3.

Assume that both

A : H ⇉ H

and

B : H ⇉ H

are maximal monotone operators. Then the resulting eigenoperator is calculated as follows:

\begin{matrix} T (x, u) : = (\begin{matrix} A \\ B^{- 1} \end{matrix}) (\begin{matrix} x \\ u \end{matrix}) + (\begin{matrix} 0 & I \\ - I & 0 \end{matrix}) (\begin{matrix} x \\ u \end{matrix}) \end{matrix}

which must be maximal monotone.

Proof.

Note that A and B are maximal monotone. Thus, the first operator on the right-hand side is maximal monotone. Meanwhile, the linearity of the identity operator means that the second is also maximal monotone [9]. Maximality of T follows from [10]. □

Lemma 4

([11,12]). Consider any maximal monotone operator

T : H ⇉ H

. Assume that the sequence

{w^{k}}

in

H

converges weakly to w, and the sequence

{s^{k}}

on domT converges strongly to s. If

T (w^{k}) ∋ s^{k}

for any k, then the relation

T (w) ∋ s

must hold.

3. Direcdted DR Splitting

In this section, we describe in details the directed DR splitting method, i.e., Algorithm 1 just below and make assumptions on

α_{k}

and the directed factors.

Algorithm 1 Directed DR splitting method

1:: Choose $z^{- 1} = z^{0} \in H$ . Choose $μ > 0$ , $δ_{0} = 0$ and $α_{0} = 0.9$ . Set $k : = 0$ .
2:: Choose $α_{k} > 0$ and $δ_{k} \geq 0$ satisfying (7) and (8). Compute

$\begin{matrix} z^{k + 1} = (1 - α_{k}) z^{k} + α_{k} N (z^{k}) + (1 - α_{k}) δ_{k} (z^{k} - z^{k - 1}) . \end{matrix}$

(6)

Set $k : = k + 1 .$

In this article, to guarantee its weak convergence, we make the following assumptions.

Assumption 1.

Assume that

α_{k}

and

δ_{k}

in directed DR splitting method satisfy the following:

\begin{matrix} α_{k} \in [ε, 1 - ε], (1 - α_{k - 1}) δ_{k - 1} \leq (1 - α_{k}) δ_{k}, \end{matrix}

(7)

where ε is a prescribed sufficiently small positive number.

\begin{matrix} δ_{k}^{+} : = 0.5 \frac{α_{k} - 1 + \sqrt{{(1 - α_{k})}^{2} + 4 \frac{2 {(1 - α_{k})}^{2}}{α_{k}} (\frac{2}{σ (2 - α_{k})} - 1) ((1 - σ) \frac{2 - α_{k - 1}}{α_{k - 1}} - ε)}}{\frac{2 {(1 - α_{k})}^{2}}{α_{k}} (\frac{2}{σ (2 - α_{k})} - 1)}, \\ δ_{k} \leq min \{δ_{k}^{+}, \frac{1}{1 - α_{k}} (1 - ε - \frac{1}{1 + (1 - σ) (2 - α_{k - 1}) α_{k - 1}^{- 1}})\}, \end{matrix}

(8)

where σ is chosen in

(0, 1)

in advance.

In particular, in the case of

α_{k} \equiv 0.95

and

δ_{k} \equiv δ > 0

, (7) and (8) can be replaced by the following:

α_{k} \equiv 0.95, δ_{k} \equiv δ \leq 4.68 .

(9)

In contrast, other existing upper bounds in Section 5 are around

0.34

.

As to

N (z^{k})

in (6), we may evaluate it in the following way. First solve the following:

(I + μ A) (x) ∋ z^{k}

(10)

to obtain

x^{k}

, then solve the following:

(I + μ B) (y) ∋ (2 {(I + μ A)}^{- 1} - I) (z^{k}) = 2 x^{k} - z^{k}

(11)

to obtain

y^{k}

. Thus, we write the following:

N (z^{k}) = z^{k} - (x^{k} - y^{k}) .

(12)

The directed DR splitting method above reminds us of the Krasnosel’skiĭ–Mann iteration, which is used for finding a fixed point of a non-expansive operator N (whose non-expansiveness is implied by firm expansiveness); see [2,4,13] for recent discussions and the references cited therein.

4. Weak Convergence

In this section, we prove weak convergence of Algorithm 1 in the Hilbert space.

For brevity, we may write

N (z)

as

N z

throughout this section.

Lemma 5.

Assume that

α > 0

. If

4 α β \geq γ^{2}

, then the following:

{α ‖ a ‖}^{2} + β {‖ b ‖}^{2} + γ 〈 a, b 〉 \geq 0, \forall a, b \in H .

Notice that this lemma was given in the first author’s manuscript, further mentioned in Remark 1. Its proof is elementary and based on the following:

{α ‖ a ‖}^{2} + {β ‖ b ‖}^{2} \geq 2 \sqrt{α β} ‖ a ‖ ‖ b ‖ \geq γ ‖ a ‖ ‖ b ‖ .

To simplify proof of weak convergence of Algorithm 1, we first give the following result.

Theorem 1.

If Assumption 1 holds, then the following is calculated:

\begin{matrix} ‖ z^{k + 1} {- z ‖}^{2} \\ \leq & (1 + (1 - α_{k}) δ_{k}) ‖ z^{k} {- z ‖}^{2} - (1 - α_{k}) δ_{k} {‖ z^{k - 1} - z ‖}^{2} \\ + (1 - α_{k}) δ_{k} (1 - 2 (1 - α_{k}) δ_{k} α_{k}^{- 1}) ‖ z^{k} - z^{k - 1} ‖^{2} - (2 α_{k}^{- 1} - 1) {‖ z^{k + 1} - z^{k} ‖}^{2} \\ + 4 (1 - α_{k}) δ_{k} α_{k}^{- 1} 〈 z^{k + 1} - z^{k}, z^{k} - z^{k - 1} 〉 . \end{matrix}

Proof.

It follows from (6) that the following is valid:

{\hat{z}}^{k} = z^{k} + δ_{k} (z^{k} - z^{k - 1}), z^{k + 1} = (1 - α_{k}) {\hat{z}}^{k} + α_{k} N z^{k} .

On the other side, Lemma 2 implies that there must exist

z \in H

such that the following is written:

N z = z .

Thus, we have the following:

z^{k + 1} - z = (1 - α_{k}) ({\hat{z}}^{k} - z) + α_{k} (N z^{k} - N z) .

So, we can obtain the following:

\begin{matrix} ‖ z^{k + 1} {- z ‖}^{2} \\ = {(1 - α_{k})}^{2} ‖ {\hat{z}}^{k} {- z ‖}^{2} + 2 (1 - α_{k}) α_{k} 〈 {\hat{z}}^{k} - z, N z^{k} - N z 〉 + α_{k}^{2} {‖ N z^{k} - N z ‖}^{2} \\ \leq {(1 - α_{k})}^{2} {‖ {\hat{z}}^{k} - z ‖}^{2} + 2 (1 - α_{k}) α_{k} 〈 {\hat{z}}^{k} - z, N z^{k} - N z 〉 + α_{k}^{2} 〈 z^{k} - z, N z^{k} - N z 〉, \end{matrix}

(13)

where the inequality follows from the firm non-expansiveness (Lemma 1 of [3]) of the DR operator. As to the

‖ {\hat{z}}^{k} {- z ‖}^{2}

term, it is direct to obtain the following:

\begin{matrix} ‖ {\hat{z}}^{k} {- z ‖}^{2} \\ = & ‖ z^{k} - z + δ_{k} (z^{k} - z^{k - 1}) ‖^{2} \\ = & ‖ z^{k} {- z ‖}^{2} + 2 δ_{k} 〈 z^{k} - z, z^{k} - z^{k - 1} 〉 + δ_{k}^{2} {‖ z^{k} - z^{k - 1} ‖}^{2} \\ = & ‖ z^{k} {- z ‖}^{2} + δ_{k} (‖ z^{k} {- z ‖}^{2} + ‖ z^{k} - z^{k - 1} ‖^{2} - {‖ z^{k - 1} - z ‖}^{2}) + δ_{k}^{2} {‖ z^{k} - z^{k - 1} ‖}^{2} \\ = & (1 + δ_{k}) ‖ z^{k} {- z ‖}^{2} + (δ_{k} + δ_{k}^{2}) ‖ z^{k} - z^{k - 1} ‖^{2} - δ_{k} {‖ z^{k - 1} - z ‖}^{2} . \end{matrix}

Combining this with (13) and the following:

〈 {\hat{z}}^{k} - z, N z^{k} - N z 〉 = 〈 z^{k} - z, N z^{k} - N z 〉 + δ_{k} 〈 z^{k} - z^{k - 1}, N z^{k} - N z 〉

yields the following:

\begin{matrix} ‖ z^{k + 1} {- z ‖}^{2} \\ \leq & {(1 - α_{k})}^{2} ((1 + δ_{k}) ‖ z^{k} {- z ‖}^{2} + (δ_{k} + δ_{k}^{2}) ‖ z^{k} - z^{k - 1} ‖^{2} - δ_{k} {‖ z^{k - 1} - z ‖}^{2}) \\ + 2 (1 - α_{k}) α_{k} (〈 z^{k} - z, N z^{k} - N z 〉 + δ_{k} 〈 z^{k} - z^{k - 1}, N z^{k} - N z 〉) + α_{k}^{2} 〈 z^{k} - z, N z^{k} - N z 〉 \\ = & {(1 - α_{k})}^{2} (1 + δ_{k}) ‖ z^{k} {- z ‖}^{2} + {(1 - α_{k})}^{2} (δ_{k} + δ_{k}^{2}) ‖ z^{k} - z^{k - 1} ‖^{2} - {(1 - α_{k})}^{2} δ_{k} {‖ z^{k - 1} - z ‖}^{2} \\ + (2 α_{k} - α_{k}^{2}) 〈 z^{k} - z, N z^{k} - N z 〉 + 2 (1 - α_{k}) δ_{k} α_{k} 〈 z^{k} - z^{k - 1}, N z^{k} - N z 〉 . \end{matrix}

(14)

Below, we bound the inner product

〈 z^{k} - z, N z^{k} - N z 〉

in (14). In view of (10)–(12), we have

N z^{k} = z^{k} - (x^{k} - y^{k}) .

Thus, we obtain the following:

\begin{matrix} 〈 z^{k} - z, N z^{k} - N z 〉 & = 〈 z^{k} - z, z^{k} - z - (x^{k} - y^{k}) 〉 \\ = ‖ z^{k} {- z ‖}^{2} - 〈 z^{k} - z, x^{k} - y^{k} 〉 \\ \leq ‖ z^{k} {- z ‖}^{2} - {‖ x^{k} - y^{k} ‖}^{2} \\ = ‖ z^{k} {- z ‖}^{2} - {‖ z^{k} - N z^{k} ‖}^{2}, \end{matrix}

(15)

where the inequality follows from (Lemma 4 of [14]). Combining this with the following:

α_{k} (z^{k} - N z^{k}) \overset{(6)}{=} (1 - α_{k}) δ_{k} (z^{k} - z^{k - 1}) - (z^{k + 1} - z^{k})

(16)

yields the following:

\begin{matrix} 〈 z^{k} - z, N z^{k} - N z 〉 \\ \leq & ‖ z^{k} {- z ‖}^{2} - α_{k}^{- 2} {(1 - α_{k})}^{2} δ_{k}^{2} ‖ z^{k} - z^{k - 1} ‖^{2} - α_{k}^{- 2} {‖ z^{k + 1} - z^{k} ‖}^{2} \\ + 2 α_{k}^{- 2} (1 - α_{k}) δ_{k} 〈 z^{k} - z^{k - 1}, z^{k + 1} - z^{k} 〉 . \end{matrix}

(17)

Next, we bound the inner product

〈 z^{k} - z^{k - 1}, N z^{k} - N z 〉

in (14). Since the following is calculated:

\begin{matrix} 〈 z^{k} - z^{k - 1}, N z^{k} - N z 〉 \\ = & 〈 z^{k} - z^{k - 1}, N z^{k} - z^{k} + z^{k} - z 〉 \\ = & 〈 z^{k} - z^{k - 1}, z^{k} - z 〉 + 〈 N z^{k} - z^{k}, z^{k} - z^{k - 1} 〉 \\ = & \frac{1}{2} (‖ z^{k} - z^{k - 1} ‖^{2} + ‖ z^{k} {- z ‖}^{2} - {‖ z^{k - 1} - z ‖}^{2}) + 〈 z^{k} - N z^{k}, z^{k - 1} - z^{k} 〉 \end{matrix}

and

\begin{matrix} 〈 z^{k} - N z^{k}, z^{k - 1} - z^{k} 〉 \\ \overset{(16)}{=} & α_{k}^{- 1} 〈 (1 - α_{k}) δ_{k} (z^{k} - z^{k - 1}) - (z^{k + 1} - z^{k}), z^{k - 1} - z^{k} 〉 \\ = & - (1 - α_{k}) δ_{k} α_{k}^{- 1} {‖ z^{k} - z^{k - 1} ‖}^{2} + α_{k}^{- 1} 〈 z^{k + 1} - z^{k}, z^{k} - z^{k - 1} 〉, \end{matrix}

we have the following:

\begin{matrix} 〈 z^{k} - z^{k - 1}, N z^{k} - N z 〉 \\ = & \frac{1}{2} (‖ z^{k} - z^{k - 1} ‖^{2} + ‖ z^{k} {- z ‖}^{2} - {‖ z^{k - 1} - z ‖}^{2}) - (1 - α_{k}) δ_{k} α_{k}^{- 1} {‖ z^{k} - z^{k - 1} ‖}^{2} \\ + α_{k}^{- 1} 〈 z^{k + 1} - z^{k}, z^{k} - z^{k - 1} 〉 . \end{matrix}

In a word, using the last equality and (17) to bound (14) yields the desired result. □

Now we state and prove weak convergence of the sequence generated by Algorithm 1.

Theorem 2.

Let

{z^{k}}

be the sequence generated by Algorithm 1. If (7) and (8) hold, then the sequence

{z^{k}}

weakly converges to a fixed point of N.

Proof.

In view of Theorem 1 and (7), we can obtain the following:

\begin{matrix} ‖ z^{k + 1} {- z ‖}^{2} - (1 - α_{k + 1}) δ_{k + 1} ‖ z^{k} {- z ‖}^{2} + (1 - σ) \frac{2 - α_{k}}{α_{k}} {‖ z^{k + 1} - z^{k} ‖}^{2} \\ \leq & ‖ z^{k} {- z ‖}^{2} - (1 - α_{k}) δ_{k} ‖ z^{k - 1} {- z ‖}^{2} + (1 - σ) \frac{2 - α_{k - 1}}{α_{k - 1}} {‖ z^{k} - z^{k - 1} ‖}^{2} - Δ_{k}, \end{matrix}

where

σ \in (0, 1)

and the following is calculated:

\begin{matrix} Δ_{k} & : = σ \frac{2 - α_{k}}{α_{k}} {‖ z^{k + 1} - z^{k} ‖}^{2} - \frac{4 (1 - α_{k}) δ_{k}}{α_{k}} 〈 z^{k + 1} - z^{k}, z^{k} - z^{k - 1} 〉 \\ + ((1 - σ) \frac{2 - α_{k - 1}}{α_{k - 1}} - (1 - α_{k}) δ_{k} (1 - \frac{2 (1 - α_{k}) δ_{k}}{α_{k}})) {‖ z^{k} - z^{k - 1} ‖}^{2} . \end{matrix}

We set the following:

φ_{k} : = ‖ z^{k} {- z ‖}^{2} - (1 - α_{k}) δ_{k} ‖ z^{k - 1} {- z ‖}^{2} + (1 - σ) \frac{2 - α_{k - 1}}{α_{k - 1}} {‖ z^{k} - z^{k - 1} ‖}^{2} .

(18)

Then we obtain the following:

φ_{k + 1} \leq φ_{k} - Δ_{k} .

(19)

Notice the following:

\begin{matrix} φ_{k} & = ‖ z^{k} {- z ‖}^{2} - (1 - α_{k}) δ_{k} ‖ z^{k - 1} {- z ‖}^{2} + (1 - σ) \frac{2 - α_{k - 1}}{α_{k - 1}} {‖ z^{k} - z^{k - 1} ‖}^{2} \\ = ‖ z^{k - 1} {- z ‖}^{2} + 2 〈 z^{k - 1} - z, z^{k} - z^{k - 1} 〉 + {‖ z^{k} - z^{k - 1} ‖}^{2} \\ - (1 - α_{k}) δ_{k} ‖ z^{k - 1} {- z ‖}^{2} + (1 - σ) \frac{2 - α_{k - 1}}{α_{k - 1}} {‖ z^{k} - z^{k - 1} ‖}^{2} \\ = (1 - (1 - α_{k}) δ_{k}) ‖ z^{k - 1} {- z ‖}^{2} + 2 〈 z^{k - 1} - z, z^{k} - z^{k - 1} 〉 + (1 + (1 - σ) \frac{2 - α_{k - 1}}{α_{k - 1}}) {‖ z^{k} - z^{k - 1} ‖}^{2} . \end{matrix}

Combining this with Lemma 5 and (8) results in the following:

\begin{matrix} δ_{k} \leq \frac{1}{1 - α_{k}} (1 - ε - \frac{1}{1 + (1 - σ) (2 - α_{k - 1}) α_{k - 1}^{- 1}}) \\ \Leftrightarrow & 4 (1 - (1 - α_{k}) δ_{k} - ε) (1 + (1 - σ) (2 - α_{k - 1}) α_{k - 1}^{- 1}) \geq 2^{2} \end{matrix}

yields the following:

φ_{k} \geq ε {‖ z^{k - 1} - z ‖}^{2} .

(20)

Meanwhile, by Lemma 2 and (8) we obtain the following:

\begin{matrix} δ_{k} \leq 0.5 \frac{α_{k} - 1 + \sqrt{{(1 - α_{k})}^{2} + 4 \frac{2 {(1 - α_{k})}^{2}}{α_{k}} (\frac{2}{σ (2 - α_{k})} - 1) ((1 - σ) \frac{2 - α_{k - 1}}{α_{k - 1}} - ε)}}{\frac{2 {(1 - α_{k})}^{2}}{α_{k}} (\frac{2}{σ (2 - α_{k})} - 1)} \\ \Leftrightarrow & \frac{2 {(1 - α_{k})}^{2}}{α_{k}} (\frac{2}{σ (2 - α_{k})} - 1) δ_{k}^{2} + (1 - α_{k}) δ_{k} - (1 - σ) \frac{2 - α_{k - 1}}{α_{k - 1}} + ε \leq 0 \\ \Leftrightarrow & 4 σ \frac{2 - α_{k}}{α_{k}} ((1 - σ) \frac{2 - α_{k - 1}}{α_{k - 1}} - (1 - α_{k}) δ_{k} (1 - \frac{2 (1 - α_{k}) δ_{k}}{α_{k}}) - ε) \geq \frac{4^{2} {(1 - α_{k})}^{2}}{α_{k}^{2}} δ_{k}^{2}, \end{matrix}

we can obtain the following:

Δ_{k} \geq ε {‖ z^{k} - z^{k - 1} ‖}^{2} .

(21)

Obviously, based on (19)–(21), we can conclude the following:

\begin{matrix} \lim φ_{k} exists \Rightarrow {z^{k}} is bounded in norm; \end{matrix}

(22)

\begin{matrix} \lim Δ_{k} = 0 \Rightarrow z^{k} - z^{k - 1} \to 0, k \to + \infty . \end{matrix}

(23)

It follows from (7), (6) and (23) the following:

α_{k} (z^{k} - N z^{k}) = (1 - α_{k}) δ_{k} (z^{k} - z^{k - 1}) - (z^{k + 1} - z^{k}) \to 0 \Rightarrow (I - N) (z^{k}) \to 0 .

(24)

On the other hand, as proved in (22),

{z^{k}}

is bounded in norm, thus there exists at least one weak cluster point

z^{\infty}

, i.e., written as follows:

z^{k_{j}} ⇀ z^{\infty} .

(25)

Finally, since

I - N

is continuous and monotone, it must be maximal monotone. In view of Lemma 4,

(I - N) (z^{\infty}) = 0

.

Denoted as the following:

ϕ_{k} (z) : = ‖ z^{k} {- z ‖}^{2} - (1 - α_{k}) δ_{k} {‖ z^{k - 1} - z ‖}^{2} .

Then, it follows from (18), (22) and (23) that

lim ϕ_{k} (z)

exists as well.

Below, we show that

{z^{k}}

weakly converges to

z^{\infty}

. Let

z_{1}^{\infty}

and

z_{2}^{\infty}

be two weak cluster points of

{z^{k}}

. Then, repeating the arguments above yields that

z_{1}^{\infty}

and

z_{2}^{\infty}

are solutions. Correspondingly, we set the following:

l_{i} : = lim ϕ_{k} (z_{i}^{\infty}), i = 1, 2 .

Consider the following:

\begin{matrix} ‖ z^{k} - z_{1}^{\infty} ‖^{2} - (1 - α_{k}) δ_{k} {‖ z^{k - 1} - z_{1}^{\infty} ‖}^{2} \\ = & ‖ z^{k} - z_{2}^{\infty} ‖^{2} - 2 〈 z^{k} - z_{2}^{\infty}, z_{1}^{\infty} - z_{2}^{\infty} 〉 + {‖ z_{1}^{\infty} - z_{2}^{\infty} ‖}^{2} \\ - (1 - α_{k}) δ_{k} (‖ z^{k - 1} - z_{2}^{\infty} ‖^{2} - 2 〈 z^{k - 1} - z_{2}^{\infty}, z_{1}^{\infty} - z_{2}^{\infty} 〉 + {‖ z_{1}^{\infty} - z_{2}^{\infty} ‖}^{2}) \\ = & ‖ z^{k} - z_{2}^{\infty} ‖^{2} - (1 - α_{k}) δ_{k} {‖ z^{k - 1} - z_{2}^{\infty} ‖}^{2} - 2 〈 z^{k} - z_{2}^{\infty}, z_{1}^{\infty} - z_{2}^{\infty} 〉 \\ + 2 (1 - α_{k}) δ_{k} 〈 z^{k - 1} - z_{2}^{\infty}, z_{1}^{\infty} - z_{2}^{\infty} 〉 + (1 - (1 - α_{k}) δ_{k}) {‖ z_{1}^{\infty} - z_{2}^{\infty} ‖}^{2} \\ \geq & ‖ z^{k} - z_{2}^{\infty} ‖^{2} - (1 - α_{k}) δ_{k} {‖ z^{k - 1} - z_{2}^{\infty} ‖}^{2} - 2 〈 z^{k} - z_{2}^{\infty}, z_{1}^{\infty} - z_{2}^{\infty} 〉 \\ + 2 (1 - α_{k}) δ_{k} 〈 z^{k - 1} - z_{2}^{\infty}, z_{1}^{\infty} - z_{2}^{\infty} 〉 + ε {‖ z_{1}^{\infty} - z_{2}^{\infty} ‖}^{2}, \end{matrix}

where the inequality follows from (8), which indicates the following:

(1 - α_{k}) δ_{k} \leq 1 - ε .

Meanwhile, we obtain the following:

〈 z^{k - 1} - z_{2}^{\infty}, z_{1}^{\infty} - z_{2}^{\infty} 〉 = 〈 z^{k} - z_{2}^{\infty}, z_{1}^{\infty} - z_{2}^{\infty} 〉 - 〈 z^{k} - z^{k - 1}, z_{1}^{\infty} - z_{2}^{\infty} 〉 .

Combing this with (23) and taking the limit along

k \in N_{1}

, where

N_{1}

such that

{z^{k}}

along

k \in N_{1}

weakly converges to

z_{2}^{\infty}

, we obtain the following:

l_{1} \geq l_{2} + ε {‖ z_{1}^{\infty} - z_{2}^{\infty} ‖}^{2} .

Similarly, we also obtain the following:

l_{2} \geq l_{1} + ε {‖ z_{2}^{\infty} - z_{1}^{\infty} ‖}^{2} .

Adding these two inequalities yields the following:

‖ z_{2}^{\infty} - z_{1}^{\infty} ‖^{2} = 0 \Rightarrow z_{1}^{\infty} = z_{2}^{\infty}

(26)

and

{z^{k}}

converges weakly. □

Finally, with the help of the theorem above, we further state and prove weak convergence of the main sequence

{J_{μ A} (z^{k})}

.

Theorem 3.

Let

{z^{k}}

be the sequence generated by Algorithm 1. If (7) and (8) hold, then the main sequence

{J_{μ A} (z^{k})}

weakly converges to a solution of (1).

Proof.

In view of (10)–(12), we have the following:

\begin{matrix} (I + μ A) (x^{k}) ∋ z^{k} \Leftrightarrow A (x^{k}) ∋ μ^{- 1} (z^{k} - x^{k}) \Leftrightarrow x^{k} = J_{μ A} (z^{k}), \end{matrix}

(27)

\begin{matrix} (I + μ B) (y^{k}) ∋ 2 x^{k} - z^{k} \Leftrightarrow B^{- 1} (μ^{- 1} (2 x^{k} - z^{k} - y^{k})) ∋ y^{k}, \\ N (z^{k}) = z^{k} - (x^{k} - y^{k}) . \end{matrix}

(28)

Thus, we obtain the following:

x^{k} - y^{k} = (I - N) (z^{k}) \overset{(24)}{\to} 0 .

(29)

Note that (22) and the following:

‖ x^{k} - x^{0} ‖ = ‖ J_{μ A} (z^{k}) - J_{μ A} (z^{0}) ‖ \leq ‖ z^{k} - z^{0} ‖

imply that

{x^{k}}

is bounded in norm as well.

On the other hand side, by the definition of T, we have the following:

T (x^{k}, μ^{- 1} (2 x^{k} - z^{k} - y^{k})) = (\begin{matrix} A (x^{k}) + μ^{- 1} (2 x^{k} - z^{k} - y^{k}) \\ B^{- 1} (μ^{- 1} (2 x^{k} - z^{k} - y^{k})) - x^{k} \end{matrix}) ∋ (\begin{matrix} μ^{- 1} (x^{k} - y^{k}) \\ - x^{k} + y^{k} \end{matrix}),

where the relation ∋ follows from (27) and (28).

Since

{(z^{k}, x^{k})}

is bounded in norm, so does

{(z^{k_{j}}, x^{k_{j}})}

, where

k_{j}

corresponds to the one in (25). Thus, there exists at least one weak cluster point, written as follows:

(z^{k_{j_{l}}}, x^{k_{j_{l}}}) ⇀ (z^{\infty}, x^{\infty}), k \to + \infty .

Combining this with (29), Lemma 3 and Lemma 4 yields the following:

T (x^{\infty}, μ^{- 1} (x^{\infty} - z^{\infty})) = (\begin{matrix} A (x^{\infty}) + μ^{- 1} (x^{\infty} - z^{\infty}) \\ B^{- 1} (μ^{- 1} (x^{\infty} - z^{\infty})) - x^{\infty} \end{matrix}) ∋ (\begin{matrix} 0 \\ 0 \end{matrix}) .

Namely, the following:

\begin{matrix} A (x^{\infty}) + μ^{- 1} (x^{\infty} - z^{\infty}) ∋ 0, \\ B^{- 1} (μ^{- 1} (x^{\infty} - z^{\infty})) - x^{\infty} ∋ 0 \Leftrightarrow μ^{- 1} (x^{\infty} - z^{\infty}) \in B (x^{\infty}) . \end{matrix}

This shows the following:

0 \in A (x^{\infty}) + μ^{- 1} (x^{\infty} - z^{\infty}) \in A (x^{\infty}) + B (x^{\infty}) .

Finally, we prove uniqueness of the weak cluster point

x^{\infty}

. Let

x_{1}^{\infty}

and

x_{2}^{\infty}

be two weak cluster points of

{x^{k}}

. In view of (27), we have the following:

(I + μ A) (x^{k}) ∋ z^{k} \Rightarrow z^{k} = x^{k} + μ a^{k}, a^{k} \in A (x^{k}) .

Since A is monotone, we further have the following:

‖ z_{1}^{\infty} - z_{2}^{\infty} ‖^{2} = ‖ x_{1}^{\infty} + μ a_{1}^{\infty} - (x_{2}^{\infty} + μ a_{2}^{\infty}) ‖^{2} \geq {‖ x_{1}^{\infty} - x_{2}^{\infty} ‖}^{2},

as proved in [14]. Combining this with (26) yields

x_{1}^{\infty} = x_{2}^{\infty}

. □

Remark 1.

In the proof of Theorem 1, there are two crucial parts. One is that, by fully exploiting firm non-expansiveness of the DR operator, we obtain the inequality (13). The other is an appropriate application of (Lemma 4 of [14]) in deriving the inequality (15). In the proof of Theorem 2 except for uniqueness, we point-by-point follow the analytical techniques recently developed in the first author’s manuscript entitled “On an accelerated Krasnosel’skiĭ-Mann iteration”. This manuscript and a revised version entitled “Directed Krasnosel’skiĭ-Mann iteration” can be found in a single pdf file via https://ydong2024.github.io/downloads/directedKM.pdf, (accessed on 22 August 2025).

Almost all the main results in the three theorems above are taken from the second author’s thesis, completed in April of 2024, and here are some of them with slight revisions. The exceptions are two proofs of uniqueness of weak cluster point in Theorem 2 and Theorem 3, which are absent in this thesis. Later on, in the revised version just mentioned, the first author completed a detailed proof. Furthermore, proof of the uniqueness of weak cluster points in Theorem 3 is due to the first author.

5. Comparisons

In this section, we numerically demonstrate the assumption (8) to some extent. Meanwhile, we compare (8) with other recently proposed ones.

For brevity, we simply set

α_{k} \equiv α, δ_{k} \equiv δ

. Then (8) reduces to the following:

\begin{matrix} δ^{+} : = 0.5 \frac{- 1 + \sqrt{1 + \frac{8}{α} (\frac{2}{σ (2 - α)} - 1) (1 - σ) (\frac{2 - α}{α})}}{\frac{2 (1 - α)}{α} (\frac{2}{σ (2 - α)} - 1)}, \\ δ_{n e w}^{+} \leq min \{δ^{+}, \frac{1}{1 - α} (1 - \frac{1}{1 + (1 - σ) (2 - α) α^{- 1}})\} : = f (σ) . \end{matrix}

(30)

Be aware that, in contrast to (8), we no longer introduce the extra

ε

above. First, we present the function of

σ

and

f (σ)

under different values of

α

, as shown in Figure 1:

Numerical demonstration of (30) is given in Table 1, where

δ_{n e w}^{+}

stands for a slightly lower approximation of the maximum of f in (30) with respect to

σ

,

(1 - α) δ_{n e w}^{+}

stands for the directed factor.

Next, we consider those assumptions made in the first author’s manuscript mentioned in Remark 1. We set

α_{k} \equiv α, δ_{k} \equiv δ

. Thus, they become the following:

\begin{matrix} δ^{-} : = 0.5 \frac{- 1 + \sqrt{1 + 4 ((\frac{1}{σ} - 1) \frac{1}{α} + 1) (1 - σ) \frac{1}{α}}}{(\frac{1}{σ} - 1) \frac{1}{α} + 1}, \\ δ_{n e w}^{-} \leq min \{δ^{-}, \frac{1}{1 - α} (1 - \frac{1}{1 + (1 - σ) (1 - α) / α})\} : = h (σ) . \end{matrix}

(31)

Numerical demonstration of (31) is given in Table 2.

Finally, we consider the assumptions made in [2] very recently. We set the following:

β_{k} = ϵ_{k} = ρ_{k} = θ_{k} \equiv 0, T_{k} \equiv N,

in our notation, the algorithm therein can also be written in the form (6), and the corresponding assumptions become the following:

\begin{matrix} 0 \leq δ < 1, (1 - α) δ < 1, \\ (1 - α) δ (1 + δ) + (\frac{1}{α} - 1) δ (1 - δ) - (\frac{1}{α} - 1) (1 - δ) < 0 . \end{matrix}

Namely, the following:

0 \leq δ < \frac{α + 2 - \sqrt{α^{2} + 8 α}}{2 (1 - α)} .

(32)

Numerical demonstration of (32) is given in Table 3.

From Table 1, Table 2 and Table 3, we can observe that our computed values of

(1 - α) δ_{n e w}^{+}

are consistently larger than the other two for each sampling point. In particular, in the case of

α_{k} \equiv 0.95

, our new upper bound of

δ

is

4.6859

, in a sharp contrast to

0.3437

in Table 2 and

0.3410

in Table 3, respectively.

6. Modeling and Numerical Experiments

In this section, we modeled rare feature selection in deep learning in a new way and confirmed efficiency of Algorithm 1 in solving the corresponding test problems. In our writing style, rather than striving for maximal test problems, we tried to make the basic ideas and techniques as clear as possible.

We performed all numerical experiments via Python 3.9.2.

We compared Algorithm 1 with the classical DR splitting method, selected for their similarities in features, applicability, and implementation effort.

Our test problem is motivated by “rare feature selection” formulated in Johnstone and Eckstein (Section 6.3 of [6]) (TripAdvisor data are available at https://github.com/yanxht/TripAdvisorData, accessed on 1 May 2024), which can be stated as follows:

min_{β_{0}, γ} ‖ β_{0} {e + X H γ - y ‖}_{2}^{2} / (2 n) + λ (| γ_{1} | + \dots + | γ_{r - 1} |) + {λ ‖ H γ ‖}_{1},

(33)

where X is the n-by-d data matrix, H is the d-by-r coefficient matrix,

y \in R^{n}

is the target vector,

e \in R^{n}

is the all-one vector, and

β_{0} \in R

is an offset and

γ \in R^{r}

. In [5,6], the authors gave a detailed description of the relationship of

H, γ

(see (Section 6.3 of [6]) and (Section 3 of [5]) for more details). The

ℓ_{1}

norm on

γ

enforces sparsity of

γ

, which in turn fuses together coefficients associated with similar features. The

ℓ_{1}

norm on

H γ

additionally enforces sparsity on these coefficients, which is also desirable.

As described in [5,6], we applied this model on the TripAdvisor hotel-review dataset. The response variable y was the overall rating of the hotel, in the set

\{1, 2, 3, 4, 5\}

. The features were the counts of certain adjectives in the review. Many adjectives were very rare, with 95% of the adjectives appearing in less than 5% of the reviews. There were 7573 adjectives from 169,987 reviews, and the auxiliary similarity tree

T

had 15,145 nodes. The 169,987 × 7573 design matrix X and the 7573 × 15,145 matrix H arising from the similarity tree

T

were both sparse, having 0.32% and 0.15% nonzero entries, respectively.

By replacing the least squares term with the

ℓ_{1}

norm term, we obtained:

\min_{x \in R^{r + 1}} ϕ (x) : = {‖ X H x - y ‖}_{1} / (n) + λ (|x_{2}| + \dots + |x_{r}|) + λ {‖ M x ‖}_{1},

(34)

(note that

λ

here corresponds to

0.5 λ

in (Equation (7.4) of [7]) when mu there was taken to be

0.5

) and the corresponding optimality condition is written as follows:

0 \in λ J x + {(X H)}^{T} {\partial ‖ \cdot ‖}_{1} (X H x - y) / n + λ M^{T} \partial {‖ \cdot ‖}_{1} (M x),

(35)

where

J = diag (0, \partial |\cdot|, \dots, \partial |\cdot|, 0) .

In this article, we rewrote (34) as follows:

\min_{x \in R^{r + 1}} | x_{2} | + \dots + | x_{r} | + {‖(\begin{matrix} \frac{1}{n λ} X H \\ M \end{matrix}) x - (\begin{matrix} \frac{1}{n λ} y \\ 0 \end{matrix})‖}_{1} .

(36)

Denoted as follows:

f (x) = | x_{2} | + \dots + | x_{r} |, g (Q x - q) = {‖(\begin{matrix} \frac{1}{n λ} X H \\ M \end{matrix}) x - (\begin{matrix} \frac{1}{n λ} y \\ 0 \end{matrix})‖}_{1},

(37)

and the corresponding optimality condition is as follows:

0 \in \partial f (x) + Q^{T} \partial g (Q x - q) .

(38)

Thus, we obtain the following:

\{\begin{matrix} 0 & \in \partial f (x) + Q^{T} u \\ 0 & \in - Q x + q + \partial g^{- 1} (u) \end{matrix}

or

0 \in F (w) + B (w),

(39)

where

w : = (\begin{matrix} x \\ u \end{matrix}), F (w) = (\begin{matrix} 0 & Q^{T} \\ - Q & 0 \end{matrix}) (\begin{matrix} x \\ u \end{matrix}) + (\begin{matrix} 0 \\ q \end{matrix}), B = (\begin{matrix} \partial f & 0 \\ 0 & \partial g^{- 1} \end{matrix}) .

Note that the classical DR splitting method reads as follows:

\{\begin{matrix} {\bar{w}}^{k} = {(I + μ B)}^{- 1} (w^{k} - μ F (w^{k})) \\ w^{k + 1} + μ F (w^{k + 1}) = w^{k} + μ F (w^{k}) - (w^{k} - {\bar{w}}^{k}) . \end{matrix}

So, we obtained the intermediate iterates, written as follows:

\{\begin{matrix} {\bar{x}}^{k} = {(I + μ \partial f)}^{- 1} (x^{k} - μ Q^{T} u^{k}) \\ {\bar{u}}^{k} = {(I + μ \partial g)}^{- 1} (u^{k} + μ (Q x^{k} - q)) . \end{matrix}

Moreover, the resolvent of the

ℓ_{1}

norm term can be computed by the following:

{\bar{x}}^{k} = \{\begin{matrix} {(I + μ \partial | \cdot |)}^{- 1} (x_{i}^{k}) = s g n (x_{i}^{k}) max {| x_{i}^{k} | - μ, 0}, & i = 2, \dots, n - 1, \\ x_{i}^{k}, & i = 1, n . \end{matrix}

As for Algorithm 1, we first used the same way of generating intermediate iterates, then we considered the following:

\begin{array}{r} (\begin{matrix} I & μ Q^{T} \\ - μ Q & I \end{matrix}) (\begin{matrix} Δ x \\ Δ u \end{matrix}) = - α (\begin{matrix} x^{k} - {\bar{x}}^{k} \\ u^{k} - {\bar{u}}^{k} \end{matrix}) + (1 - α) δ (\begin{matrix} I & μ Q^{T} \\ - μ Q & I \end{matrix}) (\begin{matrix} x^{k} - x^{k - 1} \\ u^{k} - u^{k - 1} \end{matrix}) . \end{array}

Equivalently, we obtain the following:

\begin{matrix} Δ x + μ Q^{T} Δ u = - α (x^{k} - {\bar{x}}^{k}) + (1 - α) δ [x^{k} - x^{k - 1} + μ Q^{T} (u^{k} - u^{k - 1})], \end{matrix}

\begin{matrix} - μ Q Δ x + Δ u = - α (u^{k} - {\bar{u}}^{k}) + (1 - α) δ [- μ Q (x^{k} - x^{k - 1}) + u^{k} - u^{k - 1}] . \end{matrix}

Thus, we first solve the following:

(I + μ^{2} Q Q^{T}) Δ u = - α μ Q (x^{k} - {\bar{x}}^{k}) - α (u^{k} - {\bar{u}}^{k}) + (1 - α) δ (I + μ^{2} Q Q^{T}) (u^{k} - u^{k - 1})

(40)

to obtain

Δ u^{k}

, then the following:

Δ x^{k} = - α (x^{k} - {\bar{x}}^{k}) + (1 - α) δ [(x^{k} - x^{k - 1}) + μ Q^{T} (u^{k} - u^{k - 1})] - μ Q^{T} Δ u .

Finally, we compute the following:

x^{k + 1} = x^{k} + Δ x^{k}, u^{k + 1} = u^{k} + Δ u^{k} .

Note that we implemented the celebrated conjugate gradient method with one iteration (not too much!) to solve subproblem (40).

DR: The classical DR splitting method.

dDR

α = 0.95

,

δ = 0.34

: Algorithm 1 with an upper bound given in [2].

dDR

α = 0.95

,

δ = 4.68

: Algorithm 1 with an upper bound given in this article.

For the starting points, we chose

x^{0} = 0, z^{0} = 0 .

For the parameter

λ

, we considered the following two cases.

Case 1

In this case, we set

λ = 0.5 \times 10^{- 4}

(corresponds to

λ = 10^{- 4}

in [7]), and we took

μ = 1 / ‖ Q ‖

, where

‖ Q ‖ \approx 4925

. We used the stopping criterion, written as follows:

ϕ (x^{k}) / ϕ (x^{0}) \leq 0.589 .

Case 2

In this case, we set

λ = 0.5 \times 10^{- 5}

(corresponds to

λ = 10^{- 5}

in [7]), and we took

μ = 1 / ‖ Q ‖

, where

‖ Q ‖ \approx

49,233. We used the following stopping criterion:

ϕ (x^{k}) / ϕ (x^{0}) \leq 0.575 .

The corresponding numerical results were plotted in the figures (left) in Figure 2 and Figure 3, where “CPU Time” stands for the elapsed time using tic and toc in seconds. Notice that the figures (right) were taken from [7], in which “Algorithm 1”, “Algorithm 2” and “Algorithm 3” are those algorithms recently proposed in [7], “JE splitting” is a splitting method of (Algorithm 1 of [6]), and “Vũ splitting” is a splitting method proposed in [15].

From Figure 2 and Figure 3, it can be seen that dDR with

α = 0.95

,

δ = 4.68

was faster than the other two.

Be aware that the ratio

ϕ (x^{k}) / ϕ (x^{0})

is actually identical to the counterpart given in [7] provided that

x^{k}

,

x^{0}

are the same. So, comparisons within those numerical results in [7] (i.e., the figures (right) in Figure 2 and Figure 3) are meaningful. At first sight, those algorithms [7] outperformed dDR with

α = 0.95

,

δ = 4.68

. However, we would like to stress that the latter solved (36) to similar accuracy to those [7] in about 7 or 8 s, compared to the time-consuming (tens of minutes or more) trial and error process of determining those proximal parameters, rooted in those algorithms [7]. In this sense, the directed DR splitting method is user-friendly and efficient in solving this feature selection model.

7. Conclusions

In this article, we have studied the directed version of the DR splitting method in real Hilbert spaces. By using new, self-contained, and simplified techniques, we have proved its weak convergence. The major innovation is that we exploit the firm non-expansiveness of the DR operator. To the best of our knowledge, this is for the first time. Thus, our upper bounds on direction factors surpass existing ones, assuming that the involved factors remain constant. To apply the method to feature selection in deep learning, we have given a new rare feature selection model equipped with the TripAdvisor hotel-review dataset. Numerical results have confirmed the user-friendliness and efficiency of directed DR splitting in solving this model.

For the directed DR splitting, this deserves developing self-adaptive ways of choosing the directed factors, at least for the first few iterations, and this would also be interesting to apply to hyperspectral unmixing [16]. While these topics are beyond the scope of this article, we hope to tackle them in the future.

Author Contributions

Conceptualization, Y.D.; Methodology, Y.D. and M.C.; Validation, Y.D. and M.C.; Formal analysis, Y.D. and M.C.; Investigation, M.C. and Y.D.; Resources, Y.D.; Data curation, Y.D.; Writing—original draft, Y.D.; Writing—review & editing, Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The code and data for calculations of the results in this paper will be made available upon request.

Acknowledgments

We are very grateful to Li Yiyi, Zhao Yue, Xu Hao, and Li Miao for their help.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Shehu, Y.; Liu, L.; Dong, Q.L.; Yao, J.C. A relaxed forward-backward-forward algorithm with alternated inertial step: Weak and linear convergence. Netw. Spatial Econ. 2022, 22, 959–990. [Google Scholar] [CrossRef]
Cortild, D.; Peypouquet, J. Krasnoselskii-Mann iterations: Inertia, perturbations and approximation. J. Optim. Theory Appl. 2025, 204, 35. [Google Scholar] [CrossRef]
Lions, P.L.; Mercier, B. Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 1979, 16, 964–979. [Google Scholar] [CrossRef]
Dong, Y.D.; Sun, M.D. New acceleration factors of the Krasnosel’skiĭ-Mann iteration. Results Math. 2022, 77, 194. [Google Scholar] [CrossRef]
Yan, X.H.; Bien, J. Rare feature selection in high dimensions. J. Am. Stat. Assoc. 2020, 116, 887–900. [Google Scholar] [CrossRef]
Johnstone, P.; Eckstein, J. Single-forward-step projective splitting: Exploiting cocoercivity. Comput. Optim. Appl. 2021, 78, 125–166. [Google Scholar] [CrossRef]
Dong, Y.D.; Li, Y.Y. A new operator splitting method with application to feature selection. AIMS Math. 2025, 10, 10740–10763. [Google Scholar] [CrossRef]
Eckstein, J.; Bertsekas, D.P. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 1992, 55, 293–318. [Google Scholar] [CrossRef]
Minty, G.J. Monotone (nonlinear) operators in Hilbert space. Duke Math. J. 1962, 29, 341–346. [Google Scholar] [CrossRef]
Rockafellar, R.T. On the maximality of sums of nonlinear monotone operators. Trans. Am. Math. Soc. 1970, 149, 75–88. [Google Scholar] [CrossRef]
Brézis, H. Opérateurs Maximaux Monotones; North-Holland: Amsterdam, The Netherlands, 1973. [Google Scholar]
Pascali, D.; Sburlan, S. Nonlinear Mappings of Monotone Type; Editura Academiei: Bucharest, Romania, 1978. [Google Scholar]
Iyiola, O.S.; Shehu, Y. New convergence results for inertial Krasnosel’skiĭ-Mann iterations in Hilbert spaces with applications. Results Math. 2021, 76, 1–25. [Google Scholar] [CrossRef]
Dong, Y.D.; Fischer, A. A family of operator splitting methods revisited. Nonlinear Anal. Theory, Methods Appl. 2010, 72, 4307–4315. [Google Scholar] [CrossRef]
Vũ, B. A splitting algorithm for dual monotone inclusions involving cocoercive operators. Adv. Comput. Math. 2013, 38, 667–681. [Google Scholar] [CrossRef]
Chouzenoux, E.; Corbineau, M.; Pesquet, J. A proximal interior point algorithm with applications to image processing. J. Math. Imaging Vis. 2020, 62, 919–940. [Google Scholar] [CrossRef]

Figure 1. The graph of f(

σ

) for different

α

.

Figure 1. The graph of f(

σ

) for different

α

.

Figure 2. The case of

λ = 0.5 \times 10^{- 4}

. (a,b) Numerical results on Case 1.

Figure 2. The case of

λ = 0.5 \times 10^{- 4}

. (a,b) Numerical results on Case 1.

Figure 3. The case of

λ = 0.5 \times 10^{- 5}

.(a,b) Numerical results on Case 2.

Figure 3. The case of

λ = 0.5 \times 10^{- 5}

.(a,b) Numerical results on Case 2.

Table 1. Numerical demonstration of (30) with respect to

σ

.

Table 1. Numerical demonstration of (30) with respect to

σ

.

$α$	0.5	0.55	0.6	0.65	0.7	0.75
$δ_{n e w}^{+}$	0.8138	0.8516	0.9025	0.9716	1.0674	1.2056
$σ$	0.54	0.53	0.52	0.50	0.49	0.48
$(1 - α) δ_{n e w}^{+}$	0.4069	0.3832	0.3610	0.3400	0.3202	0.3014
$α$	0.8	0.85	0.9	0.95	0.99
$δ_{n e w}^{+}$	1.4174	1.7758	2.4999	4.6859	22.2192
$σ$	0.47	0.46	0.45	0.45	0.44
$(1 - α) δ_{n e w}^{+}$	0.2835	0.2664	0.2500	0.2343	0.2222

Table 2. Numerical demonstration of (31) with respect to

σ

.

Table 2. Numerical demonstration of (31) with respect to

σ

.

$α$	0.5	0.55	0.6	0.65	0.7	0.75
$δ_{n e w}^{-}$	0.4769	0.4575	0.4397	0.4230	0.4075	0.3930
$(1 - α) δ_{n e w}^{-}$	0.2385	0.2059	0.1759	0.1480	0.1223	0.0983
$α$	0.8	0.85	0.9	0.95	0.99
$δ_{n e w}^{-}$	0.3795	0.3668	0.3549	0.3437	0.3353
$(1 - α) δ_{n e w}^{-}$	0.0759	0.0550	0.0355	0.0172	0.0034

Table 3. Numerical demonstration of (32) with respect to

α

.

Table 3. Numerical demonstration of (32) with respect to

α

.

$α$	0.5	0.55	0.6	0.65	0.7	0.75
$δ_{n e w}$	0.4384	0.4239	0.4105	0.3983	0.3870	0.3765
$(1 - α) δ_{n e w}$	0.2192	0.1907	0.1642	0.1394	0.1161	0.0941
$α$	0.8	0.85	0.9	0.95	0.99
$δ_{n e w}$	0.3668	0.3576	0.3490	0.3410	0.3348
$(1 - α) δ_{n e w}$	0.0734	0.0536	0.0349	0.0170	0.0033

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dong, Y.; Chen, M. Directed Douglas–Rachford Splitting Method with Application to Feature Selection. Modelling 2025, 6, 96. https://doi.org/10.3390/modelling6030096

AMA Style

Dong Y, Chen M. Directed Douglas–Rachford Splitting Method with Application to Feature Selection. Modelling. 2025; 6(3):96. https://doi.org/10.3390/modelling6030096

Chicago/Turabian Style

Dong, Yunda, and Miaomiao Chen. 2025. "Directed Douglas–Rachford Splitting Method with Application to Feature Selection" Modelling 6, no. 3: 96. https://doi.org/10.3390/modelling6030096

APA Style

Dong, Y., & Chen, M. (2025). Directed Douglas–Rachford Splitting Method with Application to Feature Selection. Modelling, 6(3), 96. https://doi.org/10.3390/modelling6030096

Article Menu

Directed Douglas–Rachford Splitting Method with Application to Feature Selection

Abstract

1. Introduction

2. Preliminaries

3. Direcdted DR Splitting

4. Weak Convergence

5. Comparisons

6. Modeling and Numerical Experiments

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI