Proximal Point Algorithm with Euclidean Distance on the Stiefel Manifold

Oviedo, Harry

doi:10.3390/math11112414

Open AccessArticle

Proximal Point Algorithm with Euclidean Distance on the Stiefel Manifold

by

Harry Oviedo

Facultad de Ingeniería y Ciencias, Universidad Adolfo Ibáñez, Av. Diag. Las Torres 2640, Santiago de Chile 7941169, Chile

Mathematics 2023, 11(11), 2414; https://doi.org/10.3390/math11112414

Submission received: 11 March 2023 / Revised: 9 May 2023 / Accepted: 11 May 2023 / Published: 23 May 2023

(This article belongs to the Section E: Applied Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we consider the problem of minimizing a continuously differentiable function on the Stiefel manifold. To solve this problem, we develop a geodesic-free proximal point algorithm equipped with Euclidean distance that does not require use of the Riemannian metric. The proposed method can be regarded as an iterative fixed-point method that repeatedly applies a proximal operator to an initial point. In addition, we establish the global convergence of the new approach without any restrictive assumption. Numerical experiments on linear eigenvalue problems and the minimization of sums of heterogeneous quadratic functions show that the developed algorithm is competitive with some procedures existing in the literature.

Keywords:

proximal point method; Stiefel manifold; orthogonality constraint; Riemannian optimization

MSC:

65K05; 90C30; 90C56; 53C21

1. Introduction

In this paper, we are interested in designing a proximal point procedure to solve the following optimization problem

min_{X \in R^{n \times p}} F (X), s . t . X^{⊤} X = I_{p},

(1)

where

F : R^{n \times p} \to R

is a continuously differentiable matrix function and

I_{p} \in R^{p \times p}

denotes the identity matrix. The feasible set of Problem (1), denoted by

S t (n, p) = {X \in R^{n \times p} : X^{⊤} X = I_{p}}

, is known as the Stiefel manifold [1]. Actually, this set constitutes an embedded Riemannian sub-manifold of the Euclidean space

R^{n \times p}

with dimensions equal to

n p - \frac{1}{2} p (p + 1)

(see [1]). Notice that (1) is a well-defined optimization problem because

S t (n, p)

is a compact set and

F

is a continuous function; therefore, the Weierstrass theorem ensures the existence of at least one global minimizer (and even a global maximizer) for

F

on the Stiefel manifold.

The orthogonality-constrained minimization problem (1) is widely applicable in many fields, such as the nearest low-rank correlation matrix problem [2,3], the linear eigenvalue problem [4,5,6], sparse principal component analysis [5,7], Kohn–Sham total energy minimization [4,6,8,9], low-rank matrix completion [10], the orthogonal Procrustes problem [8,11], maximization of sums of heterogeneous quadratic functions from statistics [4,12,13], the joint diagonalization problem [13], dimension reduction techniques in pattern recognition [14], and deep neural networks [15,16], among others.

In the Euclidean setting, given

f : R^{n} \to R \cup {\infty}

, a closed proper convex function, the proximal operator

{p r o x}_{f} : R^{n} \to R^{n}

is defined by

{p r o x}_{f} (x) = \arg min_{y \in R^{n}} f (y) + \frac{1}{2 α} {∥ y - x ∥}_{2}^{2},

(2)

where

{∥ \cdot ∥}_{2}

is the standard norm of

R^{n}

and

α > 0

is the proximal parameter [17].

The scalar

α

plays an important role controlling the magnitude by which the proximal operator sends the points towards the optimum values of f. In particular, larger values of

α

are related to mapped points near the optimum, while smaller values of this parameter promote a smaller movement to the minimum. It can design iterative optimization procedures that use the proximal operator to define the recursive update scheme. For example, the proximal minimization algorithm [17], minimizes the cost function f by consecutively applying the proximal operator

{p r o x}_{f} (\cdot)

, similar to fixed-point methods, to some given initial vector

x_{0} \in R^{n}

.

Several researchers have proposed generalizations of the proximal minimization algorithm in the Riemannian context. This kind of method was first considered in the Riemannian context by Ferreira and Oliveira [18] in the particular case of Hadamard manifolds. Papa Quiroz and Oliveira [19] adapted the proximal point method for quasiconvex functions and proved full convergence of the sequence

{x_{k}}

to a minimizer over Hadamard manifolds. In addition, Souza and Oliveira [20] introduced a proximal point algorithm to minimize DC functions on Hadamard manifolds. Wang et. al. [21] established linear convergence and finite termination of this type of algorithm on Hadamard manifolds. For this same type of manifold, in [22], the authors proved global convergence of inexact proximal point methods. Recently, Almeida et. al. [23] developed a modified version of the proximal point procedure for minimization over Hadamard manifolds. For the specific case of optimization on the Stiefel manifold, in [7], the authors proposed a proximal gradient method to minimize the sum of two function

f (X) + g (X)

over

S t (n, p)

, where f is smooth and its gradient is Lipschitz continuous, and g is convex and Lipschitz continuous. Another proximal-type algorithm is proposed in [24], where the authors developed a proximal linearized augmented Lagrangian algorithm (PLAM) to solve (1). However, the PLAM algorithm does not build a feasible sequence of iterates, which differs from the rest of the Riemannian proposals.

To minimize a function

F : M \to R

defined on a Riemannian manifold

M

, all the approaches presented in [18,19,20,21,23,25] consider the following generalization of (2)

{p r o x}_{F} (X) = \arg min_{Y \in M} F (Y) + \frac{1}{2 α} {dist}^{2} (X, Y),

(3)

where

dist (X, Y)

is the Riemannian distance [1]. The main disadvantage of all these works is that the authors proposed methods and theoretical analyses based on the exponential mapping, which requires the construction of geodesics on

M

. However, it is not always possible to find closed expressions for geodesics over a given Riemannian manifold, since geodesics are defined through a differential equation, [1,26]. Even in the case when we have available a closed formula for the corresponding geodesics on

M

, the computational cost of calculating the exponential mapping over a matrix space is too high, which is an obstacle to solving large-scale problems.

In this paper, we introduce a very simple proximal point algorithm to tackle Stiefel-manifold-constrained optimization problems. The proposed approach replaces the term

dist (X, Y)

in (3) with the usual matrix distance

{∥ X - Y ∥}_{F}

in order to avoid purely Riemannian concepts and techniques such as Riemannian distance and geodesics. The proposed iterative method tries to solve the optimization problem (1) by repeatedly applying our modified proximal operator to a given starting point. We prove (without imposing the Lipschitz continuity hypothesis) that our method converges to critical points of the restriction of the cost function to the Stiefel manifold. Our preliminary computational results suggest that our proposal presents competitive numerical performance against several feasible methods existing in the literature.

The rest of this manuscript is organized as follows. Section 2 summarizes a few well-known notations and concepts of linear algebra and Riemannian geometry that will be exploited in this paper. Afterwards, Section 3 introduces a new proximal point algorithm to deal with optimization problems with orthogonality constraints. Section 4 provides a concise convergence analysis for the proposed algorithm. Section 5 presents some illustrative numerical results, where we compare our approach with several state-of-the-art methods for the solution of linear eigenvalue problems and the minimization of sums of heterogeneous quadratic functions. Finally, the paper ends with a conclusion in Section 6.

2. Preliminaries

Throughout this paper, we say that

W \in R^{n \times n}

is skew-symmetric if

W = - W^{⊤}

. Given a square matrix

A \in R^{m \times m}

,

skew (A)

denotes the skew-symmetric part of A; that is,

skew (A) = 0.5 (A - A^{⊤})

. The trace of X is defined as the sum of the diagonal elements, which we denote as

T r [X]

. The standard inner product between two matrices

A, B \in R^{m \times n}

is given by

〈 A, B 〉 \equiv \sum_{i, j} a_{i j} b_{i j} = T r [A^{⊤} B]

. The Frobenius norm is defined by

{∥ A ∥}_{F} = \sqrt{〈 A, A 〉}

. Let

X \in S t (n, p)

be an arbitrary matrix in the Stiefel manifold; the tangent space of the Stiefel manifold at X is given by [1]

T_{X} S t (n, p) = {Z \in R^{n \times p} : Z^{⊤} X + X^{⊤} Z = 0} .

Let

X \in S t (n, p)

; the canonical metric [6] associated with the tangent space of the Stiefel manifold is defined by

{〈 ξ_{X}, η_{X} 〉}_{c} \equiv T r [ξ_{X}^{⊤} (I - \frac{1}{2} X X^{⊤}) η_{X}], \forall ξ_{X}, η_{X} \in T_{X} S t (n, p) .

Let

F : R^{n \times p} \to R

be a differentiable function; we denote as

DF (X) \equiv (\frac{\partial F (X)}{\partial X_{i j}})

the matrix of partial derivatives of

F

(the Euclidean gradient of

F

). Let

Φ : S t (n, p) \to R

be a smooth function defined on the Stiefel manifold; then the Riemannian gradient of

Φ

at

X \in S t (n, p)

, denoted as

\nabla Φ (X)

, is the unique vector in

T_{X} S t (n, p)

satisfying

D Φ (X) [ξ_{X}] : = lim_{τ \to 0} \frac{Φ (γ (τ)) - Φ (γ (0))}{τ} = 〈 \nabla Φ (X), ξ_{X} 〉, \forall ξ_{X} \in T_{X} S t (n, p),

(4)

where

γ : [0, τ_{max}] \to S t (n, p)

is any curve that verifies

γ (0) = X

and

\dot{γ} (0) = ξ_{X}

.

The Riemannian gradient of

Φ : S t (n, p) \to R

under the canonical metric has the following closed expression [4,6]

\nabla Φ (X) = D Φ (X) - X D Φ {(X)}^{⊤} X, \forall X \in S t (n, p) .

(5)

In addition, based on Formula (5), we can define the following projection operator over

T_{X} S t (n, p)

P_{X}^{c} [Z] = Z - X Z^{⊤} X, with Z \in R^{n \times p} .

(6)

It can be easily shown that the operator (6) effectively projects matrices from

R^{n \times p}

to the tangent space

T_{X} S t (n, p)

. This projection operator was also considered in [27].

Similar to the case of smooth unconstrained optimization,

X \in S t (n, p)

is a critical point of

Φ : S t (n, p) \to R

if it satisfies [1,28]

\nabla Φ (X) = 0 .

Therefore, the critical points of the restriction of the cost function

F

to the Stiefel manifold are candidates to be local minimizers of Problem (1). Here, we clarify that the objective function

F

that appears in (1) has both a Euclidean gradient and a Riemannian gradient. In the rest of this paper, we will denote by

\nabla F (\cdot)

the Riemannian gradient of the restriction of

F

to the set

S t (n, p)

under the canonical metric.

3. Proximal Point Algorithm on $St (n, p)$

In this section, we propose an implicitly defined curve on the Stiefel manifold. We also validate that the proposed curve verifies analogous properties that Riemannian gradient methods have. Using this curve, we further present in detail the proposed proximal point algorithm.

As we mentioned in the introduction, we consider an adaptation of the exact proximal point method to the framework of minimization over the Stiefel manifold: given a feasible point

X \in S t (n, p)

, we compute the new iterate

X (\bar{α})

as a point on the curve

X (\cdot) : [0, α_{max}] \to S t (n, p)

defined by

X (α) \equiv \arg min_{Y \in R^{n \times p}} α F (Y) + \frac{1}{2} {∥ Y - X ∥}_{F}^{2}, s . t . Y^{⊤} Y = I_{p} .

(7)

Observe that the argmin set in (7) is never empty since

α F (Y) + \frac{1}{2} {∥ Y - X ∥}_{F}^{2}

is a continuous function on a compact domain. In addition, note that this proximal optimization problem is obtained from (3) by substituting the Riemannian distance with the standard metric associated with the matrix space

R^{n \times p}

. Additionally,

X (α)

satisfies that

X (0) = X

. Thus,

X (α)

is a curve on

S t (n, p)

that connects the consecutive iterates X and

X (\bar{α})

. This is an analogous property to that which the retraction-based line-search curves verify (see [1,29]). Here it is important to remark that (7) can have multiple global solutions; in this situation,

X (α)

would not be a curve since it would not be a well-defined function.

The lemma below establishes that

X (α)

is a descent iterative process.

Lemma 1.

Given

X \in S t (n, p)

, consider the curve (7). Then,

F (X (α)) \leq F (X) - \frac{1}{2 α} {∥ X (α) - X ∥}_{F}^{2}, \forall α \in (0, \infty) .

(8)

Proof of Lemma 1.

Let

α

be a positive real number. In view of the optimality of

X (α)

in (7), we have

α F (X (α)) + \frac{1}{2} {∥ X (α) - X ∥}_{F}^{2} \leq α F (X) .

(9)

Post-multiplying both sides of (9) by

α^{- 1}

and rearranging, we obtain

F (X (α)) \leq F (X) - \frac{1}{2 α} {∥ X (α) - X ∥}_{F}^{2},

completing the poof. □

Lemma 1 guarantees that the proposed approach is a descent iterative process. Therefore, by executing an iterative process based on the proximal curve (7), we can minimize the objective function and at the same time move towards stationarity. Taking into account this fact, we propose our proximal point algorithm on

S t (n, p)

, the steps for which are described in Algorithm 1 (PPA-St).

Algorithm 1 PPA-St

1:: $X_{0} \in S t (n, p)$ , $0 < α_{min} \leq α_{max} < \infty$ , ${α_{k}}$ is a sequence such that $α_{k} \in [α_{min}, α_{max}]$ for all $k \in N$ , $k \leftarrow 0$ .
2:: while $∥ \nabla F (X_{k}) ∥_{F} \neq 0$ do
3:: $X_{k + 1} = \arg min_{Y \in S t (n, p)} α_{k} F (Y) + \frac{1}{2} {∥ Y - X_{k} ∥}_{F}^{2} .$

(10)
4:: If $∥ X_{k + 1} - X_{k} ∥_{F} = 0$ then stop the algorithm.
5:: $k \leftarrow k + 1$ ,
6:: end while

Notice that the proposed PPA-St algorithm can be interpreted as a Euclidean proximal point algorithm with respect to the cost function

f (X) = F (X) + δ_{S t (n, p)} (X),

where

δ_{S t (n, p)}

is the indicator function given by

δ_{S t (n, p)} (X) = \{\begin{matrix} 0 & if X \in St (n, p) \\ + \infty & otherwise, \end{matrix}

and

F : R^{n \times p} \to (- \infty, + \infty]

is defined by

F (X) = \{\begin{matrix} F (X) & if X \in St (n, p) \\ + \infty & otherwise . \end{matrix}

Thus the PPA-St process can be seen as a special case of the proximal alternating linearized descent (PALM) algorithm developed in [30] by selecting the functions

H (x, y) = 0

and

g (y) = 0

(in the notation of [30]). Particularly, PALM has a rich convergence analysis and was already applied to the Stiefel manifold in [31,32]. Nonetheless, the main differences between PALM and PPA-St are that our proposal does not require linearizing the function

F (\cdot)

to solve the problem, and PPA-St does not involve inertial steps.

On the other hand, there are some cost functions for which the proxy (7) can be solved analytically. For example, if

F_{1} (Y) = T r [Y^{⊤} M]

, where

M \in R^{n \times p}

is a data matrix, then the proximity operators are

X (α) \equiv \arg min_{Y \in R^{n \times p}} α T r [Y^{⊤} M] + \frac{1}{2} {∥ Y - X ∥}_{F}^{2}, s . t . Y^{⊤} Y = I_{p} .

(11)

with

X \in S t (n, p)

constant. Since

X, Y \in S t (n, p)

, the optimization problem above is equivalent to

X (α) \equiv \arg min_{Y \in R^{n \times p}} \frac{1}{2} {∥ Y - (X - α M) ∥}_{F}^{2}, s . t . Y^{⊤} Y = I_{p},

(12)

which has a closed-form solution (see Proposition 2.3 in [8]). Additionally, let

A, B \in R^{n \times n}

be two constant matrices and

X \in S t (n, n)

. Let us consider the objective function

F_{2} : R^{n \times n} \to R

given by

F_{2} (Y) = \frac{1}{2} {∥ A Y - B ∥}_{F}^{2}

. In this special case, the proximal operator is reduced to

X (α) \equiv \arg min_{Y \in R^{n \times n}} \frac{α}{2} {∥ A Y - B ∥}_{F}^{2} + \frac{1}{2} {∥ Y - X ∥}_{F}^{2}, s . t . Y^{⊤} Y = I_{n} .

(13)

The above constrained optimization problem can be reformulated as

X (α) = \arg min_{Y \in R^{n \times n}} \frac{1}{2} ∥ [\binom{A}{(1 / \sqrt{α}) I}] Y - [\binom{B}{(1 / \sqrt{α}) X}] ∥_{F}^{2} s . t . Y^{⊤} Y = I_{n},

(14)

which is an orthogonal Procrustes problem with an analytical solution (see [33]). However, in general the proximal operator (7) does not have a closed-form expression for its solutions. Therefore, in the implementation of PPA-St, we will use an efficient Riemannian gradient method based on the QR-retraction mapping (see Example 4.1.3 in [1]) in order to solve the optimization subproblem (10).

By introducing the notation

ϕ_{k} (Y) \equiv α_{k} F (Y) + \frac{1}{2} {∥ Y - X_{k} ∥}_{F}^{2}

, we propose to use the following feasible line-search method, starting at

Y_{0}^{k} = X_{k}

and

τ_{0} = α_{k}

,

Y_{i + 1}^{k} = (Y_{i}^{k} - τ_{i} \nabla ϕ_{k} (Y_{i}^{k})) c h o l {(I_{p} + τ_{i}^{2} \nabla ϕ_{k} {(Y_{i}^{k})}^{⊤} \nabla ϕ_{k} (Y_{i}^{k}))}^{- 1},

(15)

where

\nabla ϕ_{k} (Y)

is the Riemannian gradient under the canonical metric of

ϕ_{k}

evaluated at Y; that is,

\nabla ϕ_{k} (Y) = P_{Y}^{c} [D ϕ_{k} (Y)]

, where

D ϕ_{k} (Y)

is the Euclidean gradient of

ϕ_{k}

, i.e.,

D ϕ_{k} (Y) = α_{k} DF (Y) + Y - X_{k}

. In addition,

c h o l (A)

denotes the Cholesky factor obtained from the Cholesky factorization of

A \in R^{p \times p}

, i.e., let

A \in R^{p \times p}

be a symmetric and positive definite matrix (PSD) and suppose that

A = L_{A}^{⊤} L_{A}

is its Cholesky decomposition; then

c h o l (A) \equiv L_{A}

. Observe that this function is well-defined due to the uniqueness of Cholesky factorization for PSD matrices. Additionally, notice that in the recursive scheme (15), we are projecting

Y_{i}^{k} - τ_{i} \nabla ϕ_{k} (Y_{i}^{k})

over the Stiefel manifold using its QR factorization obtained from the Cholesky decomposition (see Equation (1.3) in [34]). We now present the inexact version of our Algorithm 1 based on the Riemannian gradient scheme (15).

It is well-known that if we endow the iterative method (15) with a globalization strategy to determine the step-size

τ_{i} > 0

, such as the Armijo’s rule [1,35] or a non-monotone Zhang–Hager type condition [36,37], then the Riemannian line-search method (15) is globally convergent (please see [1,36]). This means that the while loop between Steps 3 and 6 of Algorithm 2 does not take infinite time, i.e., there must exist an index

I_{k} \in N

such that

∥ \nabla ϕ_{k} (Y_{i}^{k}) ∥_{F} \leq ε_{k}

for all

i \geq I_{k}

. In addition, it is always possible to determine a step-size

τ_{i} > 0

such that Armijo’s rule (16) holds. For practical purposes, we implement our Algorithm 2 using the well-known backtracking strategy to find such a step-size (see [35]). Furthermore, to reduce the number of line searches performed in each inner iteration (the iterations in terms of the index i), we incorporate the Barzilai–Borwein step-size, which typically improves the performance of gradient-type methods (see [38]).

Algorithm 2 Inexact PPA-St

1:: $X_{0} \in S t (n, p)$ , $0 < α_{min} \leq α_{max} < \infty$ , $c_{1} \in (0, 1)$ , ${α_{k}}$ is a sequence such that $α_{k} \in [α_{min}, α_{max}]$ for all $k \in N$ , ${ε_{k}}$ is a sequence of positive real numbers such that $ε_{k} \to 0$ , $k \leftarrow 0$ .
2:: while $∥ \nabla F (X_{k}) ∥_{F} \neq 0$ do
3:: Set $Y_{0}^{k} = X_{0}$ and $i = 0$ .
4:: while $∥ \nabla ϕ_{k} (Y_{i}^{k}) ∥_{F} > ε_{k}$ do
5:: $Y_{i + 1}^{k} = (Y_{i}^{k} - τ_{i} \nabla ϕ_{k} (Y_{i}^{k})) c h o l {(I_{p} + τ_{i}^{2} \nabla ϕ_{k} {(Y_{i}^{k})}^{⊤} \nabla ϕ_{k} (Y_{i}^{k}))}^{- 1}$ ,
where $τ_{i} > 0$ is selected in such a way that the Armijo condition is satisfied, i.e.,

$ϕ_{k} (Y_{i + 1}^{k}) \leq ϕ_{k} (Y_{i}^{k}) - c_{1} τ_{i} {∥ \nabla ϕ_{k} (Y_{i}^{k}) ∥}_{F}^{2},$

(16)
6:: $i \leftarrow i + 1$ ,
7:: end while
8:: $X_{k + 1} = Y_{i}^{k}$ ,
9:: $k \leftarrow k + 1$ ,
10:: end while

4. Convergence Results

In this section, we establish the global convergence of Algorithms 1 and 2. Here, we say that an algorithm is globally convergent if, for any initial point

X_{0} \in S t (n, p)

, the generated sequence

{X_{k}}

satisfies that

\nabla F (X_{k}) \to 0

. Thus, global convergence does not refer to convergence towards global optima. Firstly, we analyze the convergence properties of Algorithm 1 by revealing the relationships between the residuals

∥ \nabla F (X_{k}) ∥_{F}

,

∣ F (X_{k + 1}) - F (X_{k}) ∣

and

∥ X_{k + 1} - X_{k} ∥_{F}

.

Since

F

is continuously differentiable over

R^{n \times p}

, then its derivative

DF (X)

is continuous. Hence,

DF (X)

is bounded on

S t (n, p)

due to the compactness of the Stiefel manifold. Then, there exists a constant

κ > 0

such that

{∥ DF (X) ∥}_{F} \leq κ, \forall X \in S t (n, p) .

(17)

Consequently, the Riemannian gradient of

F

satisfies

{∥ \nabla F (X) ∥}_{F} = {∥ DF (X) - X DF {(X)}^{⊤} X ∥}_{F} \leq 2 κ,

(18)

for all

X \in S t (n, p)

.

The following proposition states that Algorithm 1 stops at Riemannian critical points of

F

.

Proposition 1.

Let

{X_{k}}

be a sequence generated by Algorithm 1. Suppose that Algorithm terminates at iteration

k \in N

; then

\nabla F (X_{k}) = 0

.

Proof of Proposition 1.

The first-order necessary optimality condition associated with Subproblem (10) leads to

α_{k} \nabla F (X_{k + 1}) + P_{X_{k + 1}}^{c} [X_{k + 1} - X_{k}] = 0,

(19)

but since Algorithm 1 terminates at the k-th iteration, we have that

X_{k + 1} = X_{k}

, which directly implies that

P_{X_{k + 1}}^{c} [X_{k + 1} - X_{k}] = P_{X_{k + 1}}^{c} [0] = 0

. By substituting this fact in (19), we obtain the desired result. □

The rest of this section is devoted to study the asymptotic behavior of Algorithm 1 for infinite sequences

{X_{k}}

generated by our approach, since otherwise, Proposition 1 says that Algorithm 1 returns a stationary point for Problem (1). The lemma below provides us two key theoretical results.

Lemma 2.

Let

{X_{k}}

be an infinite sequence generated by Algorithm 1. Then, we have

1.: ${F (X_{k})}$ is a convergent sequence.
2.: The residual sequence ${∥ X_{k + 1} - X_{k} ∥_{F}}$ converges to zero.

Proof of Lemma 2.

It follows from Lemma 1 that

F (X_{k}) \leq F (X_{k - 1}) - \frac{1}{2 α_{k - 1}} {∥ X_{k} - X_{k - 1} ∥}_{F}^{2} .

(20)

Therefore,

{F (X_{k})}

is a monotonically decreasing sequence. Now, since the Stiefel manifold is a compact set and

F

is a continuous function, we obtain that

F

has a maximum and minimum on

S t (n, p)

. Therefore,

{F (X_{k})}

is bounded, and then

{F (X_{k})}

is a convergent sequence, which proves the first part of the lemma.

On the other hand, by rearranging Inequality (20), we arrive at

\begin{matrix} ∥ X_{k} - X_{k - 1} ∥_{F}^{2} & \leq & 2 α_{k - 1} (F (X_{k - 1}) - F (X_{k})) \\ \leq & 2 α_{max} (F (X_{k - 1}) - F (X_{k})) . \end{matrix}

(21)

Applying limits in (21) and using the first part of this lemma, we obtain

lim_{k \to \infty} {∥ X_{k + 1} - X_{k} ∥}_{F} = 0 .

□

Now we are ready to prove the global convergence of Algorithm 1, which is established in the theorem below.

Theorem 1.

Let

{X_{k}}

be an infinite sequence generated by Algorithm 1. Then

lim_{k \to \infty} {∥ \nabla F (X_{k + 1}) ∥}_{F} = 0 .

Proof of Theorem 1.

Firstly, let us denote by

P_{k} : = X_{k + 1} {(X_{k + 1} - X_{k})}^{⊤} X_{k + 1}

. Now, notice that

∥ P_{k} ∥_{F} = ∥ X_{k + 1} {(X_{k + 1} - X_{k})}^{⊤} X_{k + 1} ∥_{F} \leq {∥ X_{k + 1} - X_{k} ∥}_{F} .

(22)

By applying Lemma 2 in (22), we obtain

lim_{k \to \infty} {∥ P_{k} ∥}_{F} = 0 .

(23)

From Lemma 1, we have

F (X_{k + 1}) \leq F (X_{k}) - \frac{1}{2 α_{k}} {∥ X_{k + 1} - X_{k} ∥}_{F}^{2} .

(24)

It follows from (24), (7), the Cauchy–Schwarz inequality, and (18) that

\begin{matrix} F (X_{k + 1}) & \leq & F (X_{k}) - \frac{1}{2 α_{k}} {∥ X_{k + 1} - X_{k} ∥}_{F}^{2} \\ = & F (X_{k}) - \frac{1}{2 α_{k}} {∥ P_{k} - α_{k} \nabla F (X_{k + 1}) ∥}_{F}^{2} \\ = & F (X_{k}) - \frac{1}{2 α_{k}} ∥ P_{k} ∥_{F}^{2} + T r [P_{k}^{⊤} \nabla F (X_{k + 1})] - \frac{α_{k}}{2} {∥ \nabla F (X_{k + 1}) ∥}_{F}^{2} \\ < & F (X_{k}) + T r [P_{k}^{⊤} \nabla F (X_{k + 1})] - \frac{α_{k}}{2} {∥ \nabla F (X_{k + 1}) ∥}_{F}^{2} \\ \leq & F (X_{k}) + T r [P_{k}^{⊤} \nabla F (X_{k + 1})] - \frac{α_{min}}{2} {∥ \nabla F (X_{k + 1}) ∥}_{F}^{2} \\ \leq & F (X_{k}) + 2 κ ∥ P_{k} ∥_{F} - \frac{α_{min}}{2} {∥ \nabla F (X_{k + 1}) ∥}_{F}^{2}, \end{matrix}

(25)

which implies that

∥ \nabla F (X_{k + 1}) ∥_{F}^{2} \leq \frac{2}{α_{min}} (F (X_{k}) - F (X_{k + 1})) + \frac{4 κ}{α_{min}} {∥ P_{k} ∥}_{F} .

(26)

Finally, taking limits in (26) and considering (23) and Lemma 2, we obtain

lim_{k \to \infty} {∥ \nabla F (X_{k + 1}) ∥}_{F} = 0,

which completes the proof. □

We now turn to proving the convergence to stationary points of the inexact version of the PPA-St approach.

Theorem 2.

Let

{X_{k}}

be an infinite sequence generated by Algorithm 2. Then

lim_{k \to \infty} {∥ \nabla F (X_{k + 1}) ∥}_{F} = 0 .

Proof of Theorem 2.

From the Armijo condition (16), Step 7, and the definition of

ϕ_{k} (Y)

, we have

α_{k} F (X_{k + 1}) + \frac{1}{2} ∥ X_{k + 1} - X_{k} ∥_{F}^{2} \leq ϕ_{k} (Y_{i - 1}^{k}) - c_{1} τ_{i - 1} {∥ \nabla ϕ_{k} (Y_{i - 1}^{k}) ∥}_{F}^{2} .

(27)

Additionally, the Armijo condition (16) clearly implies that, for fixed k,

{ϕ_{k} (Y_{i}^{k})}

is a non-increasing sequence. Combining this result with inequality (27) and Step 2, we arrive at

\begin{matrix} α_{k} F (X_{k + 1}) + \frac{1}{2} {∥ X_{k + 1} - X_{k} ∥}_{F}^{2} & \leq & ϕ_{k} (Y_{i - 1}^{k}) - c_{1} τ_{i - 1} {∥ \nabla ϕ_{k} (Y_{i - 1}^{k}) ∥}_{F}^{2} \\ \leq & ϕ_{k} (Y_{i - 1}^{k}) \\ \leq & ϕ_{k} (Y_{0}^{k}) = α_{k} F (X_{k}), \end{matrix}

which leads to

F (X_{k + 1}) \leq F (X_{k}) - \frac{1}{2 α_{k}} {∥ X_{k + 1} - X_{k} ∥}_{F}^{2} .

Therefore, the sequence of objective values

{F (X_{k})}

is convergent. Moreover, we get that

{lim}_{k \to \infty} {∥ X_{k + 1} - X_{k} ∥}_{F}^{2} = 0

.

On the other hand, notice that

\nabla ϕ_{k} (X_{k + 1}) = α_{k} \nabla F (X_{k + 1}) + P_{X_{k}}^{c} [X_{k + 1} - X_{k}] .

(28)

The second term on the right-hand side of (28) verifies that

\begin{matrix} | | P_{X_{k}}^{c} [X_{k + 1} - X_{k}] {| |}_{F} & = & ∥ (X_{k + 1} - X_{k}) - X_{k} {(X_{k + 1} - X_{k})}^{⊤} X_{k} ∥_{F} \\ \leq & ∥ X_{k + 1} - X_{k} ∥_{F} + {∥ X_{k} {(X_{k + 1} - X_{k})}^{⊤} X_{k} ∥}_{F} \\ \leq & ∥ X_{k + 1} - X_{k} ∥_{F} + ∥ X_{k} ∥_{F}^{2} {∥ X_{k + 1} - X_{k} ∥}_{F} \\ = & (1 + ∥ X_{k} ∥_{F}^{2}) {∥ X_{k + 1} - X_{k} ∥}_{F} \\ = & (1 + T r [X_{k}^{⊤} X_{k}]) {∥ X_{k + 1} - X_{k} ∥}_{F} \\ = & (1 + T r [I_{p}]) {∥ X_{k + 1} - X_{k} ∥}_{F} \\ = & (1 + p) ∥ X_{k + 1} - X_{k} ∥_{F} . \end{matrix}

(29)

By rearranging (28), we get

\nabla F (X_{k + 1}) = \frac{1}{α_{k}} (\nabla ϕ_{k} (X_{k + 1}) - P_{X_{k}}^{c} [X_{k + 1} - X_{k}]) .

Applying the norm to both sides of the above equality and considering that

∥ \nabla ϕ_{k} (X_{k + 1}) ∥_{F}

\leq ε_{k}

and (29), we arrive at

\begin{matrix} ∥ \nabla F (X_{k + 1}) ∥_{F} & \leq & \frac{1}{α_{k}} {∥ \nabla ϕ_{k} (X_{k + 1}) - P_{X_{k}}^{c} [X_{k + 1} - X_{k}] ∥}_{F} \\ \leq & \frac{1}{α_{min}} (∥ \nabla ϕ_{k} (X_{k + 1}) ∥_{F} + {∥ P_{X_{k}}^{c} [X_{k + 1} - X_{k}] ∥}_{F}) \\ \leq & \frac{1}{α_{min}} (ε_{k} + (1 + p) {∥ X_{k + 1} - X_{k} ∥}_{F}) . \end{matrix}

Taking limits in the above relation, we conclude that

lim_{k \to \infty} {∥ \nabla F (X_{k + 1}) ∥}_{F} = 0,

proving the theorem. □

Corollary 1.

Let

{X_{k}}

be an infinite sequence of iterates generated by Algorithm 1 or Algorithm 2. Then every accumulation point of

{X_{k}}

is a critical point of

F

in the Riemannian sense.

Proof.

Let

{X_{k}}

be an infinite sequence generated by Algorithm 1 (or Algorithm 2). Clearly, the set of all accumulation points of the sequence

{X_{k}}

is non-empty since

{X_{k}} \subset S t (n, p)

and

S t (n, p)

is bounded. Let

\bar{X}

be an accumulation point of

{X_{k}}

; that is, there is a subsequence

{X_{k}}_{k \in K}

converging to

\bar{X}

. Since

S t (n, p)

is compact and

X_{k}

is feasible, for all

k \geq 0

, we have

\bar{X} \in S t (n, p)

. Applying Theorem 1 (or Theorem 2) and considering that

\nabla F

is a continuous function, we arrive at

\nabla F (\bar{X}) = lim_{k \in K} \nabla F (X_{k}) = 0,

proving the corollary. □

5. Computational Experiments

In this section, we present some numerical results to verify the practical performance of the proposed algorithm. We test our Algorithm 2 on academic problems, considering linear eigenvalue problems and minimization of sums of heterogeneous quadratic functions. We coded our simulations in MATLAB (version 2017b) with double precision on a machine with an Intel(R) Core(TM) i7-4770 CPU@3.40 GHz, a 1TB HD, and 16 GB RAM. We compare our approach with the Riemannian gradient method based on the Cayley transform [6] (OptStiefel) and with three Riemannian conjugate gradient methods—RCG1a, RCG1b, and RCG1b+ZH—developed in [13]. (The OptSt MATLAB code is available at https://github.com/wenstone/OptM, accessed on 10 May 2023, and the Riemannian conjugate gradient methods—RCG1a, RCG1b, and RCG1b + ZH—can be downloaded from http://www.optimization-online.org/DB_HTML/2016/09/5617.html, accessed on 10 May 2023). In addition, we stop all the methods when the algorithms find a matrix

\hat{X} \in S t (n, p)

such that

∥ \nabla F (\hat{X}) ∥_{F} < 1 e

−4. In all the tests, we consider Algorithm 2 with

α_{k} = α = p

for all

k \geq 0

. The implementation of our algorithm is available at https://www.mathworks.com/matlabcentral/fileexchange/128644-proximal-point-algorithm-on-the-stiefel-manifold, accessed on 10 May 2023.

In the rest of this section, we use the following notation: Time, Nitr, Grad, Feasi, and Fval denote the average total computing time in seconds, average number of iterations, average residual

∥ \nabla F (\hat{X}) ∥_{F}

, average feasibility error

∥ {\hat{X}}^{⊤} \hat{X} - I_{p} ∥_{F}

, and average final cost function value, respectively. In all experiments presented below, we solve ten independent instances for each pair

(n, p)

, and then we report all of these mean values. For all of the computational tests, we randomly generate the starting point

X_{0}

using the MATLAB command

[X_{0}, \sim] = q r (r a n d n (n, p), 0)

.

5.1. The Linear Eigenvalues Problem

In order to illustrate the numerical behavior of our method in computing some eigenvalues of a given symmetric matrix

A \in R^{n \times n}

, we present a numerical experiment taken from [8]. Let

λ_{1} \geq λ_{2} \geq \dots \geq λ_{n}

be the eigenvalues of A. The p-largest eigenvalue problem can be mathematically formulated as

\sum_{i = 1}^{p} λ_{i} = max_{X \in R^{n \times p}} T r [X^{⊤} A X] s . t . X^{⊤} X = I_{p} .

(30)

Firstly, we illustrate the numerical behavior of Algorithm 2 via varying the proximal parameter

α_{k}

. Specifically, we conduct an experiment where we choose

α_{k} = α

constant throughout the iterations process and solve an instance of Problem (30) with

(n, p) = (1000, 100)

for different values of

α

. We randomly generate the matrix A as follows:

r a n d n (^{'} s e e d^{'}, 1); B = r a n d n (1000, 100); A = 0.5 * (B^{'} + B),

using MATLAB’s commands, and we solve the problem for each

α \in {1, 5, 10, 50, 100}

. In Figure 1, we present the convergence history of Algorithm 2 for all considered values of

α

. In addition, Table 1 contains the numerical results associated with each value of

α

. From Figure 1 and Table 1, we clearly see that Algorithm 2 requires a larger number of iterations to achieve the desired accuracy in the gradient norm for small values of

α

.

Now, we consider the following experiment design: given

(n, p)

, we randomly generate dense matrices assembled as

A = 0.5 ({\bar{A}}^{⊤} + \bar{A})

, where

\bar{A} \in R^{n \times n}

is a matrix whose entries are sampled from a standard Gaussian distribution. Table 2 contains the computational results associated with varying

p \in {1, 50, 100, 300, 500}

but fixed

n = 1000

. As shown in Table 2, all of the methods obtained estimates of a solution of Problem (30) with the required precision. Furthermore, we clearly observe that as p approaches n, our proposal converges more quickly, even in terms of computational time, than the rest of the methods.

5.2. Heterogeneous Quadratic Minimization

In this subsection, we consider the minimization of sums of heterogeneous quadratic functions over the Stiefel manifold; this problem is formulated as

min_{X \in R^{n \times p}} \sum_{i = 1}^{p} T r [X_{[i]}^{⊤} A_{i} X_{[i]}] s . t . X^{⊤} X = I_{p},

(31)

where

A_{i} s

are n-by-n symmetric matrices and

X_{[i]}

denotes the i-th column of X. For benchmarking, we consider two structures for the data matrices

A_{i} s

obtained by using the following MATLAB commands:

Structure I: $A_{i} = d i a g (\frac{(i - 1) n + 1}{p} : \frac{1}{p} : \frac{i n}{p})$ , for all $i \in {1, 2, \dots, p}$ .
Structure II: $A_{i} = d i a g (\frac{(i - 1) n + 1}{p} : \frac{1}{p} : \frac{i n}{p}) + B_{i}^{⊤} + B_{i}$ , for all $i \in {1, 2, \dots, p}$ ,

where

B_{i} s

are random matrices generated by

B_{i} = 0.1 r a n d n (n)

. This experiment design was taken from [13]. The numerical results concerning Structures I and II are contained in Table 3 and Table 4, respectively. From Table 3, we see that the most efficient method both in terms of the number of iterations and in total computational time was our procedure. The second most efficient method was OptStiefel. However, the numerical performance of PPASt and OptStiefel is very similar.

On the other hand, the results related to Structure II show that OptStiefel is slightly superior to PPASt in terms of computational time. However our PPASt approach was much more efficient than the three Riemannian conjugate gradient methods and took fewer iterations to reach the desired tolerance than the rest of the methods. In addition, to illustrate the numerical efficiency of our proposal, we designed two randomized experiments where the initial points were built using the following MATLAB commands,

r a n d n (^{'} s e e d^{'}, 1); [X_{0}, \sim] = q r (r a n d n (n, p), 0),

and generated the heterogeneous quadratic minimization problems according to Structures I and II described above. In Figure 2 and Figure 3, we show the convergence history in terms of iterations and time (in seconds) of all the methods for these specific experiments. From these figures, we clearly see that our proximal point method converges faster (in terms of iterations) to a local minimizer than the rest of the methods, while in terms of computational time, the proposal achieves competitive results with respect to the other methods. Particularly, PPASt is the most efficient method for problems with Structure I.

5.3. The Joint Diagonalization Problem

Now, we evaluate the performance of our method on non-quadratic objective functions. Particularly, we consider the joint diagonalization problem [39], mathematically formulated as

max_{X \in R^{n \times p}} \sum_{i = 1}^{N} {∥ D i a g (X^{⊤} A_{i} X) ∥}_{F}^{2} s . t . X^{⊤} X = I_{p},

(32)

where the data

A_{i}

s are n-by-n symmetric matrices and

D i a g (M)

is the diagonal matrix obtained from M by replacing the non-diagonal elements of M with zero. In order to test all the algorithms, we randomly generate the matrices

A_{i}

s as follows

A_{i} = d i a g (\sqrt{n + 1}, \sqrt{n + 2}, \dots \sqrt{2 n}) + B_{i} + B_{i}^{⊤},

where

B_{i} \in R^{n \times n}

is a random matrix whose entries are generated independently from a Gaussian distribution, and, given a vector

v \in R^{n}

,

d i a g (v)

denotes the diagonal matrix of size n-by-n whose i-th diagonal entry is exactly the i-th element of the vector v. This experiment was taken from [13]. We include in the numerical comparisons the Riemannian proximal gradient method (ManPG) developed in [7]. For this experiment, we generate ten independent instances for different values of n and p and report the mean values of Time, Nitr, Grad, Feasi, and Fval for each method. Additionally, we use

N = 10000

as the maximum number of iterations allowed for all the algorithms and we use

ϵ = 1 \times 10^{- 5}

as the tolerance for the termination rule associated with the gradient norm. The numerical results associated with this numerical test are presented in Table 5.

From Table 5, we see that the ManPG method obtains very poor performance, which is possibly due to the step-size used to initialize the backtracking process being set to

α_{k}^{ini} = 1

in each iteration, which can lead to the method carrying out many re-orthogonalizations and function evaluations per iteration. Furthermore, we notice that our proposal was the most efficient both in terms of total computational time and in the number of iterations. In fact, in terms of CPU time, the OptStiefel method is better than PPASt only for instances with

(n, p, N) = (500, 3, 3)

, while for the rest of the instances, our PPASt outperforms the other methods.

6. Conclusions

In this article, we introduced a new, feasible method, free of exponential mapping, for solving smooth optimization problems on the Stiefel manifold. In particular, the proposal is an exact proximal point algorithm constructed with the standard distance of the matrix space

R^{n \times p}

that exploits the geometric properties of the Stiefel manifold. The proposed algorithm constructs a sequence

{X_{k}} \subset S t (n, p)

through an iterative process that successively applies a simple proximal operator to the initial point

X_{0}

. The global convergence of the proposed approach was established. In addition, we demonstrated that every accumulation point of the sequence generated by the algorithms is a stationary point without imposing any restrictive hypotheses on the objective function. Since the proposed algorithm is a special case of PALM [30], inertial extensions are already discussed in [32,40]. Our computational studies show that the developed method can be a good tool to solve trace maximization problems and heterogeneous quadratic minimization problems on the Stiefel manifold.

Similarly to the proofs presented in [30,40], we can try to establish convergence of the sequence

{X_{k}}

to a critical point of

F

under the Kurdyka–Lojasiewicz property. These idea will remain as future work.

Funding

This research was financially supported by the Faculty of Engineering and Sciences, Universidad Adolfo Ibáñez, through the FES startup package for scientific research.

Data Availability Statement

The data sets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The author declare no conflict of interest.

References

Absil, P.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008. [Google Scholar]
Grubišić, I.; Pietersz, R. Efficient rank reduction of correlation matrices. Linear Algebra Its Appl. 2007, 422, 629–653. [Google Scholar] [CrossRef]
Pietersz, R.; Groenen, P. Rank reduction of correlation matrices by majorization. Quant. Financ. 2004, 4, 649–662. [Google Scholar] [CrossRef]
Oviedo, H. Implicit steepest descent algorithm for optimization with orthogonality constraints. Optim. Lett. 2022, 16, 1773–1797. [Google Scholar] [CrossRef]
Oviedo, H.; Dalmau, O. A scaled gradient projection method for minimization over the Stiefel manifold. In Proceedings of the Mexican International Conference on Artificial Intelligence MICAI-2019, Xalapa, Mexico, 28 October 2019. [Google Scholar] [CrossRef]
Wen, Z.; Yin, W. A feasible method for optimization with orthogonality constraints. Math. Program. 2013, 142, 397–434. [Google Scholar] [CrossRef]
Chen, S.; Ma, S.; Man-Cho, S.; Zhang, T. Proximal gradient method for nonsmooth optimization over the Stiefel manifold. SIAM J. Optim. 2020, 30, 210–239. [Google Scholar] [CrossRef]
Oviedo, H.; Lara, H.; Dalmau, O. A non-monotone linear search algorithm with mixed direction on Stiefel manifold. Optim. Methods Softw. 2018, 34, 437–457. [Google Scholar] [CrossRef]
Zhang, X.; Zhu, J.; Wen, Z.; Zhou, A. Gradient type optimization methods for electronic structure calculations. SIAM J. Sci. Comput. 2014, 36, 265–289. [Google Scholar] [CrossRef]
Lara, H.; Oviedo, H.; Yuan, J. Matrix completion via a low rank factorization model and an augmented Lagrangean succesive overrelaxation algorithm. Bull. Comput. Appl. Math. 2014, 2, 21–46. [Google Scholar]
Oviedo, H.; Guerrero, S. Solving Weighted Orthogonal Procrustes Problems via a Projected Gradient Method. Preprint in Optimization Online. Available online: http://www.optimization-online.org/DB_HTML/2021/05/8375.html (accessed on 11 March 2023).
Bolla, M.; Michaletzky, G.; Tusnády, G.; Ziermann, M. Extrema of sums of heterogeneous quadratic forms. Linear Algebra Its Appl. 1998, 269, 331–365. [Google Scholar] [CrossRef]
Zhu, X. A Riemannian conjugate gradient method for optimization on the Stiefel manifold. Comput. Optim. Appl. 2017, 67, 73–110. [Google Scholar] [CrossRef]
Kokiopoulou, E.; Chen, J.; Saad, Y. Trace optimization and eigenproblems in dimension reduction methods. Numer. Linear Algebra Appl. 2011, 18, 565–602. [Google Scholar] [CrossRef]
Hasannasab, M.; Hertrich, J.; Neumayer, S.; Plonka, G.; Setzer, S.; Steidl, G. Parseval proximal neural networks. J. Fourier Anal. Appl. 2020, 26, 1–31. [Google Scholar] [CrossRef]
Huang, L.; Liu, X.; Lang, B.; Yu, A.; Wang, Y.; Li, B. Orthogonal weight normalization: Solution to optimization over multiple dependent Stiefel manifolds in deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
Ferreira, O.; Oliveira, P. Proximal point algorithm on Riemannian manifolds. Optimization 2002, 51, 257–270. [Google Scholar] [CrossRef]
Quiroz, E.; Oliveira, P. Proximal point methods for quasiconvex and convex functions with Bregman distances on Hadamard manifolds. J. Convex Anal. 2009, 16, 49–69. [Google Scholar]
Souza, J.; Oliveira, P. A proximal point algorithm for DC fuctions on Hadamard manifolds. J. Glob. Optim. 2015, 63, 797–810. [Google Scholar] [CrossRef]
Wang, J.; Li, C.; Lopez, G.; Yao, J. Proximal point algorithms on Hadamard manifolds, linear convergence and finite termination. SIAM J. Optim. 2016, 26, 2696–2729. [Google Scholar] [CrossRef]
Wang, J.; Li, C.; Lopez, G.; Yao, J. Convergence analysis of inexact proximal point algorithms on Hadamard manifolds. Journal Glob. Optim. 2015, 61, 553–573. [Google Scholar] [CrossRef]
Almeida, Y.; Da-Cruz, N.; Oliveira, P.; Souza, J. A modified proximal point method for DC functions on Hadamard manifolds. Comput. Optim. Appl. 2020, 76, 649–673. [Google Scholar] [CrossRef]
Gao, B.; Liu, X.; Yuan, Y. Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 2019, 41, 1949–1983. [Google Scholar] [CrossRef]
Bento, G.; Ferreira, O.; Melo, J. Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds. J. Optim. Theory Appl. 2017, 173, 548–562. [Google Scholar] [CrossRef]
Dreisigmeyer, D. Equality Constraints, Riemannian Manifolds and Direct Search Methods. Preprint in Optimization Online. Available online: https://optimization-online.org/2007/08/1743/ (accessed on 11 March 2023).
Lara, H.; Oviedo, H. Solving joint diagonalization problems via a Riemannian conjugate gradient method in Stiefel manifold. In Proceedings of the Congresso Nacional de Matemática Aplicada e Computacional CNMAC2018, Campinas, Brazil, 17–21 September 2018. [Google Scholar] [CrossRef]
Upadhyay, B.; Ghosh, A. On constraint qualifications for mathematical programming problems with vanishing constraints on Hadamard manifolds. J. Optim. Theory Appl. 2023. [Google Scholar] [CrossRef]
Hu, J.; Liu, X.; Wen, Z.; Yuan, Y. A brief introduction to manifold optimization. J. Oper. Res. Soc. China 2020, 8, 199–248. [Google Scholar] [CrossRef]
Bolte, J.; Sabach, S.; Teboulle, M. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 2014, 146, 459–494. [Google Scholar] [CrossRef]
Hertrich, J.; Nguyen, D.; Aujol, J.; Bernard, D.; Berthoumieu, Y.; Saadaldin, A.; Steidl, G. PCA reduced Gaussian mixture models with applications in superresolution. Inverse Probl. Imaging 2020, 2, 341–366. [Google Scholar] [CrossRef]
Hertrich, J.; Steidl, G. Inertial stochastic PALM and applications in machine learning. Sampl. Theory Signal Process. Data Anal. 2022, 20. [Google Scholar] [CrossRef]
Schönemann, P. A generalized solution of the orthogonal procrustes problem. Psychometrika 1966, 31, 1–10. [Google Scholar] [CrossRef]
Fukaya, T.; Kannan, R.; Nakatsukasa, Y.; Yamamoto, Y.; Yanagisawa, Y. Shifted Cholesky QR for computing the QR factorization of ill-conditioned matrices. SIAM J. Sci. Comput. 2020, 42, 477–503. [Google Scholar] [CrossRef]
Nocedal, J.; Wright, S. Numerical Optimization; Springer: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
Oviedo, H. Global convergence of Riemannian line search methods with a Zhang-Hager-type condition. Numer. Algorithms 2022, 91, 1183–1203. [Google Scholar] [CrossRef]
Zhang, H.; Hager, W. A nonmonotone line search technique and its application to unconstrained optimization. SIAM J. Optim. 2004, 14, 1043–1056. [Google Scholar] [CrossRef]
Barzilai, J.; Borwein, J. Two-point step size gradient methods. IMA J. Numer. Anal. 1988, 8, 141–148. [Google Scholar] [CrossRef]
Theis, F.; Cason, T.; Absil, P. Soft dimension reduction for ICA by joint diagonalization on the Stiefel manifold. In Proceedings of the 8th International Conference on Independent Component Analysis and Signal Separation, Berlin, Germany, 15–18 March 2009. [Google Scholar] [CrossRef]
Pock, T.; Sabach, S. Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM J. Imaging Sci. 2016, 9, 1756–1787. [Google Scholar] [CrossRef]

Figure 1. Convergence behavior of Algorithm 2 from the same initial point for different values of

α

.

Figure 1. Convergence behavior of Algorithm 2 from the same initial point for different values of

α

.

Figure 2. Convergence behavior of all the methods from the same initial point for minimization of the heterogeneous quadratics function: Structure I with

(n, p) = (100, 00, 10)

. (a) Iterations vs. the logarithm of the gradient norm. (b) Seconds vs. the logarithm of the gradient norm.

Figure 2. Convergence behavior of all the methods from the same initial point for minimization of the heterogeneous quadratics function: Structure I with

(n, p) = (100, 00, 10)

. (a) Iterations vs. the logarithm of the gradient norm. (b) Seconds vs. the logarithm of the gradient norm.

Figure 3. Convergence behavior of all the methods from the same initial point for minimization of the heterogeneous quadratics function: Structure II with

(n, p) = (1000, 5)

. (a) Iterations vs. the logarithm of the gradient norm. (b) Seconds vs. the logarithm of the gradient norm.

Figure 3. Convergence behavior of all the methods from the same initial point for minimization of the heterogeneous quadratics function: Structure II with

(n, p) = (1000, 5)

. (a) Iterations vs. the logarithm of the gradient norm. (b) Seconds vs. the logarithm of the gradient norm.

Table 1. Numerical performance of the PPASt scheme for different values of

α

in the solution of a randomly generated linear eigenvalue problem with

(n, p) = (1000, 100)

.

Table 1. Numerical performance of the PPASt scheme for different values of

α

in the solution of a randomly generated linear eigenvalue problem with

(n, p) = (1000, 100)

.

$α$	Nitr	Time	Grad	Feasi	Fval
1	189	11.145	9.99 $\times 10^{- 5}$	3.10 $\times 10^{- 15}$	3.657 $\times 10^{3}$
5	58	5.2831	9.70 $\times 10^{- 5}$	3.89 $\times 10^{- 15}$	3.657 $\times 10^{3}$
10	37	3.3362	9.76 $\times 10^{- 5}$	3.77 $\times 10^{- 15}$	3.657 $\times 10^{3}$
50	26	2.7402	4.93 $\times 10^{- 5}$	3.64 $\times 10^{- 15}$	3.657 $\times 10^{3}$
100	22	2.4424	4.39 $\times 10^{- 5}$	3.84 $\times 10^{- 15}$	3.657 $\times 10^{3}$

Table 2. Eigenvalues on randomly generated dense matrices for fixed

n = 1000

.

Table 2. Eigenvalues on randomly generated dense matrices for fixed

n = 1000

.

p	1	50	100	300	500
	PPASt
Nitr	9.9	18.3	19.7	21.1	26.1
Time	0.04	0.89	1.75	9.51	21.64
Grad	3.79 $\times 10^{- 5}$	5.61 $\times 10^{- 5}$	6.41 $\times 10^{- 5}$	6.31 $\times 10^{- 5}$	7.51 $\times 10^{- 5}$
Feasi	1.67 $\times 10^{- 16}$	2.13 $\times 10^{- 15}$	3.53 $\times 10^{- 15}$	7.89 $\times 10^{- 15}$	1.19 $\times 10^{- 14}$
Fval	44.4994	1.97 $\times 10^{3}$	3.64 $\times 10^{3}$	8.08 $\times 10^{3}$	9.49 $\times 10^{3}$
	OptStiefel
Nitr	141.3	273.4	293.3	295.1	336.7
Time	0.04	0.93	2.53	14.58	31.12
Grad	7.98 $\times 10^{- 5}$	7.86 $\times 10^{- 5}$	8.25 $\times 10^{- 5}$	8.25 $\times 10^{- 5}$	8.19 $\times 10^{- 5}$
Feasi	8.88 $\times 10^{- 17}$	2.59 $\times 10^{- 14}$	3.05 $\times 10^{- 14}$	1.65 $\times 10^{- 14}$	1.59 $\times 10^{- 14}$
Fval	44.4994	1.97 $\times 10^{3}$	3.64 $\times 10^{3}$	8.08 $\times 10^{3}$	9.49 $\times 10^{3}$
	RCG1a
Nitr	143.9	298.0	303.5	339.6	358.9
Time	0.06	1.59	4.39	30.64	78.21
Grad	8.48 $\times 10^{- 5}$	8.34 $\times 10^{- 5}$	8.02 $\times 10^{- 5}$	8.78 $\times 10^{- 5}$	8.56 $\times 10^{- 5}$
Feasi	5.00 $\times 10^{- 16}$	5.75 $\times 10^{- 15}$	9.62 $\times 10^{- 15}$	2.55 $\times 10^{- 14}$	1.57 $\times 10^{- 14}$
Fval	44.4994	1.97 $\times 10^{3}$	3.64 $\times 10^{3}$	8.08 $\times 10^{3}$	9.49 $\times 10^{3}$
	RCG1b
Nitr	137.2	288.7	296.1	309.9	360.8
Time	0.05	1.48	4.01	25.34	68.74
Grad	8.19 $\times 10^{- 5}$	8.61 $\times 10^{- 5}$	8.25 $\times 10^{- 5}$	8.51 $\times 10^{- 5}$	8.37 $\times 10^{- 5}$
Feasi	3.55 $\times 10^{- 16}$	6.22 $\times 10^{- 15}$	1.01 $\times 10^{- 14}$	2.55 $\times 10^{- 14}$	1.57 $\times 10^{- 14}$
Fval	44.4994	1.97 $\times 10^{3}$	3.64 $\times 10^{3}$	8.08 $\times 10^{3}$	9.49 $\times 10^{3}$
	RCG1b+ZH
Nitr	197.0	272.2	319.0	355.7	384.8
Time	0.06	1.25	3.99	27.24	67.52
Grad	7.84 $\times 10^{- 5}$	8.10 $\times 10^{- 5}$	8.53 $\times 10^{- 5}$	8.60 $\times 10^{- 5}$	8.29 $\times 10^{- 5}$
Feasi	3.55 $\times 10^{- 16}$	7.23 $\times 10^{- 15}$	1.04 $\times 10^{- 14}$	2.85 $\times 10^{- 14}$	1.57 $\times 10^{- 14}$
Fval	44.4994	1.97 $\times 10^{3}$	3.64 $\times 10^{3}$	8.08 $\times 10^{3}$	9.49 $\times 10^{3}$

Table 3. Numerical results of heterogeneous quadratic minimization considering Structure I.

Method	Nitr	Time	Grad	Feasi	Fval
	Structure I: $n = 10000$ , $p = 10$
PPASt	76.7	3.25	8.14 $\times 10^{- 5}$	8.17 $\times 10^{- 16}$	4.50 $\times 10^{4}$
OptStiefel	1012.6	4.74	8.39 $\times 10^{- 5}$	3.98 $\times 10^{- 15}$	4.50 $\times 10^{4}$
RCG1a	1216.4	9.13	8.90 $\times 10^{- 5}$	6.72 $\times 10^{- 14}$	4.50 $\times 10^{4}$
RCG1b	1140.1	8.47	8.58 $\times 10^{- 5}$	5.77 $\times 10^{- 14}$	4.50 $\times 10^{4}$
RCG1b + ZH	1191.9	8.45	8.36 $\times 10^{- 5}$	5.72 $\times 10^{- 14}$	4.50 $\times 10^{4}$

Table 4. Numerical results of heterogeneous quadratics minimization considering Structure II.

Method	Nitr	Time	Grad	Feasi	Fval
	Structure II: $n = 1000$ , $p = 5$
PPASt	57.6	16.09	8.21 $\times 10^{- 5}$	7.66 $\times 10^{- 16}$	1.99 $\times 10^{3}$
OptStiefel	592.1	14.27	8.18 $\times 10^{- 5}$	2.86 $\times 10^{- 15}$	1.99 $\times 10^{3}$
RCG1a	672.7	23.23	9.01 $\times 10^{- 5}$	4.44 $\times 10^{- 15}$	1.99 $\times 10^{3}$
RCG1b	643.6	21.77	8.50 $\times 10^{- 5}$	4.05 $\times 10^{- 15}$	1.99 $\times 10^{3}$
RCG1b + ZH	638.7	18.55	8.79 $\times 10^{- 5}$	4.98 $\times 10^{- 15}$	1.99 $\times 10^{3}$

Table 5. Numerical results on the joint diagonalization problem.

Method	Nitr	Time	Grad	Feasi	Fval
	$(n, p, N) = (500, 3, 3)$
OptStiefel	444.7	1.25	8.56 $\times 10^{- 6}$	3.37 $\times 10^{- 16}$	−3.59 $\times 10^{4}$
ManPG	9904.8	54.86	1.54 $\times 10^{- 4}$	1.24 $\times 10^{- 15}$	−3.59 $\times 10^{4}$
PPASt	27.6	1.51	6.10 $\times 10^{- 6}$	3.67 $\times 10^{- 16}$	−3.59 $\times 10^{4}$
RCG1a	547.7	2.41	7.87 $\times 10^{- 6}$	2.49 $\times 10^{- 15}$	−3.59 $\times 10^{4}$
RCG1b	570.7	2.24	7.97 $\times 10^{- 6}$	2.52 $\times 10^{- 15}$	−3.59 $\times 10^{4}$
RCG1b + ZH	605.3	1.94	7.39 $\times 10^{- 6}$	2.03 $\times 10^{- 15}$	−3.59 $\times 10^{4}$
	$(n, p, N) = (500, 3, 5)$
OptStiefel	634.2	4.11	8.59 $\times 10^{- 6}$	9.25 $\times 10^{- 16}$	−4.60 $\times 10^{4}$
ManPG	11232.0	97.74	2.25 $\times 10^{- 4}$	1.04 $\times 10^{- 15}$	−4.60 $\times 10^{4}$
PPASt	32.1	3.40	7.48 $\times 10^{- 6}$	2.85 $\times 10^{- 16}$	−4.60 $\times 10^{4}$
RCG1a	654.6	4.82	8.44 $\times 10^{- 6}$	2.41 $\times 10^{- 15}$	−4.60 $\times 10^{4}$
RCG1b	653.7	4.76	7.85 $\times 10^{- 6}$	2.61 $\times 10^{- 15}$	−4.60 $\times 10^{4}$
RCG1b + ZH	770.0	4.44	7.91 $\times 10^{- 6}$	2.92 $\times 10^{- 15}$	−4.60 $\times 10^{4}$
	$(n, p, N) = (1000, 3, 3)$
OptStiefel	1702.5	25.36	8.90 $\times 10^{- 6}$	3.84 $\times 10^{- 16}$	−7.28 $\times 10^{4}$
ManPG	19801.0	426.69	3.54 $\times 10^{- 2}$	8.66 $\times 10^{- 16}$	−7.28 $\times 10^{4}$
PPASt	57.2	13.71	6.91 $\times 10^{- 6}$	3.22 $\times 10^{- 16}$	−7.28 $\times 10^{4}$
RCG1a	1277.1	22.03	8.28 $\times 10^{- 6}$	2.02 $\times 10^{- 15}$	−7.28 $\times 10^{4}$
RCG1b	1207.2	20.75	8.26 $\times 10^{- 6}$	2.47 $\times 10^{- 15}$	−7.28 $\times 10^{4}$
RCG1b + ZH	1216.6	17.01	8.75 $\times 10^{- 6}$	2.82 $\times 10^{- 15}$	−7.28 $\times 10^{4}$
	$(n, p, N) = (1000, 3, 5)$
OptStiefel	2194.3	50.45	3.40 $\times 10^{- 6}$	3.34 $\times 10^{- 16}$	−9.26 $\times 10^{4}$
ManPG	15401.0	505.59	3.10 $\times 10^{- 3}$	8.01 $\times 10^{- 16}$	−9.26 $\times 10^{4}$
PPASt	46.8	18.59	6.68 $\times 10^{- 6}$	3.52 $\times 10^{- 16}$	−9.26 $\times 10^{4}$
RCG1a	1045.3	30.71	7.71 $\times 10^{- 6}$	2.14 $\times 10^{- 15}$	−9.26 $\times 10^{4}$
RCG1b	1019.9	30.01	8.58 $\times 10^{- 6}$	1.83 $\times 10^{- 15}$	−9.26 $\times 10^{4}$
RCG1b + ZH	996.0	22.95	8.22 $\times 10^{- 6}$	1.64 $\times 10^{- 15}$	−9.26 $\times 10^{4}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oviedo, H. Proximal Point Algorithm with Euclidean Distance on the Stiefel Manifold. Mathematics 2023, 11, 2414. https://doi.org/10.3390/math11112414

AMA Style

Oviedo H. Proximal Point Algorithm with Euclidean Distance on the Stiefel Manifold. Mathematics. 2023; 11(11):2414. https://doi.org/10.3390/math11112414

Chicago/Turabian Style

Oviedo, Harry. 2023. "Proximal Point Algorithm with Euclidean Distance on the Stiefel Manifold" Mathematics 11, no. 11: 2414. https://doi.org/10.3390/math11112414

APA Style

Oviedo, H. (2023). Proximal Point Algorithm with Euclidean Distance on the Stiefel Manifold. Mathematics, 11(11), 2414. https://doi.org/10.3390/math11112414

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Proximal Point Algorithm with Euclidean Distance on the Stiefel Manifold

Abstract

1. Introduction

2. Preliminaries

3. Proximal Point Algorithm on $St (n, p)$

4. Convergence Results

5. Computational Experiments

5.1. The Linear Eigenvalues Problem

5.2. Heterogeneous Quadratic Minimization

5.3. The Joint Diagonalization Problem

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Proximal Point Algorithm with Euclidean Distance on the Stiefel Manifold

Abstract

1. Introduction

2. Preliminaries

3. Proximal Point Algorithm on St ( n , p )

4. Convergence Results

5. Computational Experiments

5.1. The Linear Eigenvalues Problem

5.2. Heterogeneous Quadratic Minimization

5.3. The Joint Diagonalization Problem

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. Proximal Point Algorithm on $St (n, p)$