An Efficient Algorithm for Convex Biclustering

Chen, Jie; Suzuki, Joe

doi:10.3390/math9233021

Open AccessArticle

An Efficient Algorithm for Convex Biclustering

by

Jie Chen

^*

and

Joe Suzuki

Graduate School of Engineering Science, Osaka University, Osaka 560-0043, Japan

^*

Author to whom correspondence should be addressed.

Mathematics 2021, 9(23), 3021; https://doi.org/10.3390/math9233021

Submission received: 18 October 2021 / Revised: 15 November 2021 / Accepted: 22 November 2021 / Published: 25 November 2021

(This article belongs to the Special Issue Mathematics, Statistics and Applied Computational Methods)

Download

Browse Figures

Versions Notes

Abstract

:

We consider biclustering that clusters both samples and features and propose efficient convex biclustering procedures. The convex biclustering algorithm (COBRA) procedure solves twice the standard convex clustering problem that contains a non-differentiable function optimization. We instead convert the original optimization problem to a differentiable one and improve another approach based on the augmented Lagrangian method (ALM). Our proposed method combines the basic procedures in the ALM with the accelerated gradient descent method (Nesterov’s accelerated gradient method), which can attain

O (1 / k^{2})

convergence rate. It only uses first-order gradient information, and the efficiency is not influenced by the tuning parameter

λ

so much. This advantage allows users to quickly iterate among the various tuning parameters

λ

and explore the resulting changes in the biclustering solutions. The numerical experiments demonstrate that our proposed method has high accuracy and is much faster than the currently known algorithms, even for large-scale problems.

Keywords:

clustering; convex biclustering; optimization; gradient descent method

1. Introduction

By clustering, such as k-means clustering [1] and hierarchical clustering [2,3], we usually mean dividing N samples, each consisting of p covariate values, into several categories, where

N, p \geq 1

.

In this paper, we consider biclustering [4] that is an extended notion of clustering. In biclustering, we divide both

{1, \dots, N}

and

{1, \dots, p}

based on the data simultaneously. If we are given a data matrix in

R^{N \times p}

, then the rows and columns within the shared group exhibit similar characteristics. For example, given a gene expression data matrix, with genes as columns and samples as rows, the biclustering detects the submatrices, which represent the cooperative behavior of a group of genes corresponding to a group of samples [5]. Figure 1 illustrates an intuitive difference between standard clustering and biclustering. In recent years, biclustering has become a ubiquitous data-mining technique with varied applications, such as text mining, recommendation system, and bioinformatics. A comprehensive survey of biclustering was given by [6,7,8].

However, as noted in [6], biclustering is an NP-hard problem. Thus, the results may vary significantly with different initializations. Moreover, some conventional biclustering models suffer from poor performance due to non-convexity, which may return local optimal solutions. In order to avoid such an inconvenience, Chi et al. [9] proposed convex biclustering by reformulating the problem to convex formulations and using the fused lasso [10] concept.

In convex clustering [9,11,12,13], given a data matrix

X \in R^{N \times p}

and

λ > 0

, we compute a matrix U of the same size as X. Let

X_{i \cdot}

and

U_{i \cdot}

be the i-th rows of X and U, let

X_{\cdot j}

and

U_{\cdot j}

be the the j-th columns of X and U, and let

{∥ X - U ∥}_{F}^{2}

denote the square sums of the

N p

elements, and

∥ U_{i \cdot} - U_{j \cdot} ∥

,

∥ U_{\cdot m} - U_{\cdot n} ∥

as the

ℓ_{2}

norm of p, N-dimensional vectors, respectively. Convex clustering finds

U \in R^{N \times p}

that minimizes a weighted sum of

{∥ X - U ∥}_{F}^{2}

and

{λ ∥ U_{i \cdot} - U_{j \cdot} {∥}}_{i \neq j}

(it may be formulated as minimizing a weighted sum of

{∥ X - U ∥}_{F}^{2}

and

{λ ∥ U_{\cdot m} - U_{\cdot n} {∥}}_{m \neq n}

). If

X_{i \cdot}

and

X_{j \cdot}

share

U_{i \cdot} = U_{j \cdot}

, then they are in the same group w.r.t.

{1, \dots, N}

. On the other hand, convex biclustering finds

U \in R^{N \times p}

that minimizes a weighted sum of

{∥ X - U ∥}_{F}^{2}

and

{λ ∥ U_{i \cdot} - U_{j \cdot} {∥}}_{i \neq j}

and

{λ ∥ U_{\cdot m} - U_{\cdot n} {∥}}_{m \neq n}

.

The convex biclustering achieves checker-board-like biclusters by penalizing both the rows and columns of U. When

λ

(the tuning parameter for

{∥ U_{i \cdot} - U_{j \cdot} {∥}}_{i \neq j}

and

{∥ U_{\cdot m} - U_{\cdot n} {∥}}_{m \neq n}

) is zero, each

(i, j)

occupies a unique bicluster

{(i, j)}

and

x_{i j} = u_{i j}

for

i = 1, \dots, N

and

j = 1, \dots, p

when

X = (x_{i j})

and

U = (u_{i j})

. As

λ

increases, the bicluster begins to fuse. For sufficiently large

λ

, all

(i, j)

merge into one single bicluster

{(i, j) | i = 1, \dots, N, j = 1, \dots, p}

. The convex formulation guarantees a globally optimal solution and demonstrates superior performance to competing approaches. Chi et al. [9] claimed that the convex biclustering performs better than the dynamic tree-cutting algorithm [14] and sparse biclustering algorithm [15] in their experiments.

Nevertheless, despite these advantages, convex biclustering has not yet gained widespread popularity, due to its intensive computation. On the one hand, the main challenge of solving the optimization problem is the two fused penalty terms: indecomposable and non-differentiable. These properties increase the difficulty of solving. Many splitting methods for the indecomposable problem are complicated and create many subproblems to solve; techniques such as the subgradient method for the non-differentiable problem are slow to converge [16,17]. Moreover, it is difficult to find the optimal tuning parameter

λ

because we need to solve optimization problems with a sequence of parameters

λ

and select the one for the specific demand of researchers. Hence, we need to propose a fast way to solve the problems with the sequence of parameters

λ

. On the other hand, with the increased demands of biclustering techniques, convex biclustering is faced with large-scale data as the volume and complexity of data grows. Above all, it is necessary to propose an efficient algorithm to solve the convex biclustering problem.

There are limited algorithms for solving the problem in the literature. Chi et al. [9] proposed the convex biclustering algorithm (COBRA) using a Dykstra-like proximal algorithm [18] to solve the convex biclustering problem. Weylandt [19] proposed using alternating direction method of multipliers (ADMM) [20,21] and its variant, generalized ADMM [22], to solve the problem.

However, the COBRA yields subproblems, including the convex clustering problem, which requires expensive computations for large-scale problems due to the high per iteration cost [23,24]. Essentially, COBRA is a splitting method that separately solves a composite optimization problem containing three terms. Additionally, it is sensitive to tuning parameter

λ

. Therefore, obtaining the solutions under a wide range of parameters

λ

takes time, which is not feasible for broad applications and different demands for users. ADMM generally solves the problem by breaking it into smaller pieces and updating the variables alternately. Still, at the same time, it also introduces several subproblems which may cost much time. To be more specific, the ADMM proposed by Weylandt [19] requires solving the Sylvester equation in the step of updating the variable U. Hence, the Schur decomposition requires solving the Sylvester equation based on the numerical method from [25], which is complicated and time consuming. Additionally, it is known that ADMM exhibits

O (\frac{1}{k})

, where k is the number of iterations, convergence in general [26]. It often takes time to achieve relatively high precision [27], which is not feasible in some highly accurate applications. For example, the gene expression data contain huge information (the feature dimension usually exceeds 1000). However, COBRA and ADMM do not scale well for such large-scale problems. Overall, the above algorithm shows weak performance, which motivates us to combine some current algorithms to efficiently solve the convex biclustering problem like in reference [28].

This paper proposes an efficient algorithm with simple subproblems and a fast convergence rate to solve the convex biclustering problem. Rather than update each variable alternately, like ADMM, we use the augmented Lagrangian method (ALM) to update the primal variables simultaneously. In this way, we can transform the optimization problem to be differentiable, solve the problem via an efficient gradient descent method and further simplify the subproblems. Our proposed method is motivated by the work [29] in which the authors presented a way to convert the augmented Lagrangian function to a composite optimization that can be solved by the proximal gradient method [30]. Using the process twice to handle the two fused penalties

{λ ∥ U_{i \cdot} - U_{j \cdot} {∥}}_{i \neq j}

and

{λ ∥ U_{\cdot m} - U_{\cdot n} {∥}}_{m \neq n}

, we obtain a differentiable problem from the augmented Lagrangian function. Then, we propose Nesterov’s accelerated gradient method to solve the differentiable problem, which has

O (\frac{1}{k^{2}})

global convergence rate.

Our main contributions are as follows:

We propose an efficient algorithm to solve the convex biclustering model for large-scale N and p. The algorithm is a first-order method with simple subproblems. It only requires calculating the matrix multiplications and simple proximal operators, while the ADMM approaches require matrix inversion.
Our proposed method does not require as much computation, even when the tuning parameter $λ$ is large, as the existing approaches do, which means that it is easier to obtain biclustering results simultaneously for several $λ$ values.

The remaining parts of this paper are as follows. In Section 2, we provide some preliminaries which are used in the paper and introduce the convex biclustering problem. In Section 3, we illustrate our proposed algorithm for solving the convex biclustering model. After that, we conduct numerical experiments to evaluate the performance of our algorithm in Section 4.

Notation: In this paper, we use

{| | x | |}_{p}

to denote the

ℓ_{p}

norm of a vector

x \in R^{d}

,

{| | x | |}_{p} : = (\sum_{i = 1}^{d} | x_{i} {|^{p})}^{\frac{1}{p}}

for

p \in [1, \infty)

, and

{| | x | |}_{\infty} : = {max}_{i} | x_{i} |

. For a matrix

X \in R^{p \times q}

, we use

{| | X | |}_{F}

to denote the Frobenius norm,

{| | X | |}_{2}

denotes the spectral norm, and

{| | X | |}_{1} : = \sum_{i = 1}^{p} \sum_{j = 1}^{q} | x_{i j} |

if not specified.

2. Preliminaries

In this section, we provide the background for understanding the proposed method in Section 3. In particular, we introduce the notions of ADMM, ALM, NAGM, and convex biclustering.

We say that differentiable

f : R^{n} \to R

has a Lipschitz-continuous gradient if there exists

L > 0

(Lipschitz constant) such that

| | \nabla f (x) - {\nabla f (y) | |}_{2} \leq L | | x - {y | |}_{2}, \forall x, y \in R^{n} .

(1)

We define the conjugate of a function

f : R^{n} \to R

by

f^{*} (y) = sup_{x \in dom f} (y^{T} x - f (x)),

where

dom f \subseteq R^{n}

is the domain of f, and know that

f^{*}

is closed (

{x \in dom (f^{*}) | f^{*} (x) \leq α}

is a closed set for any

α \in R

) and convex. It is known that

{(f^{*})}^{*} = f

when f is closed and convex, and that Moreau’s decomposition [31] is available. Let

f : R^{n} \to R

be closed and convex. For any

x \in R^{n}

and

γ > 0

, we have

p r o x_{γ f} (x) + γ p r o x_{γ^{- 1} f^{*}} (γ^{- 1} x) = x .

(2)

where

{prox}_{f} : R^{n} \to R^{n}

is the proximal operator defined by

{prox}_{f} (x) : = arg min_{y \in R^{n}} {\frac{1}{2} | | x - {y | |}_{2}^{2} + f (y)} .

(3)

The relation (2) is derived from

{(f^{*})}^{*} = f

and the definition (3) [31].

2.1. ADMM and ALM

In this subsection, we introduce the general optimization procedures of ADMM and ALM.

Let

f, g : R^{n} \to R

and

h : R^{p} \to R

be convex. Assume that f is differentiable, and g and h are not necessarily differentiable. We consider the following optimization problem:

\begin{matrix} min_{x, y} f (x) + h (y) + g (x) \\ subject to A x = y, \end{matrix}

(4)

with variables

x \in R^{n}

and

y \in R^{p}

, and matrix

A \in R^{p \times n}

. To this end, we define the augmented Lagrangian function as the following,

L_{ν} (x, y, λ) : = f (x) + g (x) + h (y) + 〈 λ, A x - y 〉 + \frac{ν}{2} | | A x - {y | |}_{2}^{2},

(5)

where

ν > 0

is an augmented Lagrangian parameter, and

λ \in R^{p}

is the Lagrangian multipliers.

ADMM is a general procedure to find the solution to the problem (4) by iterating

\begin{matrix} x^{k + 1} & : = arg min_{x} L_{ν} (x, y^{k}, λ^{k}), \\ y^{k + 1} & : = arg min_{y} L_{ν} (x^{k + 1}, y, λ^{k}), \\ λ^{k + 1} & : = λ^{k} + ν (A x^{k + 1} - y^{k + 1}), \end{matrix}

given the initial values

y^{1}

and

λ^{1}

.

What we mean by the ALM [32,33,34] is to minimize the augmented Lagrangian function (5) w.r.t. variables x and y simultaneously given a

λ

value, i.e., we iterate the following steps:

\begin{matrix} (x^{k + 1}, y^{k + 1}) & : = arg min_{x, y} L_{ν} (x, y, λ^{k}), \end{matrix}

(6)

\begin{matrix} λ^{k + 1} & : = λ^{k} + ν (A x^{k + 1} - y^{k + 1}) . \end{matrix}

(7)

Shimmura and Suzuki [29] considered the minimization of the function

ϕ : R^{n} \to R

,

ϕ (x) : = f (x) + min_{y} {h (y) + 〈 λ, A x - y 〉 + \frac{ν}{2} | | A x - {y | |}_{2}^{2}},

and the non-differentiable function

g (x)

over

x \in R^{n}

, replacing the minimization over

x \in R^{n}

and

y \in R^{n}

in (6).

Lemma 1

([29], Theorem 1). The function

ϕ (x)

is differentiable and its differential is

\nabla ϕ (x) = \nabla f (x) + A^{T} ({prox}_{ν h^{*}} (ν A x + λ)) .

By Lemma 1, the minimization in (6) can be regarded as the composite optimization problem with the differentiable function

ϕ (x)

and the non-differentiable function

g (x)

. Therefore, it is feasible to use the proximal gradient method to update the variable x, such as the fast iterative shrinkage-thresholding algorithm (FISTA) [30].

2.2. Nesterov’s Accelerated Gradient Method

Nesterov [35] proposed a variant of the gradient descent method for Lipschitz differentiable functions. It has

O (\frac{1}{k^{2}})

convergence rate while the (traditional) gradient descent method has

O (\frac{1}{k})

[35]. Considering the minimization of a convex and differentiable function

F (x)

, NAGM is described in Algorithm 1 when

\nabla F (x)

has a Lipschitz constant L in (1).

Algorithm 1 NAGM.

Input: Lipschitz constant L, initial value

x^{0} = y^{0}

,

t^{1} = 1

.

While

k < k_{max}

(until convergence) do

1:: $x^{k + 1} = y^{k} - \frac{1}{L} \nabla F (y^{k})$
2:: $t^{k + 1} = \frac{1 + \sqrt{4 {t^{k}}^{2} + 1}}{2}$
3:: $y^{k + 1} = x^{k + 1} + \frac{t^{k} - 1}{t^{k + 1}} (x^{k + 1} - x^{k})$
4:: $k = k + 1$

End while

Algorithm 1 replaces the gradient descent

y^{k + 1} = y^{k} - \frac{1}{L} \nabla F (y^{k})

by Steps 1 to 3: Step 1 executes the gradient descent to obtain

x^{k + 1}

from

y^{k}

, Steps 2 and 3 calculate new

y^{k + 1}

based on the previous

x^{k}, x^{k + 1}

, and then return to the gradient descent in Step 1. NAGM assumes that F is differentiable, while FISTA [30], an accelerated version of ISTA [30], deals with non-differentiable F using the proximal gradient descent (see Table 1).

2.3. Convex Biclustering

We consider the convex biclustering problem in a general setting. Suppose we have a data matrix

X = (x_{i j})

consisting of N observations,

i = 1, \dots, N

, w.r.t. p features

j = 1, \dots, p

. Our task is to assign each observation to one of the non-overlapped row clusters

C_{1}, \dots, C_{R} \subseteq {1, \dots, N}

and assign each feature to one of the non-overlapped column clusters

D_{1}, \dots, D_{K} \subseteq {1, \dots, p}

. We assume that the clusters

C_{1}, \dots, C_{R}

and

D_{1}, \dots, D_{K}

and the values of R and K are not known a priori.

More precisely, the convex biclustering in this paper is formulated as follows:

min_{U \in R^{N \times p}} \frac{1}{2} | | X - {U | |}_{F}^{2} + λ (\sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} ω_{i j} | | U_{i \cdot} - U_{j \cdot} {| |}_{2} + \sum_{m = 1}^{p - 1} \sum_{n = m + 1}^{p} {\tilde{ω}}_{m n} | | U_{\cdot m} - U_{\cdot n} {| |}_{2}),

(8)

where

U_{i \cdot}

and

U_{\cdot j}

are the i-th row and j-th column of

U \in R^{N \times p}

. Chi et al. [9] suggested a requirement on the weights selection:

ω_{i j} : = 1_{i, j}^{k} exp (- ϕ ∥ x_{i \cdot} - x_{j \cdot} ∥_{2}^{2})

and

{\tilde{ω}}_{m n} : = 1_{m, n}^{k} exp (- \tilde{ϕ} {∥ x_{\cdot m} - x_{\cdot b} ∥}_{2}^{2}),

where

1_{i, j}^{k}

is 1 if j belongs to the i’s k-nearest neighbors and 0 otherwise.

1_{m, n}^{k}

is defined similarly (the parameter k should be specified beforehand), and

x_{i \cdot}

and

x_{\cdot j}

are the i-th row and j-th column of the matrix X. They suggested that the constants

ϕ

and

\tilde{ϕ}

are determined so that the sums

\sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} ω_{i j}

and

\sum_{m = 1}^{p - 1} \sum_{n = m + 1}^{p} {\tilde{ω}}_{m n}

are

N^{- 1 / 2}

and

p^{- 1 / 2}

, respectively.

Chi et al. [9] proposed COBRA to solve the problem (8). Essentially, COBRA solves the standard convex clustering problems of rows and columns alternately,

min_{U \in R^{N \times p}} \frac{1}{2} | | X - {U | |}_{F}^{2} + λ \sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} ω_{i j} | | U_{i \cdot} - U_{j \cdot} {| |}_{2}

(9)

and

min_{U \in R^{N \times p}} \frac{1}{2} | | X - {U | |}_{F}^{2} + λ \sum_{m = 1}^{p - 1} \sum_{n = m + 1}^{p} {\tilde{ω}}_{m n} | | U_{\cdot m} - U_{\cdot n} {| |}_{2},

(10)

until the solution converges. However, both optimization problems contain a non-differentiable

ℓ_{2}

norm term, and solving the convex clustering problems (9) and (10) take much time [23]. Later, the ADMM-based approaches considered alternating variables procedures and outperformed the COBRA for large parameter

λ

[19].

3. The Proposed Method and Theoretical Analysis

In this section, we first show that the whole terms in the augmented Lagrangian function of (8) can be differentiable w.r.t. U after introducing two dual variables. Therefore, we use NAGM rather than FISTA in [29], where assumes the objective function contains a non-differentiable term.

In order to make the notation clear, we formulate the problem (8) in another way. Let

ϵ_{1}

and

ϵ_{2}

be the sets

{(i, j) | ω_{i j} > 0, i < j}

and

{(m, n) | {\tilde{ω}}_{m n} > 0, m < n}

, respectively, and denote the cardinality of a set S by

| S |

. We define the matrices

C \in R^{| ϵ_{1} | \times n}

and

D \in R^{p \times | ϵ_{2} |}

by

C_{l, i} = 1, C_{l, j} = - 1, C_{l, k} = 0, k \neq i, j ⟺ l = (i, j) \in ϵ_{1},

C = {[\begin{matrix} 1 & - 1 & 0 & \dots & 0 & 0 \\ 1 & 0 & - 1 & \dots & 0 & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & 1 & - 1 \end{matrix}]}_{| ϵ_{1} | \times n},

and

D_{m, l} = 1, D_{n, l} = - 1, D_{k, l} = 0, k \neq m, n ⟺ l = (m, n) \in ϵ_{2},

D = {[\begin{matrix} 1 & 1 & \dots & 0 \\ - 1 & 0 & \dots & 0 \\ 0 & - 1 & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & 1 \\ 0 & 0 & \dots & - 1 \end{matrix}]}_{p \times | ϵ_{2} |},

respectively.

Then, the optimization problem (8) can be reformulated as follows:

\begin{matrix} min_{U \in R^{N \times p}} \frac{1}{2} | | X - {U | |}_{F}^{2} + λ (\sum_{l \in ϵ_{1}} ω_{l} | | C_{l, \cdot} {U | |}_{2} + \sum_{l \in ϵ_{2}} {\tilde{ω}}_{l} | | U D_{\cdot, l} {| |}_{2}) . \end{matrix}

(11)

3.1. The ALM Formulation

To implement ALM, we further construct the problem (11) into the following constrained optimization problem by introducing the dual variables

V \in R^{| ϵ_{1} | \times p}

and

Z \in R^{N \times | ϵ_{2} |}

,

\begin{matrix} min_{U, V, Z} & \frac{1}{2} | | X - {U | |}_{F}^{2} + λ (\sum_{l \in ϵ_{1}} ω_{l} | | V_{l} {| |}_{2} + \sum_{l \in ϵ_{2}} {\tilde{ω}}_{l} | | Z_{l} {| |}_{2}) \\ subject to & C_{l, \cdot} U - V_{l} = 0, \forall l \in ϵ_{1}, \\ U D_{\cdot, l} - Z_{l} = 0, \forall l \in ϵ_{2}, \end{matrix}

(12)

where

V_{l}

and

Z_{l}

are the l-th row and l-th column of V and Z, respectively. If we introduce the following functions,

\begin{matrix} f (U) : & = \frac{1}{2} | | X - {U | |}_{F}^{2} \\ h (V) : & = λ \sum_{l \in ϵ_{1}} ω_{l} | | V_{l} {| |}_{2} \\ g (Z) : & = λ \sum_{l \in ϵ_{2}} {\tilde{ω}}_{l} | | Z_{l} {| |}_{2}, \end{matrix}

then the problem in (12) becomes

\begin{matrix} min_{U, V, Z} \{f (U) + h (V) + g (Z)\} . \end{matrix}

(13)

The augmented Lagrangian function of the problem (12) is given by

\begin{matrix} L_{ν} (U, V, Z, Λ_{1}, Λ_{2}) : = & f (U) + h (V) + \sum_{l \in ϵ_{1}} 〈 Λ_{1 l}, C_{l, \cdot} U - V_{l} 〉 + \frac{ν}{2} \sum_{l \in ϵ_{1}} | | C_{l, \cdot} U - V_{l} {| |}_{2}^{2} \\ + g (Z) + \sum_{l \in ϵ_{2}} 〈 Λ_{2 l}, U D_{\cdot, l} - Z_{l} 〉 + \frac{ν}{2} \sum_{l \in ϵ_{2}} | | U D_{\cdot, l} - Z_{l} {| |}_{2}^{2}, \end{matrix}

(14)

where

ν > 0

is an augmented Lagrangian penalty,

Λ_{1} \in R^{| ϵ_{1} | \times p}

and

Λ_{2} \in R^{N \times | ϵ_{2} |}

are Lagrangian multipliers, and

Λ_{1 l}

and

Λ_{2 l}

are the l-th row and l-th column of

Λ_{1}

and

Λ_{2}

, respectively.

Hence, the ALM procedure of the problem (12) consists of the following three steps:

\begin{matrix} (U^{k}, V^{k}, Z^{k}) & = arg min_{U, V, Z} L_{ν} (U, V, Z, Λ_{1}^{k - 1}, Λ_{2}^{k - 1}), \end{matrix}

(15)

\begin{matrix} Λ_{1 l}^{k} & = Λ_{1 l}^{k - 1} + ν (C_{l, \cdot} U^{k} - V_{l}^{k}), \forall l \in ϵ_{1}, \end{matrix}

(16)

\begin{matrix} Λ_{2 l}^{k} & = Λ_{2 l}^{k - 1} + ν (C_{l, \cdot} U^{k} - V_{l}^{k}), \forall l \in ϵ_{2} . \end{matrix}

(17)

3.2. The Proposed Method

We construct our proposed method: repeatedly minimizing the augmented Lagrangian function in Equation (15) w.r.t

U, V_{l}, Z_{l}

and updating the Lagrange multipliers in Equations (16) and (17). The whole procedure is summarized in Algorithm 2.

Algorithm 2 Proposed method.

Input: Data X, matrices C and D, Lipschitz constant L calculated by (27), penalties

λ

and

ν

, initial value

Λ_{1}^{0}, Λ_{2}^{0}

,

Y^{1}

,

t^{1} = 1

.

While

k < k_{max}

(until convergence) do

1:: Calculate the gradient $\nabla F (Y^{k})$ by (20).
2:: Update iterate: $U^{k} \leftarrow Y^{k} - \frac{1}{L} \nabla F (Y^{k})$ .
3:: Update iterate: $Λ_{1 l}^{k} \leftarrow P_{B_{l}} (Λ_{1 l}^{k - 1} + ν C_{l, \cdot} U^{k})$ , $for l \in ϵ_{1}$ , by (24) and (25), where $B_{l} : = {{y : | | y | |}_{2} \leq λ ω_{l}}$ .
4:: Update iterate: $Λ_{2 l}^{k} \leftarrow P_{{\tilde{B}}_{l}} (Λ_{2 l}^{k - 1} + ν U^{k} D_{\cdot, l})$ , $for l \in ϵ_{2}$ , by (26) and (25), where ${\tilde{B}}_{l} : = {\tilde{y} : | | \tilde{y} {| |}_{2} \leq λ {\tilde{ω}}_{l}}$ .
5:: $t^{k + 1} = \frac{1 + \sqrt{1 + 4 {t^{k}}^{2}}}{2}$
6:: $Y^{k + 1} = U^{k} + \frac{t^{k} - 1}{t^{k + 1}} (U^{k} - U^{k - 1})$
7:: $k = k + 1$

Output: optimal solution to problem (8),

U^{*} = U^{k}

.

Step 1: update

U

. In the U-update, if we define the following function in Equation (14),

\begin{matrix} F (U) : & = f (U) + min_{V, Z} \{L (U, V, Z, Λ_{1}, Λ_{2})\} \\ = f (U) + min_{V, Z} \{h (V) + \sum_{l \in ϵ_{1}} 〈 Λ_{1 l}, C_{l, \cdot} U - V_{l} 〉 + \frac{ν}{2} \sum_{l \in ϵ_{1}} | | C_{l, \cdot} U - V_{l} {| |}_{2}^{2} \\ + g (Z) + \sum_{l \in ϵ_{2}} 〈 Λ_{2 l}, U D_{\cdot, l} - Z_{l} 〉 + \frac{ν}{2} \sum_{l \in ϵ_{2}} | | U D_{\cdot, l} - Z_{l} {| |}_{2}^{2}\}, \end{matrix}

(18)

then the update of U in (15) can be written as

U^{k + 1} : = arg min_{U} F (U^{k}) .

(19)

We find that (18) is differentiable due to Lemma 1, and obtain the following proposition.

Proposition 1.

The function

F (U)

is differentiable with respect to U, and

\nabla_{U} F (U) = - X + U + C^{T} ({prox}_{ν h^{*}} (ν C U + Λ_{1})) + ({prox}_{ν g^{*}} (ν U D + Λ_{2})) D^{T} .

(20)

For the proof, see the Appendix A.1.

With Proposition 1, we can use NAGM (Algorithm 1) to update U by solving the differentiable optimization problem (19).

Step 2: update

V_{l}

and

Λ_{1}

. By step (15) in the ALM procedure, we must minimize the functions in Equation (18) corresponding to the vector

V_{l}

by updating the following,

\begin{matrix} V_{l}^{k} & = arg min_{V_{l}} \{λ ω_{l} | | V_{l} {| |}_{2} + \frac{ν}{2} | | V_{l} {| |}_{2}^{2} - \sum_{l \in ϵ_{1}} 〈 Λ_{1 l}^{k - 1} + ν C_{l, \cdot} U^{k}, V_{l} 〉\} \\ = arg min_{V_{l}} \{\frac{ν}{2} | | V_{l} - (C_{l, \cdot} U^{k} + ν^{- 1} Λ_{1 l}^{k}) {| |}_{2}^{2} + h_{l} (V_{l})\} \\ = {prox}_{h_{l} / ν} (C_{l, \cdot} U^{k} + ν^{- 1} Λ_{1 l}^{k - 1}), \end{matrix}

where

h_{l} (V_{l}) : = λ ω_{l} | | V_{l} {| |}_{2}

denotes the l-th term in

h (V)

.

We substitute the optimal

V_{l}^{k}

in the k-th iteration into step ()

Λ_{1 l}^{k} \leftarrow Λ_{1 l}^{k - 1} + ν (C_{l, \cdot} U^{k} - V_{l}^{k}), \forall l \in ϵ_{1},

to obtain

Λ_{1 l}^{k} \leftarrow Λ_{1 l}^{k - 1} + ν C_{l, \cdot} U^{k} - ν {prox}_{h_{l} / ν} (C_{l, \cdot} U^{k} + ν^{- 1} Λ_{1 l}^{k - 1}) .

(21)

By Moreau’s decomposition (2), we further simplify the update (21) as follows,

Λ_{1 l}^{k} \leftarrow {prox}_{ν h_{l}^{*}} (Λ_{1 l}^{k - 1} + ν C_{l, \cdot} U^{k}),

(22)

which means the updates of

V_{l}

and

Λ_{1 l}

become one update (22). Hence, there is no longer a need to store and compute the variable

V_{l}

in the ALM updates, which reduces computational costs.

In the update (22), the conjugate function

h_{l}^{*} (y)

of the

ℓ_{2}

norm is an indicator function ([36], Example 3.26):

\begin{matrix} h_{l}^{*} (y) = \{\begin{matrix} 0, & {if | | y | |}_{2} \leq λ ω_{l}, \\ \infty, & otherwise . \end{matrix} \end{matrix}

(23)

Moreover, the proximal operator of the indicator function (22) is the projection problem ([37], Theorem 6.24):

{prox}_{ν h_{l}^{*}} (ν C_{l, \cdot} U^{k} + Λ_{1 l}^{k - 1}) = P_{B_{l}} (ν C_{l, \cdot} U^{k} + Λ_{1 l}^{k - 1}),

(24)

where

B_{l} : = {{y : | | y | |}_{2} \leq λ ω_{l}}

, and the operator

P_{B_{l}}

denotes the projection onto the ball

B_{l}

. It solves the problem

P_{B_{l}} (x) : = arg {min}_{u \in B_{l}} | | u - {x | |}_{2}^{2}

, i.e.,

\begin{matrix} P_{B_{l}} (x) = \{\begin{matrix} x, & {if | | x | |}_{2} \leq λ ω_{l}, \\ λ ω_{l}, & otherwise . \end{matrix} \end{matrix}

(25)

This projection problem completes in

O (p)

operations for a p-dimensional vector

x \in R^{p}

.

Step 3: update Z and

Λ_{2}

. Similarly, we can derive the following equations:

\begin{matrix} Z_{l}^{k + 1} & = arg min_{Z_{l}} \{λ \tilde{ω_{l}} | | Z_{l} {| |}_{2} + \frac{ν}{2} | | Z_{l} {| |}_{2}^{2} - \sum_{l \in ϵ_{2}} 〈 Λ_{2 l}^{k - 1} + ν U^{k} D_{\cdot, l}, Z_{l} 〉\} \\ = arg min_{Z_{l}} \{\frac{ν}{2} | | Z_{l} - (U^{k} D_{\cdot, l} + ν^{- 1} Λ_{2 l}^{k - 1}) {| |}_{2}^{2} + g_{l} (Z_{l})\} \\ = {prox}_{g_{l} / ν} (U^{k} D_{\cdot, l} + ν^{- 1} Λ_{2 l}^{k - 1}) \end{matrix}

where

g_{l} (Z_{l}) : = λ \tilde{ω_{l}} | | Z_{l} {| |}_{2}

. Then, the dual variable

Λ_{2}

update becomes

Λ_{2 l}^{k} \leftarrow {prox}_{ν g_{l}^{*}} (Λ_{2 l}^{k - 1} + ν U^{k} D_{\cdot, l}), for l \in ϵ_{2} .

If we write in the projection operator, then it becomes

Λ_{2 l}^{k} \leftarrow P_{{\tilde{B}}_{l}} (Λ_{2 l}^{k - 1} + ν U^{k} D_{\cdot, l})

(26)

where

{\tilde{B}}_{l} : = {\tilde{y} : | | \tilde{y} {| |}_{2} \leq λ {\tilde{ω}}_{l}}

.

Our proposed method only uses first-order information. Furthermore, we just need to calculate the gradient of the function F and proximal operators in each iteration, where the proximal operators are easy to obtain by solving the projection problem.

3.3. Lipschitz Constant and Convergence Rate

By the following lemma, we know that if we choose

\frac{1}{L}

as the step size for each iteration in the NAGM, then the convergence rate is, at most,

O (\frac{1}{k^{2}})

.

Lemma 2

([30,35,38]). Let

{U^{k}}

as the sequence generated by Algorithm 1, and

U_{0}

as an initial value. If we take the step size as

\frac{1}{L}

, then for any

k \geq 1

we have

F (U^{k}) - F (U^{*}) \leq \frac{2 L | | U_{0} - U^{*} {| |}_{F}^{2}}{{(k + 1)}^{2}} .

In order to examine the performance of the proposed method, we derive the Lipschitz constant L of

\nabla_{U} F (U)

as in the following proposition.

Proposition 2.

The Lipschitz constant of

\nabla_{U} F (U)

is upperbounded by

1 + ν λ_{max} (C^{T} C) + ν λ_{max} (D^{T} D),

(27)

where

λ_{max}

denotes the maximum eigenvalue of the corresponding matrix.

Proof.

By definition in (1) and Proposition 1, we derive the Lipschitz constant as follows,

\begin{matrix} | | \nabla_{U} F (U_{1}) - \nabla_{U} F (U_{2}) {| |}_{2} & = | | U_{1} - U_{2} + C^{T} ({prox}_{ν h^{*}} (ν C U_{1} + Λ_{1})) - C^{T} ({prox}_{ν h^{*}} (ν C U_{2} + Λ_{1})) \\ + ({prox}_{ν g^{*}} (ν U_{1} D + Λ_{2})) D^{T} - ({prox}_{ν g^{*}} (ν U_{2} D + Λ_{2})) D^{T} {| |}_{2} \\ \leq | | U_{1} - U_{2} {| |}_{2} + | | C^{T} ({prox}_{ν h^{*}} (ν C U_{1} + Λ_{1})) - C^{T} ({prox}_{ν h^{*}} (ν C U_{2} + Λ_{1})) {| |}_{2} \\ + | | ({prox}_{ν g^{*}} (ν U_{1} D + Λ_{2})) D^{T} - ({prox}_{ν g^{*}} (ν U_{2} D + Λ_{2})) D^{T} {| |}_{2} . \end{matrix}

By the definition of matrix 2-norm and the nonexpansiveness of the proximal operators ([39], Lemma 2.4), we obtain

\begin{matrix} | | \nabla_{U} F (U_{1}) - \nabla_{U} F (U_{2}) {| |}_{2} & \leq | | U_{1} - U_{2} {| |}_{2} + \sqrt{λ_{max} (C^{T} C)} | | ν C U_{1} - ν C U_{2} {| |}_{2} + | | ν U_{1} D - ν U_{2} D {| |}_{2} \sqrt{λ_{max} (D^{T} D)} \\ \leq | | U_{1} - U_{2} {| |}_{2} + ν λ_{max} (C^{T} C) | | U_{1} - U_{2} {| |}_{2} + ν λ_{max} (D^{T} D) | | U_{1} - U_{2} {| |}_{2} \\ \leq (1 + ν λ_{max} (C^{T} C) + ν λ_{max} (D^{T} D)) | | U_{1} - U_{2} {| |}_{2} \end{matrix}

□

Finally, it should be noted that for the time complexity, the proposed method is less sensitive to the

λ

value than the conventional methods. In fact, the

λ

value affects the proposed method only through the functions

h_{l}^{*}

and

g_{l}^{*}

that take 0 or ∞ depending on

{∥ y ∥}_{2} \leq λ ω_{l}

and

∥ \tilde{y} ∥_{2} \leq λ {\tilde{ω}}_{l}

in (23).

On the other hand, the COBRA solves two optimization problems, and the ADMM-based methods need to solve the Sylvester equation, which means that all of them are influenced by the

λ

value so much.

4. Experiments

In this section, we show the performance of the proposed approach for estimating and assessing the biclusters by conducting experiments on both synthetic and real datasets. We executed the following algorithms:

COBRA: Dykstra-like proximal algorithm proposed by Chi et al. [9].
ADMM: the ADMM proposed by Weylandt [19].
G-ADMM (generalized ADMM): the modified ADMM presented by Weylandt [19].
Proposed method: the proposed algorithm showed in Algorithm 2.

They were all implemented by Rcpp on a Macbook Air with 1.6 GHz Intel Core i5 and 8 GB memory. We recorded the wall times for the four algorithms.

4.1. Artificial Data Analysis

We evaluate the performance of the proposed methods on synthetic data in terms of the number of iterations, the execution time, and the clustering quality.

We generate the artificial data

X \in R^{N \times p}

with a checkerboard bicluster structure similar to the method in [9]. We simulate

X_{i j} \sim N (μ_{r c}, σ^{2})

(i.i.d.) as follows, where the indices r and c range in clusters

{1, \dots, R}

and

{1, \dots, C}

, respectively, which means that the number of biclusters is

M : = R \times C

. We assign each

x_{i j}

randomly belongs to one of those M biclusters. The mean

μ_{r c}

is chosen uniformly from an equally spaced sequence

{- 10, - 9, \dots, 9, 10}

, and the

σ

is chosen as

1.5

and

3.0

for different noise levels.

In our experiments, we consider the following stopping criteria for the four algorithms.

Relative error:

$\frac{| | U^{k + 1} - U^{k} {| |}_{F}}{max {| | U^{k} | |_{F}, 1}} \leq ϵ .$
Objective function error:

$| | F (U^{k}) - F (U^{*}) {| |}_{F} \leq ϵ .$

where

ϵ

is a given accuracy tolerance. We terminate the algorithm if the above error is smaller than

ϵ

or the maximum number of iterations exceeds 10,000. We use the relative error for the time comparisons and quality assessment and the objective function error for convergence rate analysis.

4.1.1. Comparisons

We change the sizes of

N, p

of the data matrix X and the tuning parameter

λ

to test the performance of four algorithms and compare the performance among the algorithms. At first, we compare the execution time with different

λ

, ranging from 1 to 2000, and setting

R = 4

,

C = 4

,

σ = 1.5

. We obtain the results shown in Figure 2a.

From Figure 2a, we observe that the execution time of the COBRA and G-ADMM increases rapidly as

λ

varies. The execution time of COBRA is the largest when

λ > 1400

. Therefore, it will take a long time for COBRA to visualize the whole fusion process, particularly the single bicluster case. Our proposed method significantly outperforms the other three algorithms and offers high stability in a wide range of

λ

. Due to its low computational time, our proposed method is a preferable choice to visualize the biclusters for various

λ

values when applying biclustering.

Next, we compare the execution time with different p (from 1 to 200). Here, we fix the number of column clusters and row clusters (

C = R = 4

). Figure 2b shows that the execution time of the algorithms increases as p grows. The proposed method shows better performance than the other three. In particular, the ADMM and G-ADMM are suffered from the feature dimension p and the computations grow dramatically.

Then, we vary the sample size N from 100 to 1000 and fix the size of the feature

p = 40

, with

λ = 1

,

R = 4

,

C = 4

,

σ = 1.5

. Figure 3 shows the execution times of the three algorithms (COBRA, G-ADMM, and Proposed) for each N.

The curves reveal that when the larger the sample size N, the more time the algorithms require. Moreover, the ADMM takes more than 500 s when

N > 500

and takes around 1800 s when

N = 1000

, which are much larger than the other three algorithms. Thus, we remove the result of ADMM from the figure. However, our proposed method only takes around 10 s even when

N = 1000

, which is six times smaller than G-ADMM.

4.1.2. Assessment

We evaluate the clustering quality by a widely used criterion called the Rand index (RI) [40]. The value of RI ranges from 0 to 1; a higher value shows better performance, and 1 indicates the perfect quality of the clustering. Note that we can obtain the true bicluster labels in the data generation procedure. We generate the matrix data with

N = 100

and

p = 100

, and set two noise levels, low (

σ = 1.5

) and high (

σ = 3.0

). We compare the clustering quality of our proposed method with ADMM, G-ADMM, and COBRA under different settings. Setting 1:

R = 2

,

C = 4

,

σ = 1.5

; Setting 2:

R = 4

,

C = 4

,

σ = 1.5

; Setting 3:

R = 4

,

C = 8

,

σ = 1.5

; Setting 4:

R = 2

,

C = 4

,

σ = 3

; Setting 5:

R = 4

,

C = 4

,

σ = 3

; Setting 6:

R = 4

,

C = 8

,

σ = 3

.

Table 2 presents the result of the experiment. As the tuning parameter

λ

increases, the biclusters tend to fuse and reduce noise interference in the raw data. While in some cases, for extremely high

λ

such as 10,000, the biclusters may be over-smoothed, and the value of the Rand index is decreased. For example, in the first case (Setting 1: the number of biclusters is

2 \times 4

and

σ = 1.5

). The Rand index in COBRA, ADMM, and our proposed method shows a similar value in most cases because all the algorithms solve the same model. However, the G-ADMM exhibits the worst performance due to its slow convergence rate, and it cannot converge well when the tuning parameter

λ

is large (

λ = 5000

and

λ =

10,000). Overall, from the results in Table 2, our proposed method shows high accuracy and stability from low to high noise.

4.2. Real Data Analysis

In this section, we use three different real datasets to demonstrate the performance of our proposed method.

Firstly, we use the presidential speeches dataset preprocessed by Weylandt et al. [41] that contains 75 high-frequency words taken from the significant speeches of the 44 U.S. presidents around the year 2018. We show the heatmaps in Figure 4 under a wide range of tuning parameters

λ

to exhibit the fusion process of biclusters. We set the tolerance

ϵ

to be

1 \times 10^{- 6}

, and use the relative error stopping criterion as described in Section 4.1. The columns represent the different presidents, and the rows represent the different words. When

λ = 0

, the heatmap is disordered, and there are no distinct subgroups. While we increase the

λ

, the biclusters begin to merge. We can further find out the common vocabulary used in some groups of the prime minister’s speeches. Moreover, as shown in Figure 4f, the heatmap clearly shows four biclusters with two subgroups of presidents and two subgroups of words when

λ =

30,000.

Secondly, we compare the computational time of four algorithms for two actual datasets. One is The Cancer Genome Atlas (TCGA) dataset [42], which contains 438 breast cancer patients (samples) and 353 genes (features), and the other one is the diffuse large-B-cell lymphoma (DLBCL) dataset [43] with 3795 genes and 58 patient samples. In DLBCL, there are 32 samples from cured patients and 26 samples from sick individuals among the 58 samples. Furthermore, we extract 500 genes with the highest variances among the original genes.

Figure 5a,b depicts the outcomes of the elapsed time comparison. From the curves, we observe that our proposed approach surpasses the other three methods. In contrast, ADMM shows the worst performance in the DLBCL dataset, and the case of tolerance

ϵ < 10^{- 3}

requirement in the TCGA dataset.

Lastly, we compare the number of iterations to achieve the specified tolerance of

F (U^{k}) - F (U^{*})

and run it on the TCGA and DLBCL datasets. Figure 6a,b reveals that the COBRA algorithm has the fastest convergence rate, whereas the generalized ADMM is the slowest to converge. Our proposed method shows competitive performance in the convergence rate.

Overall, from the above experiment results of the artificial and real datasets, our proposed method has superior computational performance with high accuracy.

5. Discussion

We proposed a method to find a solution to the convex biclustering problem. We found that it outperformed the conventional algorithms, such as COBRA and ADMM-based procedures, in the sense of efficiency. Our proposed method is more efficient than COBRA because the latter should solve two optimization problems containing non-differentiable fused terms in each cycle. Additionally, the proposed method performed better than the ADMM-based procedures because the former is based on [29] and uses the NAGM to update the variable U. However, ADMM spends much more time computing the matrix inverse. Moreover, our proposed method is stable while varying the tuning parameters

λ

, which is convenient for us to find the optimal

λ

and visualize the variation of the heatmaps under a wide range of

λ

.

As for further improvements, we can use ADMM as a warm start strategy to select an initial value for our proposed method. What is more, according to the fusion process of the heatmap results in Figure 4, it will be meaningful if we can derive the range of tuning parameters

λ

that yield the non-trivial solutions of the convex biclustering with more than one bicluster. Additionally, the proposed method can motivate future work. We can extend the proposed method to solve other clustering problems, such as the sparse singular value decomposition model [44] and the integrative generalized convex clustering model [45].

Author Contributions

Conceptualization, J.C. and J.S.; methodology, J.C. and J.S.; software, J.C.; formal analysis, J.C. and J.S.; investigation, J.C.; resources, J.C.; data curation, J.C.; writing—original draft preparation, J.C. and J.S.; writing—review and editing, J.C. and J.S.; visualization, J.C.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Grant-in-Aid for Scientific Research (KAKENHI) C, Grant number: 18K11192.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this paper. Presidential speeches dataset: https://www.presidency.ucsb.edu (accessed date 18 October 2021); DLBCL dataset: http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi (accessed date 18 October 2021).

Acknowledgments

The authors would like to thank Ryosuke Shimmura for helpful discussions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADMM	alternating direction method of multipliers
ALM	augmented Lagrangian method
COBRA	convex biclustering algorithm
DLBCL	diffuse large-B-cell lymphoma
FISTA	fast iterative shrinkage-thresholding algorithm
ISTA	iterative shrinkage-thresholding algorithm
NAGM	Nesterov’s accelerated gradient method
RI	Rand index
TCGA	The Cancer Genome Atlas

Appendix A

Appendix A.1. Proof of Proposition 1

First, we define the following two functions,

r_{1} (V) : = h (V) + \frac{ν}{2} {| | V | |}_{F}^{2},

and

r_{2} (Z) : = g (Z) + \frac{ν}{2} {| | Z | |}_{F}^{2} .

By the definition of

F (U)

in (18),

\begin{matrix} F (U) : & = f (U) + min_{V, Z} \{L (U, V, Z, Λ_{1}, Λ_{2})\} \\ = f (U) + min_{V, Z} \{h (V) + g (Z) + \sum_{l} 〈 Λ_{1 l}, C_{l, \cdot} U - V_{l} 〉 + \frac{ν}{2} \sum_{l} | | C_{l, \cdot} U - V_{l} {| |}_{2}^{2} \\ + \sum_{l} 〈 Λ_{2 l}, U D_{\cdot, l} - Z_{l} 〉 + \frac{ν}{2} \sum_{l} | | U D_{\cdot, l} - Z_{l} {| |}_{2}^{2}\} \\ = f (U) + min_{V} \{h (V) + \frac{ν}{2} {| | V | |}_{F}^{2} - \sum_{l} 〈 Λ_{1 l} + ν C_{l, \cdot} U, V_{l} 〉\} + \frac{ν}{2} {| | C U | |}_{F}^{2} + \sum_{l} 〈 Λ_{1 l}, C_{l, \cdot} U 〉 \\ + min_{Z} \{g (Z) + \frac{ν}{2} {| | Z | |}_{F}^{2} - \sum_{l} 〈 Λ_{2 l} + ν U D_{\cdot, l}, Z_{l} 〉\} + \frac{ν}{2} {| | U D | |}_{F}^{2} + \sum_{l} 〈 Λ_{2 l}, U D_{\cdot, l} 〉 \\ = - max_{V} \{〈 ν C U + Λ_{1}, V 〉 - h (V) - \frac{ν}{2} {| | V | |}_{F}^{2}\} + f (U) + \frac{ν}{2} {| | C U | |}_{F}^{2} + \sum_{l} 〈 Λ_{1 l}, C_{l, \cdot} U 〉 \\ - max_{Z} \{〈 ν U D + Λ_{2}, Z 〉 - g (Z) - \frac{ν}{2} {| | Z | |}_{F}^{2}\} + \frac{ν}{2} {| | U D | |}_{F}^{2} + \sum_{l} 〈 Λ_{2 l}, U D_{\cdot, l} 〉 \\ = f (U) - r_{1}^{*} (ν C U + Λ_{1}) - r_{2}^{*} (ν U D + Λ_{2}) + \frac{ν}{2} {| | C U | |}_{F}^{2} + 〈 Λ_{1}, C U 〉 \\ + \frac{ν}{2} {| | U D | |}_{F}^{2} + 〈 Λ_{2}, U D 〉 . \end{matrix}

By Theorem 26.3 in [33], if the function

r : R^{p} \to R

is closed and strongly convex, then we have the differentiable conjugate function

r^{*} (v)

, and

\nabla r^{*} (v) = arg max_{u \in R^{p}} {〈 u, v 〉 - r (u)} .

Hence, we can derive the following equations,

\begin{matrix} \nabla r_{1}^{*} (v) & = arg max_{u} \{〈 u, v 〉 - h (u) - \frac{ν}{2} {| | u | |}_{F}^{2}\} \\ = arg max_{u} \{- \frac{1}{2} {| | u | |}_{F}^{2} + \frac{1}{ν} 〈 u, v 〉 - \frac{1}{ν} h (u)\} \\ = arg min_{u} \{\frac{1}{2} | | u - \frac{v}{ν} {| |}_{F}^{2} + \frac{1}{ν} h (u)\} \\ = {prox}_{h / ν} (\frac{v}{ν}) . \end{matrix}

Then, we obtain

\nabla r_{1}^{*} (ν C U + Λ_{1}) = ν C^{T} ({prox}_{h / ν} (C U + \frac{Λ_{1}}{ν}))

and, similarly,

\nabla r_{2}^{*} (ν U D + Λ_{2}) = ν ({prox}_{g / ν} (U D + \frac{Λ_{2}}{ν})) D^{T} .

Next, take the derivative of

F (U)

w.r.t U,

\begin{matrix} \nabla_{U} F (U) & = \nabla f (U) - \nabla r_{1}^{*} (ν C U + Λ_{1}) - \nabla r_{2}^{*} (ν U D + Λ_{2}) \\ + ν C^{T} C U + ν U D D^{T} + C^{T} Λ_{1} + Λ_{2} D^{T} \\ = \nabla f (U) - ν C^{T} ({prox}_{h / ν} (C U + ν^{- 1} Λ_{1})) - ν ({prox}_{g / ν} (U D + \frac{Λ_{2}}{ν})) D^{T} \\ + ν C^{T} C U + C^{T} Λ_{1} + ν U D D^{T} + Λ_{2} D^{T} \\ = \nabla f (U) + C^{T} ({prox}_{ν h^{*}} (ν C U + Λ_{1})) + ({prox}_{ν g^{*}} (ν U D + Λ_{2})) D^{T}, \end{matrix}

and we obtain the last equation by Moreau’s decomposition.

References

Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. (Appl. Stat.) 1979, 28, 100–108. [Google Scholar] [CrossRef]
Johnson, S.C. Hierarchical clustering schemes. Psychometrika 1967, 32, 241–254. [Google Scholar] [CrossRef] [PubMed]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Hartigan, J.A. Direct clustering of a data matrix. J. Am. Stat. Assoc. 1972, 67, 123–129. [Google Scholar] [CrossRef]
Cheng, Y.; Church, G.M. Biclustering of expression data. ISMB Int. Conf. Intell. Syst. Mol. Biol. 2000, 8, 93–103. [Google Scholar]
Tanay, A.; Sharan, R.; Shamir, R. Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002, 18, S136–S144. [Google Scholar] [CrossRef] [Green Version]
Prelić, A.; Bleuler, S.; Zimmermann, P.; Wille, A.; Bühlmann, P.; Gruissem, W.; Hennig, L.; Thiele, L.; Zitzler, E. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 2006, 22, 1122–1129. [Google Scholar] [CrossRef] [PubMed]
Madeira, S.C.; Oliveira, A.L. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 2004, 1, 24–45. [Google Scholar] [CrossRef]
Chi, E.C.; Allen, G.I.; Baraniuk, R.G. Convex biclustering. Biometrics 2017, 73, 10–19. [Google Scholar] [CrossRef]
Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; Knight, K. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2005, 67, 91–108. [Google Scholar] [CrossRef] [Green Version]
Hocking, T.D.; Joulin, A.; Bach, F.; Vert, J.P. Clusterpath an algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, DC, USA, 28 June–2 July 2011; p. 1. [Google Scholar]
Lindsten, F.; Ohlsson, H.; Ljung, L. Clustering using sum-of-norms regularization: With application to particle filter output computation. In Proceedings of the 2011 IEEE Statistical Signal Processing Workshop (SSP), Nice, France, 28–30 June 2011; pp. 201–204. [Google Scholar]
Pelckmans, K.; De Brabanter, J.; Suykens, J.A.; De Moor, B. Convex clustering shrinkage. In Proceedings of the PASCALWorkshop on Statistics and Optimization of Clustering Workshop, London, UK, 4–5 July 2005. [Google Scholar]
Langfelder, P.; Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinform. 2008, 9, 1–13. [Google Scholar] [CrossRef] [Green Version]
Tan, K.M.; Witten, D.M. Sparse biclustering of transposable data. J. Comput. Graph. Stat. 2014, 23, 985–1008. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shor, N.Z. Minimization Methods for Non-Differentiable Functions; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 3. [Google Scholar]
Boyd, S.; Xiao, L.; Mutapcic, A. Subgradient methods. Lect. Notes EE392o Stanf. Univ. Autumn Quart. 2003, 2004, 2004–2005. [Google Scholar]
Bauschke, H.H.; Combettes, P.L. A Dykstra-like algorithm for two monotone operators. Pac. J. Optim. 2008, 4, 383–391. [Google Scholar]
Weylandt, M. Splitting methods for convex bi-clustering and co-clustering. In Proceedings of the 2019 IEEE Data Science Workshop (DSW), Minneapolis, MN, USA, 2–5 June 2019; pp. 237–242. [Google Scholar]
Glowinski, R.; Marroco, A. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires. ESAIM Math. Model. Numer. Anal.-ModéLisation MathéMatique Anal. NuméRique 1975, 9, 41–76. [Google Scholar] [CrossRef]
Gabay, D.; Mercier, B. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 1976, 2, 17–40. [Google Scholar] [CrossRef] [Green Version]
Deng, W.; Yin, W. On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 2016, 66, 889–916. [Google Scholar] [CrossRef] [Green Version]
Chi, E.C.; Lange, K. Splitting methods for convex clustering. J. Comput. Graph. Stat. 2015, 24, 994–1013. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Suzuki, J. Sparse Estimation with Math and R: 100 Exercises for Building Logic; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Bartels, R.H.; Stewart, G.W. Solution of the matrix equation AX+ XB= C [F4]. Commun. ACM 1972, 15, 820–826. [Google Scholar] [CrossRef]
Goldstein, T.; O’Donoghue, B.; Setzer, S.; Baraniuk, R. Fast alternating direction optimization methods. Siam J. Imaging Sci. 2014, 7, 1588–1623. [Google Scholar] [CrossRef] [Green Version]
Boyd, S.; Parikh, N.; Chu, E. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers; Now Publishers Inc.: Boston, MA, USA, 2011. [Google Scholar]
Abualigah, L.; Diabat, A.; Mirjalili, S.; Abd Elaziz, M.; Gandomi, A.H. The arithmetic optimization algorithm. Comput. Methods Appl. Mech. Eng. 2021, 376, 113609. [Google Scholar] [CrossRef]
Shimmura, R.; Suzuki, J. Converting ADMM to a Proximal Gradient for Convex Optimization Problems. arXiv 2021, arXiv:2104.10911. [Google Scholar]
Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. Siam J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef] [Green Version]
Moreau, J.J. Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France 1965, 93, 273–299. [Google Scholar] [CrossRef]
Hestenes, M.R. Multiplier and gradient methods. J. Optim. Theory Appl. 1969, 4, 303–320. [Google Scholar] [CrossRef]
Rockafellar, R.T. The multiplier method of Hestenes and Powell applied to convex programming. J. Optim. Theory Appl. 1973, 12, 555–562. [Google Scholar] [CrossRef]
Nocedal, J.; Wright, S. Numerical Optimization; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Nesterov, Y.E. A method for solving the convex programming problem with convergence rate O (1/k²). Dokl. Akad. Nauk Sssr 1983, 269, 543–547. [Google Scholar]
Boyd, S.; Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Beck, A. First-Order Methods in Optimization; SIAM: Philadelphia, PA, USA, 2017. [Google Scholar]
Nemirovski, A.; Yudin, D. Problem Complexity and Method Efficiency in Optimization; John Wiley: Hoboken, NJ, USA, 1983. [Google Scholar]
Combettes, P.L.; Wajs, V.R. Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul. 2005, 4, 1168–1200. [Google Scholar] [CrossRef] [Green Version]
Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
Weylandt, M.; Nagorski, J.; Allen, G.I. Dynamic visualization and fast computation for convex clustering via algorithmic regularization. J. Comput. Graph. Stat. 2020, 29, 87–96. [Google Scholar] [CrossRef] [Green Version]
Koboldt, D.; Fulton, R.; McLellan, M.; Schmidt, H.; Kalicki-Veizer, J.; McMichael, J.; Fulton, L.; Dooling, D.; Ding, L.; Mardis, E.; et al. Comprehensive molecular portraits of human breast tumours. Nature 2012, 490, 61–70. [Google Scholar]
Rosenwald, A.; Wright, G.; Chan, W.C.; Connors, J.M.; Campo, E.; Fisher, R.I.; Gascoyne, R.D.; Muller-Hermelink, H.K.; Smeland, E.B.; Giltnane, J.M.; et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med. 2002, 346, 1937–1947. [Google Scholar] [CrossRef] [PubMed]
Lee, M.; Shen, H.; Huang, J.Z.; Marron, J. Biclustering via sparse singular value decomposition. Biometrics 2010, 66, 1087–1095. [Google Scholar] [CrossRef] [PubMed]
Wang, M.; Allen, G.I. Integrative generalized convex clustering optimization and feature selection for mixed multi-view data. J. Mach. Learn. Res. 2021, 22, 1–73. [Google Scholar]

Figure 1. While standard clustering divides either rows or columns, the biclustering divides both. (a) Row clustering; (b) column clustering; (c) biclustering.

Figure 2. Execution time for various

λ

and p with

N = 100

and

ϵ = 1 \times 10^{- 6}

. (a) Different

λ

, with

p = 40

; (b) different p, with

λ = 1

.

Figure 2. Execution time for various

λ

and p with

N = 100

and

ϵ = 1 \times 10^{- 6}

. (a) Different

λ

, with

p = 40

; (b) different p, with

λ = 1

.

Figure 3. Execution times for each N with

λ = 1, p = 40

and

ϵ = 1 \times 10^{- 6}

.

Figure 3. Execution times for each N with

λ = 1, p = 40

and

ϵ = 1 \times 10^{- 6}

.

Figure 4. The heatmap results of proposed method implementation on the presidential speeches dataset under a wide range of

λ

. (a)

λ = 0

; (b)

λ = 1500

; (c)

λ = 2000

; (d)

λ = 5000

; (e)

λ =

15,000; (f)

λ =

30,000.

Figure 4. The heatmap results of proposed method implementation on the presidential speeches dataset under a wide range of

λ

. (a)

λ = 0

; (b)

λ = 1500

; (c)

λ = 2000

; (d)

λ = 5000

; (e)

λ =

15,000; (f)

λ =

30,000.

Figure 5. Plot of

log (F (U^{k}) - F (U^{*}))

vs. the elapsed time. (a) TCGA; (b) DLBCL.

Figure 5. Plot of

log (F (U^{k}) - F (U^{*}))

vs. the elapsed time. (a) TCGA; (b) DLBCL.

Figure 6. Plot of

log (F (U^{k}) - F (U^{*}))

vs. the number of iterations. (a) TCGA; (b) DLBCL.

Figure 6. Plot of

log (F (U^{k}) - F (U^{*}))

vs. the number of iterations. (a) TCGA; (b) DLBCL.

Table 1. Gradient descent and its modifications.

	Differentiable	Non-Differentiable
Ordinary	Gradient descent	Proximal gradient descent (e.g., ISTA [30])
Accelated	NAGM [35]	FISTA [30]

Table 2. Assessment result.

Setting	Algorithm	Rand Index
Setting	Algorithm	$λ = 100$	$λ = 1000$	$λ = 5000$	$λ = 10, 000$
Setting 1	COBRA	0.874	0.875	0.999	0.931
	ADMM	0.872	0.875	0.999	0.931
	G-ADMM	0.874	0.875	0.874	0.872
	Proposed	0.875	0.875	0.999	0.931
Setting 2	COBRA	0.928	0.932	0.994	0.999
	ADMM	0.928	0.935	0.994	0.999
	G-ADMM	0.928	0.934	0.981	0.936
	Proposed	0.928	0.934	0.994	0.999
Setting 3	COBRA	0.959	0.962	0.962	0.999
	ADMM	0.961	0.962	0.962	0.998
	G-ADMM	0.959	0.962	0.967	0.967
	Proposed	0.961	0.962	0.962	0.999
Setting 4	COBRA	0.870	0.870	0.870	0.935
	ADMM	0.870	0.868	0.871	0.933
	G-ADMM	0.870	0.870	0.871	0.871
	Proposed	0.870	0.870	0.871	0.935
Setting 5	COBRA	0.934	0.934	0.934	0.964
	ADMM	0.934	0.932	0.934	0.964
	G-ADMM	0.934	0.934	0.932	0.932
	Proposed	0.934	0.934	0.934	0.964
Setting 6	COBRA	0.960	0.960	0.962	0.962
	ADMM	0.961	0.962	0.962	0.962
	G-ADMM	0.960	0.962	0.962	0.960
	Proposed	0.961	0.962	0.962	0.962

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Suzuki, J. An Efficient Algorithm for Convex Biclustering. Mathematics 2021, 9, 3021. https://doi.org/10.3390/math9233021

AMA Style

Chen J, Suzuki J. An Efficient Algorithm for Convex Biclustering. Mathematics. 2021; 9(23):3021. https://doi.org/10.3390/math9233021

Chicago/Turabian Style

Chen, Jie, and Joe Suzuki. 2021. "An Efficient Algorithm for Convex Biclustering" Mathematics 9, no. 23: 3021. https://doi.org/10.3390/math9233021

APA Style

Chen, J., & Suzuki, J. (2021). An Efficient Algorithm for Convex Biclustering. Mathematics, 9(23), 3021. https://doi.org/10.3390/math9233021

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Algorithm for Convex Biclustering

Abstract

1. Introduction

2. Preliminaries

2.1. ADMM and ALM

2.2. Nesterov’s Accelerated Gradient Method

2.3. Convex Biclustering

3. The Proposed Method and Theoretical Analysis

3.1. The ALM Formulation

3.2. The Proposed Method

3.3. Lipschitz Constant and Convergence Rate

4. Experiments

4.1. Artificial Data Analysis

4.1.1. Comparisons

4.1.2. Assessment

4.2. Real Data Analysis

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Proof of Proposition 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI