1. Introduction
Recently there has been an increasing interest in developing AI (artificial intelligence) algorithms using the technique of algorithm unrolling [
1]. The motivation behind algorithm unrolling is to achieve some interpretability in the AI algorithm, unlike many other AI algorithms which are usually considered ’black boxes’. The main idea with algorithm unrolling is the conversion of an iterative algorithm, e.g., Iterative Shrinkage–Thresholding Algorithm (ISTA) [
2,
3,
4,
5], into another iterative algorithm with a different mapping in each iteration. Algorithm unrolling has been successfully applied in applications such as compressed sensing [
6], phase retrieval [
7], power systems [
8], image fusion [
9] and microscopy [
10]. The original iterative algorithm is usually derived from theoretical and/or physical consideration of the problem at hand. The unrolled algorithm inherits some of the theoretical and/or physical features from the original algorithm, and therefore has some aspects of interpretability. For example with ISTA, the theoretical consideration is to achieve a sparse representation of a signal vector given a dictionary of atom (constituent) vectors. An abstraction of algorithm unrolling is to consider a sequence of mappings between metric spaces, where the application of a mapping can be considered as one iteration of the algorithm. This abstraction is also relevant in other areas like non-autonomous contraction mappings [
11].
Consider a sequence of metric spaces
,
, where
denotes the set of elements (points) and
denotes the corresponding distance function. The sequence of mappings
is between the metric spaces:
for
, as illustrated in
Figure 1. Starting with an initial point
, we consider the new point after the application of the sequence of mappings:
where the symbol ‘∘’ denotes the composition of mappings. The superscript ‘
’ in
denotes the index of the space the point belongs to, i.e.,
. However if the point is obtained using (
1),
m also represents the index of the term in the sequence. In other words,
is the
term in the sequence
, where every term in the sequence, in general, belongs to a different space.
When all metric spaces are the same, i.e.,
,
, and all mappings are the same, i.e.,
,
, (
1) represents the classical iteration procedure found in many branches of applied mathematics and data science, e.g., Jacobi iterations and Gauss–Seidel iterations [
12]. Furthermore, when the mapping
T is a contraction, the Banach Fixed Point theorem applies, and this also has applications in the study of differential and integral equations [
12].
In practice, with algorithm unrolling, the different mappings are obtained through parametrization; i.e., where is the parameter vector that defines the mapping in each iteration.
Sequences of contraction mappings have also previously been considered in [
13,
14,
15,
16]. However the context considered in [
13,
14,
15,
16] is different to that considered here. The works in [
13,
14,
15,
16] are mainly interested in the convergence of the sequence of fixed points from the sequence of contraction mappings
, i.e., the convergence of
where
. These previous works do not consider the tandem application of the sequence of mappings as shown in (
1). In this work, we study the behavior of (
1) as
and prove some convergence results. More detailed comparisons are deferred to
Section 5. Application to the unrolled ISTA will also be considered. To the best of our knowledge, similar results are not found elsewhere.
Organization of paper: In
Section 2, we review some definitions and concepts in metric spaces that are relevant to this work. New definitions which are generalizations of classical notions will also be presented here. The convergence in the general case, with different mappings and different metric spaces, is analyzed in
Section 3. In
Section 4, the convergence with different mappings on the same metric space is analyzed. Detailed comparisons with previous works, which considered sequences of mappings, are found in
Section 5. The general convergence result is then used to analyze the convergence of the unrolled ISTA in
Section 6. Concluding remarks are found in
Section 7.
2. Preliminaries and Definitions
Firstly, we provide some comments about the notation used in the paper. As mentioned earlier, the superscript ‘’ in denotes the index of the space the point belongs to, i.e., . In general, the spaces are different; i.e., for . Since we are concerned with a sequence of mappings from one space to another space, m also represents the index of the sequence; i.e., represents a sequence of points in different spaces (in general). In the special case when all the spaces are the same, we use a subscript to denote the index of the sequence, i.e., , which is the common convention for sequences.
We first recall some basic definitions for metric spaces [
12] which are relevant to the developments that follow. A metric space is a set of points (elements)
X and an endowed distance function
that satisfies the following axioms. For all
, we have
- A1
is a non-negative finite-valued function.
- A2
if and only if .
- A3
; i.e., the distance function is symmetric.
- A4
which is known as the triangle inequality.
Using induction on axiom A4, we have the following.
Definition 1 (Generalized triangle inequality)
. For all , Definition 2 (Cauchy sequence and completeness)
. A sequence () is a Cauchy sequence if for every , there exists such that If every Cauchy sequence converges to a limit point, i.e.,then the space X is complete. Consider a mapping
which maps points in
X onto itself. The mapping
T is a contraction if there exists a constant
such that for all
With a contraction mapping, we have the well-known Banach Fixed Point Theorem (BFPT).
Theorem 1 (BFPT)
. Consider a complete metric space and a contraction mapping T: . We have the following:
- 1.
There exists a point such thatThe point is unique and is known as the fixed point of T. - 2.
Given any initial point , the sequence of points generated from the iterationsconverges to a fixed point; i.e.,
The discussions above are for a single metric space and a single mapping. However, in this work we are considering the general case of multiple metric spaces and multiple mappings as described in (
1). Some of the results above are still relevant, e.g., axioms A1–A4 and Definitions 1 and 2, when we are considering each metric space in the sequence in isolation. The relationships above will have superscript
(m) for the elements of the metric space
with the corresponding distance function
. However, new definitions are needed, when multiple metric spaces and mappings are involved.
Definition 3 (Lipschitz coefficient)
. For the map , let The coefficient of the map is defined as Note that since
, by axiom A2, the denominator is non-zero. If
is a trivial constant map, then the numerator and
are equal to zero. We
only consider non-trivial maps such that is positive. If
is finite-valued, we have the following inequality:
Note that when
,
, and both sides of (
3) are equal to zero; i.e., (
3) is still valid.
Definition 4 (Contraction mapping)
. The (non-trivial) mapping is a contraction if and we have. Definitions 3 and 4 are generalizations of the classical notions in a single metric space and mapping to sequences of metric spaces and mappings. Note that inequality (
3) applies to any mapping
, but (
4) applies only to mappings that are contractions.
4. Different Mappings on the Same Metric Space
We now consider the case where all spaces are the same; i.e.,
, for all
m and
. Since all points are now in the same space, we denote the sequence
(from the iterations) as
, where
Two new definitions, pertaining to the properties of the mappings, are first given.
Definition 5 (Pairwise contraction)
. A pair of mappings is a pairwise contraction if for any such that , there exists a positive constant such that For a sequence of mappings
, we define the sequence of Lipschitz coefficients
(
) as
where
Definition 6 (Sequential contraction)
. A sequence of mappings is a sequential contraction if the following conditions hold:
- 1.
The sequence of Lipschitz coefficients () are positive and finite-valued; i.e., .
- 2.
There exists a finite positive integer L such that is a pairwise contraction for all ; i.e.,for any such that , and .
Remark 2. - 1.
The definition above reduces to the classical definition of a contraction mapping if all the mappings are the same; i.e., .
- 2.
If T is a contraction mapping (in the classical sense), then is a pairwise contraction and the sequence is a sequential contraction.
We then have the following result.
Theorem 3. Consider the iterates from (
11)
on a complete metric space X. Suppose, for a given initial point , the following conditions hold: - 1.
The sequence is a sequential contraction.
- 2.
There exists a finite positive integer L such that - 3.
for .
Then, for any initial point whose generated iterates satisfy , for all , there exists a limit point such that Proof. Since
is positive and finite, we have
for all
and
. For
,
, and for
,
can be any finite positive value. Since
, using (
14) repeatedly, we have
Using (
13), for
we have
Using the generalized triangle inequality (
2), for
we have
Applying inequality (
16) to each term on the R.H.S. we have
Applying the geometric series formula
to the summation in brackets, where
,
and
, we have
Now (
15) implies that
and therefore
. Using this inequality we have
Now
is a finite product of terms involving
,
and
. Since the latter are all finite,
is also finite. The term
is positive and finite as
. As a consequence of (
15),
and therefore
can be made sufficiently small by choosing a sufficiently large
m. This means that for every
, there exists a
K such that
meaning
is a Cauchy sequence. Since the space
X is complete, by Definition 2, a limit point exists. □
Remark 3. - 1.
Although a limit point exists, in general, is not a fixed point of any of the mappings.
- 2.
If all mappings are the same and are contractions, the conditions for the Banach Fixed Point theorem apply, and the limit point is also the fixed point of the mapping.
- 3.
The limit point, in general, will depend on the initial point .
A simple formula for estimating the number of required iterations, when , is given by the following.
Corollary 2. If all conditions of Theorem 3 are satisfied, when , for a given , the boundis achieved if the number of iterations M satisfieswhere Proof. As
,
. Therefore, using inequality (
17), we have
Taking the logarithm (of any base) of both sides, and noting that
and
, so that the inequality is reversed, we have
Inequality (
18) is then obtained. □
5. Comparisons
Sequences of mappings have also been considered previously in [
13,
14,
15,
16]. We now make detailed comparisons with the previous works, and show that the previous results are fundamentally different to the results in this work.
The fundamental concept in previous works is to consider a sequence of self-mappings
on a complete metric space
X (with distance function
d). The mappings are assumed to be contractions but with the possibility of different Lipschitz coefficients:
where
for all
n, and in general
(
). By the Banach Fixed Point Theorem, there exists a unique fixed point for each
; i.e.,
The condition imposed on the sequence of mappings is that there exists a limiting self-map
T, which is a contraction, such that
The convergence can be pointwise or uniform over
X. Then the sequence of fixed points
also converges to the fixed point of
T; i.e.,
The results above do not explicitly refer to any iteration process to obtain the fixed points. However, Theorem 1 implies that each fixed point
can be obtained as follows. Given any initial point
, we perform the following iterations:
and
In (
21) and (
22), the subscript ‘
k’ tracks the iteration number, and the subscript ‘
n’ tracks the mapping that is used. In general, a different initial point
can be used for each mapping
. The important observation is that the same mapping is used for the iterations in (
21). This is fundamentally different to (
11) where, in general, a different mapping is used in each iteration. The condition for convergence in (
20) is the existence of a limiting map as shown in (
19), but no such condition is needed for (
11). Although the limit point exists in both cases, the limit point from (
11) is, in general, not associated with any fixed point of the maps, unlike in (
19) where the limit point is also the fixed point of the map. Furthermore, not all maps with (
11) need to be contractions; only maps after a finite number of iterations need to be contractions. However, all maps in (
19) are contractions.
From an application perspective, the scenario described above—where there are multiple iterative algorithms as implied in (
21) (one for each
n)—is not something typically found in practice. Equations (
19) and (
20) are therefore primarily of theoretical interest. The iterations in (
1), however, are found in practice, e.g., algorithm unrolling, and we will next study a concrete practical example.
6. Unrolled ISTA Algorithm
We now analyze the convergence of a well-known iterative algorithm with the aid of the previous result. The relevant metric space is the Banach space of vectors in
with the
distance function
The conventional Iterative Shrinkage–Thresholding Algorithm (ISTA) [
2,
3] can be described by the following iterations:
where
and
are parameters of the algorithm and
is the given input. The soft thresholding function
is applied element-wise to vectors in
, and for a scalar
, is defined as
Starting with an initial point
, a sequence of iterates
is computed using (
23). The algorithm arises in the context of LASSO (least absolute shrinkage and selection operator) in statistics and sparse coding in signal processing. In sparse coding, given an input
, the goal is to find a parsimonious representation of
using an overcomplete dictionary of vectors, which are columns of
. A common approach to achieve this is to solve the following convex optimization problem:
where
is the regularization parameter that controls the level of sparsity. The solution to (
25) can be achieved using the iterations in (
23), where the parameter
is the iteration step size.
The unrolled ISTA [
1], by generalizing (
23), is given by
where the parameters of the mappings are
,
and
. When the parameters
,
and
are determined via a machine learning framework, i.e., data-driven, we have what is commonly known as Learned ISTA or LISTA.
We will establish conditions for the convergence of (
26). We first prove a relevant property of the soft thresholding function.
Lemma 1. The function is
non-expansive
for any ; i.e.,for all . Proof. We first establish the scalar form of (
27). For any
, define
Due to the piece-wise nature of
, as shown in (
24), there are four cases to consider:
- 1.
For
:
Therefore and since , we have .
- 2.
For
,
:
Therefore
. Since
,
Due to symmetry, this case is similar to the , case.
- 3.
Due to symmetry, this case is similar to the case.
- 4.
For
,
:
Due to symmetry, this case is similar to the , case.
Therefore, for all
, we have
Now consider the square of the L.H.S. of (
27). Using (
28), we have
Taking the square root yields the desired result. □
The induced/operator norm is defined as
which is also the spectral norm. The convergence of the unrolled ISTA is given by the following.
Theorem 4. Consider two initial points and , and the corresponding iterates, and , respectively, using (
26)
. Suppose the parameters of the mappings satisfy the following conditions: - 1.
is positive and finite-valued for all m.
- 2.
There exists a finite positive integer L such that
Proof. From (
26), with the parameters of the mapping suppressed for brevity, we have
. For any two arbitrary points
, using (
27), we have
Using the definition of the operator norm on the last expression, we have
The Lipschitz coefficients are then (). By invoking Theorem 2, the required result is obtained. □
A special case of unrolling is when, instead of the generic in each iteration, we use a predetermined dictionary , and allow the step size to vary between iterations. We then have the following corollary.
Corollary 3. Let and denote, respectively, the smallest and largest singular value of . Supposeand the following conditions are satisfied: - 1.
The matrix has full column rank, so that is positive definite, with singular values that satisfy - 2.
There exists a small positive ϵ () such thatfor all m.
Convergence in (
30)
is then achieved. Proof. We will show that conditions 1 and 2 of Theorem 4 are satisfied. The theorem is then invoked to prove the corollary. Firstly, we have
This means that
is symmetric and has real eigenvalues. Therefore, the singular value (
) squared is equal to the eigenvalue (
) squared; i.e.,
and
. Using some fundamental identities of the eigenvalues of a matrix and the definition of singular values, we have
Since the operator norm is also equal to the spectral norm, we have
Firstly, condition (
33) ensures that
is positive. Now condition (
32) implies that there are at least two distinct singular values of
. Therefore the last expression in (
34) cannot be zero. Furthermore, since condition (
32) implies that all singular values
are finite, the last expression in (
34) must be finite. Therefore
is finite and positive, which is condition 1 of Theorem 4.
Condition 2 of Theorem 4 (inequality (
29)) is satisfied if there exists a small positive
(
) such that
for all
m. This is achieved if
for all
i. This implies
This condition is ensured by (
33). Therefore condition 2 of Theorem 4 is satisfied. □
Remark 4. - 1.
Note that in (
29)
is a sufficient condition but not necessary. - 2.
In both Theorem 4 and Corollary 3, there is no restriction on either the soft-threshold parameters , or the matrix .
- 3.
In the machine learning paradigm, the expressivity of the algorithm generally increases with the number of parameters. With the special case of unrolling in (
31)
, there is only one parameter in . However, for convergence, it is not necessary to haveeven though that is the case with the original ISTA algorithm in (
23)
. A general , which has parameters, can be used and results in higher expressivity. - 4.
The term is a generalization of the term in (
23)
, which is related to the input . In the original ISTA formulation, there is only one input, but in a data-driven machine learning framework, there are multiple inputs . This generalization allows the algorithm to adapt to this situation.