Mitigating the Drawbacks of the L0 Norm and the Total Variation Norm

Zeng, Gengsheng L.

doi:10.3390/axioms14080605

Open AccessArticle

Mitigating the Drawbacks of the L₀ Norm and the Total Variation Norm

by

Gengsheng L. Zeng

Department of Computer Science, Utah Valley University, Orem, UT 84058, USA

Axioms 2025, 14(8), 605; https://doi.org/10.3390/axioms14080605

Submission received: 27 June 2025 / Revised: 23 July 2025 / Accepted: 31 July 2025 / Published: 4 August 2025

(This article belongs to the Special Issue Advances in Mathematical Methods in Signal Processing and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

In compressed sensing, it is believed that the L₀ norm minimization is the best way to enforce a sparse solution. However, the L₀ norm is difficult to implement in a gradient-based iterative image reconstruction algorithm. The total variation (TV) norm minimization is considered a proper substitute for the L₀ norm minimization. This paper points out that the TV norm is not powerful enough to enforce a piecewise-constant image. This paper uses the limited-angle tomography to illustrate the possibility of using the L₀ norm to encourage a piecewise-constant image. However, one of the drawbacks of the L₀ norm is that its derivative is zero almost everywhere, making a gradient-based algorithm useless. Our novel idea is to replace the zero value of the L₀ norm derivative with a zero-mean random variable. Computer simulations show that the proposed L₀ norm minimization outperforms the TV minimization. The novelty of this paper is the introduction of some randomness in the gradient of the objective function when the gradient is zero. The quantitative evaluations indicate the improvements of the proposed method in terms of the structural similarity (SSIM) and the peak signal-to-noise ratio (PSNR).

Keywords:

image reconstruction; total variation prior; piecewise constant; limited angle tomography; L₀ norm

MSC:

49N30; 49Q05; 65K10; 68U10; 68W20; 68W25; 68W40; 90C23

1. Introduction

The focus of this paper is on the L₀ norm minimization. If

x

is a scalar, the L₀ norm of

x

is defined as [1,2,3].

{‖x‖}_{L_{0}} = \{\begin{matrix} 0 & i f & x = 0, \\ 1 & i f & x \neq 0 . \end{matrix}

(1)

If

X

is a vector or a matrix, the L₀ norm of

X

is defined as

{‖X‖}_{L_{0}} = The total count of the nonzero elements in X

(2)

Here we use the term ‘norm’ even though the L₀ norm is not a true norm in the mathematical sense. It is better to refer to the L₀ norm as the cardinality function or sparsity measure. A norm must have the positive homogeneity property

‖k x‖ = k ‖x‖,

(3)

for all

x

in its domain and

k > 0 .

Clearly, the L₀ norm does not satisfy the positive homogeneity property (3). In fact, according to (1), we have

{‖k x‖}_{L_{0}} = {‖x‖}_{L_{0}},

(4)

for all

x

in its domain and

k > 0 .

One important application of the L₀ norm is in compressed sensing, which was introduced in the first decade of this century [4,5,6,7,8,9,10]. The principle of compressed sensing is to exactly reconstruct a signal using a small number of samples if the signal is sparse. The problem can be described as a system of linear equations:

A X = P,

(5)

where

A

is the system matrix,

P

is the measurement vector, and

X

is the unknown vector. In a compressed sensing problem, the number of measurements in

P

is much smaller than the number of unknowns in

X

. In other words, the system matrix

A

has more columns than rows. We call the solution X sparse if most of the elements in X are zero.

Another situation in compressed sensing is that the solution X of the system (5) is not sparse; however, a transformed version of X is sparse. If such a sparsification transformation operator is denoted by

ψ

, and the elements in vector Y

Y = ψ (X)

(6)

are dominated by zeros; then Y is sparse. The total count of non-zero elements in

Y

is the L₀ norm of

Y

. In theory, the solution

X

can be obtained by minimizing the following objective function F

F = {‖A X - P‖}_{L_{2}}^{2} + α | | ψ (X) {| |}_{L_{0}},

(7)

where

α

is a tuning parameter.

Let us explain (5) and (7) further. Many real-word problems can be modeled as a system of linear equations (5), where the measurements are represented as a vector P and the object being measured is represented by a vector X. The measurements P usually contain noise. The vector X is a discretized version of the real-world object to be estimated. The system matrix A describes the first-order approximation of measurement physics. The matrix A is assumed to be known. In compressed sensing, the object is under-sampled. Thus, the size of P is smaller than the size of X, and the matrix A has more columns than rows.

The first term in (7),

{‖A X - P‖}_{L_{2}}^{2}

, is the data fidelity term. Minimizing this term is equivalent to solving (5). When the system (5) is under-determined, the solutions in (5) is not unique. The second term in (7),

α | | ψ (X) {| |}_{L_{0}}

, is the Bayesian term. Minimizing this term enforces

ψ (X)

to be sparse or enforces X to be piecewise-constant.

Since the L₀ norm counts the total number of non-zero elements, it is difficult to use a gradient-based algorithm to minimize the objective function F defined in (7). In fact, minimizing the L₀ norm is an NP-hard problem [11,12,13,14,15], making it computationally infeasible for large-scale problems.

Many researchers attempted to minimize L₀ norm and found it to be difficult to deal with the L₀ norm directly [2,3,16,17,18,19,20,21,22,23,24,25]. One way is to approximate the L₀ norm by other norms [2,3,16,17,18,19]. Another way is to convert an L₀-norm minimization problem to an integer programming problem [20,21]. Still another way is to approximate the L₀-norm by a smooth function [22,23,24,25]. The L₀-norm minimization problem can also be converted to some sub problems by using some thresholds [26,27]. Robitzsch compared an L_p-norm method with an approximation of the L₀-norm method and showed that the L₀-norm method is slightly better [28].

The most popular method to replace the L₀-norm is the use of the total variation (TV) norm [29,30,31]. In the TV norm methods, the total variation (TV) norm of the unknown

X

is used to replace the second term in (7). The TV norm minimization method is to use finite difference as the sparsification operator

ψ

and use the L₁ norm to approximate the L₀ norm. If

X

is piecewise constant, the finite difference version is sparse. The L₁ norm for a vector

X

of n elements is given by

{‖X‖}_{L_{1}} = \sum_{i = 1}^{n} |x_{i}| .

(8)

Thus, the TV norm of a vector X of n elements is

{‖X‖}_{T V} = \sum_{i = 1}^{n - 1} |x_{i} - x_{i - 1}| .

(9)

Using the TV norm optimization, the objective function (7) for the vector case becomes

F = {‖A X - P‖}_{L_{2}}^{2} + α {‖X‖}_{T V} = {‖A X - P‖}_{L_{2}}^{2} + α \sum_{i = 1}^{n - 1} |x_{i} - x_{i - 1}| .

(10)

The TV norm of a matrix is not a direct extension of (9) to a higher dimension. We have two different definitions of the TV norm for a two-dimensional matrix (that is, a two-dimensional image) based on the two different ways to define the finite difference [1]. One TV norm is referred to as the isotropic TV norm and the other TV norm is referred to as the anisotropic TV norm. We use double indices for each element of the matrix

X

.

The isotropic TV norm is defined as [1]

T V_{i s o} (X) = \sum_{i, j} \sqrt{{(x_{i + 1, j} - x_{i, j})}^{2} + {(x_{i, j + 1} - x_{i, j})}^{2}},

(11)

and the anisotropic TV norm is defined by [1]

T V_{a n i s o} (X) = \sum_{i, j} (|x_{i + 1, j} - x_{i, j}| + |x_{i, j + 1} - x_{i, j}|) .

(12)

For the two-dimensional image case, the objective function (10) can then be written as

F = {‖A X - P‖}_{L_{2}}^{2} + α T V_{i s o} (X) = {‖A X - P‖}_{L_{2}}^{2} + α \sum_{i, j} \sqrt{{(x_{i + 1, j} - x_{i, j})}^{2} + {(x_{i, j + 1} - x_{i, j})}^{2}}

(13)

and

F = {‖A X - P‖}_{L_{2}}^{2} + α T V_{a n i s o} (X) = {‖A X - P‖}_{L_{2}}^{2} + α \sum_{i, j} (|x_{i + 1, j} - x_{i, j}| + |x_{i, j + 1} - x_{i, j}|),

(14)

respectively.

The justification of using L₁ norm to approximate the L₀ norm is that they both prefer a solution with more zeros. Figure 1 shows an exemplary solution line

x_{2} = m x_{1} + b

; any point

(x_{1}, x_{2})

on this line is a solution to (5). This solution line may intersect the coordinate axes at two points:

p_{1}

and

p_{2}

, with

p_{1} = (- b / m, 0)

and

p_{2} = (0, b)

, respectively. The L₀ minimization method will select either

p_{1}

or

p_{2}

to be the solution, because

| |p_{1}| |_{L_{0}} = | |p_{2}| |_{L_{0}} = 1

and other points on the line have a larger L₀ norm of

| |(x_{1}, x_{2})| |_{L_{0}} = 2

. As for the L₁ norm,

| |p_{1}| |_{L_{1}} = | b / m |

, and

| |p_{2}| |_{L_{1}} = | b |

. The L₁ minimization method will select

p_{1}

if

|m| > 1

or

p_{2}

if

|m| < 1

to be the solution. The L₁ minimization method will select either

p_{1}

or

p_{2}

to be the solution if

|m| = 1

. Therefore, L₀ minimization and L₁ minimization may select the same solutions. The justification illustrated in our toy example is unrealistic because we do not have the luxury to obtain this solution line to search along.

In the objective function (10), the first term involving the L₂ norm is dominating. Thus, when the parameter α is small, the optimal solution of (10) may not make

ψ (X)

sparse.

A drawback of the TV norm is that it cannot tell the difference between a smooth monotonic transition and a sharp monotonic transition (See Figure 2).

Therefore, using the L₀ norm and objective functions (7) may work better than objective functions (13) and (14), in the sense that the L₀ norm can tell the difference for the two curves shown in Figure 2. A drawback of the finite difference using the L₀ method is that we do not have an effective and efficient way to minimize the L₀ norm. The main goal of the paper is to directly minimize the objective function (7) with the use of the L₀ norm and find an innovative way to handle the L₀ norm.

2. Methods

In Section 1, we analyzed the drawbacks of the TV norm and the L₀ norms. In this section, we will develop a method to replace the TV norm by the L₀ norm for a practical iterative algorithm.

We notice from the traditional definition of the L₀ norm (see Figure 3 Top)

{‖x‖}_{L_{0}} = \{\begin{matrix} 0 & i f & x = 0 \\ 1 & i f & x \neq 0 . \end{matrix}

that the “if

x = 0

” statement is hardly true in a practical computer algorithm, because in practice a very small value (for example, an image pixel value of 0.000000000001) can be treated as zero. For the scalar case, it is reasonable to replace the definition (1) by (15) below

f_{0} (x) = \{\begin{matrix} \frac{|x|}{c} & i f & |x| \leq c \\ 1 & i f & |x| > c . \end{matrix}

(15)

for a chosen

c > 0

. In (15), c is a tuning parameter, determined by trial-and-error. This modification is illustrated in Figure 3 (Middle). The piecewise linear function

f_{0} (x)

defined in (5) is continuous and is a good approximation of

{‖x‖}_{L_{0}}

when

c \to 0

.

In fact, we can have many versions of the L₀ norm. For example, the version shown in Figure 3 (Bottom) is a smooth function; the derivative of the curve exists everywhere. One such smooth function is

g_{0} (x) = 1 - e^{- x^{2} / c}

for a small

c > 0

.

The requirement to be differentiable everywhere is not needed. We believe that a gradient-based optimization algorithm only requires the existence of the left and right derivatives everywhere. Let us consider a toy example of

f (x) = |x|

(16)

We want to find the minimum of the function defined in (16). It is obvious that the solution is

x = 0

(17)

We notice that the function

f (x) = |x|

is not differentiable at

x = 0

. At

x = 0

, the left derivative and right derivative of

f (x)

are

f_{l e f t}^{'} (0) = - 1

and

f_{r i g h t}^{'} (0) = 1

, respectively. A gradient-based optimization algorithm can be crafted as

x^{n e x t} = x^{c u r r e n t} - λ \frac{f_{l e f t}^{'} (x^{c u r r e n t}) + f_{r i g h t}^{'} (x^{c u r r e n t})}{2}

(18)

where the parameter

λ > 0

controls the step size of the iterative optimization algorithm. Algorithm (18) is nothing but the commonly used gradient descent algorithm, except that the gradient is replaced by the average of the left and right gradients. Therefore, our definition of the L₀ norm is user-friendly in gradient-based iterative optimization algorithms. Using a smooth function

g_{0} (x)

shown in Figure 3 (Bottom), to approximate the L₀ norm is not necessary.

To extend the revised L₀ definition from the scalar case (15) to a matrix X, whose elements are denoted as

x_{i, j}

, we have

f_{0} (X) = \sum_{i, j} f_{0} (x_{i, j})

(19)

where

f_{0} (x_{i, j})

is defined in (15). The definition of (19) is still not effective in a gradient-based iterative optimization algorithm. The partial derivative of

f_{0} (X)

with respect to

x_{i, j}

is calculated as

\frac{\partial f_{0} (X)}{\partial x_{i, j}} = \frac{d f_{0} (X)}{d x_{i, j}} = \{\begin{matrix} \frac{1}{c} s g n (x_{i, j}) & i f & |x_{i, j}| \leq c \\ 0 & i f & |x_{i, j}| > c . \end{matrix}

(20)

Let

s t e p (x) = \{\begin{matrix} 1 & i f & x \geq 0 \\ 0 & i f & x < 0 . \end{matrix}

(21)

Then (20) can be expressed as

\frac{\partial f_{0} (X)}{\partial x_{i, j}} = \frac{d f_{0} (X)}{d x_{i, j}} = \frac{1}{c} \times s g n (x_{i, j}) \times s t e p (c - |x_{i, j}|) .

(22)

The plot of (22) is shown in Figure 4. It is observed that when

|x| > c

the derivative is 0. In other words, there will be no update action in the iterative algorithm for most of the image pixels. This makes the optimization algorithm almost inactive and ineffective.

In order to obtain more actions in the iterative algorithm, our next innovation is to replace the zero values in

d f_{0} (x) / d x

by small zero-mean random variables. Thus, Figure 4 becomes Figure 5. Since the optimization algorithms need the expression of

d f_{0} (x) / d x

and do not care if there is a corresponding expression of

f_{0} (x)

, we do not bother to investigate what the definition of

f_{0} (x)

is corresponding to the

d f_{0} (x) / d x

shown in Figure 5.

The mathematical expression for the revised derivative shown in Figure 5 is given as

\frac{\partial f_{0} (X)}{\partial x_{i, j}} = \frac{d f_{0} (X)}{d x_{i, j}} = \frac{1}{c} \times s g n (x_{i, j}) \times s t e p (c - |x_{i, j}|) + r a n d \times s t e p (|x_{i, j}| - c),

(23)

where

r a n d

is a small zero-mean random variable. The tuning parameter c is selected by trail-and-error, depending on the application.

We remind the reader that our images are not sparse, but piecewise constant. We need a sparsification transformation operator

ψ

to convert a piecewise-constant image to a sparse image. We chose the finite difference operator as the sparsification transformation operator

ψ

. Just like cases of isotropic TV (11) and anisotropic (12) TV definitions, we can have two definitions of

{\partial ‖X‖}_{L_{0}} / \partial x_{i, j}

, one being the isotropic derivative version and the other being the anisotropic version.

The isotropic version is defined as

\begin{matrix} \frac{{\partial ‖X‖}_{L_{0}}}{\partial x_{i, j}} = \frac{{\partial \sum_{n, m} ‖\sqrt{{(x_{n + 1, m} - x_{n, m})}^{2} + {(x_{n, m + 1} - x_{n, m})}^{2}}‖}_{L_{0}}}{\partial x_{i, j}} \\ = \sum_{n, m} \frac{\partial {‖\sqrt{{(x_{n + 1, m} - x_{n, m})}^{2} + {(x_{n, m + 1} - x_{n, m})}^{2}}‖}_{L_{0}}}{\partial \sqrt{{(x_{n + 1, m} - x_{n, m})}^{2} + {(x_{n, m + 1} - x_{n, m})}^{2}}} \frac{\partial \sqrt{{(x_{n + 1, m} - x_{n, m})}^{2} + {(x_{n, m + 1} - x_{n, m})}^{2}}}{\partial x_{i, j}} \\ = \sum_{n, m} g (n, m) \times \frac{\partial \sqrt{{(x_{n + 1, m} - x_{n, m})}^{2} + {(x_{n, m + 1} - x_{n, m})}^{2}}}{\partial x_{i, j}} \\ = g (i - 1, j) \frac{\partial \sqrt{{(x_{i, j} - x_{i - 1, j})}^{2} + {(x_{i - 1, j + 1} - x_{i - 1, j})}^{2}}}{\partial x_{i, j}} \\ + g (i, j) \frac{\partial \sqrt{{(x_{i + 1, j} - x_{i, j})}^{2} + {(x_{i, j + 1} - x_{i, j})}^{2}}}{\partial x_{i, j}} \\ + g (i, j - 1) \frac{\partial \sqrt{{(x_{i + 1, j - 1} - x_{i, j - 1})}^{2} + {(x_{i, j} - x_{i, j - 1})}^{2}}}{\partial x_{i, j}} \\ = g (i - 1, j) \frac{x_{i, j} - x_{i - 1, j}}{\sqrt{{(x_{i, j} - x_{i - 1, j})}^{2} + {(x_{i - 1, j + 1} - x_{i - 1, j})}^{2}}} \\ + g (i, j - 1) \frac{x_{i, j} - x_{i, j - 1}}{\sqrt{{(x_{i + 1, j - 1} - x_{i, j - 1})}^{2} + {(x_{i, j} - x_{i, j - 1})}^{2}}} \\ - g (i, j) \frac{x_{i + 1, j} + x_{i, j + 1} - 2 x_{i, j}}{\sqrt{{(x_{i + 1, j} - x_{i, j})}^{2} + {(x_{i, j + 1} - x_{i, j})}^{2}}} \\ = g (i - 1, j) \frac{x_{i, j} - x_{i - 1, j}}{u (i - 1, j)} + g (i, j - 1) \frac{x_{i, j} - x_{i, j - 1}}{u (i, j - 1)} - g (i, j) \frac{x_{i + 1, j} + x_{i, j + 1} - 2 x_{i, j}}{u (i, j)} \end{matrix}

(24)

with

u (n, m) = \sqrt{{(x_{n + 1, m} - x_{n, m})}^{2} + {(x_{n, m + 1} - x_{n, m})}^{2}}

(25)

and

\begin{matrix} g (n, m) = \frac{\partial {‖\sqrt{{(x_{n + 1, m} - x_{n, m})}^{2} + {(x_{n, m + 1} - x_{n, m})}^{2}}‖}_{L_{0}}}{\partial \sqrt{{(x_{n + 1, m} - x_{n, m})}^{2} + {(x_{n, m + 1} - x_{n, m})}^{2}}} \\ = \frac{1}{c} \times s t e p (c - \sqrt{{(x_{n + 1, m} - x_{n, m})}^{2} + {(x_{n, m + 1} - x_{n, m})}^{2}}) \\ + r a n d \times s t e p (\sqrt{{(x_{n + 1, m} - x_{n, m})}^{2} + {(x_{n, m + 1} - x_{n, m})}^{2}} - c) \\ = \frac{1}{c} \times s t e p (c - u (n, m)) + r a n d \times s t e p (u (n, m) - c) . \end{matrix}

(26)

The anisotropic version is defined by

\frac{{\partial ‖X‖}_{L_{0}}}{\partial x_{i, j}} = \frac{\partial \sum_{n, m} ({‖x_{n + 1, m} - x_{n, m}‖}_{L_{0}} + {‖x_{n, m + 1} - x_{n, m}‖}_{L_{0}})}{\partial x_{i, j}}

= \frac{\partial ({‖x_{i, j} - x_{i - 1, j}‖}_{L_{0}} + {‖x_{i - 1, j + 1} - x_{i - 1, j}‖}_{L_{0}})}{\partial x_{i, j}}

+ \frac{\partial ({‖x_{i + 1, j} - x_{i, j}‖}_{L_{0}} + {‖x_{i, j + 1} - x_{i, j}‖}_{L_{0}})}{\partial x_{i, j}}

+ \frac{\partial ({‖x_{i + 1, j - 1} - x_{i, j - 1}‖}_{L_{0}} + {‖x_{i, j} - x_{i, j - 1}‖}_{L_{0}})}{\partial x_{i, j}}

= \frac{\partial ({‖x_{i, j} - x_{i - 1, j}‖}_{L_{0}})}{\partial x_{i, j}} + \frac{\partial ({‖x_{i + 1, j} - x_{i, j}‖}_{L_{0}})}{\partial x_{i, j}} + \frac{\partial ({‖x_{i, j + 1} - x_{i, j}‖}_{L_{0}})}{\partial x_{i, j}} + \frac{\partial ({‖x_{i, j} - x_{i, j - 1}‖}_{L_{0}})}{\partial x_{i, j}}

(27)

= \frac{1}{c} \times s g n (x_{i, j} - x_{i - 1, j}) \times s t e p (c - |x_{i, j} - x_{i - 1, j}|) + r a n d_{1} \times s t e p (|x_{i, j} - x_{i - 1, j}| - c)

+ \frac{1}{c} \times s g n (x_{i + 1, j} - x_{i, j}) \times s t e p (c - |x_{i + 1, j} - x_{i, j}|) + r a n d_{2} \times s t e p (|x_{i + 1, j} - x_{i, j}| - c)

+ \frac{1}{c} \times s g n (x_{i, j + 1} - x_{i, j}) \times s t e p (c - |x_{i, j + 1} - x_{i, j}|) + r a n d_{3} \times s t e p (|x_{i, j + 1} - x_{i, j}| - c)

+ \frac{1}{c} \times s g n (x_{i, j} - x_{i, j - 1}) \times s t e p (c - |x_{i, j} - x_{i, j - 1}|) + r a n d_{4} \times s t e p (|x_{i, j} - x_{i, j - 1}| - c)

If a gradient based iterative algorithm is used to minimize an objective function, the algorithm does not update the image pixel

x_{i, j}

when the partial derivative of the objective function with respect to

x_{i, j}

is zero. More often than not, the L₀ Bayesian term is ‘silent’ and has little contribution to find a piecewise constant solution. If we replace zero with zero-mean random variables, we actually disturb the algorithm so that it is not ‘silent.’

This strategy is similar to the simulated annealing algorithm, which generates random solutions and gradually rejects less optimal solutions [32]. However, there is no theoretical guarantee that random solutions are better solutions.

By replacing zero by zero-mean random variables does not change the fact that the L₀ defined norm is not convex. To obtain the global minimum, one must do exhaustive search for the entire solution space. In other words, L₀ minimization even with the current modification is still NP-hard. The proposed algorithm is gradient based, does not do exhaustive search, and usually does not reach the global minimum.

A drawback of the objective function (7) is the difficulty in selecting the control parameter

α

. Instead of minimizing the objective function (7) directly, we propose to alternatively minimize the data fidelity term

{‖A X - P‖}_{L_{2}}^{2}

and the Bayesian term

| | ψ (X) {| |}_{L_{0}}

. In this way, the value of tuning parameter α is no longer important. In other words, we use an iterative Projection onto Convex Sets (POCS) algorithm.

There are many algorithms to minimize the first term (i.e., the data fidelity term). We chose the maximum likelihood expectation maximization (MLEM) algorithm in our implementation [33,34,35,36,37]. On the other hand, the Bayesian term is minimized by a gradient descent algorithm. For an image reconstruction task, at the pixel

x_{i, j}

, the gradient is given by (27).

Now, we use a different way to explain the algorithm to enforce the L₀ norm of the gradient image. For an image reconstruction task, we want to update the pixel

x_{i, j}

, as shown in Figure 6. The horizontal and vertical gradients of pixel

x_{i, j}

are

x_{i, j} - x_{i - 1, j}

,

x_{i, j} - x_{i + 1, j}

, and

x_{i, j} - x_{i, j - 1}

,

x_{i, j} - x_{i, j + 1}

, correspondingly. We want to minimize the L₀ norms of these four differences. The derivative of these four L₀ norms have this common expression:

D (x_{i, j}, x_{n e i g h b o r}) = D e r i v a t i v e = a \times s g n (x_{i, j} - x_{n e i g h b o r})

(28)

where

a = \{\begin{matrix} 1 / c, & w h e n & |x_{i, j} - x_{n e i g h b o r}| < c, \\ r a n d, & w h e n & |x_{i, j} - x_{n e i g h b o r}| \geq c . \end{matrix}

(29)

A gradient descent algorithm to update

x_{i, j}

is given as

x_{i, j}^{n e x t} = x_{i, j} - μ [D (x_{i, j}, x_{i, j - 1}) + D (x_{i, j}, x_{i, j + 1}) + D (x_{i, j}, x_{i, j - 1}) + D (x_{i, j}, x_{i, j + 1})]

(30)

Here, the zero-mean random noise, rand, is uniformly distributed in

[- 1, 1]

. The tuning parameter

c

was chosen as 0.1 by trial-and-error in our implementation.

In our computer simulations, we used 100,000 iterations of the POCS algorithm. At each POCS iteration, we first ran 10 iterations of the MLEM algorithm to minimize the data fidelity term and then we ran 10 iterations of the gradient descent algorithm to minimize the L₀ term with a step size of 0.1.

We summarize the main steps in the development of the proposed algorithm as follows. We started with the original L₀ definition (1). Then, this expression was replaced by a piecewise-linear continuous approximation (15). Next, the partial derivative (22) was replaced by (23). The proposed POCS algorithm is summarized as a flowchart in Figure 7.

3. Results

We applied the proposed POCS algorithm to a limited-angle two-dimensional image reconstruction problem. The computer-generated phantoms were considered. The first phantom had a large, uniform disk as the background and 13 smaller, uniform squares and disks. The sizes, locations, and intensities for each square and disk are listed in Table 1. The second phantom was the famous Shepp–Logan head phantom [38]. No noise was added to the phantom projection data.

In the computer simulations, the parallel-beam imaging geometry was considered. The scanning angle for the first phantom was 40°, and for the second phantom was 90°. The image size was 256 × 256. For these two phantom studies, three image reconstruction algorithms were compared in Figure 8 and Figure 9, respectively: The well-known MLEM algorithm [34], the MLEM-TV algorithm [33], and the proposed POCS revised L₀-norm minimization algorithm.

The iteration number was 400 in the POCS algorithm. Within each iteration, there were 10 iterations of the MLEM algorithm for data fidelity enforcement and 10 iterations of the gradient decent algorithm for piecewise-constant enforcement.

The only tuning parameter in the MLEM algorithm is the number of iterations. However, the gradient descent algorithm has two tuning parameters: the number of iterations and the step size. The step size was chosen as 0.0001 in the first phantom reconstruction and was 0.00001 in the second phantom reconstruction.

The MLEM reconstruction has the most limited-angle artifacts, and the shapes of the small objects are not well defined. The TV reconstruction slightly improves the boundaries of the small objects in the image. The most significant improvement is achieved by the proposed POCS L₀ minimization algorithm.

Table 2 and Table 3 show the quantitative evaluation results with the structural similarity (SSIM) [39] and the peak signal-to-noise ratio (PSNR) [40] for the two phantom studies, respectively. An SSIM value closer to 1 indicates better image quality. A greater PSNR value indicates better image quality.

We observe that the reconstruction artifacts depend heavily on the size and the orientation of an object. A larger object tends to have more severe artifacts. If there are many smaller objects inside a larger object, the artifacts from each smaller object interact. Therefore, the distance between objects affects the overall artifacts. In the first phantom, the small objects are isolated from each other. In the second phantom, the small objects are close to each other. The second phantom is more difficult to reconstruct than the first phantom.

4. Conclusions

The TV norm has a drawback that it cannot distinguish a smooth function and a piecewise-constant function. The TV Bayesian objective function may not be effective in promoting a piecewise constant solution. The L₀ norm, on the other hand, is difficult to implement in a gradient-based optimization algorithm. This paper aims to address these drawbacks.

The difficulty of using the L₀ norm in an optimization algorithm is well known. Remedies have been proposed the tested by many researchers. The efforts can be classified into two categories: approximating the L₀ norm by a different norm and approximating the L₀ norm itself by a continuous functional. The well-known TV-norm optimization belongs to the first category. Our paper belongs to the second category. A unique feature of our method is in the region where the signal is not zero. In this region, the traditional L₀ norm has

\frac{{\partial ‖X‖}_{L_{0}}}{\partial x_{i, j}} = 0

. We replace this 0 with a zero-mean random variable. There are many methods that combine L₀, L_1, and TV models. The unique feature of our method is the introduction of randomness in the derivative.

The L₀ norm is not convex. The gradient-based algorithm usually does not converge to the global minimum. We introduce a zero-mean random perturbation to the algorithm; this random perturbation gives the algorithm a chance to ‘jump out’ from a local minimum to another local minimum with a smaller objective function value.

In this paper, we replace the Bayesian algorithm with a POCS algorithm and revise the derivative of the L₀ norm so that it does not have a constant zero in most cases. As an application to limited-angle tomography, the proposed algorithm outperforms the MLEM-TV algorithm when the scanning angular range is small.

It is difficult to compare any two algorithms in general, because the performance of the algorithms is application-dependent. As shown in our two phantom studies, different applications may require different minimal scanning angular ranges.

Funding

This research was funded by NIH, grant number 2R15EB024283-03.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MLEM	maximum-likelihood expectation-maximization
POCS	projection onto convex sets
PSNR	peak signal-to-noise ratio
SSIM	structural similarity
TV	total variation

References

Zeng, G.L.; Li, Y. Morphing from the TV-norm to the l₀-norm. Biomed. J. Sci. Tech. Res. 2024, 55, 46741–46747. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Sun, Y.; Schaefer, S.; Wang, W. Denoising point sets via L₀ minimization. Comput. Aided Geom. Des. 2015, 35, 2–15. [Google Scholar] [CrossRef]
Nguyen, R.M.; Brown, M.S. Fast and effective L₀ gradient minimization by region fusion. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 208–216. [Google Scholar]
Baraniuk, R.G. Compressive Sensing [Lecture Notes]. IEEE Signal Process. Mag. 2007, 24, 118–121. [Google Scholar] [CrossRef]
Romberg, J. Imaging via Compressive Sampling. IEEE Signal Process. Mag. 2008, 25, 14–20. [Google Scholar] [CrossRef]
Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
Tsaig, Y.; Donoho, D.L. Extensions of compressed sensing. Signal Process. 2006, 86, 549–571. [Google Scholar] [CrossRef]
Tanner, J.; Vary, S. Compressed sensing of low-rank plus sparse matrices. Appl. Comput. Harmon. Anal. 2023, 64, 254–293. [Google Scholar] [CrossRef]
Donoho, D.L.; Tanner, J. Exponential bounds implying construction of compressed sensing matrices, error-correcting codes, and neighborly polytopes by random sampling. IEEE Trans. Inf. Theory 2010, 56, 2002–2016. [Google Scholar] [CrossRef]
Candes, E.J.; Romberg, J.; Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 2006, 52, 489–509. [Google Scholar] [CrossRef]
Woeginger, G.J. Exact algorithms for NP-hard problems: A survey. In Combinatorial Optimization—Eureka, You Shrink! Papers Dedicated to Jack Edmonds 5th International Workshop, Aussois, France, March 5–9, 2003; 2001 Revised Papers; Springer: Berlin/Heidelberg, Germany, 2003; pp. 185–207. [Google Scholar]
Paschos, V.T. An overview on polynomial approximation of NP-hard problems. Yugosl. J. Oper. Res. 2009, 19, 3–40. [Google Scholar] [CrossRef]
Tkatek, S.; Bahti, O.; Lmzouari, Y.; Abouchabaka, J. Artificial intelligence for improving the optimization of NP-hard problems: A review. Int. J. Adv. Trends Comput. Sci. Appl. 2020, 9, 7411–7420. [Google Scholar]
Li, W.; Ding, Y.; Yang, Y.; Sherratt, R.S.; Park, J.H.; Wang, J. Parameterized algorithms of fundamental NP-hard problems: A survey. Hum. Centric Comput. Inf. Sci. 2020, 10, 29. [Google Scholar] [CrossRef]
Lin, F.T.; Kao, C.Y.; Hsu, C.C. Applying the genetic approach to simulated annealing in solving some NP-hard problems. IEEE Trans. Syst. Man Cybern. 1993, 23, 1752–1767. [Google Scholar]
Zhang, J.; Zhao, C.; Zhao, D.; Gao, W. Image compressive sensing recovery using adaptively learned sparsifying basis via L₀ minimization. Signal Process. 2014, 103, 114–126. [Google Scholar] [CrossRef]
Brandt, C.; Seidel, H.P.; Hildebrandt, K. Optimal spline approximation via ℓ0-minimization. In Computer Graphics Forum; John Wiley & Sons Ltd.: Hoboken, NJ, USA, 2015; Volume 34, pp. 617–626. [Google Scholar]
Lu, Z.; Zhang, Y. Penalty decomposition methods for L₀-norm minimization. arXiv 2010, arXiv:1008.5372. [Google Scholar]
Sun, Y.; Schaefer, S.; Wang, W. Image structure retrieval via L0 minimization. IEEE Trans. Vis. Comput. Graph. 2017, 24, 2129–2139. [Google Scholar] [CrossRef]
Delle Donne, D.; Kowalski, M.; Liberti, L. A novel integer linear programming approach for global L₀ minimization. J. Mach. Learn. Res. 2023, 24, 18322–18349. [Google Scholar]
Atamturk, A.; Gómez, A.; Han, S. Sparse and smooth signal estimation: Convexification of l₀-formulations. J. Mach. Learn. Res. 2021, 22, 1–43. [Google Scholar]
Hyder, M.; Mahata, K. An approximate l₀ norm minimization algorithm for compressed sensing. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, China, 12 May 2008; pp. 3365–3368. [Google Scholar]
Wang, L.; Yin, X.; Yue, H.; Xiang, J. A regularized weighted smoothed L₀ norm minimization method for underdetermined blind source separation. Sensors 2018, 18, 4260. [Google Scholar] [CrossRef]
Robitzsch, A. Computational aspects of L₀ linking in the Rasch model. Algorithms 2025, 18, 213. [Google Scholar] [CrossRef]
O’Neill, M.; Burke, K. Variable selection using a smooth information criterion for distributional regression models. Stat. Comput. 2023, 33, 71. [Google Scholar] [CrossRef]
Zhang, Y.; Dong, B.; Lu, Z. ℓ₀ Minimization for wavelet frame based image restoration. Math. Comput. 2013, 82, 995–1015. [Google Scholar] [CrossRef]
Cheng, X.; Zeng, M.; Liu, X. Feature-preserving filtering with L₀ gradient minimization. Comput. Graph. 2014, 38, 150–157. [Google Scholar] [CrossRef]
Robitzsch, A. L₀ and L_p loss functions in model-robust estimation of structural equation models. Psych 2023, 5, 1122–1139. [Google Scholar] [CrossRef]
Needell, D.; Ward, R. Stable image reconstruction using total variation minimization. SIAM J. Imaging Sci. 2013, 6, 1035–1058. [Google Scholar] [CrossRef]
Yang, J.; Yu, H.; Jiang, M.; Wang, G. High-order total variation minimization for interior tomography. Inverse Probl. 2010, 26, 035013. [Google Scholar] [CrossRef]
Sidky, E.Y.; Pan, X. Image reconstruction in circular cone-beam computed tomography by constrained, total-variation minimization. Phys. Med. Biol. 2008, 53, 4777. [Google Scholar] [CrossRef]
Van Laarhoven, P.J.; Aarts, E.H. Simulated annealing. In Simulated Annealing: Theory and Applications; Springer: Dordrecht, The Netherlands, 1987; pp. 7–15. [Google Scholar]
Panin, V.Y.; Zeng, G.L.; Gullberg, G.T. Total variation regulated EM algorithm. IEEE Trans. Nucl. Sci. 1999, 46, 2202–2210. [Google Scholar] [CrossRef]
Shepp, L.A.; Vardi, Y. Maximum likelihood reconstruction for emission tomography. IEEE Trans. Med. Imaging 2007, 1, 113–122. [Google Scholar] [CrossRef]
Snyder, D.L.; Miller, M.I.; Thomas, L.J.; Politte, D.G. Noise and edge artifacts in maximum-likelihood reconstructions for emission tomography. IEEE Trans. Med. Imaging 1987, 6, 228–238. [Google Scholar] [CrossRef]
Levitan, E.; Herman, G.T. A maximum a posteriori probability expectation maximization algorithm for image reconstruction in emission tomography. IEEE Trans. Med. Imaging 1987, 6, 185–192. [Google Scholar] [CrossRef]
Ollinger, J.M. Maximum-likelihood reconstruction of transmission images in emission computed tomography via the EM algorithm. IEEE Trans. Med. Imaging 1994, 13, 89–101. [Google Scholar] [CrossRef]
Shepp, L.A.; Logan, B.F. The Fourier reconstruction of a head section. IEEE Trans. Nucl. Sci. 1974, 21, 21–43. [Google Scholar] [CrossRef]
Zhou, W.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Jain, A.K. Fundamentals of Digital Image Processing; Prentice Hall: Hoboken, NJ, USA, 1989. [Google Scholar]

Figure 1. Points

(x_{1}, x_{2})

on the line

x_{2} = m x_{1} + b

are solutions to (1).

Figure 1. Points

(x_{1}, x_{2})

on the line

x_{2} = m x_{1} + b

are solutions to (1).

Figure 2. Two curves with the same TV value but different finite-difference plus L₀ values.

Figure 3. Three ways to re-define the L₀ norm for a scalar x. Top: The traditional definition (1). Middle: The proposed definition (15). Bottom: A smooth version of (15).

Figure 4. The curve of

d f_{0} (x) / d x

according to the new definition (15).

Figure 4. The curve of

d f_{0} (x) / d x

according to the new definition (15).

Figure 5. The curve of

f_{0} (x) / d x

according to the new definition (15) and by replacing the zeros with zero-mean random variables.

Figure 5. The curve of

f_{0} (x) / d x

according to the new definition (15) and by replacing the zeros with zero-mean random variables.

Figure 6. We consider updating a pixel

x_{i, j}

by calculating the gradients with its horizonal and vertical neighbors.

Figure 6. We consider updating a pixel

x_{i, j}

by calculating the gradients with its horizonal and vertical neighbors.

Figure 7. The flowchart of the proposed POCS algorithm.

Figure 8. Results of the first phantom study. (A) True phantom; (B) MLEM reconstruction; (C) TV reconstruction; (D1, D2) Proposed revised L₀ norm reconstruction.

Figure 9. Results of the second phantom study. (A) True phantom; (B) MLEM reconstruction; (C) TV reconstruction; (D) Proposed revised L₀ norm reconstruction.

Table 1. Parameters for the first phantom.

Type	Center x	Center y	Diameter or Side Length	Density
Circle	0	0	230.40	0.5
Square	89.60	0	25.60	1.0
Square	0	89.60	23.04	0.5
Square	−89.60	0	20.48	1.0
Square	0	−89.60	17.92	1.0
Square	64.00	64.00	15.36	0.5
Square	−64.00	−64.00	12.80	1.0
Square	−64.00	64.00	10.24	1.0
Square	64.00	−64.00	7.68	0.5
Square	0	0	5.12	1.0
Circle	44.80	0	23.04	1.0
Circle	0	44.80	17.92	0.5
Circle	−44.80	0	12.80	1.0
Circle	0	−44.80	7.68	0.5

Table 2. Quantitative evaluation results for the first phantom.

Method	PSNR	SSIM
MLEM (B)	9.3822	0.4001
TV (C)	13.4989	0.6341
Proposed 1 (D1)	14.6675	0.6761
Proposed 2 (D2)	17.0831	0.7144

Table 3. Quantitative evaluation results for the second phantom.

Method	PSNR	SSIM
MLEM (B)	17.2455	0.7067
TV (C)	17.2679	0.7300
Proposed (D)	22.0501	0.8547

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, G.L. Mitigating the Drawbacks of the L₀ Norm and the Total Variation Norm. Axioms 2025, 14, 605. https://doi.org/10.3390/axioms14080605

AMA Style

Zeng GL. Mitigating the Drawbacks of the L₀ Norm and the Total Variation Norm. Axioms. 2025; 14(8):605. https://doi.org/10.3390/axioms14080605

Chicago/Turabian Style

Zeng, Gengsheng L. 2025. "Mitigating the Drawbacks of the L₀ Norm and the Total Variation Norm" Axioms 14, no. 8: 605. https://doi.org/10.3390/axioms14080605

APA Style

Zeng, G. L. (2025). Mitigating the Drawbacks of the L₀ Norm and the Total Variation Norm. Axioms, 14(8), 605. https://doi.org/10.3390/axioms14080605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mitigating the Drawbacks of the L₀ Norm and the Total Variation Norm

Abstract

1. Introduction

2. Methods

3. Results

4. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI