1. Introduction
Image reconstruction in medical imaging, in general, considers estimating pixel intensities or attenuations from measurements obtained from an imaging system. For example, for positron emission tomography (PET), the measurements are obtained according to the procedure summarized below; see [
1,
2] for more details. A type of radioactive isotope is introduced into the body of a patient and, due to the decay of radioisotope, it emits positrons. Each positron moves in the body for a small distance (usually less than 1 mm) and then interacts with an electron to produce a pair of gamma photons that travel in almost opposite directions. The scanning device in the imaging system can detect each pair of gamma photons with a certain probability and all such detections form the measurements that can appear in a histogram or a list form [
3]. It is usually assumed that the detection probabilities are known and they can be pre-computed and stored or computed on-the-fly.
Note that a special feature of measurements is that they are contaminated by noises, which can be a severe problem particularly if each measurement is small in value due to dose safety limit. It is possible that, if the noises are not properly addressed, the reconstructed image can be distorted by excessive noises. For example, for low dose X-ray CT (a type of transmission tomography), the metal streak artifact (e.g., [
4]) can be a severe problem for the traditional filtered backprojection method. Statistical iterative reconstruction methods, due to their ability to model the physics and measurements more accurately, are capable to reduce metal streak artifacts [
5].
To deal with the noise contamination problem, statistical image reconstruction methods in emission, transmission, X-ray CT,
etc. have been developed based on specified probability models for measurements. For example, for single photon emission computed tomography (SPECT), possible options include: weighted least squares (equivalent to variable variance Gaussian) [
6], fixed variance Gaussian [
7] and Poisson [
8] models. These models can also be used for transmission scans. Since accidental coincidences are the main source of background noise in PET, most PET scans are precorrected for accidental coincidences by real-time subtraction of the coincidences in the delayed window [
9]. For randoms-precorrected PET scans, possible measurement models are Gaussian, ordinary Poisson and shifted Poisson [
9], and all of these are just approximations as the true probability density function (pdf) for the measurements is difficult. Shifted Poisson is also used to model X-ray CT measurements [
10].
Different algorithms have been proposed to maximize their corresponding objective functions. For example, for emission tomography, the expectation-maximization (EM) algorithm [
8] is designed to maximize the log-likelihood formulated from Poisson distributed measurements, or the iterative space reconstruction algorithm (ISRA) [
7] for maximizing the log-likelihood formulated from Gaussian (with fixed variances) distributed measurements. An attractive aspect of both EM and ISRA is that they are very easy to implement and both respect the non-negativity constraint on the reconstructions. However, if the objective function contains a penalty term, which is normally used to smooth the reconstruction, then both EM and ISRA become impractical as they involve, in each iteration, a non-linear system of equations that is tedious to solve exactly due to the large number of unknowns in these equations. Moreover, the penalty function also adds an extra inconvenience when searching for a non-negative solution is desirable.
To simplify notations, both the measurements and the unknown image are lexicographically ordered into vectors. More specifically, we use
to present the measurement vector and
to denote the unknown image vector, where superscript
T denotes matrix transpose. Note although the notations are unified for different reconstruction problems in this paper, the meaning of these notations, such as
x and
y, can be different for different imaging modalities. Vectors
y and
x are related through a system matrix
A; see Equation (
4) below for some examples. For tomographic reconstruction problems, matrix
A is usually assumed known so its estimation is not covered by this paper. Rather, we focus on how to estimate
x from the observed
y and the known system matrix
A. We denote the estimate of
x by
.
Statistical reconstruction
obtained by maximum penalized likelihood (MPL) (also known as maximum a posteriori (MAP)) is defined by
where
is an objective function derived from the probability distribution for measurements and the penalty function. When the
’s are assumed independent (given
x), the penalized likelihood objective function is
where
is the log-likelihood function given by
Here
is the smoothing parameter and
is the penalty function used to smooth
. In Equation (
3),
denotes the log-density function for measurement
, and
is a function of
(here
denotes the non-negative orthant of
) representing the mean measurement of camera bin
i. Examples of
include
where
with
being the
ith row of matrix
A,
is the known blank scan counts of the
ith detector and
the known mean background counts. Another example is polyenergetic transmission scans (such as X-ray CT) where
and here
denotes the attenuation map corresponding to the
m-th energy spectrum,
x is a vector formed by the
’s and
is the blank scan count from energy spectrum
m.
In Equation (
3) the notation
is used to emphasize that
is a function of
and it also involves measurement
. We can also write this function as
or
in different contexts when there is no ambiguity. However, the functional properties of
may change with respect to its different arguments. For example, if assuming
follows a Poisson distribution for either emission or transmission scans, then
This is clearly a concave function of
for both emission and transmission cases. However, for
(treated as a function of
x), it may be no longer concave for transmission but still concave for emission scans. Concavity is an important property exploited by the optimization transfer algorithms.
Let
be an
n-vector of all
. The first term of Equation (
2),
i.e.,
, measures similarity between
y and
. Different probability distributions have been used to model
even under the same imaging modality. For example, for emission tomography, if assuming the Poisson model for
(
i.e.,
) then
is given by Equation (
6), or if considering the weighted least squares then
where
is the weight. When
we have the weighted least squares model as suggested in [
11]. Another example in emission (or transmission) tomography is the randoms-precorrected PET scan (assume no scattering to simplify). In this context, the observed measurements are
, where
and
(both unavailable directly) denotes the number of coincidences of the prompt and delayed windows respectively. Although we can assume
and
and that they are independent, the exact distribution of
cannot be derived directly (e.g., [
9]). An approximate probability model suggested in [
9] is the shifted Poisson distribution, namely
, which gives
or the weighted least squares given by
Note that the shifted Poisson approximation matches the first two moments with the true probability model for
when both the prompt and delayed measurements are assumed independent and follow Poisson distributions.
In this paper, we present and discuss several important non-negatively constrained penalized likelihood reconstruction algorithms. When designing a reconstruction algorithm in tomographic imaging, one considers the following important issues: (i) the algorithm is computationally efficient, and ideally it involves only forward-projection (e.g., ) and back-projection (e.g., ) operations; (ii) the algorithm can be easily applied to different measurement probability models and imaging modalities; (iii) the algorithm can impose the non-negativity constraint; (iv) the algorithm converges fast. Our discussions on the algorithms in this paper will mainly focus on these points.
In tomographic imaging, it is important to produce smoothed reconstructions as severe noise in a reconstruction can cause false diagnoses. Smoothing can generally be achieved by one of the following five practices: (i) early termination of the iterations (e.g., [
12]); (ii) MPL reconstructions with an appropriate smoothing parameter (e.g., [
13]); (iii) functional representation of the unknown image by a set of smooth basis functions (e.g., [
14]); (iv) post smoothing of the reconstruction within each iteration (e.g., [
15]) or after all iterations ([
16]); and (v) pre-smoothing of the camera data (
i.e., sinogram) followed by filter backprojection (FBP) (e.g., [
17,
18]). We focus on the penalized likelihood approach to smoothing in this paper. In Equation (
2), the smoothing parameter
h balances two conflicting targets: fidelity of the
s to the
s and smoothness of
x. Although an appropriate choice of
h is important for achieving a reconstruction with
balanced fidelity and smoothness, we will not consider how to estimate
h in this paper. A penalty function
is used to smooth or regulate the estimate
. Usually,
takes the form of
where
represents a neighborhood operation (such as the first or second order difference) on pixel
j, and function
measures the magnitude of
. A common choice of
ρ is the quadratic function:
. Generally, a quadratic penalty tends to produce images with over-smoothed edges. Possible edge preserving penalties include total variation (TV) (e.g., [
19]) Huber [
20] and hyperbolic functions (e.g., [
21]). Note that
is convex for all these options.
The optimal choice of the penalty function
J and the smoothing parameter
h are unsolved problems in image processing and will not be further elaborated in this paper. We emphasize that smoothing by MPL indeed produce visually improved reconstructions over the tradition filtered-backprojection method particularly in dose-limited tomography such as low dose X-ray CT. The edge preserving penalties are extremely useful, such as TV and Huber penalties; see [
22,
23,
24]. However, the MPL reconstructions can have unnatural noise textures very different from the familiar filtered-backprojection method. Its impact on diagnostic tasks is still unknown and this is an active research area; see [
25] for examples and discussions.
We adopt the following notations throughout this paper. Let be the estimate of x obtained at iteration k of an algorithm. The notation indicates the derivative of function b with respect to the variable in the brackets. For example, represents the derivative of b with respect to and the derivative of b with respect to x. We use to denote the derivative of b with respect to , the j-th element of vector x. We also let and represent, respectively, and evaluated at .
Non-negatively constrained MPL image reconstruction algorithms can be classified into simultaneous and block-iterative (a.k.a. ordered subset (OS)) algorithms. For simultaneous algorithms, all elements in
y are used to update
x in each iteration, and for block-iterative algorithms, distinct portions of
y are used in turn to update
x. We discuss in this paper some simultaneous algorithms for non-negatively constrained MPL reconstructions, and the block-iterative algorithms are not included in our discussions. The rest of this paper is arranged as follows. The expectation-maximization algorithm for emission tomography is discussed in
Section 2.
Section 3 explains the alternating minimization algorithm designed specifically for transmission tomography.
Section 4 contains explanations on the optimization transfer algorithms and their applications to tomographic reconstructions. The multiplicative iterative (MI) algorithms for tomographic imaging are provided in
Section 5 and the Fisher scoring based Jacobi or Gauss–Seidel over-relaxation algorithms are presented in
Section 6.
Section 7 explains another Gauss–Seidel method named the iterative coordinate ascent algorithm. Finally,
Section 8 includes discussions and remarks about this paper.
In this paper we focus on explaining and summarizing different non-negatively constrained tomographic imaging algorithms. Numerical comparisons of some of these algorithms are available in [
26], and therefore will not be given in this paper.
2. EM Algorithm for Maximum Likelihood Reconstruction in Emission Tomography
The expectation-maximization (EM) algorithm [
27] is a statistical algorithm for iteratively computing maximum likelihood estimates when data contain random missing values. Here “random” means these missing values do not provide extra information about the parameters we wish to estimate. We first give a brief summary of the EM algorithm below.
Since there exist the missing and the observed (or incomplete) components, we can define the complete data set as a combination of the incomplete and the missing data. Note, however, that our aim is to estimate the unknown parameters by maximizing the log-likelihood of the incomplete data. The rationale for the EM algorithm is that if maximizing the incomplete data likelihood is difficult while maximizing the complete data likelihood is easy, then EM can be used to compute iteratively the maximum of the incomplete data likelihood by maximizing the complete data likelihood in each iteration.
Let
be the complete data set given by
, where
denotes the incomplete data and
the missing data. Let
be the log-likelihood based on the complete data
and
the log-likelihood of the incomplete data
, where
x is a
p-vector for the unknown parameters. Let
be the maximum likelihood (ML) estimate of
x. Then iteration
of the EM algorithm comprises two steps:
E-Step: Compute the conditional expectation of the complete data log-likelihood given the incomplete data and
, and denote this function by
M-Step: Update the
x estimate by maximizing the
Q function, namely
One major advantage of EM is that it guarantees, under certain regularity conditions, that the incomplete data log-likelihood
increases in consecutive iterations before convergence. Note that EM requires availability of the
Q function in a closed form; otherwise, a Monte-Carlo E-step can be used to replace the E-step [
28].
The EM algorithm was first applied to emission tomograph by Shepp and Vardi [
8] and Lange and Carson [
29]. Both papers adopt the Poisson model for emission counts, namely
are independent Poisson random variables with mean
. This model assumes
; otherwise, we can depict
as the value after subtracting
from the bin
i measurement. From this Poisson model, we can formulate the complete data as
, where
follows the Poisson distribution with mean
. Clearly, each
represents the unknown portion of measurement on camera bin
i attributed to image pixel
j. The corresponding complete data log-likelihood is
and the corresponding
Q function is
where
. Since the conditional distribution of
is
, we have
. Thus after solving
, the M-step of the EM algorithm gives the following updating formula for
x:
for
. It has been pointed out in [
23,
30] that formula (
15) can also be explained by the Bayes conditional probability formula. This EM algorithm possesses the following properties making it attractive for emission tomography; they are:
If the initial then for all ; i.e., it automatically satisfies the non-negativity constraint on x.
The algorithm is easy to implement as it only involves forward- and back-projections.
The updating formula in Equation (
15) increases the incomplete data log-likelihood:
, where equality holds only when the iteration has converged.
satisfies , where is with . Thus the x estimate at any iteration satisfies that the total expected and the total observed counts are equal.
The above EM is easy to implement and possesses some attractive properties on the reconstructions. This algorithm, however, is restricted only to emission tomography with Poisson distributed measurements. It cannot be easily extended to other reconstruction tasks. For example, application of the EM algorithm to transmission tomography does not lead to an exact updating formula due to the fact that its M-step does not produce a closed-form solution; see [
29]. Another limitation is that this EM algorithm can only be used for maximum likelihood reconstructions, and its application to the MPL reconstruction will not in general result in closed-form updating formula. To rectify this problem, Green [
31] developed a one-step-late (OSL) algorithm for the MPL reconstruction by replacing
x in the derivative of the penalty function by its current estimate
, and therefore an “exact” solution can still be accomplished. But this method suffers from the deficiencies that (i) the algorithm may be non-convergent; and (ii) some estimates may be negative.
De Pierro [
32] reproduced the EM updating formula using a totally different argument. In his derivation, there is no missing data and hence no E-step. Although the algorithm is named “modified EM”, it is not a real EM. In fact, this algorithm belongs to a more general class called the optimization transfer algorithms, since the Poisson log-likelihood optimization problem is transferred to a simpler optimization in each iteration. We will summarize the optimization transfer algorithms in the
Section 4.
3. Alternating Minimization Algorithms for Transmission Tomography
We have explained in
Section 2 that the EM algorithm is not directly suitable for transmission scans as its M-step cannot be computed exactly. In this section, we summarize an alternating minimization algorithm designed to solve the transmission tomographic problem, including X-Ray CT. This algorithm is a generalization to the EM algorithm [
33] and its application to transmission tomography can be found in [
34].
Following [
34], we explain this algorithm using the polyenergetic transmission tomography example. In this context, if assuming transmission scans follow Poisson distributions, the corresponding log-likelihood is
where
is the scan count of detector
i and
(now expressed as a function of vector
z, which will be defined below) is given by Equation (
5). Moreover, elements of the attenuation map associated with spectrum
m, namely elements of
in Equation (
5), are further modeled by
where
j indexes pixels,
r represents different types of materials,
are known linear attenuation coefficients and
are the unknown partial densities (e.g., [
34]) we wish to estimate. In Equation (
16),
z is a vector of size
formed by column-wise stacking the vectors
.
Define set
where
for
and
equals the background noise
for
. Clearly,
given in Equation (
5) can now be expressed as
. Define another set
In [
34,
is called the exponential family and
the linear family. Let
p and
q be the vectors created from
and
respectively. It can be shown that the problem of maximizing the log-likelihood Equation (
16) can be re-written as
subject to
, where
is the
I-divergence [
35] given by
Thus, maximizing the log-likelihood in Equation (
16) can be achieved iteratively. Assuming the estimates
,
and
are obtained at iteration
k, then iteration
contains two steps:
- (i)
compute by minimizing subject to ;
- (ii)
compute by minimizing subject to .
Note that the second step is equivalent to minimizing
over
with
being given by the expression in Equation (
19).
Minimizing
over
is easily achieved using the Lagrange multiplier, and the result is
On the other hand, direct optimization of
over
is an unmanageable task as the
’s are mixed (
i.e., not decoupled or separated from each other) within the objective function. One approach to overcome this problem is by using a decoupled objective function representing an upper bound of the original objective function. In fact, it can be shown that for
given by Equation (
19),
where
and
is an estimate of
corresponding to the estimate
of
. This inequality is obtained from the fact that
is a convex function of
. Clearly,
on the right hand side of Equation (
24) are decoupled and thus their non-negatively constrained optimizations will result in closed-form solutions. When we take
, the optimal solution to
is
where
and
. We give some remarks about this algorithm below.
Remarks- (1)
This algorithm is designed for maximum likelihood estimation. However, it can be easily extended to MPL where the penalty function must be convex and therefore can also be decoupled.
- (2)
This algorithm is developed for the likelihood function derived from the simple Poisson measurement noise. Note that the alternating minimization algorithm was also developed for a compound Poisson noise model in [
36] and its comparison with the simple Poisson alternating minimization was provided in [
37]. For other measurement distributions, however, the corresponding algorithms have to be completely re-developed.
- (3)
The convergence properties of the alternating maximization algorithm have been studied in [
34]. Particularly, it is monotonically convergent under certain conditions.
- (4)
It will become clear in
Section 5 (Example 5.3) that the multiplicative-iterative algorithm can be derived more easily for this transmission reconstruction problem.
- (5)
The trick of decoupling the objective function using its convex (or concave) property is also the key technique of the optimization transfer algorithms discussed in
Section 4.
4. Optimization Transfer Algorithms
Details of the optimization transfer (OT) algorithm (also called the minorization–maximization (MM) algorithm for maximizations) can be found in, for example, [
38]. In this section we present this algorithm briefly and explain its application in emission and transmission tomography.
The fundamental idea of the OT algorithm is that it employs a surrogate function to minorize (see the definition below) the objective function in each iteration, and then update the parameter estimate by maximizing this surrogate function.
More specifically, a function
is said to minorize
at
if it satisfies the following “minorization” conditions:
- (i)
, and
- (ii)
for all x.
Then at iteration
,
x is estimated by maximizing
,
i.e.,
If the exact maximum is not easy to obtain, we can find an
by simply increasing
, as this will also guarantee that the monotonic condition stated below remains for
.
An attractive property when using this surrogate function is that
satisfies the monotonic condition, namely
where equality holds only when the iteration has converged. This monotonic property can be easily verified by the minorization conditions since
For implementation of the OT algorithm to medical imaging, a surrogate function
must be determined. There exist different ways of choosing the surrogate function, such as those listed in [
38]. We mainly consider two approaches in this paper: (i) the method based on the inequality on concave functions (called the concave inequality hereafter); and (ii) the method based on quadratic lower bounds (also known as paraboloidal surrogates [
39]). These ideas are summarized below.
Let
be the objective function we wish to maximize, where
is the
i-th row of matrix
and
x is a
p-vector. For matrix
A, we assume its elements
are non-negative and
. We also assume that all
are concave functions. Let
be weights satisfying
. Then according to the concave inequality we have
There are different ways of choosing weights
. For example, we can use
, which is also adopted in [
32]. In this case since each
is a function of
x, the surrogate function corresponding to Equation (
28) is
and it is easy to verify that this surrogate satisfies the minorization conditions. The right hand side of Equation (
29) is a weighted summation of functions
, each involving a single
only (
i.e., decoupled), and therefore maximization with respect to
x of
can be achieved by a sequence of 1-D optimizations. Another trick, due to De Pierro [
32], uses the following concave inequality:
If the weights
do not depend on
, then Equation (
30) leads to the surrogate function of
which clearly also meets the minorization conditions. In Equation (
31), the choice of
is again flexible, and one popular option is to use
.
The above two surrogates are developed based on the concave inequality. Another useful approach is to employ a quadratic lower bound (e.g., [
40]). Assume
is twice differentiable with its second derivative denoted by
. Let
be a number such that
for all
, then
The right hand side of Equation (
32) is a parabola surrogate of
and the condition on
guarantees that this function lies below
. Unlike the previous surrogate functions, this surrogate is not separable in
x, and therefore its maximization with respect to
x cannot be reduced to a series of 1-D problems. To overcome this problem we can find another function surrogating the above parabola surrogate but with separable
x. Towards this, we denote the right hand side quadratic function of Equation (
32) by
. Since
is concave in
, we can use either Equations (
29) or (
31) to find a surrogate to
and the resulting algorithm is called the separable paraboloidal surrogate (SPS) algorithm [
39]. For example, corresponding to Equation (
31), a separable parabola surrogate of
is
A careful selection of the curvature
in Equation (
32) can lead to fast convergence of the SPS algorithm. Erdoǧan and Fessler [
39] derived the optimal curvature for the SPS algorithm in transmission tomography.
Next, we present two examples explaining how to implement the OT algorithm to emission and transmission tomography.
Example 4.1 (OT for emission scans with Poisson noise).
In this example we explain the application of OT for MPL reconstruction in emission tomography, where measurements are assumed to follow Poisson distributions. De Pierro’s modified EM (MEM) [
32] coincides with the method discussed below when
. Firstly, under the Poisson model for emission scans, the penalized log-likelihood function is
where
ρ is assumed a convex function. Let
where
. It is easy to verify that
is concave with respect to
, so we can use Equation (
28) to define its surrogate function. On the other hand, for the penalty function in Equation (
34),
is concave, so we can use Equation (
31) to construct its surrogate. Combining them together we have the following surrogate for
:
where
. Now
The equation
has a closed-form solution for
when
and
for all
i. In this context, Equation (
37) reduces to a quadratic function so we wish to solve for
from
subject to
, and its analytic solution is readily available. If
or
ρ is not quadratic, the analytic solution to Equation (
37) does not exist. In this case, one can use an 1-D optimization method to solve it, or alternatively, one may use a separable parabola surrogate rather than Equation (
36). An example of the latter is explained in the next example where the reconstruction problem is for transmission tomography.
Example 4.2 (OT for transmission scans with Poisson noise).
This example considers the application of OT to MPL reconstruction in transmission tomography. Our explanations follow [
39] closely. For transmission scans with Poisson noise, the penalized log-likelihood is given by
where
ρ is convex. Let
and
Since
is concave with respect to
, a separable parabola surrogate can be defined according to Equation (
33). For the first term of Equation (
39) (
i.e., the log-likelihood part), a separable parabola is given by
where
and here
satisfies
for all
. For the second term of Equation (
39) (
i.e., the penalty part), let
and let the weights
. Its separable parabola surrogate is
where
Here
is chosen such that
for all
in its range; this curvature
ensures that
lies above
. Aggregating Equations (
41) and (
43) we obtain a separable parabola surrogate for
:
We have
and for this example
Let
and
. The solution of
, subject to
, is given by
where
This is in fact a special gradient algorithm with a diagonal preconditioning matrix.
5. Multiplicative Iterative Algorithms
The OT algorithms presented in the last section have the following important achievements: (1) they manage to transform a high dimensional optimization problem into a series of 1-D optimizations; (2) due to 1-D optimizations, the non-negativity constraints can be easily enforced by simply resetting negative estimates to zero in each iteration; (3) the surrogate given by the separable parabola approach is general enough to be applicable to different tomographic reconstructions. A limitation of OT is that it requires all (log-density) and (negative penalty) to be concave functions.
In this section we discuss a competitive alternative to the OT method called the multiplicative iterative (MI) algorithm; its application to tomographic imaging can be found in [
26] and to box-constrained image processing in [
41].
The main motivation of the MI algorithm is that it can be easily derived under different imaging modalities and different measurement noise models. Moreover, for some difficult penalties, such as TV, or even non-convex penalties [
42], MI can be easily implemented to solve the corresponding optimization problems.
A general MI updating formula can be developed suitable for all tomographic reconstruction problems regardless of the mean function model, measurement probability distribution and penalty function. The simulation study reported in [
26] reveals that MI has competitive convergence speed when compared with OT and other reconstruction algorithms. The MI algorithm does not require concavity of the functions
and
and therefore is more general than the OT algorithm. It requires existence of the first derivatives of
and
. It is possible that the objective function
in Equation (
2) has multiple local maxima. In this case, MI finds one of the local non-negative maxima, depending on the starting value of the algorithm.
Here are some notations needed to explain the MI algorithm. For a function , let be the positive component of and the negative component so that For a number b, Let and so that . Thus, for the numerical value of function at point , we can also write .
We develop the MI algorithm from the Karush–Kuhn–Tucker (KKT) necessary conditions for the non-negatively constrained optimization of
. They are:
for
. Therefore, we aim to solve for
x from
Note that the expression inside the brackets of Equation (
51) represents
, and
is included in Equation (
51) to reflect the conditions in Equations (
49) and (50).
The key step in developing the MI algorithm is to rearrange Equation (
51) such that its positive and negative terms appear on different sides of the Equation (
51). Hence we rewrite Equation (
51) as
This equation naturally suggests the following fixed point algorithm to update
x:
where
and
denote respectively the right and left hand side of Equation (
52), namely,
and
and
ϵ is a small positive constant, such as
, used to avoid zero denominate of Equation (
53). Note that the
ϵ value does not affect where the algorithm converges to. As both numerator and denominator of Equation (
53) are positive,
whenever
.
In Equation (
53) the updated
is denoted by
indicating this is not the final estimate for iteration
. In fact, this update does not ensure monotonic increment of
and a line search step must be included to rectify this problem. We first express Equation (
53) as a gradient algorithm:
where
with
. Note that
when
. When
we set
only if
(since
satisfies the KKT condition in this case); otherwise, we set
, where
is another small constant such as
. Equation (
56) explains that
emanates from
in the gradient direction of Ψ with a non-negative step size
. For the line search step, the search direction is
with
denoting the line search step size. Sine
guarantees
, we only search in the fixed range of
. After including a line search step
is obtained according to
Due to the fixed search interval, this line search is remarkably simple. One simple and efficient search strategy is provided by the Armijo’s rule (e.g., [
43]). Armijo line search is a finite terminating algorithm. Briefly, it starts with
, and for each
α it checks if the following Armijo condition is satisfied:
where
is a fixed parameter such as
. If Equation (
58) is true then stop; otherwise, reset
(such as
) and reevaluate the Armijo condition (
58). Note that the repeated evaluations of
can be made with
being computed only once. Therefore, the line search step does not add extra major computations to the MI algorithm.
Convergence properties of the MI algorithm are given in [
26,
41]. Briefly, under certain regular conditions, MI converges monotonically to a local maxima satisfying the KKT conditions.
For the mean functions given in Equation (
4), we have
for emission and
for transmission tomography; the corresponding updating formula (
53) becomes:
for emission tomography, and
for transmission tomography. The derivative
in the above formulae depends on the log-density
. Some examples are presented below.
Example 5.1 (MI for emission scans with Poisson noise).
For emission tomography with Poisson noise, we have the log-density function for
:
where
. Thus
, which gives
and
. The updating formula (
59) becomes, for
,
Note that when
(
i.e., maximum likelihood reconstruction),
and
, this algorithm coincides with the EM algorithm for emission tomography. After line search, the estimate of
x at iteration
is given by Equation (
57). In this algorithm, there is only one back-projection (for the numerator of Equation (
62)) and one forward-projection in each iteration; its computational burden is the same as EM.
Example 5.2 (MI for randoms-precorrected PET emission scans).
Some PET scans produce measurements that have already been corrected for randoms [
44] and their measurements no longer follow Poisson distributions. We consider in this example the model weighted least squares which is also used in [
11] but under a different context,
i.e., we reconstruct from randoms-precorrected measurements
by maximizing the objective Equation (
2) where
Here
is used to denote
, and for this
formula (
59) still applies. Now since
we have
and
. The MI algorithm updates
x first according to
and then, after the line search step, computes
according to Equation (
57).
Example 5.3 (MI for polyenergetic transmission scans with Poisson noise).
Application of the MI algorithm to polyenergetic X-ray CT is again extremely easy. Under the assumption of Poisson noise, the log-density for measurement
is identical to Equation (
61) but now with
; see Equation (
17). In Example 5.1 we have already derived
and
for the Poisson noise log-density. On the other hand, the derivative of
with respect to
(denoted by
) is
Thus, the updating formula for ployenergetic transmission is
for
and
. After the line search step specified in Equation (
57),
is obtained. This iterative formula involves one forward- and two back-projections in each iteration, and therefore it demands similar amount of computations when compared with the alternative minimization algorithm in [
34]. When
,
and
, this MI algorithm is identical to the algorithm given in [
45] for maximum likelihood reconstruction in transmission tomography. Note that unlike the optimization transfer and alternating minimization algorithms, the MI algorithm can be easily derived for other objective functions, such as the weighted least-squares function.
The above examples demonstrate that the MI algorithms are easy to derive and to implement in tomographic imaging. The line search step it requires does not incur significant computational burden.
6. Modified Fisher’s Method of Scoring Using Jacobi or Gauss–Seidel Over-Relaxations
In this section we elaborate on another non-negatively constrained method for tomographic imaging, which is a modification to the standard Fisher’s method of scoring (FS) algorithm. This method is developed based on the following steps. Firstly, the objective function
is approximated by a quadratic function in each iteration, where the Fisher information matrix (e.g., [
46]) is used to define the quadratic term; secondly, an over-relaxation method, either the Jacobi over-relaxation (JOR) or the Gauss–Seidel over-relaxation (also called the successive over-relaxation (SOR)), is employed to solve approximately the linear system derived from zeroing the derivative of this quadratic function. The resulting algorithms are called FS-JOR and FS-SOR and their detailed descriptions can be found in [
47,
48]. Descriptions of the JOR and SOR methods are available, for example, in [
49].
FS is a general optimization algorithm for computing maximum likelihood estimates. Its advantages over the traditional Newton’s method have been documented in [
50]. Briefly, FS iterations are well defined due to the non-negativeness of the Fisher information matrix, but for the Newton’s method, the negative Hessian matrix may not even be non-negative definite, making it unnecessarily proceed in the uphill direction in some applications. Transmission tomography is an example where this problem for the Newton’s method indeed occurs; see Example 6.2.
We assume the objective function
in Equation (
2) is twice differentiable and let
be the Fisher information matrix, namely
. At iteration
of the Fisher scoring algorithm,
is approximated by the following quadratic function:
where
denotes the Fisher information matrix at
. Then the
x estimate is updated by constrained maximization of
, namely
The KKT conditions for this optimization are
where
Here
denotes the
j-th row of matrix
. The JOR and SOR methods solve, for
,
in different manners: JOR solves it by fixing all the
x elements, except
, at their estimates from the last iteration (
i.e., iteration
k), but SOR solves it by fixing all the
x elements, except
, at their most current estimates.
The above illustrations describe how to incorporate JOR or SOR sub-iterations into the FS algorithm. In fact, in each iteration, JOR or SOR is used to solve approximately the linear system of equations determined by the FS algorithm, and then this approximate solution is used as the starting value for the next FS iteration. These new schemes modify the standard FS method, and are feasible for large estimation problems.
Usually it suffices to run one JOR or SOR sub-iteration. But running more than one sub-iterations is also attractive as it has the potential to reduce the computations for the entire optimization process. Suppose within each Fisher scoring iteration we run
m sub-iterations of JOR or SOR. The resulting algorithms are called the
m-step FS-JOR and
m-step FS-SOR algorithms respectively. Let
r be the sub-iteration index for the over-relaxation method and
the estimate of
x at the
r-th over-relaxation sub-iteration of the
k-th FS iteration. Let
be the
-th element of
. Assume
for all
j. At iteration
, first set
. If using JOR to solve Equation (
73) we have
and if using SOR to solve we then have
where
and
is the relaxation parameter. If any
then it is reset to zero. This resetting is correct since the only possibility for
is that the expressions in the round brackets of Equations (
74) and (
75) are negative since
and
. Hence resetting
to zeros assures that the FS-JOR and FS-SOR algorithms converge to, when they converge, the solution satisfying the KKT conditions. At the end of the sub-iterations set
. Note that when
, the last term in the round brackets of either Equation (
74) or (
75) becomes zero. Thus 1-step FS-JOR is basically a gradient algorithm and we can therefore replace
ω by a line search step size
, where the search range is fixed at
as this range will keep the estimate non-negative.
The relaxation parameter
ω is used to achieve convergence of the FS-JOR and FS-SOR algorithms. Results contained in [
47] give convergence properties when
and when the non-negativity constraint is ignored. In fact in this context FS-SOR converges if
and FS-JOR converges if
, where
is the maximum eigenvalue of
. Here
is the MPL solution.
From the updating formulae given in Equations (
74) and (
75) we can see that both FS-JOR and FS-SOR involve the gradient
and the Fisher information matrix based operation
. The gradient is standard for most reconstruction algorithms, but the computation of
requires more careful consideration. It will become clear in Examples 6.1 and 6.2 that for tomographic reconstructions
usually exhibits as
, where
. It is not wise to compute
first as this involves multiplications of two huge matrices
A and
. For FS-JOR, a feasible alternative is to use the forward projection to find
first, then to multiply it with the diagonal values of
W to get
, and finally to back-project
to obtain
(ignoring the penalty term). This approach involves only one forward- and one back-projections in every sub-iteration. The situation for FS-SOR is more complicated since
changes with the pixel index
j. The above approach for FS-JOR cannot be used here as otherwise each FS-SOR sub-iteration will demand infeasible
p pairs of forward- and back-projections. To confront this problem, let
The
part of Equation (
75) involves
. Note that
so we can start with
and obtain
by applying Equation (
77). Although here the number of multiplications for
(where vector
varies with its index
j) becomes the same as
, it requires column access to the system matrix
A, which can be a problem if
A is generated on-the-fly.
We next provide examples of applying FS-JOR and FS-SOR to emission and transmission tomography.
Example 6.1 (Emission scans with Poisson noise).
For emission reconstruction with Poisson noise, the log-density of
is given by Equation (
61). Thus for the corresponding object function
of Equation (
2), its gradients are
and its Fisher information matrix elements are
where
,
. Assuming we run only one sub-iteration for FS-JOR or FS-SOR (
i.e.,
), the FS-JOR iterative formula is
and the FS-SOR formula is
Then
. The formula given in Equation (
80) is just a gradient algorithm so
ω can be replaced by a line search step size
. Efficient computation of Equation (
81) requires column access to matrix
A as explicated before. Hudson
et al. [
48] reported simulation results and a real data application for emission reconstruction. They compared FS-JOR and FS-SOR with EM. The computer time required per iteration for the EM and one-step FS-JOR algorithms were similar. By comparison with the EM algorithm, FS-JOR and FS-SOR accelerated convergence when an appropriate value of
ω was used. Particularly, FS-SOR had a superior speed of convergence when
.
Example 6.2 (Transmission scans with Poisson noise).
For transmission reconstructions with Poisson noise, we can easily work out the gradient and Fisher information matrix from its penalized likelihood function. The gradients are
and the Fisher information matrix elements are
where
,
and
. Note that for this example, the Fisher information matrix is non-negative but the negative Hessian matrix may not be non-negative, making the Newton method non-applicable. Corresponding to
, the FS-JOR iterative formula is
and the FS-SOR formula is
Then
. Again, Equation (
84) is a gradient algorithm so that a line search can be used, and efficient implementation of Equation (
85) demands unpleasant column access to
A.
This section explains the Fisher scoring based image reconstruction algorithms using JOR or SOR sub-iterations. For these algorithms, any negative estimates in each iteration can be corrected by simply resetting to zero, as this way of resetting enforces the KKT conditions. If only one sub-iteration is used, FS-JOR is equivalent to a gradient algorithm. For efficient implementation of FS-SOR, it requires column retrieval of the system matrix A, which can be infeasible for some reconstruction problems.
7. Iterative Coordinate Ascent Algorithms
Another method using SOR is the method of iterative coordinate ascent (ICA) (or iterative coordinate descent (ICD) for minimization problems). ICA was first implemented to tomographic imaging in [
51,
52]. The basic idea of ICA is to apply SOR directly to the objective function
, resulting in a sequence of 1-D functions where each
is associated with one of these 1-D functions. Then each function is solved exactly or approximately to update the corresponding
. More specifically, using the SOR principle we can define a function for
according to
This is a function of
only and we can update the
estimate by
Since this is a 1-D function, the constraint
can be easily enforced using, for example, the resetting to zero approach.
One computational issue with ICA when applied to tomographic imaging is that it requires repeated calculations of
for all
i when updating
. This problem can be rectified by the following approach. Let
Consider the evaluation of
. Assuming the update of
is given by
then
and therefore
This relationship explains that
can be cheaply computed using the
value before the
update plus a correction term. However, similar to FS-SOR, it necessitates column access to
A. This can be a potential issue if
A is generated on-the-fly.
Next we use again the emission and transmission examples to elaborate the ICA algorithm.
Example 7.1 (Emission scans with Poisson noise).
Firstly, we define
From the penalized log-likelihood function of emission measurements
(see, for example, Equation (
34)), function
is given by
Since this is a non-quadratic function of
, exact maximization is infeasible. We can find its approximate optimization by running a single or multi- step of, for example, the Newton or Fisher scoring algorithm. In this example we consider using the Fisher scoring algorithm to optimize
and call the resulting algorithm ICA-FS. After a single step of Fisher scoring we have
where
and
is a line search step size enforcing
, where equality holds only when the algorithm is converged. This monotonic condition eventually leads to
. The update for
is then
.
Example 7.2 (Transmission scans with Poisson noise).
For this example we have
where
is defined in Equation (
90). The ICA-FS algorithm gives
where
, and then
.