Algorithms for Non-Negatively Constrained Maximum Penalized Likelihood Reconstruction in Tomographic Imaging

Image reconstruction is a key component in many medical imaging modalities. The problem of image reconstruction can be viewed as a special inverse problem where the unknown image pixel intensities are estimated from the observed measurements. Since the measurements are usually noise contaminated, statistical reconstruction methods are preferred. In this paper we review some non-negatively constrained simultaneous iterative algorithms for maximum penalized likelihood reconstructions, where all measurements are used to estimate all pixel intensities in each iteration.


Introduction
Image reconstruction in medical imaging, in general, considers estimating pixel intensities or attenuations from measurements obtained from an imaging system.For example, for positron emission tomography (PET), the measurements are obtained according to the procedure summarized below; see [1,2] for more details.A type of radioactive isotope is introduced into the body of a patient and, due to the decay of radioisotope, it emits positrons.Each positron moves in the body for a small distance (usually less than 1 mm) and then interacts with an electron to produce a pair of gamma photons that travel in almost opposite directions.The scanning device in the imaging system can detect each pair of gamma photons with a certain probability and all such detections form the measurements that can appear in a histogram or a list form [3]. It is usually assumed that the detection probabilities are known and they can be pre-computed and stored or computed on-the-fly.
Note that a special feature of measurements is that they are contaminated by noises, which can be a severe problem particularly if each measurement is small in value due to dose safety limit.It is possible that, if the noises are not properly addressed, the reconstructed image can be distorted by excessive noises.For example, for low dose X-ray CT (a type of transmission tomography), the metal streak artifact (e.g., [4]) can be a severe problem for the traditional filtered backprojection method.Statistical iterative reconstruction methods, due to their ability to model the physics and measurements more accurately, are capable to reduce metal streak artifacts [5].
To deal with the noise contamination problem, statistical image reconstruction methods in emission, transmission, X-ray CT, etc. have been developed based on specified probability models for measurements.For example, for single photon emission computed tomography (SPECT), possible options include: weighted least squares (equivalent to variable variance Gaussian) [6], fixed variance Gaussian [7] and Poisson [8] models.These models can also be used for transmission scans.Since accidental coincidences are the main source of background noise in PET, most PET scans are precorrected for accidental coincidences by real-time subtraction of the coincidences in the delayed window [9].For randoms-precorrected PET scans, possible measurement models are Gaussian, ordinary Poisson and shifted Poisson [9], and all of these are just approximations as the true probability density function (pdf) for the measurements is difficult.Shifted Poisson is also used to model X-ray CT measurements [10].
Different algorithms have been proposed to maximize their corresponding objective functions.For example, for emission tomography, the expectation-maximization (EM) algorithm [8] is designed to maximize the log-likelihood formulated from Poisson distributed measurements, or the iterative space reconstruction algorithm (ISRA) [7] for maximizing the log-likelihood formulated from Gaussian (with fixed variances) distributed measurements.An attractive aspect of both EM and ISRA is that they are very easy to implement and both respect the non-negativity constraint on the reconstructions.However, if the objective function contains a penalty term, which is normally used to smooth the reconstruction, then both EM and ISRA become impractical as they involve, in each iteration, a non-linear system of equations that is tedious to solve exactly due to the large number of unknowns in these equations.Moreover, the penalty function also adds an extra inconvenience when searching for a non-negative solution is desirable.
To simplify notations, both the measurements and the unknown image are lexicographically ordered into vectors.More specifically, we use y = (y 1 , . . ., y n ) T to present the measurement vector and x = (x 1 , . . ., x n ) T to denote the unknown image vector, where superscript T denotes matrix transpose.Note although the notations are unified for different reconstruction problems in this paper, the meaning of these notations, such as x and y, can be different for different imaging modalities.Vectors y and x are related through a system matrix A; see Equation (4) below for some examples.For tomographic reconstruction problems, matrix A is usually assumed known so its estimation is not covered by this paper.Rather, we focus on how to estimate x from the observed y and the known system matrix A. We denote the estimate of x by x.
Statistical reconstruction x obtained by maximum penalized likelihood (MPL) (also known as maximum a posteriori (MAP)) is defined by where Ψ(x) is an objective function derived from the probability distribution for measurements and the penalty function.When the y i 's are assumed independent (given x), the penalized likelihood objective function is where l(x) is the log-likelihood function given by Here h > 0 is the smoothing parameter and J(x) is the penalty function used to smooth x.In Equation ( 3), l i denotes the log-density function for measurement y i , and µ i is a function of x ∈ R p + (here R p + denotes the non-negative orthant of R p ) representing the mean measurement of camera bin i. Examples of µ i include where η i (x) = A i x with A i being the ith row of matrix A, b i is the known blank scan counts of the ith detector and r i the known mean background counts.Another example is polyenergetic transmission scans (such as X-ray CT) where and here x m = (x 1m , . . ., x pm ) T denotes the attenuation map corresponding to the m-th energy spectrum, x is a vector formed by the x m 's and b im is the blank scan count from energy spectrum m.
In Equation (3) the notation l i (µ i ; y i ) is used to emphasize that l i is a function of µ i and it also involves measurement y i .We can also write this function as l i (η i ) or l i (x) in different contexts when there is no ambiguity.However, the functional properties of l i may change with respect to its different arguments.For example, if assuming y i follows a Poisson distribution for either emission or transmission scans, then This is clearly a concave function of µ i for both emission and transmission cases.However, for l i (x) (treated as a function of x), it may be no longer concave for transmission but still concave for emission scans.Concavity is an important property exploited by the optimization transfer algorithms.Let µ be an n-vector of all µ i .The first term of Equation (2), i.e., l(x), measures similarity between y and µ.Different probability distributions have been used to model y i even under the same imaging modality.For example, for emission tomography, if assuming the Poisson model for y i (i.e., y i ∼ Poisson(µ i )) then l i is given by Equation ( 6), or if considering the weighted least squares then where w i is the weight.When w i = µ i we have the weighted least squares model as suggested in [11].
Another example in emission (or transmission) tomography is the randoms-precorrected PET scan (assume no scattering to simplify).In this context, the observed measurements are where y Prompt i and y Delay i (both unavailable directly) denotes the number of coincidences of the prompt and delayed windows respectively.Although we can assume y Prompt i ∼ Poisson(A i x + r i ) and y Delay i ∼ Poisson(r i ) and that they are independent, the exact distribution of y i cannot be derived directly (e.g., [9]).An approximate probability model suggested in [9] is the shifted Poisson distribution, namely y i + 2r i ∼ Poisson(A i x + 2r i ), which gives or the weighted least squares given by Note that the shifted Poisson approximation matches the first two moments with the true probability model for y i + 2r i when both the prompt and delayed measurements are assumed independent and follow Poisson distributions.
In this paper, we present and discuss several important non-negatively constrained penalized likelihood reconstruction algorithms.When designing a reconstruction algorithm in tomographic imaging, one considers the following important issues: (i) the algorithm is computationally efficient, and ideally it involves only forward-projection (e.g., Ax) and back-projection (e.g., A T y) operations; (ii) the algorithm can be easily applied to different measurement probability models and imaging modalities; (iii) the algorithm can impose the non-negativity constraint; (iv) the algorithm converges fast.Our discussions on the algorithms in this paper will mainly focus on these points.
In tomographic imaging, it is important to produce smoothed reconstructions as severe noise in a reconstruction can cause false diagnoses.Smoothing can generally be achieved by one of the following five practices: (i) early termination of the iterations (e.g., [12]); (ii) MPL reconstructions with an appropriate smoothing parameter (e.g., [13]); (iii) functional representation of the unknown image by a set of smooth basis functions (e.g., [14]); (iv) post smoothing of the reconstruction within each iteration (e.g., [15]) or after all iterations ( [16]); and (v) pre-smoothing of the camera data (i.e., sinogram) followed by filter backprojection (FBP) (e.g., [17,18]).We focus on the penalized likelihood approach to smoothing in this paper.In Equation (2), the smoothing parameter h balances two conflicting targets: fidelity of the µ i s to the y i s and smoothness of x.Although an appropriate choice of h is important for achieving a reconstruction with balanced fidelity and smoothness, we will not consider how to estimate h in this paper.A penalty function J(x) is used to smooth or regulate the estimate x.Usually, J(x) takes the form of where C j x represents a neighborhood operation (such as the first or second order difference) on pixel j, and function ρ(•) measures the magnitude of C j x.A common choice of ρ is the quadratic function: Generally, a quadratic penalty tends to produce images with over-smoothed edges.Possible edge preserving penalties include total variation (TV) (e.g., [19]) Huber [20] and hyperbolic functions (e.g., [21]).Note that ρ(•) is convex for all these options.
The optimal choice of the penalty function J and the smoothing parameter h are unsolved problems in image processing and will not be further elaborated in this paper.We emphasize that smoothing by MPL indeed produce visually improved reconstructions over the tradition filtered-backprojection method particularly in dose-limited tomography such as low dose X-ray CT.The edge preserving penalties are extremely useful, such as TV and Huber penalties; see [22][23][24].However, the MPL reconstructions can have unnatural noise textures very different from the familiar filtered-backprojection method.Its impact on diagnostic tasks is still unknown and this is an active research area; see [25] for examples and discussions.
We adopt the following notations throughout this paper.Let x (k) be the estimate of x obtained at iteration k of an algorithm.The notation ∇b(•) indicates the derivative of function b with respect to the variable in the brackets.For example, ∇b(A i x) represents the derivative of b with respect to A i x and ∇b(x; x (k) ) the derivative of b with respect to x.We use ∇ j b(x) to denote the derivative of b with respect to x j , the j-th element of vector x.We also let ∇b(x (k) ) and ∇ j b(x (k) ) represent, respectively, ∇b(x) and ∇ j b(x) evaluated at x = x (k) .
Non-negatively constrained MPL image reconstruction algorithms can be classified into simultaneous and block-iterative (a.k.a.ordered subset (OS)) algorithms.For simultaneous algorithms, all elements in y are used to update x in each iteration, and for block-iterative algorithms, distinct portions of y are used in turn to update x.We discuss in this paper some simultaneous algorithms for non-negatively constrained MPL reconstructions, and the block-iterative algorithms are not included in our discussions.The rest of this paper is arranged as follows.The expectation-maximization algorithm for emission tomography is discussed in Section 2. Section 3 explains the alternating minimization algorithm designed specifically for transmission tomography.Section 4 contains explanations on the optimization transfer algorithms and their applications to tomographic reconstructions.The multiplicative iterative (MI) algorithms for tomographic imaging are provided in Section 5 and the Fisher scoring based Jacobi or Gauss-Seidel over-relaxation algorithms are presented in Section 6. Section 7 explains another Gauss-Seidel method named the iterative coordinate ascent algorithm.Finally, Section 8 includes discussions and remarks about this paper.
In this paper we focus on explaining and summarizing different non-negatively constrained tomographic imaging algorithms.Numerical comparisons of some of these algorithms are available in [26], and therefore will not be given in this paper.

EM Algorithm for Maximum Likelihood Reconstruction in Emission Tomography
The expectation-maximization (EM) algorithm [27] is a statistical algorithm for iteratively computing maximum likelihood estimates when data contain random missing values.Here "random" means these missing values do not provide extra information about the parameters we wish to estimate.We first give a brief summary of the EM algorithm below.
Since there exist the missing and the observed (or incomplete) components, we can define the complete data set as a combination of the incomplete and the missing data.Note, however, that our aim is to estimate the unknown parameters by maximizing the log-likelihood of the incomplete data.The rationale for the EM algorithm is that if maximizing the incomplete data likelihood is difficult while maximizing the complete data likelihood is easy, then EM can be used to compute iteratively the maximum of the incomplete data likelihood by maximizing the complete data likelihood in each iteration.
Let C be the complete data set given by C = [Y, M], where Y denotes the incomplete data and M the missing data.Let l C (x) be the log-likelihood based on the complete data C and l(x) the log-likelihood of the incomplete data Y, where x is a p-vector for the unknown parameters.Let x be the maximum likelihood (ML) estimate of x.Then iteration k + 1 of the EM algorithm comprises two steps: 1. E-Step: Compute the conditional expectation of the complete data log-likelihood given the incomplete data and x (k) , and denote this function by 2. M-Step: Update the x estimate by maximizing the Q function, namely One major advantage of EM is that it guarantees, under certain regularity conditions, that the incomplete data log-likelihood l(x) increases in consecutive iterations before convergence.Note that EM requires availability of the Q function in a closed form; otherwise, a Monte-Carlo E-step can be used to replace the E-step [28].
The EM algorithm was first applied to emission tomograph by Shepp and Vardi [8] and Lange and Carson [29].Both papers adopt the Poisson model for emission counts, namely y i are independent Poisson random variables with mean µ i = A i x.This model assumes r i = 0; otherwise, we can depict y i as the value after subtracting r i from the bin i measurement.From this Poisson model, we can formulate the complete data as C = {y ij : y i = p j=1 y ij }, where y ij follows the Poisson distribution with mean µ ij = a ij x j .Clearly, each y ij represents the unknown portion of measurement on camera bin i attributed to image pixel j.The corresponding complete data log-likelihood is and the corresponding Q function is where y t .Thus after solving ∇ j Q(x; x (k) ) = 0, the M-step of the EM algorithm gives the following updating formula for x: for j = 1, . . ., p.It has been pointed out in [23,30] that formula (15) can also be explained by the Bayes conditional probability formula.This EM algorithm possesses the following properties making it attractive for emission tomography; they are: 1.If the initial x (0) ≥ 0 then x (k) ≥ 0 for all k ≥ 1; i.e., it automatically satisfies the non-negativity constraint on x. 2. The algorithm is easy to implement as it only involves forward-and back-projections.3. The updating formula in Equation ( 15) increases the incomplete data log-likelihood: l(x (k+1) ) ≥ l(x (k) ), where equality holds only when the iteration has converged.4. k) .Thus the x estimate at any iteration satisfies that the total expected and the total observed counts are equal.
The above EM is easy to implement and possesses some attractive properties on the reconstructions.This algorithm, however, is restricted only to emission tomography with Poisson distributed measurements.It cannot be easily extended to other reconstruction tasks.For example, application of the EM algorithm to transmission tomography does not lead to an exact updating formula due to the fact that its M-step does not produce a closed-form solution; see [29].Another limitation is that this EM algorithm can only be used for maximum likelihood reconstructions, and its application to the MPL reconstruction will not in general result in closed-form updating formula.To rectify this problem, Green [31] developed a one-step-late (OSL) algorithm for the MPL reconstruction by replacing x in the derivative of the penalty function by its current estimate x (k) , and therefore an "exact" solution can still be accomplished.But this method suffers from the deficiencies that (i) the algorithm may be non-convergent; and (ii) some estimates may be negative.
De Pierro [32] reproduced the EM updating formula using a totally different argument.In his derivation, there is no missing data and hence no E-step.Although the algorithm is named "modified EM", it is not a real EM.In fact, this algorithm belongs to a more general class called the optimization transfer algorithms, since the Poisson log-likelihood optimization problem is transferred to a simpler optimization in each iteration.We will summarize the optimization transfer algorithms in the Section 4.

Alternating Minimization Algorithms for Transmission Tomography
We have explained in Section 2 that the EM algorithm is not directly suitable for transmission scans as its M-step cannot be computed exactly.In this section, we summarize an alternating minimization algorithm designed to solve the transmission tomographic problem, including X-Ray CT.This algorithm is a generalization to the EM algorithm [33] and its application to transmission tomography can be found in [34].
Following [34], we explain this algorithm using the polyenergetic transmission tomography example.In this context, if assuming transmission scans follow Poisson distributions, the corresponding log-likelihood is where y i is the scan count of detector i and µ i (now expressed as a function of vector z, which will be defined below) is given by Equation (5).Moreover, elements of the attenuation map associated with spectrum m, namely elements of x m in Equation ( 5), are further modeled by where j indexes pixels, r represents different types of materials, u mr are known linear attenuation coefficients and z rj are the unknown partial densities (e.g., [34]) we wish to estimate.In Equation ( 16), z is a vector of size pa × 1 formed by column-wise stacking the vectors z j = (z 1j , . . ., z aj ) T .
Define set E = {q im ; i = 1, . . ., n and m = 0, 1, . . ., M } where for m = 1, . . ., M and q im equals the background noise r i for m = 0. Clearly, µ i given in Equation ( 5) can now be expressed as µ i = M m=0 q im .Define another set In [34], E is called the exponential family and L the linear family.Let p and q be the vectors created from p im and q im respectively.It can be shown that the problem of maximizing the log-likelihood Equation ( 16) can be re-written as subject to z rj ≥ 0, where I(p q) is the I-divergence [35] given by Thus, maximizing the log-likelihood in Equation ( 16) can be achieved iteratively.Assuming the estimates p (k) , q (k) and z (k) are obtained at iteration k, then iteration k + 1 contains two steps: (i) compute p (k+1) by minimizing I(p q (k) ) subject to p ∈ L; (ii) compute q (k+1) by minimizing I(p (k+1) q) subject to q ∈ E.
Note that the second step is equivalent to minimizing I(p (k+1) q) over z rj ≥ 0 with q im being given by the expression in Equation (19).Minimizing I(p q (k) ) over p ∈ L is easily achieved using the Lagrange multiplier, and the result is On the other hand, direct optimization of I(p (k+1) q) over z rj ≥ 0 is an unmanageable task as the z rj 's are mixed (i.e., not decoupled or separated from each other) within the objective function.One approach to overcome this problem is by using a decoupled objective function representing an upper bound of the original objective function.In fact, it can be shown that for q im given by Equation (19), where v 0 = max (i,m) j r a ij u mr and qim is an estimate of q im corresponding to the estimate ẑrj ≥ 0 of z rj .This inequality is obtained from the fact that I(p (k+1) q) is a convex function of z rj .Clearly, z rj on the right hand side of Equation ( 24) are decoupled and thus their non-negatively constrained optimizations will result in closed-form solutions.When we take ẑrj = z (k) rj , the optimal solution to z rj is where w(k+1) im .We give some remarks about this algorithm below.

Remarks
(1) This algorithm is designed for maximum likelihood estimation.However, it can be easily extended to MPL where the penalty function must be convex and therefore can also be decoupled.
(2) This algorithm is developed for the likelihood function derived from the simple Poisson measurement noise.Note that the alternating minimization algorithm was also developed for a compound Poisson noise model in [36] and its comparison with the simple Poisson alternating minimization was provided in [37].For other measurement distributions, however, the corresponding algorithms have to be completely re-developed.(3) The convergence properties of the alternating maximization algorithm have been studied in [34].
Particularly, it is monotonically convergent under certain conditions.(4) It will become clear in Section 5 (Example 5.3) that the multiplicative-iterative algorithm can be derived more easily for this transmission reconstruction problem.(5) The trick of decoupling the objective function using its convex (or concave) property is also the key technique of the optimization transfer algorithms discussed in Section 4.

Optimization Transfer Algorithms
Details of the optimization transfer (OT) algorithm (also called the minorization-maximization (MM) algorithm for maximizations) can be found in, for example, [38].In this section we present this algorithm briefly and explain its application in emission and transmission tomography.
The fundamental idea of the OT algorithm is that it employs a surrogate function to minorize (see the definition below) the objective function Ψ(x) in each iteration, and then update the parameter estimate by maximizing this surrogate function.
More specifically, a function Φ(x; x (k) ) is said to minorize Ψ(x) at x (k) if it satisfies the following "minorization" conditions: Then at iteration k + 1, x is estimated by maximizing Φ(x; x (k) ), i.e., If the exact maximum is not easy to obtain, we can find an x (k+1) by simply increasing Φ(x; x (k) ), as this will also guarantee that the monotonic condition stated below remains for Ψ(x).
An attractive property when using this surrogate function is that x (k+1) satisfies the monotonic condition, namely where equality holds only when the iteration has converged.This monotonic property can be easily verified by the minorization conditions since For implementation of the OT algorithm to medical imaging, a surrogate function Φ(x; x (k) ) must be determined.There exist different ways of choosing the surrogate function, such as those listed in [38].We mainly consider two approaches in this paper: (i) the method based on the inequality on concave functions (called the concave inequality hereafter); and (ii) the method based on quadratic lower bounds (also known as paraboloidal surrogates [39]).These ideas are summarized below.
Let G(x) = n i=1 g i (A i x) be the objective function we wish to maximize, where A i is the i-th row of matrix A n×p and x is a p-vector.For matrix A, we assume its elements a ij are non-negative and j a ij = 0. We also assume that all g i (•) are concave functions.Let π ij ≥ 0 be weights satisfying p j=1 π ij = 1.Then according to the concave inequality we have There are different ways of choosing weights π ij .For example, we can use π ij = a ij x j /A i x, which is also adopted in [32].In this case since each π ij is a function of x, the surrogate function corresponding to Equation ( 28) is and it is easy to verify that this surrogate satisfies the minorization conditions.The right hand side of Equation ( 29) is a weighted summation of functions g i , each involving a single x j only (i.e., decoupled), and therefore maximization with respect to x of Φ(x; x (k) ) can be achieved by a sequence of 1-D optimizations.Another trick, due to De Pierro [32], uses the following concave inequality: If the weights π ij do not depend on x j , then Equation (30) leads to the surrogate function of which clearly also meets the minorization conditions.In Equation (31), the choice of π ij is again flexible, and one popular option is to use The above two surrogates are developed based on the concave inequality.Another useful approach is to employ a quadratic lower bound (e.g., [40]).Assume g i is twice differentiable with its second derivative denoted by ∇ 2 g i .Let d The right hand side of Equation ( 32) is a parabola surrogate of g i and the condition on d (k) i guarantees that this function lies below g i .Unlike the previous surrogate functions, this surrogate is not separable in x, and therefore its maximization with respect to x cannot be reduced to a series of 1-D problems.
To overcome this problem we can find another function surrogating the above parabola surrogate but with separable x.Towards this, we denote the right hand side quadratic function of Equation ( 32) by q is concave in A i x, we can use either Equations ( 29) or (31) to find a surrogate to q (k) i and the resulting algorithm is called the separable paraboloidal surrogate (SPS) algorithm [39].For example, corresponding to Equation ( 31), a separable parabola surrogate of q (k) is A careful selection of the curvature b in Equation ( 32) can lead to fast convergence of the SPS algorithm.Erdoǧan and Fessler [39] derived the optimal curvature for the SPS algorithm in transmission tomography.
Next, we present two examples explaining how to implement the OT algorithm to emission and transmission tomography.

Example 4.1 (OT for emission scans with Poisson noise).
In this example we explain the application of OT for MPL reconstruction in emission tomography, where measurements are assumed to follow Poisson distributions.De Pierro's modified EM (MEM) [32] coincides with the method discussed below when r i = 0. Firstly, under the Poisson model for emission scans, the penalized log-likelihood function is where ρ is assumed a convex function.Let where η i = A i x.It is easy to verify that l i is concave with respect to η i , so we can use Equation ( 28) to define its surrogate function.On the other hand, for the penalty function in Equation (34), −ρ is concave, so we can use Equation (31) to construct its surrogate.Combining them together we have the following surrogate for Ψ(x): where π tj = c tj / r c tr .Now The equation ∇ j Φ(x; x (k) ) = 0 has a closed-form solution for x j when ρ(v) = v 2 /2 and r i = 0 for all i.In this context, Equation (37) reduces to a quadratic function so we wish to solve for x j from subject to x j ≥ 0, and its analytic solution is readily available.If r i = 0 or ρ is not quadratic, the analytic solution to Equation (37) does not exist.In this case, one can use an 1-D optimization method to solve it, or alternatively, one may use a separable parabola surrogate rather than Equation (36).An example of the latter is explained in the next example where the reconstruction problem is for transmission tomography.

Example 4.2 (OT for transmission scans with Poisson noise).
This example considers the application of OT to MPL reconstruction in transmission tomography.Our explanations follow [39] closely.For transmission scans with Poisson noise, the penalized log-likelihood is given by where ρ is convex.Let η i = A i x and Since l i (η i ) is concave with respect to η i , a separable parabola surrogate can be defined according to Equation (33).For the first term of Equation (39) (i.e., the log-likelihood part), a separable parabola is given by where q and here d For the second term of Equation (39) (i.e., the penalty part), let γ t = C t x and let the weights ξ tj = c tj / r c tr .Its separable parabola surrogate is where w Here e (k) t is chosen such that e (k) t ≥ ∇ 2 ρ(γ t ) for all γ t in its range; this curvature e (k) t ensures that w (k) t (γ t ) lies above ρ(γ t ).Aggregating Equations ( 41) and ( 43) we obtain a separable parabola surrogate for Ψ(x): Φ(x; We have and for this example Let a i• = r a ir and c t• = r c tr .The solution of ∇ j Φ(x; x (k) ) = 0, subject to x j ≥ 0, is given by x This is in fact a special gradient algorithm with a diagonal preconditioning matrix.

Multiplicative Iterative Algorithms
The OT algorithms presented in the last section have the following important achievements: (1) they manage to transform a high dimensional optimization problem into a series of 1-D optimizations; (2) due to 1-D optimizations, the non-negativity constraints can be easily enforced by simply resetting negative estimates to zero in each iteration; (3) the surrogate given by the separable parabola approach is general enough to be applicable to different tomographic reconstructions.A limitation of OT is that it requires all l i (•) (log-density) and −J(•) (negative penalty) to be concave functions.
In this section we discuss a competitive alternative to the OT method called the multiplicative iterative (MI) algorithm; its application to tomographic imaging can be found in [26] and to box-constrained image processing in [41].
The main motivation of the MI algorithm is that it can be easily derived under different imaging modalities and different measurement noise models.Moreover, for some difficult penalties, such as TV, or even non-convex penalties [42], MI can be easily implemented to solve the corresponding optimization problems.
A general MI updating formula can be developed suitable for all tomographic reconstruction problems regardless of the mean function model, measurement probability distribution and penalty function.The simulation study reported in [26] reveals that MI has competitive convergence speed when compared with OT and other reconstruction algorithms.The MI algorithm does not require concavity of the functions l i and −J and therefore is more general than the OT algorithm.It requires existence of the first derivatives of l i (•) and J(•).It is possible that the objective function Ψ(x) in Equation ( 2) has multiple local maxima.In this case, MI finds one of the local non-negative maxima, depending on the starting value of the algorithm.
Here are some notations needed to explain the MI algorithm.We develop the MI algorithm from the Karush-Kuhn-Tucker (KKT) necessary conditions for the non-negatively constrained optimization of Ψ(x).They are: (49) for j = 1, . . ., p. Therefore, we aim to solve for x from Note that the expression inside the brackets of Equation ( 51) represents ∇ j Ψ(x), and x j is included in Equation ( 51) to reflect the conditions in Equations ( 49) and ( 50).
The key step in developing the MI algorithm is to rearrange Equation ( 51) such that its positive and negative terms appear on different sides of the Equation ( 51).Hence we rewrite Equation ( 51) as This equation naturally suggests the following fixed point algorithm to update x: j2 denote respectively the right and left hand side of Equation ( 52), namely, and and is a small positive constant, such as = 10 −5 , used to avoid zero denominate of Equation (53).Note that the value does not affect where the algorithm converges to.As both numerator and denominator of Equation ( 53) are positive, x In Equation (53) the updated x j is denoted by x (k+1/2) j indicating this is not the final estimate for iteration k + 1.In fact, this update does not ensure monotonic increment of Ψ(x) and a line search step must be included to rectify this problem.We first express Equation (53) as a gradient algorithm: where s j satisfies the KKT condition in this case); otherwise, we set s , where ˜ is another small constant such as 10 −2 .Equation (56) explains that x (k+1/2) j emanates from x (k) j in the gradient direction of Ψ with a non-negative step size s (k) j .For the line search step, the search direction is d (k) = x (k+1/2) − x (k) with α (k) > 0 denoting the line search step size.Sine α (k) ≤ 1 guarantees x (k+1) ≥ 0, we only search in the fixed range of 0 < α (k) ≤ 1.After including a line search step x (k+1) is obtained according to Due to the fixed search interval, this line search is remarkably simple.One simple and efficient search strategy is provided by the Armijo's rule (e.g., [43]).Armijo line search is a finite terminating algorithm.Briefly, it starts with α = 1, and for each α it checks if the following Armijo condition is satisfied: where 0 < ξ < 1 is a fixed parameter such as ξ = 10 −2 .If Equation ( 58) is true then stop; otherwise, reset α = ρα (such as ρ = 0.6) and reevaluate the Armijo condition (58).Note that the repeated evaluations of Ψ(x (k) + αd (k) ) can be made with Ad (k) being computed only once.Therefore, the line search step does not add extra major computations to the MI algorithm.
Convergence properties of the MI algorithm are given in [26,41].Briefly, under certain regular conditions, MI converges monotonically to a local maxima satisfying the KKT conditions.
For the mean functions given in Equation ( 4), we have ∇ j µ i (x) = a ij for emission and ∇ j µ i (x) = −b i e −A i x a ij for transmission tomography; the corresponding updating formula (53) becomes: for emission tomography, and for transmission tomography.The derivative ∇l i (µ i ) in the above formulae depends on the log-density l i (µ i ).Some examples are presented below.

Example 5.1 (MI for emission scans with Poisson noise).
For emission tomography with Poisson noise, we have the log-density function for y i : where The updating formula (59) becomes, for j = 1, . . ., p, x Note that when h = 0 (i.e., maximum likelihood reconstruction), r i = 0 and = 0, this algorithm coincides with the EM algorithm for emission tomography.After line search, the estimate of x at iteration k + 1 is given by Equation (57).In this algorithm, there is only one back-projection (for the numerator of Equation ( 62)) and one forward-projection in each iteration; its computational burden is the same as EM.
Example 5.2 (MI for randoms-precorrected PET emission scans).Some PET scans produce measurements that have already been corrected for randoms [44] and their measurements no longer follow Poisson distributions.We consider in this example the model weighted least squares which is also used in [11] but under a different context, i.e., we reconstruct from randomsprecorrected measurements y i by maximizing the objective Equation ( 2) where Here µ i is used to denote A i x, and for this µ i formula (59) still applies.Now since we have ∇l i (µ i ) + = [(y i + 2r i )/(µ i + 2r i )] 2 and ∇l i (µ i ) − = −1.The MI algorithm updates x first according to and then, after the line search step, computes x (k+1) according to Equation (57).

Example 5.3 (MI for polyenergetic transmission scans with Poisson noise).
Application of the MI algorithm to polyenergetic X-ray CT is again extremely easy.Under the assumption of Poisson noise, the log-density for measurement y i is identical to Equation (61) but now with µ i = M m=1 b im e − j a ij r umrz rj + r i ; see Equation (17).In Example 5.1 we have already derived ∇l i (µ i ) + and ∇l i (µ i ) − for the Poisson noise log-density.On the other hand, the derivative of µ i with respect to z rj (denoted by ∇ rj µ i ) is Thus, the updating formula for ployenergetic transmission is for r = 1, . . ., a and j = 1, . . ., p.After the line search step specified in Equation (57), z (k+1) is obtained.This iterative formula involves one forward-and two back-projections in each iteration, and therefore it demands similar amount of computations when compared with the alternative minimization algorithm in [34].When h = 0, r i = 0 = 0 and m = 1, this MI algorithm is identical to the algorithm given in [45] for maximum likelihood reconstruction in transmission tomography.Note that unlike the optimization transfer and alternating minimization algorithms, the MI algorithm can be easily derived for other objective functions, such as the weighted least-squares function.
The above examples demonstrate that the MI algorithms are easy to derive and to implement in tomographic imaging.The line search step it requires does not incur significant computational burden.

Modified Fisher's Method of Scoring Using Jacobi or Gauss-Seidel Over-Relaxations
In this section we elaborate on another non-negatively constrained method for tomographic imaging, which is a modification to the standard Fisher's method of scoring (FS) algorithm.This method is developed based on the following steps.Firstly, the objective function Ψ(x) is approximated by a quadratic function in each iteration, where the Fisher information matrix (e.g., [46]) is used to define the quadratic term; secondly, an over-relaxation method, either the Jacobi over-relaxation (JOR) or the Gauss-Seidel over-relaxation (also called the successive over-relaxation (SOR)), is employed to solve approximately the linear system derived from zeroing the derivative of this quadratic function.The resulting algorithms are called FS-JOR and FS-SOR and their detailed descriptions can be found in [47,48].Descriptions of the JOR and SOR methods are available, for example, in [49].
FS is a general optimization algorithm for computing maximum likelihood estimates.Its advantages over the traditional Newton's method have been documented in [50].Briefly, FS iterations are well defined due to the non-negativeness of the Fisher information matrix, but for the Newton's method, the negative Hessian matrix may not even be non-negative definite, making it unnecessarily proceed in the uphill direction in some applications.Transmission tomography is an example where this problem for the Newton's method indeed occurs; see Example 6.2.
We assume the objective function Ψ(x) in Equation ( 2) is twice differentiable and let F (x) be the Fisher information matrix, namely F (x) = E(−∇ 2 Ψ(x)).At iteration (k + 1) of the Fisher scoring algorithm, Ψ(x) is approximated by the following quadratic function: where F (k) denotes the Fisher information matrix at x (k) .Then the x estimate is updated by constrained maximization of Ψ (k) (x), namely The KKT conditions for this optimization are From the updating formulae given in Equations ( 74) and (75) we can see that both FS-JOR and FS-SOR involve the gradient ∇Ψ(x) and the Fisher information matrix based operation F (x)δ.The gradient is standard for most reconstruction algorithms, but the computation of F (x)δ requires more careful consideration.It will become clear in Examples 6.1 and 6.2 that for tomographic reconstructions F (x) usually exhibits as A T W (x)A + ∇ 2 J(x), where W (x) = diag(w x (x), . . ., w n (x)).It is not wise to compute A T W (x)A first as this involves multiplications of two huge matrices A and A T .For FS-JOR, a feasible alternative is to use the forward projection to find Aδ first, then to multiply it with the diagonal values of W to get W (x)Aδ, and finally to back-project W (x)Aδ to obtain F (x)δ (ignoring the penalty term).This approach involves only one forward-and one back-projections in every sub-iteration.The situation for FS-SOR is more complicated since δ changes with the pixel index j.The above approach for FS-JOR cannot be used here as otherwise each FS-SOR sub-iteration will demand infeasible p pairs of forward-and back-projections.To confront this problem, let The F δ part of Equation (75) involves A(x so we can start with A(x ) by applying Equation (77).Although here the number of multiplications for Aδ (where vector δ varies with its index j) becomes the same as Ax, it requires column access to the system matrix A, which can be a problem if A is generated on-the-fly.
We next provide examples of applying FS-JOR and FS-SOR to emission and transmission tomography.

Example 6.1 (Emission scans with Poisson noise).
For emission reconstruction with Poisson noise, the log-density of y i is given by Equation (61).Thus for the corresponding object function Ψ(x) of Equation ( 2), its gradients are and its Fisher information matrix elements are where µ i = A i x + r i , j and t = 1, . . ., p. Assuming we run only one sub-iteration for FS-JOR or FS-SOR (i.e., m = 1), the FS-JOR iterative formula is and the FS-SOR formula is }.The formula given in Equation ( 80) is just a gradient algorithm so ω can be replaced by a line search step size ω (k) ∈ (0, 1].Efficient computation of Equation (81) requires column access to matrix A as explicated before.Hudson et al. [48] reported simulation results and a real data application for emission reconstruction.They compared FS-JOR and FS-SOR with EM.The computer time required per iteration for the EM and one-step FS-JOR algorithms were similar.By comparison with the EM algorithm, FS-JOR and FS-SOR accelerated convergence when an appropriate value of ω was used.Particularly, FS-SOR had a superior speed of convergence when ω = 1.

Example 6.2 (Transmission scans with Poisson noise).
For transmission reconstructions with Poisson noise, we can easily work out the gradient and Fisher information matrix from its penalized likelihood function.The gradients are and the Fisher information matrix elements are where η i = A i x, µ i = b i e −η i + r i and j and t = 1, . . ., p.Note that for this example, the Fisher information matrix is non-negative but the negative Hessian matrix may not be non-negative, making the Newton method non-applicable.Corresponding to m = 1, the FS-JOR iterative formula is a ij b i e −η (k) i (−y i + µ and the FS-SOR formula is Then x (k+1) j = max{0, x(k+1) j }.Again, Equation (84) is a gradient algorithm so that a line search can be used, and efficient implementation of Equation (85) demands unpleasant column access to A. This section explains the Fisher scoring based image reconstruction algorithms using JOR or SOR sub-iterations.For these algorithms, any negative estimates in each iteration can be corrected by simply resetting to zero, as this way of resetting enforces the KKT conditions.If only one sub-iteration is used, FS-JOR is equivalent to a gradient algorithm.For efficient implementation of FS-SOR, it requires column retrieval of the system matrix A, which can be infeasible for some reconstruction problems.where µ i (x k j ) = η i (x k j ) + r i and ω (k) j is a line search step size enforcing ψ j (x (k+1) j ) ≥ ψ j (x (k) j ), where equality holds only when the algorithm is converged.This monotonic condition eventually leads to Ψ(x (k+1) ) ≥ Ψ(x (k) ).The update for x j is then x (k+1) j = max{0, x(k+1) j }.
For this example we have −(b i e −A i x (k) (j) + r i ) + y i log(b i e −A i x (k) (j) + r i ) − hJ(x where x (k) (j) is defined in Equation ( 90).The ICA-FS algorithm gives x(k+1 where µ i (x k j ) = b i e −A i x k j + r i , and then x (k+1) j = max{0, x(k+1) j }.

Conclusions
Image reconstruction from projections has wide applications, particularly in medical imaging.Emission and transmission tomography and X-ray CT all fall into this category.Three types of reconstruction methods are available: Fourier methods, algebraic methods and likelihood based reconstruction methods.Our attention in this paper is on the penalized likelihood approaches.
In this paper we present and discuss several important simultaneous MPL reconstruction algorithms, where the non-negativity constraint is enforced.The EM algorithm is limited to maximum likelihood reconstruction problems in emission tomography and is difficult to extend to other imaging modalities and probability models for the likelihood.One variation of EM, called the alternating minimization, is developed for transmission tomography.Another variation of EM, called the OT algorithm, is suitable for any imaging modalities and probability models, but its derivation is often cumbersome as the option for the surrogate function is flexible.The OT algorithm based on the separable parabola surrogate is relatively easy to implement to different tomographic imaging.The MI algorithm, on the other hand, is easy to derive and to implement as its line search step is cheap to compute.Its convergence speed, according to the simulation study, is similar to the separable parabola surrogate algorithm.The FS-JOR and FS-SOR algorithms first apply the Fisher information matrix to obtain a quadratic approximation to the objective function, and then optimize it using JOR or SOR schemes.Implementation of ICA-FS reverses the order of FS and SOR in FS-SOR.For both FS-SOR and ICA-FS, their convergence speeds are usually superior, but their potential problem is that both involves column retrieval of A, which may not be pre-generated and stored.
For some of the algorithms covered in this paper, their corresponding block-iterative algorithms have been developed.Block-iterative algorithms can usually achieve faster convergence than their simultaneous counterpart.However, discussions of the block-iterative algorithms are not included in this paper.