1. Introduction
Image reconstruction in medical imaging, in general, considers estimating pixel intensities or attenuations from measurements obtained from an imaging system. For example, for positron emission tomography (PET), the measurements are obtained according to the procedure summarized below; see [
1,
2] for more details. A type of radioactive isotope is introduced into the body of a patient and, due to the decay of radioisotope, it emits positrons. Each positron moves in the body for a small distance (usually less than 1 mm) and then interacts with an electron to produce a pair of gamma photons that travel in almost opposite directions. The scanning device in the imaging system can detect each pair of gamma photons with a certain probability and all such detections form the measurements that can appear in a histogram or a list form [
3]. It is usually assumed that the detection probabilities are known and they can be pre-computed and stored or computed on-the-fly.
Note that a special feature of measurements is that they are contaminated by noises, which can be a severe problem particularly if each measurement is small in value due to dose safety limit. It is possible that, if the noises are not properly addressed, the reconstructed image can be distorted by excessive noises. For example, for low dose X-ray CT (a type of transmission tomography), the metal streak artifact (e.g., [
4]) can be a severe problem for the traditional filtered backprojection method. Statistical iterative reconstruction methods, due to their ability to model the physics and measurements more accurately, are capable to reduce metal streak artifacts [
5].
To deal with the noise contamination problem, statistical image reconstruction methods in emission, transmission, X-ray CT, 
etc. have been developed based on specified probability models for measurements. For example, for single photon emission computed tomography (SPECT), possible options include: weighted least squares (equivalent to variable variance Gaussian) [
6], fixed variance Gaussian [
7] and Poisson [
8] models. These models can also be used for transmission scans. Since accidental coincidences are the main source of background noise in PET, most PET scans are precorrected for accidental coincidences by real-time subtraction of the coincidences in the delayed window [
9]. For randoms-precorrected PET scans, possible measurement models are Gaussian, ordinary Poisson and shifted Poisson [
9], and all of these are just approximations as the true probability density function (pdf) for the measurements is difficult. Shifted Poisson is also used to model X-ray CT measurements [
10].
Different algorithms have been proposed to maximize their corresponding objective functions. For example, for emission tomography, the expectation-maximization (EM) algorithm [
8] is designed to maximize the log-likelihood formulated from Poisson distributed measurements, or the iterative space reconstruction algorithm (ISRA) [
7] for maximizing the log-likelihood formulated from Gaussian (with fixed variances) distributed measurements. An attractive aspect of both EM and ISRA is that they are very easy to implement and both respect the non-negativity constraint on the reconstructions. However, if the objective function contains a penalty term, which is normally used to smooth the reconstruction, then both EM and ISRA become impractical as they involve, in each iteration, a non-linear system of equations that is tedious to solve exactly due to the large number of unknowns in these equations. Moreover, the penalty function also adds an extra inconvenience when searching for a non-negative solution is desirable.
To simplify notations, both the measurements and the unknown image are lexicographically ordered into vectors. More specifically, we use 
 to present the measurement vector and 
 to denote the unknown image vector, where superscript 
T denotes matrix transpose. Note although the notations are unified for different reconstruction problems in this paper, the meaning of these notations, such as 
x and 
y, can be different for different imaging modalities. Vectors 
y and 
x are related through a system matrix 
A; see Equation (
4) below for some examples. For tomographic reconstruction problems, matrix 
A is usually assumed known so its estimation is not covered by this paper. Rather, we focus on how to estimate 
x from the observed 
y and the known system matrix 
A. We denote the estimate of 
x by 
.
Statistical reconstruction 
 obtained by maximum penalized likelihood (MPL) (also known as maximum a posteriori (MAP)) is defined by
	  
 where 
 is an objective function derived from the probability distribution for measurements and the penalty function. When the 
’s are assumed independent (given 
x), the penalized likelihood objective function is
	  
 where 
 is the log-likelihood function given by
	  
 Here 
 is the smoothing parameter and 
 is the penalty function used to smooth 
. In Equation (
3), 
 denotes the log-density function for measurement 
, and 
 is a function of 
 (here 
 denotes the non-negative orthant of 
) representing the mean measurement of camera bin 
i. Examples of 
 include 
 where 
 with 
 being the 
ith row of matrix 
A, 
 is the known blank scan counts of the 
ith detector and 
 the known mean background counts. Another example is polyenergetic transmission scans (such as X-ray CT) where
	  
 and here 
 denotes the attenuation map corresponding to the 
m-th energy spectrum, 
x is a vector formed by the 
’s and 
 is the blank scan count from energy spectrum 
m.
In Equation (
3) the notation 
 is used to emphasize that 
 is a function of 
 and it also involves measurement 
. We can also write this function as 
 or 
 in different contexts when there is no ambiguity. However, the functional properties of 
 may change with respect to its different arguments. For example, if assuming 
 follows a Poisson distribution for either emission or transmission scans, then 
 This is clearly a concave function of 
 for both emission and transmission cases. However, for 
 (treated as a function of 
x), it may be no longer concave for transmission but still concave for emission scans. Concavity is an important property exploited by the optimization transfer algorithms.
Let 
 be an 
n-vector of all 
. The first term of Equation (
2), 
i.e., 
, measures similarity between 
y and 
. Different probability distributions have been used to model 
 even under the same imaging modality. For example, for emission tomography, if assuming the Poisson model for 
 (
i.e., 
) then 
 is given by Equation (
6), or if considering the weighted least squares then 
 where 
 is the weight. When 
 we have the weighted least squares model as suggested in [
11]. Another example in emission (or transmission) tomography is the randoms-precorrected PET scan (assume no scattering to simplify). In this context, the observed measurements are 
, where 
 and 
 (both unavailable directly) denotes the number of coincidences of the prompt and delayed windows respectively. Although we can assume 
 and 
 and that they are independent, the exact distribution of 
 cannot be derived directly (e.g., [
9]). An approximate probability model suggested in [
9] is the shifted Poisson distribution, namely 
, which gives 
 or the weighted least squares given by
	  
 Note that the shifted Poisson approximation matches the first two moments with the true probability model for 
 when both the prompt and delayed measurements are assumed independent and follow Poisson distributions.
In this paper, we present and discuss several important non-negatively constrained penalized likelihood reconstruction algorithms. When designing a reconstruction algorithm in tomographic imaging, one considers the following important issues: (i) the algorithm is computationally efficient, and ideally it involves only forward-projection (e.g., ) and back-projection (e.g., ) operations; (ii) the algorithm can be easily applied to different measurement probability models and imaging modalities; (iii) the algorithm can impose the non-negativity constraint; (iv) the algorithm converges fast. Our discussions on the algorithms in this paper will mainly focus on these points.
In tomographic imaging, it is important to produce smoothed reconstructions as severe noise in a reconstruction can cause false diagnoses. Smoothing can generally be achieved by one of the following five practices: (i) early termination of the iterations (e.g., [
12]); (ii) MPL reconstructions with an appropriate smoothing parameter (e.g., [
13]); (iii) functional representation of the unknown image by a set of smooth basis functions (e.g., [
14]); (iv) post smoothing of the reconstruction within each iteration (e.g., [
15]) or after all iterations ([
16]); and (v) pre-smoothing of the camera data (
i.e., sinogram) followed by filter backprojection (FBP) (e.g., [
17,
18]). We focus on the penalized likelihood approach to smoothing in this paper. In Equation (
2), the smoothing parameter 
h balances two conflicting targets: fidelity of the 
s to the 
s and smoothness of 
x. Although an appropriate choice of 
h is important for achieving a reconstruction with 
balanced fidelity and smoothness, we will not consider how to estimate 
h in this paper. A penalty function 
 is used to smooth or regulate the estimate 
. Usually, 
 takes the form of 
 where 
 represents a neighborhood operation (such as the first or second order difference) on pixel 
j, and function 
 measures the magnitude of 
. A common choice of 
ρ is the quadratic function: 
. Generally, a quadratic penalty tends to produce images with over-smoothed edges. Possible edge preserving penalties include total variation (TV) (e.g., [
19]) Huber [
20] and hyperbolic functions (e.g., [
21]). Note that 
 is convex for all these options.
The optimal choice of the penalty function 
J and the smoothing parameter 
h are unsolved problems in image processing and will not be further elaborated in this paper. We emphasize that smoothing by MPL indeed produce visually improved reconstructions over the tradition filtered-backprojection method particularly in dose-limited tomography such as low dose X-ray CT. The edge preserving penalties are extremely useful, such as TV and Huber penalties; see [
22,
23,
24]. However, the MPL reconstructions can have unnatural noise textures very different from the familiar filtered-backprojection method. Its impact on diagnostic tasks is still unknown and this is an active research area; see [
25] for examples and discussions.
We adopt the following notations throughout this paper. Let  be the estimate of x obtained at iteration k of an algorithm. The notation  indicates the derivative of function b with respect to the variable in the brackets. For example,  represents the derivative of b with respect to  and  the derivative of b with respect to x. We use  to denote the derivative of b with respect to , the j-th element of vector x. We also let  and  represent, respectively,  and  evaluated at .
Non-negatively constrained MPL image reconstruction algorithms can be classified into simultaneous and block-iterative (a.k.a. ordered subset (OS)) algorithms. For simultaneous algorithms, all elements in 
y are used to update 
x in each iteration, and for block-iterative algorithms, distinct portions of 
y are used in turn to update 
x. We discuss in this paper some simultaneous algorithms for non-negatively constrained MPL reconstructions, and the block-iterative algorithms are not included in our discussions. The rest of this paper is arranged as follows. The expectation-maximization algorithm for emission tomography is discussed in 
Section 2. 
Section 3 explains the alternating minimization algorithm designed specifically for transmission tomography. 
Section 4 contains explanations on the optimization transfer algorithms and their applications to tomographic reconstructions. The multiplicative iterative (MI) algorithms for tomographic imaging are provided in 
Section 5 and the Fisher scoring based Jacobi or Gauss–Seidel over-relaxation algorithms are presented in 
Section 6. 
Section 7 explains another Gauss–Seidel method named the iterative coordinate ascent algorithm. Finally, 
Section 8 includes discussions and remarks about this paper.
In this paper we focus on explaining and summarizing different non-negatively constrained tomographic imaging algorithms. Numerical comparisons of some of these algorithms are available in [
26], and therefore will not be given in this paper.
  2. EM Algorithm for Maximum Likelihood Reconstruction in Emission Tomography
The expectation-maximization (EM) algorithm [
27] is a statistical algorithm for iteratively computing maximum likelihood estimates when data contain random missing values. Here “random” means these missing values do not provide extra information about the parameters we wish to estimate. We first give a brief summary of the EM algorithm below.
Since there exist the missing and the observed (or incomplete) components, we can define the complete data set as a combination of the incomplete and the missing data. Note, however, that our aim is to estimate the unknown parameters by maximizing the log-likelihood of the incomplete data. The rationale for the EM algorithm is that if maximizing the incomplete data likelihood is difficult while maximizing the complete data likelihood is easy, then EM can be used to compute iteratively the maximum of the incomplete data likelihood by maximizing the complete data likelihood in each iteration.
Let 
 be the complete data set given by 
, where 
 denotes the incomplete data and 
 the missing data. Let 
 be the log-likelihood based on the complete data 
 and 
 the log-likelihood of the incomplete data 
, where 
x is a 
p-vector for the unknown parameters. Let 
 be the maximum likelihood (ML) estimate of 
x. Then iteration 
 of the EM algorithm comprises two steps:
	  
E-Step: Compute the conditional expectation of the complete data log-likelihood given the incomplete data and 
, and denote this function by
			
 M-Step: Update the 
x estimate by maximizing the 
Q function, namely
		  
 
 One major advantage of EM is that it guarantees, under certain regularity conditions, that the incomplete data log-likelihood 
 increases in consecutive iterations before convergence. Note that EM requires availability of the 
Q function in a closed form; otherwise, a Monte-Carlo E-step can be used to replace the E-step [
28].
The EM algorithm was first applied to emission tomograph by Shepp and Vardi [
8] and Lange and Carson [
29]. Both papers adopt the Poisson model for emission counts, namely 
 are independent Poisson random variables with mean 
. This model assumes 
; otherwise, we can depict 
 as the value after subtracting 
 from the bin 
i measurement. From this Poisson model, we can formulate the complete data as 
, where 
 follows the Poisson distribution with mean 
. Clearly, each 
 represents the unknown portion of measurement on camera bin 
i attributed to image pixel 
j. The corresponding complete data log-likelihood is
	  
 and the corresponding 
Q function is
	 
 where 
. Since the conditional distribution of 
 is 
, we have 
. Thus after solving 
, the M-step of the EM algorithm gives the following updating formula for 
x: 
 for 
. It has been pointed out in [
23,
30] that formula (
15) can also be explained by the Bayes conditional probability formula. This EM algorithm possesses the following properties making it attractive for emission tomography; they are: 
If the initial  then  for all ; i.e., it automatically satisfies the non-negativity constraint on x.
The algorithm is easy to implement as it only involves forward- and back-projections.
The updating formula in Equation (
15) increases the incomplete data log-likelihood: 
, where equality holds only when the iteration has converged.
 satisfies , where  is  with . Thus the x estimate at any iteration satisfies that the total expected and the total observed counts are equal.
The above EM is easy to implement and possesses some attractive properties on the reconstructions. This algorithm, however, is restricted only to emission tomography with Poisson distributed measurements. It cannot be easily extended to other reconstruction tasks. For example, application of the EM algorithm to transmission tomography does not lead to an exact updating formula due to the fact that its M-step does not produce a closed-form solution; see [
29]. Another limitation is that this EM algorithm can only be used for maximum likelihood reconstructions, and its application to the MPL reconstruction will not in general result in closed-form updating formula. To rectify this problem, Green [
31] developed a one-step-late (OSL) algorithm for the MPL reconstruction by replacing 
x in the derivative of the penalty function by its current estimate 
, and therefore an “exact” solution can still be accomplished. But this method suffers from the deficiencies that (i) the algorithm may be non-convergent; and (ii) some estimates may be negative.
De Pierro [
32] reproduced the EM updating formula using a totally different argument. In his derivation, there is no missing data and hence no E-step. Although the algorithm is named “modified EM”, it is not a real EM. In fact, this algorithm belongs to a more general class called the optimization transfer algorithms, since the Poisson log-likelihood optimization problem is transferred to a simpler optimization in each iteration. We will summarize the optimization transfer algorithms in the 
Section 4.
  3. Alternating Minimization Algorithms for Transmission Tomography
We have explained in 
Section 2 that the EM algorithm is not directly suitable for transmission scans as its M-step cannot be computed exactly. In this section, we summarize an alternating minimization algorithm designed to solve the transmission tomographic problem, including X-Ray CT. This algorithm is a generalization to the EM algorithm [
33] and its application to transmission tomography can be found in [
34].
Following [
34], we explain this algorithm using the polyenergetic transmission tomography example. In this context, if assuming transmission scans follow Poisson distributions, the corresponding log-likelihood is
	  
 where 
 is the scan count of detector 
i and 
 (now expressed as a function of vector 
z, which will be defined below) is given by Equation (
5). Moreover, elements of the attenuation map associated with spectrum 
m, namely elements of 
 in Equation (
5), are further modeled by
	  
 where 
j indexes pixels, 
r represents different types of materials, 
 are known linear attenuation coefficients and 
 are the unknown partial densities (e.g., [
34]) we wish to estimate. In Equation (
16), 
z is a vector of size 
 formed by column-wise stacking the vectors 
.
Define set 
 where 
 for 
 and 
 equals the background noise 
 for 
. Clearly, 
 given in Equation (
5) can now be expressed as 
. Define another set
	  
 In [
34, 
 is called the exponential family and 
 the linear family. Let 
p and 
q be the vectors created from 
 and 
 respectively. It can be shown that the problem of maximizing the log-likelihood Equation (
16) can be re-written as
	  
 subject to 
, where 
 is the 
I-divergence [
35] given by
	  
 Thus, maximizing the log-likelihood in Equation (
16) can be achieved iteratively. Assuming the estimates 
, 
 and 
 are obtained at iteration 
k, then iteration 
 contains two steps: 
- (i)
 compute  by minimizing  subject to ;
- (ii)
 compute  by minimizing  subject to .
 Note that the second step is equivalent to minimizing 
 over 
 with 
 being given by the expression in Equation (
19).
Minimizing 
 over 
 is easily achieved using the Lagrange multiplier, and the result is 
 On the other hand, direct optimization of 
 over 
 is an unmanageable task as the 
’s are mixed (
i.e., not decoupled or separated from each other) within the objective function. One approach to overcome this problem is by using a decoupled objective function representing an upper bound of the original objective function. In fact, it can be shown that for 
 given by Equation (
19),
	  
 where 
 and 
 is an estimate of 
 corresponding to the estimate 
 of 
. This inequality is obtained from the fact that 
 is a convex function of 
. Clearly, 
 on the right hand side of Equation (
24) are decoupled and thus their non-negatively constrained optimizations will result in closed-form solutions. When we take 
, the optimal solution to 
 is 
 where 
 and 
. We give some remarks about this algorithm below.
Remarks- (1)
 This algorithm is designed for maximum likelihood estimation. However, it can be easily extended to MPL where the penalty function must be convex and therefore can also be decoupled.
- (2)
 This algorithm is developed for the likelihood function derived from the simple Poisson measurement noise. Note that the alternating minimization algorithm was also developed for a compound Poisson noise model in [
36] and its comparison with the simple Poisson alternating minimization was provided in [
37]. For other measurement distributions, however, the corresponding algorithms have to be completely re-developed.
- (3)
 The convergence properties of the alternating maximization algorithm have been studied in [
34]. Particularly, it is monotonically convergent under certain conditions.
- (4)
 It will become clear in 
Section 5 (Example 5.3) that the multiplicative-iterative algorithm can be derived more easily for this transmission reconstruction problem.
- (5)
 The trick of decoupling the objective function using its convex (or concave) property is also the key technique of the optimization transfer algorithms discussed in 
Section 4.
   4. Optimization Transfer Algorithms
Details of the optimization transfer (OT) algorithm (also called the minorization–maximization (MM) algorithm for maximizations) can be found in, for example, [
38]. In this section we present this algorithm briefly and explain its application in emission and transmission tomography.
The fundamental idea of the OT algorithm is that it employs a surrogate function to minorize (see the definition below) the objective function  in each iteration, and then update the parameter estimate by maximizing this surrogate function.
More specifically, a function 
 is said to minorize 
 at 
 if it satisfies the following “minorization” conditions: 
- (i)
 , and
- (ii)
  for all x.
 Then at iteration 
, 
x is estimated by maximizing 
, 
i.e.,
	  
 If the exact maximum is not easy to obtain, we can find an 
 by simply increasing 
, as this will also guarantee that the monotonic condition stated below remains for 
.
An attractive property when using this surrogate function is that 
 satisfies the monotonic condition, namely
	  
 where equality holds only when the iteration has converged. This monotonic property can be easily verified by the minorization conditions since 
For implementation of the OT algorithm to medical imaging, a surrogate function 
 must be determined. There exist different ways of choosing the surrogate function, such as those listed in [
38]. We mainly consider two approaches in this paper: (i) the method based on the inequality on concave functions (called the concave inequality hereafter); and (ii) the method based on quadratic lower bounds (also known as paraboloidal surrogates [
39]). These ideas are summarized below.
Let 
 be the objective function we wish to maximize, where 
 is the 
i-th row of matrix 
 and 
x is a 
p-vector. For matrix 
A, we assume its elements 
 are non-negative and 
. We also assume that all 
 are concave functions. Let 
 be weights satisfying 
. Then according to the concave inequality we have 
 There are different ways of choosing weights 
. For example, we can use 
, which is also adopted in [
32]. In this case since each 
 is a function of 
x, the surrogate function corresponding to Equation (
28) is
	  
 and it is easy to verify that this surrogate satisfies the minorization conditions. The right hand side of Equation (
29) is a weighted summation of functions 
, each involving a single 
 only (
i.e., decoupled), and therefore maximization with respect to 
x of 
 can be achieved by a sequence of 1-D optimizations. Another trick, due to De Pierro [
32], uses the following concave inequality:
	  
 If the weights 
 do not depend on 
, then Equation (
30) leads to the surrogate function of 
 which clearly also meets the minorization conditions. In Equation (
31), the choice of 
 is again flexible, and one popular option is to use 
.
The above two surrogates are developed based on the concave inequality. Another useful approach is to employ a quadratic lower bound (e.g., [
40]). Assume 
 is twice differentiable with its second derivative denoted by 
. Let 
 be a number such that 
 for all 
, then 
 The right hand side of Equation (
32) is a parabola surrogate of 
 and the condition on 
 guarantees that this function lies below 
. Unlike the previous surrogate functions, this surrogate is not separable in 
x, and therefore its maximization with respect to 
x cannot be reduced to a series of 1-D problems. To overcome this problem we can find another function surrogating the above parabola surrogate but with separable 
x. Towards this, we denote the right hand side quadratic function of Equation (
32) by 
. Since 
 is concave in 
, we can use either Equations (
29) or (
31) to find a surrogate to 
 and the resulting algorithm is called the separable paraboloidal surrogate (SPS) algorithm [
39]. For example, corresponding to Equation (
31), a separable parabola surrogate of 
 is 
 A careful selection of the curvature 
 in Equation (
32) can lead to fast convergence of the SPS algorithm. Erdoǧan and Fessler [
39] derived the optimal curvature for the SPS algorithm in transmission tomography.
Next, we present two examples explaining how to implement the OT algorithm to emission and transmission tomography.
Example 4.1 (OT for emission scans with Poisson noise). 
In this example we explain the application of OT for MPL reconstruction in emission tomography, where measurements are assumed to follow Poisson distributions. De Pierro’s modified EM (MEM) [
32] coincides with the method discussed below when 
. Firstly, under the Poisson model for emission scans, the penalized log-likelihood function is
	  
 where 
ρ is assumed a convex function. Let 
 where 
. It is easy to verify that 
 is concave with respect to 
, so we can use Equation (
28) to define its surrogate function. On the other hand, for the penalty function in Equation (
34), 
 is concave, so we can use Equation (
31) to construct its surrogate. Combining them together we have the following surrogate for 
: 
 where 
. Now 
 The equation 
 has a closed-form solution for 
 when 
 and 
 for all 
i. In this context, Equation (
37) reduces to a quadratic function so we wish to solve for 
 from 
 subject to 
, and its analytic solution is readily available. If 
 or 
ρ is not quadratic, the analytic solution to Equation (
37) does not exist. In this case, one can use an 1-D optimization method to solve it, or alternatively, one may use a separable parabola surrogate rather than Equation (
36). An example of the latter is explained in the next example where the reconstruction problem is for transmission tomography.
Example 4.2 (OT for transmission scans with Poisson noise). 
This example considers the application of OT to MPL reconstruction in transmission tomography. Our explanations follow [
39] closely. For transmission scans with Poisson noise, the penalized log-likelihood is given by
	  
 where 
ρ is convex. Let 
 and 
 Since 
 is concave with respect to 
, a separable parabola surrogate can be defined according to Equation (
33). For the first term of Equation (
39) (
i.e., the log-likelihood part), a separable parabola is given by
	  
 where 
 and here 
 satisfies 
 for all 
. For the second term of Equation (
39) (
i.e., the penalty part), let 
 and let the weights 
. Its separable parabola surrogate is
	  
 where 
 Here 
 is chosen such that 
 for all 
 in its range; this curvature 
 ensures that 
 lies above 
. Aggregating Equations (
41) and (
43) we obtain a separable parabola surrogate for 
: 
 We have
	  
 and for this example
	  
 Let 
 and 
. The solution of 
, subject to 
, is given by 
 where 
 This is in fact a special gradient algorithm with a diagonal preconditioning matrix.
  5. Multiplicative Iterative Algorithms
The OT algorithms presented in the last section have the following important achievements: (1) they manage to transform a high dimensional optimization problem into a series of 1-D optimizations; (2) due to 1-D optimizations, the non-negativity constraints can be easily enforced by simply resetting negative estimates to zero in each iteration; (3) the surrogate given by the separable parabola approach is general enough to be applicable to different tomographic reconstructions. A limitation of OT is that it requires all  (log-density) and  (negative penalty) to be concave functions.
In this section we discuss a competitive alternative to the OT method called the multiplicative iterative (MI) algorithm; its application to tomographic imaging can be found in [
26] and to box-constrained image processing in [
41].
The main motivation of the MI algorithm is that it can be easily derived under different imaging modalities and different measurement noise models. Moreover, for some difficult penalties, such as TV, or even non-convex penalties [
42], MI can be easily implemented to solve the corresponding optimization problems.
A general MI updating formula can be developed suitable for all tomographic reconstruction problems regardless of the mean function model, measurement probability distribution and penalty function. The simulation study reported in [
26] reveals that MI has competitive convergence speed when compared with OT and other reconstruction algorithms. The MI algorithm does not require concavity of the functions 
 and 
 and therefore is more general than the OT algorithm. It requires existence of the first derivatives of 
 and 
. It is possible that the objective function 
 in Equation (
2) has multiple local maxima. In this case, MI finds one of the local non-negative maxima, depending on the starting value of the algorithm.
Here are some notations needed to explain the MI algorithm. For a function , let  be the positive component of  and  the negative component so that  For a number b, Let  and  so that . Thus, for the numerical value of function  at point , we can also write .
We develop the MI algorithm from the Karush–Kuhn–Tucker (KKT) necessary conditions for the non-negatively constrained optimization of 
. They are: 
  for 
. Therefore, we aim to solve for 
x from 
 Note that the expression inside the brackets of Equation (
51) represents 
, and 
 is included in Equation (
51) to reflect the conditions in Equations (
49) and (50).
The key step in developing the MI algorithm is to rearrange Equation (
51) such that its positive and negative terms appear on different sides of the Equation (
51). Hence we rewrite Equation (
51) as
	  
 This equation naturally suggests the following fixed point algorithm to update 
x:
	  
 where 
 and 
 denote respectively the right and left hand side of Equation (
52), namely, 
 and 
 and 
ϵ is a small positive constant, such as 
, used to avoid zero denominate of Equation (
53). Note that the 
ϵ value does not affect where the algorithm converges to. As both numerator and denominator of Equation (
53) are positive, 
 whenever 
.
In Equation (
53) the updated 
 is denoted by 
 indicating this is not the final estimate for iteration 
. In fact, this update does not ensure monotonic increment of 
 and a line search step must be included to rectify this problem. We first express Equation (
53) as a gradient algorithm:
	  
 where 
 with 
. Note that 
 when 
. When 
 we set 
 only if 
 (since 
 satisfies the KKT condition in this case); otherwise, we set 
, where 
 is another small constant such as 
. Equation (
56) explains that 
 emanates from 
 in the gradient direction of Ψ with a non-negative step size 
. For the line search step, the search direction is 
 with 
 denoting the line search step size. Sine 
 guarantees 
, we only search in the fixed range of 
. After including a line search step 
 is obtained according to
	  
 Due to the fixed search interval, this line search is remarkably simple. One simple and efficient search strategy is provided by the Armijo’s rule (e.g., [
43]). Armijo line search is a finite terminating algorithm. Briefly, it starts with 
, and for each 
α it checks if the following Armijo condition is satisfied:
	  
 where 
 is a fixed parameter such as 
. If Equation (
58) is true then stop; otherwise, reset 
 (such as 
) and reevaluate the Armijo condition (
58). Note that the repeated evaluations of 
 can be made with 
 being computed only once. Therefore, the line search step does not add extra major computations to the MI algorithm.
Convergence properties of the MI algorithm are given in [
26,
41]. Briefly, under certain regular conditions, MI converges monotonically to a local maxima satisfying the KKT conditions.
For the mean functions given in Equation (
4), we have 
 for emission and 
 for transmission tomography; the corresponding updating formula (
53) becomes: 
 for emission tomography, and
	  
 for transmission tomography. The derivative 
 in the above formulae depends on the log-density 
. Some examples are presented below.
Example 5.1 (MI for emission scans with Poisson noise). 
For emission tomography with Poisson noise, we have the log-density function for 
: 
 where 
. Thus 
, which gives 
 and 
. The updating formula (
59) becomes, for 
, 
 Note that when 
 (
i.e., maximum likelihood reconstruction), 
 and 
, this algorithm coincides with the EM algorithm for emission tomography. After line search, the estimate of 
x at iteration 
 is given by Equation (
57). In this algorithm, there is only one back-projection (for the numerator of Equation (
62)) and one forward-projection in each iteration; its computational burden is the same as EM.
Example 5.2 (MI for randoms-precorrected PET emission scans). 
Some PET scans produce measurements that have already been corrected for randoms [
44] and their measurements no longer follow Poisson distributions. We consider in this example the model weighted least squares which is also used in [
11] but under a different context, 
i.e., we reconstruct from randoms-precorrected measurements 
 by maximizing the objective Equation (
2) where
	  
 Here 
 is used to denote 
, and for this 
 formula (
59) still applies. Now since
	  
 we have 
 and 
. The MI algorithm updates 
x first according to 
 and then, after the line search step, computes 
 according to Equation (
57).
Example 5.3 (MI for polyenergetic transmission scans with Poisson noise). 
Application of the MI algorithm to polyenergetic X-ray CT is again extremely easy. Under the assumption of Poisson noise, the log-density for measurement 
 is identical to Equation (
61) but now with 
; see Equation (
17). In Example 5.1 we have already derived 
 and 
 for the Poisson noise log-density. On the other hand, the derivative of 
 with respect to 
 (denoted by 
) is 
 Thus, the updating formula for ployenergetic transmission is 
 for 
 and 
. After the line search step specified in Equation (
57), 
 is obtained. This iterative formula involves one forward- and two back-projections in each iteration, and therefore it demands similar amount of computations when compared with the alternative minimization algorithm in [
34]. When 
, 
 and 
, this MI algorithm is identical to the algorithm given in [
45] for maximum likelihood reconstruction in transmission tomography. Note that unlike the optimization transfer and alternating minimization algorithms, the MI algorithm can be easily derived for other objective functions, such as the weighted least-squares function.
The above examples demonstrate that the MI algorithms are easy to derive and to implement in tomographic imaging. The line search step it requires does not incur significant computational burden.
  6. Modified Fisher’s Method of Scoring Using Jacobi or Gauss–Seidel Over-Relaxations
In this section we elaborate on another non-negatively constrained method for tomographic imaging, which is a modification to the standard Fisher’s method of scoring (FS) algorithm. This method is developed based on the following steps. Firstly, the objective function 
 is approximated by a quadratic function in each iteration, where the Fisher information matrix (e.g., [
46]) is used to define the quadratic term; secondly, an over-relaxation method, either the Jacobi over-relaxation (JOR) or the Gauss–Seidel over-relaxation (also called the successive over-relaxation (SOR)), is employed to solve approximately the linear system derived from zeroing the derivative of this quadratic function. The resulting algorithms are called FS-JOR and FS-SOR and their detailed descriptions can be found in [
47,
48]. Descriptions of the JOR and SOR methods are available, for example, in [
49].
FS is a general optimization algorithm for computing maximum likelihood estimates. Its advantages over the traditional Newton’s method have been documented in [
50]. Briefly, FS iterations are well defined due to the non-negativeness of the Fisher information matrix, but for the Newton’s method, the negative Hessian matrix may not even be non-negative definite, making it unnecessarily proceed in the uphill direction in some applications. Transmission tomography is an example where this problem for the Newton’s method indeed occurs; see Example 6.2.
We assume the objective function 
 in Equation (
2) is twice differentiable and let 
 be the Fisher information matrix, namely 
. At iteration 
 of the Fisher scoring algorithm, 
 is approximated by the following quadratic function: 
 where 
 denotes the Fisher information matrix at 
. Then the 
x estimate is updated by constrained maximization of 
, namely 
 The KKT conditions for this optimization are
	  
 where 
 Here 
 denotes the 
j-th row of matrix 
. The JOR and SOR methods solve, for 
, 
 in different manners: JOR solves it by fixing all the 
x elements, except 
, at their estimates from the last iteration (
i.e., iteration 
k), but SOR solves it by fixing all the 
x elements, except 
, at their most current estimates.
The above illustrations describe how to incorporate JOR or SOR sub-iterations into the FS algorithm. In fact, in each iteration, JOR or SOR is used to solve approximately the linear system of equations determined by the FS algorithm, and then this approximate solution is used as the starting value for the next FS iteration. These new schemes modify the standard FS method, and are feasible for large estimation problems.
Usually it suffices to run one JOR or SOR sub-iteration. But running more than one sub-iterations is also attractive as it has the potential to reduce the computations for the entire optimization process. Suppose within each Fisher scoring iteration we run 
m sub-iterations of JOR or SOR. The resulting algorithms are called the 
m-step FS-JOR and 
m-step FS-SOR algorithms respectively. Let 
r be the sub-iteration index for the over-relaxation method and 
 the estimate of 
x at the 
r-th over-relaxation sub-iteration of the 
k-th FS iteration. Let 
 be the 
-th element of 
. Assume 
 for all 
j. At iteration 
, first set 
. If using JOR to solve Equation (
73) we have
	  
 and if using SOR to solve we then have
	  
 where 
 and 
 is the relaxation parameter. If any 
 then it is reset to zero. This resetting is correct since the only possibility for 
 is that the expressions in the round brackets of Equations (
74) and (
75) are negative since 
 and 
. Hence resetting 
 to zeros assures that the FS-JOR and FS-SOR algorithms converge to, when they converge, the solution satisfying the KKT conditions. At the end of the sub-iterations set 
. Note that when 
, the last term in the round brackets of either Equation (
74) or (
75) becomes zero. Thus 1-step FS-JOR is basically a gradient algorithm and we can therefore replace 
ω by a line search step size 
, where the search range is fixed at 
 as this range will keep the estimate non-negative.
The relaxation parameter 
ω is used to achieve convergence of the FS-JOR and FS-SOR algorithms. Results contained in [
47] give convergence properties when 
 and when the non-negativity constraint is ignored. In fact in this context FS-SOR converges if 
 and FS-JOR converges if 
, where 
 is the maximum eigenvalue of 
. Here 
 is the MPL solution.
From the updating formulae given in Equations (
74) and (
75) we can see that both FS-JOR and FS-SOR involve the gradient 
 and the Fisher information matrix based operation 
. The gradient is standard for most reconstruction algorithms, but the computation of 
 requires more careful consideration. It will become clear in Examples 6.1 and 6.2 that for tomographic reconstructions 
 usually exhibits as 
, where 
. It is not wise to compute 
 first as this involves multiplications of two huge matrices 
A and 
. For FS-JOR, a feasible alternative is to use the forward projection to find 
 first, then to multiply it with the diagonal values of 
W to get 
, and finally to back-project 
 to obtain 
 (ignoring the penalty term). This approach involves only one forward- and one back-projections in every sub-iteration. The situation for FS-SOR is more complicated since 
 changes with the pixel index 
j. The above approach for FS-JOR cannot be used here as otherwise each FS-SOR sub-iteration will demand infeasible 
p pairs of forward- and back-projections. To confront this problem, let
	  
 The 
 part of Equation (
75) involves 
. Note that 
 so we can start with 
 and obtain 
 by applying Equation (
77). Although here the number of multiplications for 
 (where vector 
 varies with its index 
j) becomes the same as 
, it requires column access to the system matrix 
A, which can be a problem if 
A is generated on-the-fly.
We next provide examples of applying FS-JOR and FS-SOR to emission and transmission tomography.
Example 6.1 (Emission scans with Poisson noise). 
For emission reconstruction with Poisson noise, the log-density of 
 is given by Equation (
61). Thus for the corresponding object function 
 of Equation (
2), its gradients are
	  
 and its Fisher information matrix elements are
	  
 where 
, 
. Assuming we run only one sub-iteration for FS-JOR or FS-SOR (
i.e., 
), the FS-JOR iterative formula is
	  
 and the FS-SOR formula is 
 Then 
. The formula given in Equation (
80) is just a gradient algorithm so 
ω can be replaced by a line search step size 
. Efficient computation of Equation (
81) requires column access to matrix 
A as explicated before. Hudson 
et al. [
48] reported simulation results and a real data application for emission reconstruction. They compared FS-JOR and FS-SOR with EM. The computer time required per iteration for the EM and one-step FS-JOR algorithms were similar. By comparison with the EM algorithm, FS-JOR and FS-SOR accelerated convergence when an appropriate value of 
ω was used. Particularly, FS-SOR had a superior speed of convergence when 
.
Example 6.2 (Transmission scans with Poisson noise). 
For transmission reconstructions with Poisson noise, we can easily work out the gradient and Fisher information matrix from its penalized likelihood function. The gradients are 
 and the Fisher information matrix elements are
	  
 where 
, 
 and 
. Note that for this example, the Fisher information matrix is non-negative but the negative Hessian matrix may not be non-negative, making the Newton method non-applicable. Corresponding to 
, the FS-JOR iterative formula is 
 and the FS-SOR formula is 
 Then 
. Again, Equation (
84) is a gradient algorithm so that a line search can be used, and efficient implementation of Equation (
85) demands unpleasant column access to 
A.
This section explains the Fisher scoring based image reconstruction algorithms using JOR or SOR sub-iterations. For these algorithms, any negative estimates in each iteration can be corrected by simply resetting to zero, as this way of resetting enforces the KKT conditions. If only one sub-iteration is used, FS-JOR is equivalent to a gradient algorithm. For efficient implementation of FS-SOR, it requires column retrieval of the system matrix A, which can be infeasible for some reconstruction problems.
  7. Iterative Coordinate Ascent Algorithms
Another method using SOR is the method of iterative coordinate ascent (ICA) (or iterative coordinate descent (ICD) for minimization problems). ICA was first implemented to tomographic imaging in [
51,
52]. The basic idea of ICA is to apply SOR directly to the objective function 
, resulting in a sequence of 1-D functions where each 
 is associated with one of these 1-D functions. Then each function is solved exactly or approximately to update the corresponding 
. More specifically, using the SOR principle we can define a function for 
 according to 
 This is a function of 
 only and we can update the 
 estimate by 
 Since this is a 1-D function, the constraint 
 can be easily enforced using, for example, the resetting to zero approach.
One computational issue with ICA when applied to tomographic imaging is that it requires repeated calculations of 
 for all 
i when updating 
. This problem can be rectified by the following approach. Let 
 Consider the evaluation of 
. Assuming the update of 
 is given by 
 then 
 and therefore 
 This relationship explains that 
 can be cheaply computed using the 
 value before the 
 update plus a correction term. However, similar to FS-SOR, it necessitates column access to 
A. This can be a potential issue if 
A is generated on-the-fly.
Next we use again the emission and transmission examples to elaborate the ICA algorithm.
Example 7.1 (Emission scans with Poisson noise). 
Firstly, we define 
 From the penalized log-likelihood function of emission measurements 
 (see, for example, Equation (
34)), function 
 is given by 
 Since this is a non-quadratic function of 
, exact maximization is infeasible. We can find its approximate optimization by running a single or multi- step of, for example, the Newton or Fisher scoring algorithm. In this example we consider using the Fisher scoring algorithm to optimize 
 and call the resulting algorithm ICA-FS. After a single step of Fisher scoring we have 
 where 
 and 
 is a line search step size enforcing 
, where equality holds only when the algorithm is converged. This monotonic condition eventually leads to 
. The update for 
 is then 
.
Example 7.2 (Transmission scans with Poisson noise). 
For this example we have 
 where 
 is defined in Equation (
90). The ICA-FS algorithm gives 
 where 
, and then 
.