1. Introduction
We begin with some terminology used. Segments are contiguous clusters. In an imaging context, this means that clusters contain adjacent or contiguous pixels. For typical 2D (two-dimensional) images, we may also consider the 1D (one-dimensional) marginal which provides an empirical estimate of the pixel (probability) density function or PDF. For 3D (three-dimensional) images, we can consider 2D marginals, based on the voxels that constitute the 3D image volume, or also a 1D overall marginal. An image is representative of a signal. More loosely a signal is just data, mostly here with neccessary sequence or adjacency relationships. Often we will use interchangeably the terms image, image volume if relevant, signal and data.
The word “model” is used, in general, in many senses–statistical [
1], mathematical, physical models; mixture model; linear model; noise model; neural network model; sparse decomposition model; even, in different senses, data model. In practice, firstly and foremostly for algorithmic tractability, models of whatever persuasion tend to be homogeneous. In this article we wish to broaden the homogeneous mixture model framework in order to accommodate heterogeneity at least as relates to resolution scale. Our motivation is to have a rigorous model-based approach to data clustering or segmentation, that also and in addition encompasses resolution scale.
  
    
  
  
    Figure 1.
      Clusters of all morphologies are sought. Figure courtesy of George Djorgovski, Caltech.
  
 
   Figure 1.
      Clusters of all morphologies are sought. Figure courtesy of George Djorgovski, Caltech.
  
 In 
Figure 1 [
2], the clustering task is portrayed in its full generality. One way to address it is to build up parametrized clusters, for example using a Gaussian mixture model (GMM), so that the cluster “pods” are approximated by the mixture made up of the cluster component “peas” (a viewpoint expressed by A.E. Raftery, quoted in [
3]).
A step beyond a pure “peas” in a “pod” approach to clustering is a hierarchical approach. Application specialists often consider hierarchical algorithms as more versatile than their partitional counterparts (for example, k-means or Gaussian mixture models) since the latter tend to work well only on data sets having isotropic clusters [
4]. So in [
5], we segmented astronomical images of different observing filters, that had first been matched such that they related to exactly the same fields of view and pixel resolution. For the segmentation we used a Markov random field and Gaussian mixture model; followed by a within-segment GMM clustering on the marginals. Further within-cluster marginal clustering could be continued if desired. For model fit, we used approximations of the Bayes factor: the pseudo-likelihood information criterion to start with, and for the marginal GMM work, a Bayesian information criterion. This hierarchical- or tree-based approach is rigorous and we do not need to go beyond the Bayes factor model evaluation framework. The choice of segmentation methods used was due to the desire to use contiguity or adjacency information whenever possible, and when not possible to fall back on use of the marginal. This mixture of segmentation models is a first example of what we want to appraise in this work.
What now if we cannot (or cannot conveniently) match the images beforehand? In that case, segments or clusters in one image will not necessarily correspond to corresponding pixels in another image. That is a second example of where we want to evaluate different families of models.
A third example of what we want to cater for in this work is the use of wavelet transforms to substitute for spatial modeling (e.g., Markov random field modeling). In this work one point of departure is a Gaussian mixture model (GMM) with model selection using the Bayes information criterion (BIC) approximation to the Bayes factor. We extend this to a new hierarchical context. We use GMMs on resolution scales of a wavelet transform. The latter is used to provide resolution scale. Between resolution scales we do not seek a strict subset or embedding relationship over fitted Gaussians, but instead accept a lattice relation. We focus in particular on the global quality of fit of this wavelet-transform based Gaussian modeling. We show that a suitable criterion of goodness of fit for cross-model family evaluations is given by Rényi quadratic entropy.
  1.1. Outline of the Article
In 
Section 2 we review briefly how modeling, with Gaussian mixture modeling in mind, is mapped into information.
In 
Section 3 we motivate Gaussian mixture modeling as a general clustering approach.
In 
Section 4 we introduce entropy and focus on the additivity property. This property is importatant to us in the following context. Since hierarchical cluster modeling, not well addressed or supported by Gaussian mixture modeling, is of practical importance we will seek to operationalize a wavelet transform approach to segmentation. The use of entropy in this context is discussed in 
Section 5.
The fundamental role of Shannon entropy together with some other definitions of entropy in signal and noise modeling is reviewed in 
Section 6. Signal and noise modeling are potentially usable for image segmentation.
For the role of entropy in image segmentation, 
section 2 presented the state of the art relative to Gaussian mixture modeling; and 
Section 6 presented the state of the art relative to (segmentation-relevant) filtering.
What if we have segmentations obtained through different modelings? 
Section 7 addresses this through the use of Rényi quadratic entropy. Finally, 
Section 8 presents a case study.
  2. Minimum Description Length and Bayes Information Criterion
For what we may term a homogenous modeling framework, the minimum description length, MDL, associated with Shannon entropy [
6], will serve us well. However as we will now describe it does not cater for hierarchically embedded segments or clusters. An example of where hierarchical embedding, or nested clusters, come into play can be found in [
5].
Following Hansen and Yu [
7], we consider a model class, Θ, and an instantiation of this involving parameters 
θ to be estimated, yielding 
. We have 
 so the parameter space is 
k-dimensional. Our observation vectors, of dimension 
m, and of cardinality 
n, are defined as: 
. A model, 
M, is defined as 
. The maximum likelihood estimator (MLE) of 
θ is 
: 
.
Using Shannon information, the description length of X based on a set of parameter values θ is: . We need to transmit parameters also (as, for example, in vector quantization). So overall code length is: . If the number of parameters is always the same, then  can be constant. Minimizing  over θ is the same as maximizing , so if  is constant, then MDL (minimum description length) is identical to maximum likelihood, ML.
The MDL information content of the ML, or equally Bayesian maximum a posteriori (MAP) estimate, is the code length of 
. First, we need to encode the 
k coordinates of 
, where 
k is the (Euclidean) dimension of the parameter space. Using the uniform encoder for each dimension, the precision of coding is then 
 implying that the magnitude of the estimation error is 
. So the price to be paid for communicating 
 is 
 nats [
7]. Going beyond the uniform coder is also possible with the same outcome.
In summary, MDL with simple suppositions here (in other circumstances we could require more than two stages, and consider other coders) is the sum of code lengths for (i) encoding data using a given model; and (ii) transmitting the choice of model. The outcome is minimal .
In the Bayesian approach we assign a prior to each model class, and then we use the overall posterior to select the best model. Schwarz’s Bayesian Information Criterion (BIC), which approximates the Bayes factor of posterior ratios, takes the form of the same penalized likelihood, 
, where 
 ML or MAP estimate of 
θ. See [
8] for case studies using BIC.
  3. Segmentation of Arbitrary Signal through a Gaussian Mixture Model
Notwithstanding the fact that often signal is not Gaussian, cf. the illustration of 
Figure 1, we can fit observational data–density 
f with support in 
m-dimensional real space, 
 – by Gaussians. Consider the case of heavy tailed distributions.
Heavy tailed probability distributions, examples of which include long memory or 
 processes (appropriate for financial time series, telecommunications traffic flows, etc.) can be modeled as a generalized Gaussian distribution (GGD, also known as power exponential, 
α-Gaussian distribution, or generalized Laplacian distribution):
      where
	  
- –
- scale parameter, α, represents the standard deviation, 
- –
- the gamma function, , and 
- –
- shape parameter, β, is the rate of exponential decay, . 
A value of  gives us a Gaussian distribution. A value of  gives a double exponential or Laplace distribution. For , the distribution is heavy tailed. For , the distribution is light tailed.
Heavy tailed noise can be modeled by a Gaussian mixture model with enough terms [
9]. Similarly, in speech and audio processing, low-probability and large-valued noise events can be modeled as Gaussian components in the tail of the distribution. A fit of this fat tail distribution by a Gaussian mixture model is commonly carried out [
10]. As in Wang and Zhao [
10], one can allow Gaussian component PDFs to recombine to provide the clusters which are sought. These authors also found that using priors with heavy tails, rather than using standard Gaussian priors, gave more robust results. But the benefit appears to be very small.
Gaussian mixture modeling of heavy tailed noise distributions, e.g. genuine signal and flicker or pink noise constituting a heavy tail in the density, is therefore feasible. A solution is provided by a weighted sum of Gaussian densities often with decreasing weights corresponding to increasing variances. Mixing proportions for small (tight) variance components are large (e.g., 0.15 to 0.3) whereas very large variance components have small mixing proportions.
Figure 2 and 
Figure 3 illustrate long-tailed behavior and show how marginal density Gaussian model fitting works in practice. The ordinates give frequencies. See further discussion in [
11,
12].
   4. Additive Entropy
Background on entropy can be found e.g., in [
13]. Following Hartley’s 1928 treatment of equiprobable events, Shannon in 1948 developed his theory around expectation. In 1960 Rényi developed a recursive rather than linear estimation. Various other forms of entropy are discussed in [
14].
Consider density 
f with support in 
. Then:
	  
- Shannon entropy:  
- Rényi entropy:  for . 
We have: 
. So 
. We also have: 
 for 
 and 
 (see e.g., [
15], section 3.3). When 
, 
 is quadratic entropy.
  
    
  
  
    Figure 2.
      Upper left: long-tailed histogram of marginal density of product of wavelet scales 4 and 5 of a  Lena image. Upper right, lower left, and lower right: histograms of classes 1, 2 and 3. These panels exemplify a nested model.
  
 
   Figure 2.
      Upper left: long-tailed histogram of marginal density of product of wavelet scales 4 and 5 of a  Lena image. Upper right, lower left, and lower right: histograms of classes 1, 2 and 3. These panels exemplify a nested model.
  
 Both Shannon and Rényi quadratic entropy are additive, a property which will be availed of by us below for example when we we define entropy for a linear transform, i.e., an additive, invertible decomposition.
To show this, let us consider a system decomposed into independent events, A, B.
So  (alternatively written:  or ) . Shannon information is then , so the information of independent events is additive. Multiplying across by , and taking  when considering only event A and similarly for B, leads to additivity of Shannon entropy for independent events, .
Similarly for Rényi quadratic entropy we use  and we have: .
  5. The Entropy of a Wavelet Transformed Signal
The wavelet transform is a resolution-based decomposition–hence with an in-built spatial model: see e.g., [
16,
17].
A redundant wavelet transform is most appropriate, even if decimated alternatives can be considered straightforwardly too. This is because segmentation, taking information into account at all available resolution scales, simply needs all available information. A non-redundant (decimated, e.g., pyramidal) wavelet transform is most appropriate for compression objectives, but it can destroy through aliasing potentially important faint features.
  
    
  
  
    Figure 3.
      Overplotting of the histograms presented in 
Figure 2. This shows how the classes reconstitute the original data. The histogram of the latter is the upper left one in 
Figure 2.
  
 
 
   Figure 3.
      Overplotting of the histograms presented in 
Figure 2. This shows how the classes reconstitute the original data. The histogram of the latter is the upper left one in 
Figure 2.
 
  
 If 
f is the original signal, or images, then the following family of redundant wavelet transforms includes various discrete transforms such as the isotropic, B
 spline, à trous transform, called the starlet wavelet transform in [
17].
      
      where: 
 is the smooth continuum, not therefore wavelet coefficients; 
 are wavelet coefficients at scale 
s. Dimensions of 
 are all identical.
Nothing prevents us having a redundant Haar or, mutatis mutandis, redundant biorthogonal 9/7 wavelet transform (used in the JPEG-2000 compression standard). As mentined above, our choice of starlet transform is due to no damage being done, through decimation, to faint features in the image. As a matched filter the starlet wavelet function is appropriate for many types of biological, astronomical and other images [
17].
Define the entropy, 
H, of the wavelet transformed signal as the sum of the entropies 
 at the wavelet resolution levels, 
s:
Shannon and quadratic Rényi entropies are additive, as noted in 
section 4. For additivity, independence of the summed components is required. A redundant transform does not guarantee independence of resolution scales, 
. However in practice we usually have approximate independence. Our argument in favor of bypassing indepence of resolution scales is based on the practical and interpretation-related benefits of doing so.
Next we will review the Shannon entropy used in this context. Then we will introduce a new application of the Rényi quadratic entropy, again in this wavelet transform context.
  6. Entropy Based on a Wavelet Transform and a Noise Model
Image filtering allows, as a special case, thresholding and reading off segmented regions. Such approaches have been used for very fast – indeed one could say with justice, turbo-charged–clustering. See [
18,
19].
Noise models are particularly important in the physical sciences (cf. CCD, charge-coupled device, detectors) and the following approach was developed in [
20]. Observed data 
f in the physical sciences are generally corrupted by noise, which is often additive and which follows in many cases a Gaussian distribution, a Poisson distribution, or a combination of both. Other noise models may also be considered. Using Bayes’ theorem to evaluate the probability distribution of the realization of the original signal 
g, knowing the data 
f, we have:
	  
 is the conditional probability distribution of getting the data 
f given an original signal 
g, i.e., it represents the distribution of the noise. It is given, in the case of uncorrelated Gaussian noise with variance 
, by:
	  
      The denominator in Equation (
3) is independent of 
g and is considered as a constant (stationary noise). 
 is the a priori distribution of the solution 
g. In the absence of any information on the solution 
g except its positivity, a possible course of action is to derive the probability of 
g from its entropy, which is defined from information theory.
If we know the entropy 
H of the solution (we describe below different ways to calculate it), we derive its probability by:
	  
Given the data, the most probable image is obtained by maximizing 
. This leads to algorithms for noise filtering and to deconvolution [
16].
We need a probability density 
 of the data. The Shannon entropy, 
 [
21], is the summing of the following for each pixel,
      
      where 
 is an image containing integer values, 
 is the number of possible values of a given pixel 
 (256 for an 8-bit image), and the 
 values are derived from the histogram of 
g as 
, where 
 is the number of occurrences in the histogram’s 
kth bin.
The trouble with this approach is that, because the number of occurrences is finite, the estimate 
 will be in error by an amount proportional to 
 [
22]. The error becomes significant when 
 is small. Furthermore this kind of entropy definition is not easy to use for signal restoration, because its gradient is not easy to compute. For these reasons, other entropy functions are generally used, including:
	  
- Gull and Skilling [ 25- ]:
           - 
          where  M-  is a given model, usually taken as a flat image 
In all definitions 
n is the number of pixels, and 
k represents an index pixel. For the three entropies above, unlike Shannon’s entropy, a signal has maximum information value when it is flat. The sign has been inverted (see Equation (
5)), to arrange for the best solution to be the smoothest.
Now consider the entropy of a signal as the sum of the information at each scale of its wavelet transform, and the information of a wavelet coefficient is related to the probability of it being due to noise. Let us look at how this definition holds up in practice. Denoting 
h the information relative to a single wavelet coefficient, we define:
      with the information of a wavelet coefficient, 
, (Burg’s definition rather than Shannon’s). 
l is the number of scales, and 
 is the number of samples in wavelet band (scale) 
j. For Gaussian noise, and recalling that wavelet coefficients at a given resolution scale are of zero mean, we get
      
      where 
 is the noise at scale 
j. When we use the information in a functional to be minimized (for filtering, deconvolution, thresholding, etc.), the constant term has no effect and we can omit it. We see that the information is proportional to the energy of the wavelet coefficients. The larger the value of a normalized wavelet coefficient, then the lower will be its probability of being noise, and the higher will be the information furnished by this wavelet coefficient.
In summary,
      
- Entropy is closely related to energy, and as shown can be reduced to it, in the Gaussian context. 
- Using probability of wavelet coefficients is a very good way of addressing noise, but less good for non-trivial signal. 
- Entropy has been extended to take account of resolution scale. 
In this section we have been concerned with the following view of the data:  where observed data f is comprised of original data g, plus (possibly) background α (flat signal, or stationary noise component), plus noise ϵ. The problem of discovering signal from noisy, observed data is important and highly relevant in practice but it has taken us some way from our goal of cluster or segmentation modeling of f – which could well have been cleaned and hence approximate well g prior to our analysis.
An additional reason for discussing the work reported on in this section is the common processing platform provided by entropy.
Often the entropy provides the optimization criterion used (see [
13,
16,
26], and many other works besides). In keeping with entropy as having a key role in a common processing platform we instead want to use entropy for cross-model selection. Note that it complements other criteria used, e.g., ML, least squares, etc. We turn now towards a new way to define entropy for application across families of GMM analysis, wavelet transform based approaches, and other approaches besides, all with the aim of furnishing alternative segmentations.
  7. Model-Based Rényi Quadratic Entropy
Consider a mixture model:
	  
      Here 
f could correspond to one level of a mutiple resolution transformed signal. The number of mixture components is 
k.
Now take 
 as Gaussian:
	  
      where 
.
Take
      
      (
I = identity) for simplicity of the basic components used in the model.
A (i) parsimonious (ii) covering of these basic components can use a BIC approximation to the Bayes factor (see 
section 2) for selection of model, 
k.
Each of the functions 
 comprising the new basis for the observed density 
f can be termed a 
radial basis [
27]. A radial basis network, in this context, is an iterative EM-like fit optimization algorithm. An alternative view of parsimony is the view of a sparse basis, and model fitting is sparsification. This theme of sparse or compressed sampling is pursued in some depth in [
17].
We have:
      See [
13] or [
26]. Consider the case – apropriate for us – of only distinct clusters so that summing over 
 we get:
	  
Hence
      
      can be written as:
	  
      from restriction (
14) and also restricting the weights, 
. The term we have obtained expresses interactions beween pairs. Function 
 is a Gaussian. There are evident links here with Parzen kernels [
28,
29] and clustering through mode detection (see e.g., [
30,
31] and references therein).
For segmentation we will simplify further Equation (
20) to take into account just the equiweighted segments reduced to their mean (cf. [
28]).
In line with how we defined mutiple resolution entropy in 
Section 6, we can also define the Rényi quadratic information of wavelet transformed data as follows: