2.1. Distortion-Rate Theory
Clustering can be viewed as projecting a large number of discrete samples from the input space, into a finite set of discrete symbols in the clustered space, where each symbol resembles a cluster. Thus, clustering is a many-to-one mapping from the input space,
        
, to the clustered space,
        
, and can be fully characterized by the conditional probability distribution,
        
. Using this mapping, the distribution of the clustered space is estimated as:
        
        where
        
         is the distribution of the input space. 
Figure 1 demonstrates a many-to-one mapping, where each symbol,
        
, for
        
, represents a cluster of samples from the input space, and
        
         is the number of clusters.
Although clusters have different number of samples, but the average number of samples in each cluster is
        
, where
        
         is the conditional entropy of the clustered space given the input space and is estimated as:
        
The number of clusters is
        , where
        
         is the entropy of the clustered space. Note that
        
         is upper bounded by
        
         and is equal to the upper bound only when all clusters have an equal number of samples.
  
    
  
  
    Figure 1.
      Demonstration of a many-to-one mapping from the input space, includingsemi-infinite number of discrete samples, to a finite number of symbols, , in the clustered space.
  
 
   Figure 1.
      Demonstration of a many-to-one mapping from the input space, includingsemi-infinite number of discrete samples, to a finite number of symbols, , in the clustered space.
  
 To obtain a lossless many-to-one mapping, the immediate goal is to preserve the information in
        
         in the projected space,
        
. The loss of information due to mapping is measured by the conditional entropy,
        
, where
        
         is the amount of information in
        
. The mutual information between the input and clustered space,
        
, is estimated as:
        
Notice how the mutual information is estimated based on only the input distribution,
        , and mapping distribution,
        . Mutual information gives us the rate by which the clustered space represents the input space. For a lossless mapping,
        
         or
        , which in turn means that all the information in the input space is sent to the clustered space. While a higher rate for the clustered space generates less information loss, reducing this rate increases the information loss, therefore introducing a tradeoff between the rate and the information loss. In clustering, the goal is to introduce a lossy many-to-one mapping that reduces the rate by representing the semi-infinite input space with a finite number of clusters, thus introducing information loss such that
        .
The immediate goal in clustering is to introduce clusters with the highest similarity or lowest distortion among its samples. Distortion is the expected value of the distance between the input and clustered spaces,
        
, defined based on the joint distribution,
        
, as:
        
Different proximity measures can be defined as distortion, for instance, for the Euclidean distance,
        
         are the center of clusters and
        
         is the cluster variance and
        
         is a uniform distribution, where
        
         is the normalizing term and
        
         is the indicator function. The tradeoff between the preserved amount of information and the expected distortion is characterized by the Shannon-Kolmogorov rate-distortion function, where the goal is to achieve the minimum rate for a given distortion, illustrated by the horizontal arrow in 
Figure 2. The rate-distortion optimization has been extensively used for quantization, where the goal is to achieve the minimum rate for a desired distortion [
17]. Unlike quantization, the goal in clustering is to minimize the distortion for a preferred number of clusters,
        
, thus, the distortion-rate function is optimized instead:
        
In 
Figure 2, the vertical arrow demonstrates the distortion-rate optimization that achieves the lowest distortion for a desired rate. Note that the number of clusters,
        
, places an upper bound on the rate, determined by the mutual information. Assuming that decreasing distortion monotonically increases the mutual information, clustering can be interpreted as maximizing the mutual information for a fixed number of clusters,
        
, where
        
         is the number of clusters.
  
    
  
  
    Figure 2.
      Demonstration of the rate-distortion and distortion-rate optimizations by the horizontal and vertical arrows, respectively.
  
 
   Figure 2.
      Demonstration of the rate-distortion and distortion-rate optimizations by the horizontal and vertical arrows, respectively.
  
   2.3. Parzen Window Estimator with Gaussian Kernels
The distribution of samples in cluster
        
         is approximated by the non-parametric Parzen window estimator with Gaussian kernels [
21,
22], in which a Gaussian function is centered on each sample as:
        
        where
        
         is transpose,
        
         is the dimension of
        
,
        
         is the covariance matrix,
        
         are the samples of cluster
        
, and the cardinality
        
         is the number of samples in that cluster. Assuming the variances for different dimensions are equal and independent from each other, thus, providing us with a diagonal covariance matrix with constant elements,
        
, the distribution is simplified as:
        
        where
        
         is a Gaussian function with mean
        
 and variance
        
. Using the distribution estimator in (9), the quadratic terms in (7) can be further simplified as:
        
        where
        
 and 
         are the samples from clusters
        
 and 
, respectively, and
        
         are clusters from the clustered space. Note that the convolution of two Gaussian functions is also a Gaussian function.
Back to the clustering problem, in which the input space is the individual samples and clustered space is the finite number of clusters, the quadratic mutual information in (7) is restructured in the following discrete form:
        
The distribution of data,
        
, is equal to the distribution of all samples considered as one cluster, and is estimated using (9) as:
        
        where
        
         is the total number of samples,
        
, and
        
         is the number of samples in the
        
         cluster,
        
. The distribution of the clustered space, on the other hand, is estimated as:
        
The joint distribution,
        
, for each of the
        
         clusters of the clustered space is estimated as:
        
Substituting (12), (13) and (14) in (11) provides us with the following approximation for the discrete quadratic mutual information (proof provided in 
Appendix):
        
For simplification, here we define the between cluster distance among clusters
        
 and 
, as
        
, therefore, (15) can be represented as:
        
        where
        
         is a constant.
  2.4. Hierarchical Optimization
The proposed hierarchical algorithm, similar to most hierarchical clustering algorithms, operates in a bottom-up approach. In this approach clusters are merged together until one cluster is obtained, and then the whole process is evaluated to find the best number of clusters that fits the data [
4]. Such clustering algorithms start by assuming each sample as an individual cluster, and therefore require
        
         merging steps. To reduce the number of merging steps, hierarchical algorithms generally exploit an low-complexity initial clustering, such as 
k-means clustering, to generate
        
        clusters, far beyond the expected number of clusters in the data, but still much smaller than the number of samples,
        
         [
23,
24]. The initial clustering generates small spherical clusters, while significantly reducing the computational complexity of the hierarchical clustering.
Similarly, in the proposed hierarchical algorithms, clusters are merged; however, the criterion is to maximize the quadratic mutual information. Here we propose two approaches for merging, the agglomerative clustering and the split and merge clustering. In each hierarchy of the agglomerative clustering, two clusters are merged into one cluster to maximize the quadratic mutual information. In each hierarchy of the split and merge clustering, on the other hand, the cluster that has the worst effect on the quadratic mutual information is first eliminated, and then its samples are assigned to the remaining clusters in the clustered space. Following, these two approaches are explained in details.
  2.4.1. Agglomerative Clustering
In this approach, we compute the changes in the quadratic mutual information after combining any pair of clusters to find the best two clusters for merging. We pick the pair that generates the largest increase in the quadratic mutual information. Since, at each hierarchy, clusters with the lowest distortion are generated, therefore, this approach can be used for optimizing the distortion-rate function in (5). Assuming that clusters
          
 and 
           are merged to produce
          
, the changes in the quadratic mutual information,
          
, can be estimated as:
          
          where
          
          is the quadratic mutual information at the step
          
. The closed form equation in (17) provides us with the best pair for merging without literally combining each pair and estimating the quadratic mutual information. Eventually, the maximum
          
           at each hierarchy determines the true number of clusters in the data. 
Table 1 introduces the pseudo code for the agglomerative clustering approach.
  
    
  
  
    Table 1.
    Pseudo code for the agglomerative clustering.
  
 
        
        Table 1.
    Pseudo code for the agglomerative clustering. 
        | 1: Initial Clustering, | 
| 2: for
                   do | 
| 3: Estimate
                   for all pairs | 
| 4: Merge clusters
                   and , in which | 
| 5: end for | 
| 6: Determine # of clusters | 
      
   2.4.2. Split and Merge Clustering
Unlike the agglomerative clustering, this approach detects one cluster at each hierarchy for elimination. This cluster has the worst effect on the quadratic mutual information, meaning that out of all clusters, this is the cluster to be eliminated such that the mutual information is maximized. Assuming cluster
          
           has the worst effect on the mutual information, the change in the quadratic mutual information,
          
, can be estimated as:
          
The samples of the worst cluster are then individually assigned to the remaining clusters of the clustered space based on the minimum Euclidean distance, in which the closest samples are assigned first. This process also proceeds until one cluster remains. Eventually, based on the maximum changes in the quadratic mutual information at different hierarchies,
          
, the true number of clusters is determined. 
Table 2 introduces the pseudo code for the split and merge clustering approach.
  
    
  
  
    Table 2.
    Pseudo code for the split and merge clustering.
  
 
        
        Table 2.
    Pseudo code for the split and merge clustering. 
        | 1: Initial Clustering, | 
| 2: for
                   do | 
| 3: Estimate
                   for all pairs | 
| 4: Eliminate cluster
                  , in which | 
| 5: for
                   do | 
| 6: Assign sample
                   to cluster | 
| 7: end for | 
| 8: end for | 
| 9: Determine # of clusters | 
      
 Comparing the two proposed hierarchical algorithms, the split and merge clustering has the advantage of being unbiased to the initial clustering, since the eliminated cluster in each hierarchy is entirely re-clustered. However, the computational complexity of the split and merge algorithm is in the order of
          
           and higher than the agglomerative clustering, that is in the order of
          . The split and merge clustering also has the advantage of being less sensitive to the variance selection for the Gaussian kernels, since re-clustering is performed based on the minimum Euclidean distance.
Both proposed hierarchical approaches are unsupervised clustering algorithms; therefore require finding the true number of clusters. Determining the number of clusters is challenging, especially when no prior information is given about the data. In the proposed hierarchical clustering, we have access to the changes of the quadratic mutual information from the hierarchies. The true number of clusters is determined when the mutual information is maximized or when a dramatic change in the rate is observed.
Another parameter to be set for the proposed hierarchical clustering is the variance of the Gaussian kernels for the Parzen window estimator. Different variances detect different structures of data. Although there are no theoretical guidelines for choosing this variance, but some statistical methods can be used. For example, an approximation can be obtained for the variance in different dimensions,
          
           where
          
           is the diagonal element of the covariance matrix of thedata [
25]. We can also set the variance proportional to the minimum variance observed in each dimension,
          
           [
26].