Synthetic Aperture Radar Image Clustering with Curvelet Subband Gauss Distribution Parameters

Curvelet transform is a multidirectional multiscale transform that enables sparse representations for signals. Curvelet-based feature extraction for Synthetic Aperture Radar (SAR) naturally enables utilizing spatial locality; the use of curvelet-based feature extraction is a novel method for SAR clustering. The implemented method is based on curvelet subband Gaussian distribution parameter estimation and cascading these estimated values. The implemented method is compared against original data, polarimetric decomposition features and speckle noise reduced data with use of k-means, fuzzy c-means, spatial fuzzy c-means and self-organizing maps clustering methods. Experimental results show that the curvelet subband Gaussian distribution parameter estimation method with use of self-organizing maps has the best results among other feature extraction-clustering performances, with up to 94.94% overall clustering accuracies. The results also suggest that the implemented method is robust against speckle noise.


Introduction
Several remote sensing and observation systems are developed for earth surface monitoring, which can be grouped into three main categories: laser-based light detection and ranging (LIDAR), optical sensor-based multi-or hyper-spectral imaging, and microwave-based synthetic aperture radar (SAR).Among these methods, SAR is the most prominent as it has the best atmosphere permeability, better resolution and different modes of operation, such as polarimetry and interferometry.SAR imaging is an active imaging system with a microwave transmitter emitting pulsed radio waves and a receiver getting backscattered radio waves.Synthetic aperture utilizes the Doppler effect on microwave-illuminated regions to increase the azimuth direction resolution.The use of the Doppler effect results in increased azimuth resolution with reduced antenna length up to a physically allowed size.Commercially, SAR sensors are carried either by air or satellite platforms.The wavelength used in SAR imaging varies by usage requirements from 65 cm to 0.5 cm.SAR images are contaminated by a form of noise called speckle noise which can be modelled multiplicatively.SAR images are used in areas such as target detection, structure detection, road extraction, ship detection, land use classification, oil spill detection, ice field tracking, disaster aftermath evaluation, etc.These fields of use require a great deal of continuous observation and manual analysis.At this point, the use of automatic analysis tools is inevitable.
In SAR literature, pixel-based, region-based and contour-based clustering and segmentation algorithms are applied alone or in a cascaded structure.In [1], iterative region growing with the semantics method based on a Markov random field, edge strength model and region growing is applied for SAR image clustering.In [2], a Markov random field approach for SAR clustering is enriched by introducing a third random variable.Ensemble learning of spectral clustering results based on gray level co-occurancy matrix (GLCM) and wavelet transform is introduced in [3] for SAR imagery.Spectral clustering is carried out by k-means clustering in a projection space, where the transformation matrix is calculated by eigenvectors of the Gaussian similarity matrix of samples.In [4], cascaded implementation of Voronoi tessellation, Bayesian inference and reversible jump Markov chain Monte Carlo (RJMCMC) methods are used for SAR clustering.Voronoi tessellations are used to decompose homogeneous polygonal regions and Bayesian inference and RJMCMC is used for labeling.In [5], the integrated active contour method is introduced.Compared to the active contour method, where image segmentation is defined as an energy minimization problem for a closed curve, the integrated active contour approach defines energy based on the maximum likelihood estimation of parted regions' gamma distributions.In [6], complex Wishart distribution features are used with Chernoff distance for agglomerative hierarchical clustering.In [7], level set segmentation is used together with the SAR Wishart distribution model.In [8], GLCM calculated on the Gabor filter results in the brushlet space used for SAR clustering.
The article is structured as follows: Section 2 gives information about the proposed feature extraction method (curvelet subband µ, σ features), together with benchmark feature sets.In Section 3, the test site, data format and clustering methods implemented are introduced.In Section 4, experimental results are presented with several measures: first, the experimental setup is introduced, followed by a presentation of the accuracies, and finally, clustering maps are given as a means of visual comparison.Section 5 concludes the work emphasizing the important findings.

Proposed Method
The proposed feature extraction method (curvelet subband µ, σ features) is introduced together with the benchmark methods (original data, speckle reduced data, polarimetric decomposition features) in this section.

Original Data
The original data is used as a base benchmark feature set for comparison.The original data features are constructed as taking the absolute values of the upper triangular matrix of the coherency matrix.Original data has six features per sample.

H/A/α Polarimetric Decomposition
Eigenvalue decomposition of the coherency matrix results in occurrence probabilities of three different scattering processes.The occurrence probabilities P j (j = 1, …, 3) of these scattering processes are the ratios of relevant eigenvalue λ j by the sum of all eigenvalues and can be given in Equation (1) [9].
The measure of randomness in the whole scattering process entropy H can be given in Equation ( 2) based on scattering process probabilities where 0 ≤ H ≤ 1.The lower value of H indicates one dominant scattering process, whereas higher value shows that there is volume scattering and the overall scattering is more random.
The anisotropy A is the measure of difference in secondary scattering mechanisms and can be given in Equation (3).Anisotropy provides complementary information to entropy and helps explaining the surface scatterer.
The α value is the average of polarization dependence of scattering processes and can be given in Equation (4).In Equation (4) α j values correspond to polarization dependence of three different scattering processes.
H/A/α polarimetric decomposition features used in this work are calculated by the PolSARpro software provided by the European Space Agency (ESA).Polarimetric decomposition is carried out for a window size of 5 × 5. H/A/α data has three features per sample.

Speckle Reducing Anisotropic Diffusion (SRAD) Filtered Data
SAR images contain speckle noise that highly affects the image quality.Among several speckle filters that are defined in the literature, the speckle reducing anisotropic diffusion (SRAD) filter comes forward with its properties and success.The SRAD filter is a nonlinear anisotropic diffusion technique that preserves edge-like features and can reduce noise in homogenous regions [10].Unlike other existing diffusion techniques that use log-compressed data, SRAD can process data directly.The SRAD filter is defined as an iterative method that updates the image according to instantaneous local statistics and variations.In [10], it is shown that SRAD performs better than conventional anisotropic diffusion, Lee filter and enhanced Frost filter by means of smoothing and edge preserving.For an image, I update value for iteration step t can be given in Equation (5).
In Equation ( 5)  represents divergence operator, ∇ is used for gradient operator,  represents a smoothing-limiting function and  is given as the local statistics value.The  value gives local variation degree based on gradient and Laplace operators as given in Equation (6).∆ represents the Laplace operator.
smoothing-limiting function adjusts the degree that image gradient is effective on the update amount. function, given in Equation (7), is set to give 1 for a  0 value which is calculated from a homogenous region.
0 can be calculated by  and ρ parameters and iteration step  as given in Equation (8). 0 in Equation ( 8) reaches zero as the iteration step increases.
The effecting parameters for SRAD filter can be summarized as  value, ρ value and number of iterations.Applying SRAD speckle noise reduction for each element of the upper triangular part of the coherency matrix results in six features per sample.

Curvelet Transform Subband Statistical Moments
Curvelet transform (CT) is a multidirectional multiscale transform that can extract local spatial and textural features.Compared to wavelet and similar transforms, it can be said that curvelet transform can represent curve-like features with greater sparsity [11].CT is closely related to frequency-domain wedge filters, short-time Fourier transform, wavelet transform, Gabor wavelet transform, ridgelet transform, contourlet transform and other directional wavelet transforms.The definition and implementation for CT is given in two forms in the literature, namely unequally spaced fast Fourier transform and wrapping of specially selected Fourier samples [12].
Curvelet transform is mostly utilized for speckle noise reduction in SAR image processing.In content-based image retrieval (CBIR) literature, two forms of curvelet-based feature extraction is introduced: the first one assumes curvelet subbands are normally distributed and estimates Gaussian distribution (GD) parameters, and the second one assumes curvelet subbands are distributed according to generalized Gaussian distribution (GGD) and estimates GGD parameters [13].
Curvelet-based, histogram of curvelets (HoC) feature extraction method is introduced in our previous work together with the first implementation of curvelet subband GGD parameter estimation features for SAR image classification [14].In our previous work, using only one element of the coherency matrix, histograms for each normalized curvelet subbands are cascaded to form a feature vector per pixel.The results show that the proposed HoC feature extraction method gives the best classification accuracies for most of the test setups but when the number of training samples are heavily reduced, SRAD results overtake by means of classification accuracy.In this work using all of the elements of coherency matrix, curvelet subband GD parameters are cascaded to form a feature vector to be used in clustering.
Curvelet family for a continuous domain is composed of directional wedge filter results of concentric scales together with a low pass component in the frequency domain.Frequency domain continuous curvelet transform tiling is given with Figure 1a.Continuous curvelet transform U j,ℓ (r,θ) at scale 2 −j (for j ≤ j 0 ) and rotation θ ℓ in the frequency domain for signals of R 2 is defined with ω frequency domain Cartesian variables, r, θ frequency domain polar variables, W radial windowing function and periodic with 2π radians V angular windowing function as in Equation ( 9).The ranges for the variables can be given as: r ≥ 0, θ ∈ [0, 2π), j ∈ N 0 , ℓ ∈ N 0 , θ ℓ = 2πℓ 2 − ⌊ j/2 ⌋ .j parameter is an element of positive natural numbers and defines a 2 − scaling for the windowing function, U j,ℓ term is given to address one of several frequency domain windows at scale 2 −j and orientation θ ℓ .
The scale where j ≤ j 0 is given as the coarse curvelet represents the low pass component.Coarse curvelet can be given with W 0 windowing function in Equation (10).Curvelet transform spatially is given with the Fourier pair of F{φ(x)}=U(ω).φ is obtained as a Gauss filtered oscillating function.All together parabolic scaled (unequal stretching at different axis) with D j , rotated with R θℓ and translated with k = (k 1 , k 2 ) ∈ R 2 versions of φ give the spatial curvelet family.The spatial curvelet family can be given in Equation (11).
Spatial counterpart of the coarse curvelet can be given in Equation (12).
Given the spatial curvelet family, curvelet transform coefficients  of a continuous signal f is obtained as the inner product of the function and the curvelets as in Equation ( 13), where φ denotes complex conjugate.
Discrete curvelet family in the frequency domain is defined as shear filters at concentric square windows as given in Figure 1b.The discrete curvelet transform coefficients c D of a discrete signal f [t 1 , t 2 ] of size n × n based on spatial discrete curvelet family φ D can be given as an inner product in Equation (14).
Curvelet transform family spatially is illustrated in Figure 2 with different orientations and scales together with the coarse curvelet presented in the frequency domain and spatially.Curvelet-based subband GD parameter estimation feature extraction for SAR image is carried out first with taking the curvelet transform of a window around the pixel of interest.Based on the number of orientations and scales used for curvelet transform, the number of curvelet subbands varies.Feature extraction for a pixel and its neighbors can then be carried out as calculating the mean and standard variation values for each subband and cascading them.Supposing S number of subbands and six elements of the coherency matrix, the number of features for each pixel can be given as 2 × S × 6.
Curvelet-based feature extraction is important as it emphasizes spatial locality and can extract agricultural field furrow-like features naturally.

Dataset Description and Clustering Methods
This section is divided into two subsections mainly focusing on the dataset description and clustering methods used, respectively.

Dataset Description
Test materials for this work are from widely used Flevoland data acquired by AirSAR platform.Flevoland is mostly regained from sea and is located in the middle of the Netherlands as given in Figure 3. Air SAR data of Flevoland is in the form of multispectral (P, L and C bands) full polarimetric (V[ertical]V, VH[orizantal], HV, VV polarizations) modes.The nominal spatial resolution is given between 5 and 10 m.The C band full polarimetric data is used for clustering of crop lands in this work.The region of interest (ROI) area, that is 320 × 200 pixels in size, is given in Figure 3c with false coloring.Ground truth label map of the ROI region is given in Figure 4.
Flevoland data is provided in T3 format, which is the average coherency matrix of reduced Pauli decomposition vector for each pixel over number of looks .The Pauli decomposition vector k and the definition of coherency matrix Ω can be given in Equations ( 15) and ( 16), respectively, based on polarimetric backscattering amplitudes (  : horizontally polarized transmitter and horizontally polarized receiver,   : horizontally polarized transmitter and vertically polarized receiver,   : vertically polarized transmitter and horizontally polarized receiver,   : vertically polarized transmitter and vertically polarized receiver).The elements of reduced Pauli decomposition vector correspond to odd bounce scattering, even bounce scattering and volume scattering components, which can be utilized to understand the underlying physical properties of the landcover.Coherency matrix is formed as the average of Pauli decomposition vector multiplied by its Hermitian (conjugate transpose) over number of looks.The coherency matrix contains second order moments of the scattering process and can be used to describe the correlation properties of natural scatterers [9].
Coherency matrix (3 × 3 matrix) is obtained as a Hermitian matrix, which means conjugate transpose of the matrix is equal to itself.For that reason, commercially, SAR data is provided as real diagonal values of the coherency matrix together with real and imaginary parts of the strictly upper triangular matrix.

K-Means Clustering
K-means algorithm is defined for clustering  distinct samples into total of  clusters (  ,  = 1, … , ).The algorithm minimizes the cost function  with respect to optimum  cluster centers based on total distance of samples to the cluster centers   that they belong to, given in Equation (17).In Equation (17),   corresponds to the center of mass for clusters and    −   to distance between 'th center and associated 'th sample.

𝐽
K-means associates each sample with one cluster.This can be expressed in Equation ( 18) with c × n-sized binary membership matrix U, in which each row sums up to 1.Each element   of the matrix  is 1 if the 'th sample belongs to cluster i and 0 if not.
Centers of the clusters are calculated in Equation (19) as the average of the samples of each cluster.Here   shows the number of samples in cluster   .
K-means algorithm can be given iteratively as Algorithm 1 [15].K-means does not guarantee an optimum result as it is heavily dependent on initial cluster centers.

Fuzzy C-Means (FCM) Clustering
Fuzzy C-means, which utilizes fuzzy valued memberships for clusters, is different from k-means as k-means has strict binary membership values.In FCM, each sample is assigned a membership value for each cluster that sums up to 1 over all clusters per sample.This membership can take values between 0 and 1. Cost function  for FCM is given in Equation (20).Here   defines the fuzzy membership value of sample  to cluster , whereas   is the distance from cluster center   to sample   and  ∈ 1, ∞ is the fuzzification coefficient.2D-SOM is a two-dimensional unsupervised clustering algorithm, which can be defined with a four-neighbor rectangular grid or three-neighbor hexagonal grid artificial neural network structure, used to transform the input space into a two-dimensional projection space preserving topology [18].SOM structure can be given for a total of  neurons,  ∈    dimensional input vectors and  × weights from input layer to neurons.SOM algorithm is an iterative method that updates the most similar neuron to an input and its neighboring neurons' weights in such a way that resemblance is increased to that input.The most similar neuron at step  to an input  is called the winning neuron  and can be defined with neuron weight  as given in Equation ( 25).Here  represents the set of SOM neurons.
How much the other neurons (indexed with ) are to be updated together with the winning neuron (indexed with ) can be defined with   position vector of i'th neuron, and ς() decreasing effective neighbor-distance-function as in Equation ( 26).

𝜂 𝜈, 𝑘, 𝑡 = 𝑒
The weight update can be defined with distance of the input  to the winning neuron, neighboring distance of other neurons to the winning neuron and adaptation coefficient  as in Equation ( 27).α can be chosen to be a monotonically decreasing linear, exponential or rational function.

Experimental Results
Experiments are conducted for each feature extraction method paired together with the aforementioned clustering methods.For the clustering methods, a number of clusters are chosen to be equal to the number of classes, except for the 2D-SOM, where a higher number of clusters is constructed.In the literature, clustering performance evaluation without ground truth is carried out by cluster validation measures, whereas with ground truth information, accuracy calculation after cluster labeling can be used as a performance measure.In 2D-SOM, each cluster is labeled with the majority of the sample labels, whereas in the rest of the methods labeling is carried out by maximizing the overall accuracy in a one-to-one correspondence manner.Kappa values together with the clustering accuracies according to known labels are given as performance measures.
Curvelet subband GD parameter estimation feature extraction is carried out for window size 33 × 33, number of scales 2 and number of orientations 16 (as a result 17 subbands per coherency matrix element).Thus, the number of features per pixel can be given as 204 (17 subbands × 6 elements of coherency matrix × 2 features) for curvelet subband GD parameter estimation.This method is denoted in the tables as μ, ς.It should also be noted that curvelet subband features are normalized feature-wise prior to being fed to clustering methods.
Experimental results are presented in two forms based on clustering accuracies and clustering maps.Clustering accuracies are given according to overall accuracies and Kappa values are able to assess the performance of the proposed feature extraction method compared with standard benchmark features.Clustering accuracies are also further analyzed for higher number of clusters for clustering methods (k-means, FCM, sFCM) that practically use the same number of clusters as class labels.Clustering maps are given in order to be able to provide visual comparison between feature extraction methods on each clustering method.

Accuracies and Errors
K-means clustering accuracies are given in Table 1 for the average of 20 runs for each feature extraction method.The best clustering accuracy is achieved for SRAD features with k-means algorithm up to 65.41% with the experiments.FCM clustering overall accuracies are given in Table 2 for the average of 20 runs.It can be seen from Table 2 that clustering overall accuracies can be increased compared to hard membership k-means, with the introduction of fuzzy cluster memberships.It should also be noted that as the  value increases, the threshold constraint is met at a small number of iterations.Feature extraction methods can be ordered with respect to clustering accuracies as SRAD, μ, ς features, original data and H/A/α for FCM method.
sFCM clustering method results are calculated for a fixed fuzzy membership value (m = 2), various feature-space and spatial fuzzy membership values (,  ∈ 0,1,2,4,8 ) and different window sizes ( ∈ 5,11,21 ).sFCM results of the 20-run averages for best clustering accuracy yielding parameters are given in Table 3.Compared to k-means and FCM, it can be said that the introduction of spatial information through clustering iterations with sFCM, enhanced clustering accuracy for H/A/α more than the curvelet subband µ, σ features.The best clustering accuracy for sFCM is achieved by SRAD features up to 85.11%.2D-SOM clustering results are calculated for 7 × 7, 9 × 9, 11 × 11 and 13 × 13-sized hexagonal grid networks and 3, 4, 5 and 6 initial neighborings respectively.2D-SOMs run for 1000 iterations and resulting clusters are labeled as the majority label they contain.In Table 4, overall clustering accuracies are given together with the number of unique labels assigned in parenthesis.SRAD features with 7 × 7 and 9 × 9 SOM networks have better clustering accuracies compared to µ, σ features, whereas as the network grows µ, σ features presents better accuracies.It can be inferred from the results in Table 4 that SRAD extracts similar features and curvelet subband GD parameter estimation extracts discriminating features.Thus with SRAD as the network grows, 2D-SOM cannot set previously clustered together samples apart clearly.On the other hand, with µ, σ features, as the network grows and the number of clusters increases, the use of discriminating features results in labeled samples falling into similar clusters.
Kappa values for the best clustering accuracies are given in Table 7. Kappa value can be considered as differentiation from the expected value of random labeling accuracies and is given as Equation (28), where () is the confusion matrix accuracy probability and () is the probability of random labeling.Overall the best Kappa value as 0.9382 is reached with the use of µ, σ features in 13 × 13 topology 2D-SOM.SRAD features again with 13 × 13 topology-SOM are placed second with Kappa value 0.9009.

𝜅 =
−   1 −   (28) Overall evaluation of feature extraction methods on k-means, FCM and sFCM together with 2D-SOM is also carried out with higher number of clusters for 20 runs.That is k-means, FCM, sFCM are evaluated with 50, 100 and 150 clusters and labels are given the same way as in 2D-SOM.FCM is run for  = 2, sFCM is run for  = 2,  = 1,  = 1 and  = 21.The results are given in graphical form with x-axis showing number of clusters and y-axis showing accuracies as in Figure 5.It can be seen from the graphs in Figure 5 that curvelet subband µ, σ features starts with slightly lower accuracy compared to SRAD in all clustering methods; however, with the increasing number of clusters curvelet subband µ, σ features reach up to the SRAD accuracies.In 2D-SOM clustering curvelet subband µ, σ features gives even better accuracies compared to SRAD.The sFCM method is also run with 169 clusters for SRAD and curvelet subband µ, σ features, and accuracies of 94.82% and 94.67% are obtained respectively.The best accuracy overall is obtained by 13 × 13 2D-SOM with curvelet subband µ, σ features with 94.94%.These results are also consistent with nine cluster results of k-means, FCM and sFCM.
Practically, feature extraction in SAR images is conducted following a speckle noise reduction step.However, the proposed feature extraction method can be carried out without speckle reduction, as it utilizes spatial features naturally by curvelet transform.Moreover as the curvelet subband features are extracted on averages and standard deviations the disturbing effect of speckle noise can be eliminated to some extent.Results from Table 1 to Table 4 and graphs from Figure 5 suggest that the proposed method is as accurate as SRAD features or even better at some experimental setups thereby demonstrating that the proposed method is robust against speckle noise.

Clustering Maps
K-means clustering maps are given in Figure 6 together with labels and the label map.Using only feature space hard memberships and similarity measures to cluster centers, k-means clustering mappings result in cluttered small regions for original data and H/A/α, whereas bigger homogeneous regions can be seen for SRAD and µ, σ features where spatial information is diffused through feature extraction.2D-SOM classification maps for 13 × 13 topologies are given in Figure 9.In analyzing 2D-SOM cluster mapping for SRAD, it can be seen that confusions in the labeling mostly occur at edges of the regions.This situation can be explained with SRAD preserving edges by definition.Cluster confusion can also be seen due to feature diffusion in neighboring regions both in SRAD and µ, σ features.The speckle anisotropic diffusion (SRAD) method reveals the best clustering accuracies for k-means, FCM, sFCM and small-sized 2D-SOMs.Curvelet subband µ, σ features give the best clustering accuracies for bigger 2D-SOMs, which are also the best clustering accuracy overall as 94.94%.The results suggest that SRAD-based features can be considered as extracting similar features among samples, whereas curvelet-based features can be considered as extracting discriminating features.These results mainly rely on the test site restrictions which can be listed as: having low relief angle, and the land being suitable for land use classification.Apart from the discussed clustering methods, hierarchical clustering methods are also expected to perform well with discriminating features from curvelet subband GD parameter estimation.Also, as future work, feature selection and feature reduction can be implemented to the extracted curvelet subband features to increase accuracy.

Figure 5 .
Figure 5.Comparison of accuracies of feature extraction methods on (a) k-means; (b) FCM; (c) SRAD and (d) 2D-SOM, with different number of clusters.
clustering maps for the best clustering accuracy yielding  values are given in Figure7together with labels and the label map.In Figure7, clustering maps are given for original data with  = 16, SRAD with  = 1.2, H/A/α features with  = 16 and curvelet subband GD µ, σ parameter estimation features with  = 1.4.
sFCM clustering maps for the best clustering accuracy yielding parameters are given in Figure8together with labels and the label map.Apart from spatial information being diffused by feature extraction in SRAD and curvelet µ, σ features, the introduction of spatial information through sFCM clustering steps also enables bigger homogeneous-labeled regions for all feature extraction methods.

Algorithm 1 :
K-means IN ,   =1…   ← randomly selected number of  samples from   repeat   ← calculate membership for each sample   with Equation (18)   ← calculate new cluster centers for each cluster with Equation (19) until   −   < [16] cluster center   can be given based on fuzzy membership values   in Equation (21).FCM algorithm can be given iteratively as Algorithm 2[15].FCM does not guarantee optimum cluster centers as it is heavily dependent on initial fuzzy membership values.-means is defined as of feature space fuzzy membership values through spatial neighborhood membership values[16].In each iteration, spatial fuzzy membership values (  ) are calculated based on feature space fuzzy membership values (  ) and those two values are fused together to form the overall fuzzy membership ( ′ ) of a sample.Spatial fuzzy membership  of samples for a w × w neighborhood window  can be given in Equation (23) based on feature space fuzzy membership values.

Table 3 .
The best sFCM overall clustering accuracies for fixed  = 2 and corresponding parameter values.

Table 5 .
Cluster confusion matrix for SRAD features.