Entropy Estimators in SAR Image Classification

Remotely sensed data are essential for understanding environmental dynamics, for their forecasting, and for early detection of disasters. Microwave remote sensing sensors complement the information provided by observations in the optical spectrum, with the advantage of being less sensitive to adverse atmospherical conditions and of carrying their own source of illumination. On the one hand, new generations and constellations of Synthetic Aperture Radar (SAR) sensors provide images with high spatial and temporal resolution and excellent coverage. On the other hand, SAR images suffer from speckle noise and need specific models and information extraction techniques. In this sense, the G0 family of distributions is a suitable model for SAR intensity data because it describes well areas with different degrees of texture. Information theory has gained a place in signal and image processing for parameter estimation and feature extraction. Entropy stands out as one of the most expressive features in this realm. We evaluate the performance of several parametric and non-parametric Shannon entropy estimators as input for supervised and unsupervised classification algorithms. We also propose a methodology for fine-tuning non-parametric entropy estimators. Finally, we apply these techniques to actual data.


Introduction
Images obtained with coherent illumination systems, such as Synthetic Aperture Radar (SAR), are contaminated by speckle. This noise-like interference phenomenon corrupts the image in a non-Gaussian and non-additive manner, making difficult its processing and visual interpretation.
Against this backdrop, statistical procedures are essential tools for processing SAR data. A suitable model to describe this sort of image is fundamental to obtain features that promote a good analysis. In this sense, the family of G 0 distributions [1] has been extensively used to model SAR data because of its analytical simplicity and ability to describe a wide variety of roughness targets.
The application of machine and deep learning techniques to the problem of classification, segmentation and detection of objects in SAR images became more popular in recent times. Palacio et al. [2] used machine learning techniques in combination with filters to perform classification in PolSAR images. Baek and Jung [3] carried out a comparison between three different machine learning techniques to classify single-and dual-pol SAR image showing that the deep neural network presented the best performance.
Different authors used methods based on transfer learning techniques to classify SAR images. These methods aim to solve the problem of having limited labeled area information to train deep convolutional neural networks (CNN). Kang and He [4] applied this technique using a CNN trained on a CIFAR-10 dataset to extract a mid-level representation. They showed that this technique is adequate to solve the problem of the limited amount of labeled SAR data, by comparing the results obtained with a CNN without using this technique and combining a Support Vector Machine (SVM) with a Gabor filter or with gray level co-occurrence matrices. Lu and Li [5] implemented this methodology using several popular pre-trained models and proposed a new method of data augmentation. They also made a comparison with some related works and showed that their proposed method outperformed the others. Huang et al. [6] proposed to transfer the knowledge obtained from a large number of unlabeled SAR images by incorporating a reconstruction path with stacked convolutional autoencoders in the network architecture. Their proposal was competitive for the MTSAR dataset using all training samples, and had the best performance when the training dataset has a small size.
Transfer learning was also implemented by Rostami et al. [7]. They proposed to transfer the knowledge from the electro-optical domain to SAR by learning a shared embedding space, and they showed that their approach is effective when applied to a ship classification problem. Huang et al. [8] proposed another deep transfer learning method to solve the land cover classification problem with highly unbalanced classes, geographic diversity and noisy labels. They showed that the proposed model, which uses cross-entropy, can be generalized and can be applied to others SAR domains.
Several approaches have been developed in order to obtain expressive and tractable features from SAR data. In particular, entropy measures have been widely used for this purpose. Parameter estimation [9], classification [10], procedures for constructing confidence interval and contrast measures [11,12], edge detection [13], and noise reduction filters [14] are among their applications.
Sundry authors have tackled the segmentation and classification SAR images problem using information theory measures. Nobre et al. [15] used Rényi's entropy for monopolarized SAR image segmentation. Ferreira and Nascimento [16] derived a closed-form expression for the Shannon entropy based on the G 0 law for intensity data and proposed a new entropy-based segmentation method. Carvalho et al. [10] employed stochastic distances to approach unsupervised classification applied to Polarimetric Synthetic Aperture Radar (PolSAR) images. Shannon entropy has been applied to analyzed SAR imagery in several approaches, from inference [11] to classification [16]. Therefore, its estimation deserves attention.
The parametric expression of the Shannon entropy for a system characterized by a continuous random variable is the following well-known expression: where f is the probability density function that characterizes the distribution of the realvalued random variable Z. Several procedures can be applied to obtain an estimate of H(Z) given a random sample Z = (Z 1 , Z 2 , . . . , Z n ). The most direct family of estimators of H(Z) given Z consists of obtaining estimators for θ, the parameter that indexes the distribution of Z, say θ, and using them in (1). This approach yields the families of maximum likelihood, moments, and robust estimators, to name a few. This is the "parametric approach".
"Non-parametric" approaches do not use θ as a proxy. Instead, they rely on the equivalent expression for the Shannon entropy given by where F is the cumulative distribution function that also characterizes the distribution of the random variable [17]. Such alternative approaches compute estimates of F in Equation (2) from the observed sample.
Vasicek [17] replaced the distribution function F by the empirical distribution function F n and used a difference operator in place of the differential operator. van Es [18] studied an entropy estimator based on differences between order statistics. Correa [19] proposed a new entropy estimator determined from local linear regression. Al-Omari [20] and Noughabi and Noughabi [21] presented modified versions of the estimator introduced by Ebrahimi et al. [22]. It is important to mention that these estimators have been studied in different contexts. Maurizi [23] studied the works by Vasicek [17] and van Es [18] to estimate the entropy H(Z) when the random variable has support [0, 1]. Noughabi and Park [24] considered them to propose goodness of fit tests for the Laplace distribution. Suk-Bok et al. [25] assessed the proposal by [17] to estimate H(Z) for a double exponential function in the framework of multiple type-II censored sampling. More recently, Al-Labadi et al. [26] considered these estimators to propose a new Bayesian non-parametric estimation to entropy. Additionally, Lopes and Machado [27] considered Ref. [22] as a reference in the review of other entropy estimators.
In this paper, we study the performance of parametric and non-parametric estimators of the entropy in the context of supervised and unsupervised classification. In the parametric case, we use the relationship between the G 0 and Fisher distributions to obtain an expression of the entropy. In the non-parametric case, we assess these estimators in terms of bias, mean square error, computational time, and accuracy.

The G 0 Model
The multiplicative model defines the return Z in a monopolarized SAR image as the product of two independent random variables: one corresponding to the backscatter X, and the other to the speckle noise Y. In this manner, Z = XY represents the return in each pixel of the image.
The G 0 distribution is an attractive model for Z because of its flexibility to adequately model areas with all types of roughness [28,29]. For intensity SAR data, this family arises from considering the speckle noise Y modeled as a Γ distributed random variable with unitary mean and the shape parameter L ≥ 1, the number of looks. We also assume that the backscatter X obeys a reciprocal gamma law. Thus, the density function for intensity data is given by where −α, γ, z > 0 and L ≥ 1. The r-order moment is provided α < −r, and infinite otherwise. Mejail et al. [28] proved a relationship between the G 0 distribution and the Fisher-Snedekor F law, which states that the cumulative distribution function F α,γ,L for the return for every z > 0, where Υ 2L,−2α is the cumulative distribution function of a Fisher-Snedekor random variable with 2L and −2α degrees of freedom. This connection is helpful to obtain a closed formula for the entropy.

Shannon Entropy
Shannon's contribution to the creation of what is known as information theory is well known. Shannon [30] proposed a new way of measuring the transmission of information through a channel, thinking of information as a statistical concept. The entropy of the G 0 distribution can be obtained using (4). Denote H F as the entropy under the Fisher-Snedekor model; then the G 0 entropy for intensity data H G 0 is Using (5), the expression of H G 0 is where ψ (0) and B are the digamma and beta functions, respectively. Figure 1 shows the theoretical entropy H G 0 (α, γ, L) as a function of α and γ with L = 2. It can be shown that for each fixed γ value, H G 0 is an injective function. The same behavior repeats if we consider α as a constant.

Shannon Entropy Estimators
Several authors have proposed entropy estimators using (2). Most of them are based on order statistics of the sample. Al-Omari [20] presented an overview of these estimators and also proposed a new one. From a parametric point of view, it is natural to consider the maximum likelihood estimator (ML) of the entropy (H ML ).
In what follows, we describe the entropy estimators studied in this paper.

Maximum Likelihood Entropy Estimator
Let Z = (Z 1 , . . . , Z n ) be an independent random sample of size n from the G 0 (α, γ, L) distribution. Assume that L is known. The maximum likelihood estimator of (α, γ) for L is known and denoted as ( α ML , γ ML ), which consists of the values in the parametric space R − × R + , which maximize the reduced log-likelihood function: Solving (7) requires numerical maximization routines that, under certain circumstances, do not converge [31]. We use the L-BFGS-B version of the Broyden-Fletcher-Goldfarb-Shannon (BFGS) method [32] that allows box constraints. This algorithm belongs to the quasi-Newton methods family, not requiring the Hessian matrix but only the gradient. The optimal asymptotic properties of the ML estimator are well-known. The ML entropy estimator [33] is This estimator inherits all of the good properties of ML estimators (consistency and asymptotic normality), but also their pitfalls: sensitivity to the initial value, lack of convergence due to flatness of (7), and lack of robustness. Convergence problems, which are more prevalent with small samples and with data from textureless areas, were identified by Frery et al. [31] and mitigated with a line-search algorithm. Refs. [9,34,35] studied robust alternatives to (7).

Non-Parametric Entropy Estimators
Assume that Z = (Z 1 , Z 2 , . . . , Z n ) is a random sample from the law characterized by the distribution function F(z) whose order statistics are Z (1) , Z (2) , . . . , Z (n) . Vasicek [17] proposed the following entropy estimator: with m < n/2 as a positive integer, The author proved that this estimator is weakly consistent for H(Z) when m/n → 0 and n, m → ∞. The only possible numerical problem with this estimator and its variants is having zero as the argument of the logarithm, a situation that can be easily checked and solved. Their computational complexity reduces to adding differences of order statistics. These estimators are robust by nature, since they do not depend on any particular model. Differently from the approaches discussed in Refs. [9,34,35], achieving such a robustness does not impose a heavy computational burden.
Several authors introduced modifications to Vasicek's estimator. In this work we consider the following entropy estimators variants, surveyed by Al-Omari [20].
Under the same conditions as Al-Omari [37], the authors proved that when m, n → ∞ and m/n → 0. The same applies to the Noughabi-Arghami estimator.

Estimator Tuning
The choice of the spacing parameter m in this type of estimators is an important task that is still open. Wieczorkowski and Grzegorzewski [38] proposed the following heuristic formula: Our goal is to find a value of m that performs well in a range of parameters α and sample sizes n when estimating the entropy under the G 0 model. In order to achieve this goal, we assess the performance of (16) with a Monte Carlo study for each one of the entropy estimators presented in Section 2.3.2 under the G 0 model. We considered a parameter space comprised of: • Sample sizes n ∈ {9, 25, 49, 81, 121}, which represent different scenarios of squared windows of sides 3, 5, 7, 9 and 11; • Texture values α ∈ {−8, −5, −3, −1.5} to depict areas with different levels of roughness and L = 2 (the L = 1 case was studied by Cassetti et al. [39]). Since γ is a scale parameter, we based the forthcoming analysis on the condition E(Z) = 1, which links texture and brightness by γ * = −α − 1. With the aim to simplify the notation, we consider α j with j = 1, 2, 3, 4 where For each fixed n and j we draw 1000 independent samples z 1 , . . . , z n from G 0 (α j , γ * j , 2). We used m = m WG and calculated all estimators H i mj from Section 2.3.2. Therefore, we obtained a vector of estimates ( where H j is the true entropy from (6), and the sample mean squared error Then, we analyzed the performance of these estimators in terms of bias and MSE.
In order to improve the spacing (16), we implemented another strategy to choose, for each sample size n, the best value m to be used for all textures α. In the following, we considered m ∈ {1, 2, . . . , n/2 } as was indicated in (9). We repeated the same methodology as before for each m and for each j, obtaining { B 1j , . . . , B n/2 j }. This vector is represented in the jth column of Table 1. We then calculated, in each row of the table, the average of the absolute value of bias (shown in the last column of Table 1). The best m value is m = arg min s | B s. |. Table 1 shows the schema of the methodology employed, for fixed n and an entropy estimator. Each table entry, B sj , represents the bias for m = s and α = α j . Table 1. Selection criteria for the best m for each n and each entropy estimator, with α 1 = −1.5, Section 3.1 presents the results of this approach. The spacing values we obtained are different from the heuristic formula (16), and they lead to better estimates in terms of bias and mean squared error.

Classification
To study the performance of the selected entropy estimators in terms of SAR image classification, we divided the analysis into simulated and actual images. We used unsupervised and supervised techniques to choose the three estimators that led to the best values of classification quality. For the former, we applied a k-means algorithm, which groups data into k classes setting k centroids and minimizing the variance within each group. This non-hierarchical clustering technique has been applied in many studies in SAR image processing, cf. the works by Niharika et al. [40] and by Liu et al. [41].
For the latter approach we implemented a support vector machine (SVM) algorithm, which is a supervised machine learning technique [42] whose objective is to define, given a set of features, the best possible separation between classes by finding a hyperplane that maximizes the margin of separation between these classes. It is common to accept some misclassification to obtain a better overall performance; this is achieved through the penalizing parameter c.
When data cannot be separated by a hyperplane, they are transformed to a higherdimensional feature space through a suitable non-linear transformation called "kernel function". Given x, x' ∈ R n , linear and radial kernels are respectively defined by K L (x, x') = x, x' and K R (x, x') = exp(−g x − x' 2 ), for g > 0.
We randomly selected 1000 pixels in each of the four regions, far away enough from the boundaries, to find the best kernel and hyperparameters. This reference sample was divided into two sets: training and validation (80% of the sample), and testing (20%). We considered linear and radial types for the kernel, with the penalizing parameters c = 0.001, 0.01, 0.1, 1, 5, 10 and g = 0.01, 0.1, 1, 1.5, 2. With the training-validation set we made a 5-cross fold validation, and computed the mean and the standard deviation of the F1-scores. Recall that F1 = 2 · TPR · PPV/(TPR + PPV), where TPR is the True Positive Rate and PPV is the Positive Predictive Value.
This approach has been applied in different areas, such as sea oil spill monitoring [43], pattern recognition [44], and classification of polarimetric SAR data [2], among other applications.
We used different measures of quality depending on the type of classification. In the unsupervised case, we used the Calinski-Harabasz (CH) [45] and Davies-Bouldin [46] (DB) indexes, while we present the Kappa coefficient for the supervised classification. We also show the accuracy of both algorithms. All of these measures should be interpreted as "bigger is better", except for the DB index, for which "lower is better". Figure 2 presents the bias and the MSE for the Wieczorkowski and Grzegorzewski [38] criterion, L = 2 case, and for all of the estimators analyzed, except for the Al-Omari (14) and Ebrahimi (15) estimators. These two estimators presented large bias and, thus, were discarded for further analysis. It can be seen that there is no single estimator that performs best for all α values, but H C and H AO 1 present low bias and low MSE for all of the cases studied except for α = −1.5. The others estimators show bad behavior in terms of bias because of their slower convergence to zero for all of the cases studied. Table 2 shows the best m chosen according to the methodology used for L = 1 and L = 2, for samples coming from G 0 distribution. Notice that, with few exceptions, the optimal spacing m is smaller than the empirical formula m WG .

Performance of the Nonparametric Estimators for the Selected m Value
In order to study the behavior of our proposal for the selection of the m value we performed a Monte Carlo simulation as described in Section 2.4. Figure 3 shows the results obtained for the estimators studied for the m value chosen in terms of bias and MSE, for L = 2 case. We also plotted the H ML estimator. It can be observed that there is an improvement in entropy estimation in terms of bias and MSE with our methodology, compared to the (16) heuristic formula for all of the estimators studied. All of them show a faster convergence of the bias to zero and are competitive with the performance of the H ML estimator in terms of bias and MSE, for sample sizes larger than 81.  As mentioned, the optimized spacing leads, in most cases, to the use of more samples than the (16) criterion. This suggests that the latter is an optimistic view of the information content of each sample, at least when dealing with G 0 deviates. In other words, theses observations are less informative for the estimation of the entropy. Because of this, a smaller spacing, i.e., larger samples, are required to achieve good estimation quality.
In the following, we present empirical results classifying a simulated image SAR.

Simulated Image
We generated two 300 × 300 images with observations coming from G 0 distributions with L = 1, 2, γ = 0.1, and four different classes: α ∈ {−1.5, −3, −5, −8}. Figure 4a shows the image obtained with L = 2, where the brightest area corresponds to α = −1.5, i.e., extremely textured observations. As the brightness decreases, the texture changes from heterogeneous (α = −3 and −5) to a homogeneous zone corresponding to the darkest area (α = −8). As the performance measures were similar in both L = 1, 2 cases, i.e., single and multi-look, we only show results from the latter.  We computed a map of estimated entropies ( H) with each estimator by sweeping the image with sliding windows of sizes s × s, for s = 3,5,7,9,11. These are the sample sizes studied in Section 2.4. Then, we used H as a feature to classify by both the unsupervised and supervised techniques. Figure 4b shows the result of classifying by the k-means algorithm the H C map of values obtained with s = 9. Figure 4c shows the accuracy as a function of the sample size. It can be observed that a 9 × 9 window presents the best accuracy. It can also be seen that the H ML estimator has the worst performance, whereas H C , H NA and H VE show the best performance. These results are corroborated by the values shown in Table 3, in which the best performances are shown in bold font.  Table 4 presents the CH and DB values for the best sample size (s = 9, n = 81). According to CH, H C , H NA , and H VE have the best performance, whereas DB selected H C , H NA , and H V as the best.
We also provide, for the sake of comparison, quality measures obtained by H ML .  Table 5 shows the selected kernels and hyper-parameters that maximize the F1 mean and minimize the F1 variance. The best models were trained using the whole reference sample and applied to classify the complete image. The accuracy and κ coefficient were computed, and the results are shown in Figure 5. The best accuracy values are shown in Table 6, as well as the models that achieved them. It can be seen that the optimal value for L = 1 was obtained for a sliding window of size 9 × 9. Sizes 9 × 9 and 7 × 7 presented similar (best) values. In this sense, with the purpose of providing a unified criterion, we chose the size of the sliding window as 9 × 9 to perform the analysis. Table 5. Best kernel (L: lineal, R: radial) and hyper-parameters for SVM applied to simulated SAR data.   Tables 7 and 8 show the confusion matrices when the models are applied to the simulated images, L = 1, 2. It can be observed that if L = 1, H C , H NA , and H V overcame H ML for extremely high, high, and middle textured areas, respectively. For L = 2, H ML performed better than the other models except for regions with a very high level of texture in which H V and H AO 1 produced better results. Table 7. Confusion matrices for synthetic data with L = 1 (in percentage). Best values marked in bold.

Reference
Prediction

Actual Images
We assessed our proposal with two SAR images. First, we considered an image of the surroundings of Munich in Germany of the size 459 × 494, which was acquired in L-band, HV polarization, and complex single look format. Second, we used a subsample of 500 × 645 pixels of a full PolSAR image of California's San Francisco bay area, taken by the NASA/JPL AIRSAR L-band instrument in intensity format.
We applied the SVM algorithm to both actual images, replicating the procedure described in the study of simulated data, using the entropy estimator as a feature for classification of the three polarizations.
The Equivalent Number of Looks (ENL) using uncorrelated data is defined as ENL = 1/ CV 2 , the reciprocal of the sample coefficient of variation CV = σ/ µ, where σ is the sample standard deviation and µ is the sample mean [47]. In order to find the ENL in each polarization band of the image of San Francisco, we manually selected samples from homogeneous areas in each band and calculated ENL as an average weighted by the sample size per band. Finally, the ENL is the average of the estimations in each polarization. We obtained 2.53, 3.41, and 3.41 as the ENL values in the HH, HV, and VV bands, respectively. Thus, we considered the ENL as equal to 3.12 for the whole image. We then used the same spacings, m, for L = 2 and L = 3.  Figures 6 and 7 show the training samples selected to perform the supervised classification in both images. In the fist case, we worked with three types of regions: urban (red), forest (dark green), and pasture (light green). In the other case, we selected five areas: water (blue), urban zone (red), vegetation (green), pasture (yellow), and beach (orange).  We subsequently included the CV as a feature in the classification process. In this case, the best performance was achieved for the linear kernel with a cost of 10 applied to the image of San Francisco, except for H AO 1 and H V showing a best performance if a radial kernel is used with c = 5 and g = 1, respectively, and H ML with a radial kernel using c = 10 and g = 1.5, respectively. On the other hand, the radial kernel produced the best results for the image of Munich using the following hyper-parameters: Tables 9 and 10 present the test accuracy and Kappa index. We also show the validation accuracy, which was computed using cross-validation with five folds; these values are similar to the test accuracy, showing that there is no evidence of overfitting. In addition, we show that including the CV coefficient as a feature in the classification problem improved the results.
If we only consider the entropy, H VE showed the best performance in both single and multilook cases. However, if we add CV as a characteristic, then H C appears as the best classifier followed by H NA and H ML for the single-look case, and H AO 1 followed by H NA and H V for the multilook case.  Figures 8 and 9 exhibit the classification of the whole images when our proposal is applied. It can be observed that in the case of the image of San Francisco the classifiers distinguished the beach and, with the addition of the CV, some roads surrounded by trees were better classified. The processing time is an important feature when proposing a new estimator. Table 11 shows the processing time, measured in minutes, needed to perform a map of estimated entropies moving through the image with sliding windows of size 9 × 9 for each one of the estimators applied to the Munich and San Francisco images. It can be seen that H V had the shortest processing time, followed by H VE and H NA . We conclude this section by comparing the results of classifying by using estimates of the entropy with those obtained with a classical approach. Table 12 compares the results obtained using our best models against the technique that applies the improved Lee filter [48] and then classifies using SVM. Figure 10 shows the classification of the whole images applying the alternative method. It can be observed that our proposal offers advantages that prior methods cannot.

Conclusions
We assessed the performance of six non-parametric entropy estimators in conjunction with the ML estimator in terms of bias, MSE and image classification for single and multilook cases.
On the one hand, the advantage of using these non-parametric estimators is that they are very simple to implement, since they do not assume any model and do not need optimization algorithms. On the other hand, they depend on a space parameter m. Although the literature recommends a heuristic value, we proposed a criterion for choosing the value of m that presents the slightest bias in the entropy estimation for all of the textured values studied and all of the sample sizes analyzed. This criterion presents better performance than that proposed by Wieczorkowski and Grzegorzewski [38].
With these values for m, we applied unsupervised (k-means) and supervised (SVM) classification algorithms to both simulated and actual data, and compared their performance with the H ML entropy estimator. We showed evidence that H VE presents the best performance in terms of accuracy and kappa index for both single and multilook cases, when it is applied to actual images. However, when we added the coefficient of variation as a feature used by the classifier, both measures improved and the best estimators changed. H C and H AO 1 performed the best for the single and multilook cases, respectively, showing an improvement of 1% for the former and of 3% for the latter. However, these two estimators require longer processing times than the others.
We completed the analysis by comparing our proposal with another technique that combines the improved Lee filter with an SVM classifier, showing that the entropy-based approach presents better accuracy indexes.
Hence, we strongly recommend to consider these non-parametric estimators because of the simplicity of their implementation and their good performance.