Remote-Sensing Image Classification Based on an Improved Probabilistic Neural Network

This paper proposes a hybrid classifier for polarimetric SAR images. The feature sets consist of span image, the H/A/α decomposition, and the GLCM-based texture features. Then, a probabilistic neural network (PNN) was adopted for classification, and a novel algorithm proposed to enhance its performance. Principle component analysis (PCA) was chosen to reduce feature dimensions, random division to reduce the number of neurons, and Brent’s search (BS) to find the optimal bias values. The results on San Francisco and Flevoland sites are compared to that using a 3-layer BPNN to demonstrate the validity of our algorithm in terms of confusion matrix and overall accuracy. In addition, the importance of each improvement of the algorithm was proven.


Introduction
The classification of different objects, as well as different terrain characteristics, with single channel monopolarisation SAR images can carry a significant amount of error, even when operating after multilooking [1]. One of the most challenging applications of polarimetry in remote sensing is landcover classification using fully polarimetric SAR (PolSAR) images.

OPEN ACCESS
The Wishart maximum likelihood (WML) method has often been used for PolSAR classification [2]. This method uses the amplitudes of the elements in the covariance or coherency matrices. However, it does not explicitly take into consideration the phase information within polarimetric data, which plays a direct role in the characterization of a broad range of scattering processes. Furthermore, the covariance or coherency matrices are determined after spatial averaging and therefore can describe only stochastic scattering processes, while certain objects, such as man-made objects, are better characterized at a pixel-level [3].
To overcome above shortcomings, polarimetric decompositions were introduced with an aim to establish a correspondence between the physical characteristics of the considered areas and the observed scattering mechanisms. There are seven famous decomposition methods: Pauli [4], Krogager [5], Freeman [6], Huynen [7], Barnes [8], Cloude [9] and Holm [8]. The most effective method among these is the Cloude decomposition, also known as the H/A/α method. method. Section 7 applied our method to crop classification on Flevoland site. Section 8 discusses the significances of combined feature sets, random division, and PCA. Finally, Section 9 concludes this paper.

Basic Introduction
The features are derived from the multilook coherence matrix of the polarimetric SAR data. Suppose S stands for the measured scattering matrix: where S qp represents the scattering coefficients of the targets, p the polarization of the incident field, q the polarization of the scattered field. S hv equals to S vh since reciprocity applies in a monostatic system configuration. The Pauli decomposition expresses the scattering matrix S in the so-called Pauli basis, which is given by the following three 2×2 matrices: 1 0 1 0 0 1 1 1 1 , , Thus, S can be expressed as: An RGB image could be formed with the intensities |a| 2 , |b| 2 , |c| 2 . The meanings of S a , S b , and S c are listed in Table 1. Table 1. Pauli bases and their corresponding meanings.

Pauli Bases
Meaning S a Single-or odd-bounce scattering S b Double-or even-bounce scattering S c Those scatterers which are able to return the orthogonal polarization to the one of the incident wave (forest canopy)

Coherence Matrix
The coherence matrix is obtained as: 11

T T T T a b c a b c T T T T T T
The average of multiple single-look coherence matrices is the multi-look coherence matrix. (T 11 ,T 22 ,T 33 ) usually are regarded the channels of the polarimetric SAR images.

Feature Extraction
The proposed features can be divided into three types, which are explained below.

Span
The span or total scattered power indicates the received power by a fully polarimetric system and is given by:

H/A/Alpha Decomposition
Cloude and Potter [9] proposed an algorithm to identify in an unsupervised way polarimetric scattering mechanisms in the H-α plane. The method extends the two assumptions of traditional ways: 1) azimuthally symmetric targets; 2) equal minor eigenvalues λ 2 and λ 3 .
T can be rewritten as: 1  2  3   3  1  1  1  2  2  2  3  3  3   1  1  1  2  2  2  3  3  3 cos cos cos sin cos exp( ) sin cos exp( ) sin cos exp( ) sin sin exp( ) sin sin exp( ) sin sin exp( ) Then, the pseudo-probabilities of the T matrix expansion elements are defined as: The entropy indicates the degree of statistical disorder of the scattering phenomenon. It can be defined as: For high entropy values, a complementary parameter (anisotropy) is necessary to fully characterize the set of probabilities. The anisotropy is defined as the relative importance of the second scattering mechanisms [21]: The four estimates of the angles are easily evaluated as: Thus, vectors from coherence matrix can be represented as (H, A, α , β , δ , γ ).

Texture Features
The Gray level co-occurrence matrix (GLCM) is a text descriptor which takes into account the specific position of a pixel relative to another. The GLCM is a matrix whose elements correspond to the relative frequency of occurrence of pairs of gray level values of pixels separated by a certain distance in a given direction [22]. Formally, the elements of a GLCM G(i,j) for a displacement vector (a,b) is defined as ( , ) |{( , ),( , ) : Where (t,v) = (x+a, y+b), and |•| is the cardinality of a set. The displacement vector (a,b) can be rewritten as (d, θ) in polar coordinates.
The four features are extracted from normalized GLCMs, the sum of which is equal to 1. Suppose the normalized GLCM value at (i,j) is p(i,j), and their detailed definition are listed in Table 2. Table 2. Properties of GLCM.

Property Description Formula
Contrast Intensity contrast between a pixel and its neighbor

Total Features
The texture features consist of 4 GLCM-based features, which should be multiplied by 3 since there exist three channels (T 11 ,T 22 ,T 33 ). In addition, there are one span feature, and six H/α parameters. In all, the total features are 1 + 6 + 4 × 3 = 19.

Mechanism of PNN
Neural networks are widely used in pattern classification since they do not need any information about the probability distribution and the a priori probabilities of different classes. PNNs are basically pattern classifiers. They combine the well known Bayes decision strategy with the Parzen non-parametric estimator of the probability density functions (PDF) of different classes. PNNs have been of interest because they yield a probabilistic output and are easy to implement.
Taking a two categories situation as an example, we should decide the known state of nature θ to be either θ A or θ B . Suppose a set of measurements is obtained as p-dimensional vector x = [x 1 , …, x p ], the Bayes decision rule becomes: Here, f A (x) and f B (x) are the PDF for categories A and B, respectively. l A is the loss function associated with the wrong decision d(x) = θ B when θ = θ A , l B is the loss function associated with the wrong decision d(x) = θ A when θ = θ B , and the losses associated with correct decisions are taken to be zero. h A and h B are the a priori probability of occurrence of patters from category A and B, respectively.
In a simple case that assumes the loss function and a priori probability are equal, the Bayes rule classifies an input pattern to the class with higher PDF. Therefore, the accuracy of the decision boundaries depends on what the underlying PDFs are estimated. Parzen's results can be extended to estimate in the special case where the multivariate kernel is a product of univariate kernels. In the particular case of the Gaussian kernel, the multivariate estimates can be expressed as: Here, m is the number of training vectors in category A, p is the dimensionality of the training vectors, x Ai is the ith training vector for category A, and σ is the smoothing parameter. It should be noted that f A (x) is the sum of small multivariate Gaussian distributions centered at each training sample, but the sum is not limited to being Gaussian. Figure 1 shows the outline of PNN. When an input is presented, the first layer computes distances from the input vector to the input weights (IW), and produces a vector whose elements indicate how close the input is to the IW. The second layer sums these contributions for each class of inputs to produce as its net output a vector of probabilities. Finally, a compet transfer function on the output of the second layer picks the maximum of these probabilities, and produces a 1 for that class and a 0 for other classes.

PNN Structure
The mathematical expression of PNN can be expressed as: In this paper, the radbas is selected as: The compet function is defined as: This type of setting can produce a network with zero errors on training vectors, and obviously it does not need any training.

Figure 1.
Outline of PNN (R, Q, and K represent number of elements in input vector, input/target pairs, and classes of input data, respectively. IW and LW represent input weight and layer weight, respectively).

Shortcomings of Traditional PNN
Suppose P and T denote the set of training vector x and corresponding target vector y, namely, P = [x 1 , x 2 , …x Q ], and T = [y 1 , y 2 , y Q ]. IW and LW are set traditionally as follows: However, it is obvious that Q is usually very large, and then the net will be too big and consume too much computation time. On the other hand, to simplify the setting of bias b, all of its components are considered as equal [23]. Even so, the setting of b is still a challenge. Although errors on training vectors are always zero, the errors on test vectors are greatly dependent with the value of b.
If it is too small, the spread of each radial basis layer function becomes too large, and the network will take too many nearby design vectors into account, moreover, the radial basis neurons will output large values (near 1) for all the inputs used to design the network. If it is too larger, the spread becomes near zero, and the network will degrades as a nearest neighbor classifier.

A Novel Method of Weights/Biases Setting
Here we propose a novel method to solve the above two problems. The main idea is shown in Figure 2. Our improvement lies in the PCA, the random division, and the single variable optimization.

Feature Reduction
Excessive features increase computation times and storage memory. Furthermore, they sometimes make classification more complicated, which is called the curse of dimensionality. It is necessary to reduce the number of features.
Principal component analysis (PCA) is an efficient tool to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining most of the variations. It is achieved by transforming the data set to a new set of ordered variables. This technique has three effects: it orthogonalizes the components of the input vectors so that uncorrelated with each other, it orders the resulting orthogonal components so that those with the largest variation come first, and eliminates those components contributing the least to the variation in the data set.
It should be noted that the input vectors should be normalized to have zero mean and unity variance before performing PCA, which is shown in Figure 3. The normalization is a standard procedure. Details about PCA can be found in Ref. [24].

Random Division
Realistic sample numbers Q are generally very large, which leads to a quite large PNN. Thus, we divide the available data into two subsets: training subset and validation subset. The ratio of each is called trainRatio and validRatio respectively. In order to save the storage room of the net and to fasten the computation, the trainRatio is set as small as possible, and meanwhile it should not affect the accuracy of the NN.

Optimization by Brent's Search
The goal of finding optimal b can be obtained by solving this problem: find the minimum MSE on validation subset of the corresponding b. This can be depicted as a single-variable optimization problem in the dash-line rectangle in Figure 2. Brent's Search (BS) method is adopted to solve this optimization problem.
BS is a linear search, a hybrid of the golden section search and a quadratic interpolation. Golden section search has a first-order rate of convergence, while polynomial interpolations have an asymptotic rate faster than super-linear. On the other hand, the rate of convergence for the golden section search starts when the algorithm is initialized, whereas the asymptotic behavior for the polynomial interpolations can take many iterations to become apparent. BS attempts to combine the best features of both approaches. BS has the advantage that it does not require computation of the derivative, which greatly fits the optimization problem.

Terrain Classification
The NASA/JPL AirSAR L-band data for the San Francisco (California, USA) area was used for the experiments. Its size is 1,024 × 900. In order to reduce the computations, the sub-area with size 600 × 600 was extracted from the left-upper point of original image. The ground truth of the test site can be found at Ref. [2].
Quantitative information about the experiment is described as follows, where '•' denotes parameters known before simulation and '♦' denotes the parameters obtained at the initial stage of the experiment. Number

Denoising by Lee Filter
The sub-area (600 × 600) is shown in Figure 4(a). The Refined Lee filter (Window size = 7) is used to reduce the speckle noise and the results are shown in Figure 4(b). The Lee filter adapts the amount of filtering to the local statistics. Homogeneous areas are filtered with the maximum strength where point scatterers are let unfiltered. The refined filter could use directional windows to preserve edges and heterogeneous features [25].

Full Features Set
Then, the basic span image and three channels (T 11 ,T 22 ,T 33 ) are easily obtained and shown in

Feature Reduction by PCA
The curve of cumulative sum of variance with dimensions of reduced vectors via PCA is shown in Figure 10. The detailed data are listed in Table 3. It shows that only 11 features, half the original features only, could preserve 96.36% of variance.

Training Preparation
The classification is run over three classes, the sea, the urban areas and the vegetated zones. The training and testing areas are selected manually shown in Figures 11(a)-(b), respectively. Each square has a size of 60×60. In total, there are 21,600 pixels for training, and 10,800 pixels for testing. In this experiment, trainRatio is adjusted finally as 0.01, namely, the validRatio equals 0.99. In this way, the network only has 1% neurons of that constructed by traditional approach. The training subset and validation subset of the training area are divided randomly.

Weights/Biases Setting
The IW and LW are easily set according to our novel approach, and the number of neurons decreases from 21,600 to only 216. The b is estimated by BS method. Its initial range is set as [0.01, 20], which is large enough to contain the optimal point. The curve of classification error versus the steps is shown in Figure 12. It is evident that the classification error converges at only three steps shown in the red dot. However, BS will continue to search the best b value since the tolerance of b is set as small as 1e-3. The whole process of the change of b is shown in Figure 13.  Step b The optimal b is found as 4.73, with the smallest error 1.557%, namely, the highest classification accuracy 98.44%.

Application to the Whole Image
We use the trained PNN to classify the whole image, and the results are shown in Figure 14. The brims of length 3 are not calculated considering the local area of GLCM, so the size here is only 594 × 594. From Figure 14 it makes clear that the sea can be classified perfectly, while the vegetated and urban areas are easily inter-confused. The next section will calculate the confusion matrix which reflects the degree of confusion between the three classes.

Comparison with Other Approaches
Finally, our method is compared to the 3-layer BPNN [16]. The confusion matrices (CM) by each methods on training area and testing area are listed in Table 4. The element of ith row and jth column in the 3 × 3 matrix represents the amount of pixels belonging to class j as user defined are assigned to class i after the supervised classification.  It is obvious that the classification accuracies of our proposed method in training area are all higher than 32.5% (33.3% denotes the perfect classification). For the testing area, classification accuracies are all higher than 30.1%. The main drawback is around 3.3% of vegetated zones are misclassified as urban area. The overall accuracies are calculated as CM 11 + CM 22 + CM 33 and listed in Table 5, that demonstrates our method has a higher overall accuracy in both training area and testing area than those of 3-layer BPNN. The reason our method outperforms the 3-layer BPNN lies in not only the fact that PNN is adept at predicting the probabilistic results, but also the selected features sets are more discernable.

Crop Classification
Flevoland, an agricultural area in The Netherlands, was chosen as another example. The site is composed of strips of rectangular agricultural fields. The scene is designated as a supersite for the earth observing system (EOS) program, and is continuously surveyed by the authorities. The ground truth of the test site can be seen in Ref [26].

Refine Lee Filter
The Pauli image of Flevoland is shown in Figure 15(a), and the refined Lee filtered image (Window Size = 7) is shown in Figure 15(b).

Full Features
The basic span image and three channels (T 11 ,T 22 ,T 33 ) are easily obtained and shown in Figure 16.

Feature Reduction
The curve of cumulative sum of variance with dimensions of reduced vectors via PCA is shown in Figure 21. The detailed data are listed in Table 6. It shows that only 13 features, which are only half the original features, could preserve 98.06% of variance.

Training Preparation
The classification is run over 11 classes, bare soil 1, bare soil 2, barley, forest, grass, Lucerne, peas, potatoes, rapeseed, stem beans, and sugar beet. They are selected manually according to the ground truth [26]. The training set and testing set are shown in Figure 22. Each square has a size of 20 × 20. In total, there are 5,200 pixels for training, and 5,200 pixels for testing.
Since the types of classes increase (3 to 13) while the available data decrease (21,800 to 5,200), thus, we should dispose more data to the training subset. Finally, trainRatio is adjusted as 0.2, while validRatio is set as 0.8. In this way, the network only has 1/5 neurons of that constructed by traditional approach. The training subset and validation subset of the training area are divided randomly.

Weights/Biases Setting
The IW and LW are easily set according to our novel approach, and the number of neurons decreases from 5200 to only 1040. The b is estimated by BS method. Its initial range is set as before. The curve of classification error versus the steps is shown in Figure 23. It is evident that the classification error reaches the minimum at 17th steps shown in the red dot. The whole process of the change of b is shown in Figure 24. The optimal b is found as 1.0827, with the smallest error 7.8%, namely, the highest classification accuracy on the validation subset of training area is 98.44%.

Classification Results
The confusion matrices on training area and testing area are calculated and listed in Figure 25 and Figure 26. The overall accuracy of our method on training area and test area are 93.71% and 86.2% respectively.
We apply our method on the whole image. The results are shown in Figure 27. From Figure 27 it is clear that our method can classify most of areas correctly.

Discussion
The BS has an important effect on our algorithm as shown in Figure 12 and Figure 23. It can guide users to find the optimal b value in quite short steps. Otherwise, the users will take a long time with the help of exhaustive search algorithm. The function of the combined feature, random division and PCA will be discussed in detail at following paragraphs.

Single Type of Feature Set versus Combined Feature Sets
The feature sets can be divided into two types. One is the polarimetric feature set, which contains the span, the six H/A/α parameters; the other is the texture feature set, which contains the properties extracted from the GLCM. Table 7 lists the classification accuracy of classifiers using polarimetric feature set, texture feature set, and combined feature set. It indicates that the polarimetric features contribute most to the classification while the texture feature contribute less. Then, we can find that the combined feature set performs better than each single. Thus, our classifier using combined feature set can be regarded as an feature fusion method.

With and without Random Division
If we do not use the random division, the structure of PNN will increase 1/trainRatio times. Consequently, the computation will become a burden with very little improve on classification overall accuracy. Taking the San Francisco area as the example, four square areas of different size are picked out randomly from the image, and are classified by PNNs with and without random division. The computation time and overall accuracy of each are listed in Table 8.  Table 8 indicates that the computation time of traditional method is only 46 times of that of our method for 10 × 10 area, however, the ratio rockets to 516 for 40 × 40 area. Moreover, for a larger size area, such as 50 × 50, it cannot work because of the lack of memory.
From another point of view, the overall accuracy of traditional method was expected to be much higher than that of our method since it uses a great many neurons, whereas, in fact, they are nearly the same. The reason may consist of the optimization of b in our method. Accordingly, our method of weights/biases setting is valid and effective, and it is superior to traditional method in terms of computation time and storage room while it can maintain a high overall accuracy.

With and without PCA
PNNs with and without PCA are investigated in the same manner as in Section 7.1. Their computation times are depicted in Figure 28, which indicates that PNN with PCA enjoys a less computation time than that of PNN without PCA. Their time differences are gradually becoming large as the width of the randomly selected area is increasing.
In addition, the overall accuracies of these two PNNs are observed. It should be noted that input data of the PNN without PCA still should be normalized although the PCA is omitted, otherwise the performance of PNN will decrease rapidly.
The overall accuracies obtained by the two PNNs are pictured in Figure 29. It demonstrates that the PNN with PCA outperforms PNN without PCA on the small test area (width < 40). As the area becomes large (40 < width < 47), the PNN without PCA is better. Finally, as the area becomes large enough (width > 47), these performances of the two PNNs are nearly equivalent. Therefore, our method embedding PCA can performs faster, and has no loss of overall accuracy.

Conclusions
In this paper, a hybrid feature set has been introduced which is made up of the span image, the H/A/α decomposition, and the GLCM-based texture features. Then, a probabilistic neural network has been established. We proposed a novel weights/biases setting method based on Brent's method and PCA. The method can decrease the feature set, reduce the number of neurons, and find optimal bias values.
Experiments of terrain classification on a San Francisco site and a crop classification on Flevoland show that our method can obtain good results which are more accurate than those of 3-layer BPNN. Afterwards, combined feature set, random division and PCA are assumed to be omitted in turn, and the results prove the indispensability of each improvement.