Crop Classiﬁcation Using MSCDN Classiﬁer and Sparse Auto-Encoders with Non-Negativity Constraints for Multi-Temporal, Quad-Pol SAR Data

: Accurate and reliable crop classiﬁcation information is a signiﬁcant data source for agricultural monitoring and food security evaluation research. It is well-known that polarimetric synthetic aperture radar (PolSAR) data provides ample information for crop classiﬁcation. Moreover, multitemporal PolSAR data can further increase classiﬁcation accuracies since the crops show different external forms as they grow up. In this paper, we distinguish the crop types with multi-temporal PolSAR data. First, due to the “dimension disaster” of multi-temporal PolSAR data caused by exces-sive scattering parameters, a neural network of sparse auto-encoder with non-negativity constraint (NC-SAE) was employed to compress the data, yielding efﬁcient features for accurate classiﬁcation. Second, a novel crop discrimination network with multi-scale features (MSCDN) was constructed to improve the classiﬁcation performance, which is proved to be superior to the popular classiﬁers of convolutional neural networks (CNN) and support vector machine (SVM). The performances of the proposed method were evaluated and compared with the traditional methods by using simulated Sentinel-1 data provided by European Space Agency (ESA). For the ﬁnal classiﬁcation results of the proposed method, its overall accuracy and kappa coefﬁcient reaches 99.33% and 99.19%, respectively, which were almost 5% and 6% higher than the CNN method. The classiﬁcation results indicate that the proposed methodology is promising for practical use in agricultural applications.


Introduction
Crop classification plays an important role in remote sensing monitoring of agricultural conditions, and it is a premise for further monitoring of crop growth and yields [1,2]. Once the categories, areas and space distribution information of crops have been acquired in a timely and accurate manner, it can provide scientific evidence of reasonable adjustment for agriculture structure. Therefore, crop classification has great significance for guidance of agriculture production, rational distribution of farming resources and guarantee of national food security [3][4][5].
With the continuous advancement and development of remote sensing technology and its theory, it has been extensively applied in agricultural fields such as crop census, growing monitoring, yield prediction and disaster assessment [6][7][8][9]. Over the past several years, optical remote sensing has been widely applied in crop classification due to its objectivity, accuracy, wide monitoring range and low cost [10]. For example, Tatsumi adopted random forest classifier to classify the eight class crops in southern Peru of time-series Landsat 7 ETM + data, the final overall accuracy and the kappa coefficient were 81% and 0.70, respectively [11]. However, optical remote sensing data is susceptible to cloud and shadow interference during the collection, so it is difficult to obtain effective continuous optical remote sensing data in the critical period of crop morphological changes. In addition, optical remote sensing data only reflect the spectral signature of target surface. For the wide variety of ground objects, there exists the phenomenon of "same object with different spectra and different objects with the same spectrum". Therefore, the crop classification accuracy based on optical remote sensing data is limited to a certain extent. Unlike the optical remote sensing, PolSAR is an active microwave remote sensing technology, its working conditions cannot be restricted by weather and climate. Meanwhile, besides the signature of target surface, SAR remote sensing data provide other spectral signatures of target due to its penetrability. Therefore, increasing amounts of attention has been paid to the research with PolSAR data in crop classification [12,13]. However, the constraints of developing level for radar technology, the majority of classification research for crops used single-temporal PolSAR data. However, as for crop categories, identification, singletemporal PolSAR image offers only limited information for crops. Therefore, it is very difficult to identify different crop categories due to the same external phenomena in the certain period, especially during the sowing period [14]. Therefore, it is necessary to collect multi-temporal PolSAR data to further improve the crop classification accuracy.
In recent two decades, an increasing number of satellite-borne SAR systems have been launched successfully and operate on-orbit, which made it available to acquire multitemporal remote sensing data for desired target [15][16][17]. At present, there are several representative systems available for civilian applications, such as L-band Uninhabited Aerial Vehicle Synthetic Aperture Radar (UAVSAR) [18], C-band Sentinel-1 [19,20], GF-3, RADARSAT-2 and Radarsat Constellation Mission (RCM) [21], and X-band Constellation of Small Satellites for Mediterranean basin observation (COSMO) and COSMO-SkyMed 2nd Generation (CSG) [22]. Through these on-orbit SAR systems, a number of multitemporal PolSAR images for the same area can be readily acquired for crop surveillance and other related applications. Additionally, it can show different scattering characteristics for crops in different growing periods, which greatly improves the classification accuracy of crops [23][24][25].
Recently, a number of classification algorithms with PolSAR data have been presented in the literature, which can be roughly divided into three categories: (1) Algorithms based on statistical models [26]. For example, Lee et al. proposed a classical classifier based on the complex Wishart distribution [27]; (2) Algorithms based on the scattering mechanisms of polarization [28]. The points with the same physical meaning are classified using the polarization scattering parameters obtained by the coherent and incoherent decomposition algorithms (such as Pauli decomposition [29], Freeman decomposition [30], etc.) [31][32][33][34][35]; (3) The classification schemes based on machine learning [36], e.g., support vector machine (SVM) [37] and various neural networks [38]. For instance, Zeyada et al. use the SVM to classify four crops (rice, maize, grape and cotton) in the Nile Delta, Egypt [39].
With the collection of multi-temporal PolSAR data, the various classification algorithms based on time-series information have also been developed. For example, long short-term memory (LSTM) network has been exploited to recognize and classify the multi-temporal PolSAR images [40]. Zhong et al. classify the summer crops in Yolo County, California using the LSTM algorithm with Landsat Enhanced Vegetation Index (EVI) time series [25]. It can be seen that the research and application of multi-temporal PolSAR data are constantly progressing. For LSTM algorithm, the performance of this network mainly depends on input features, so a large amount of decomposition algorithms have been developed to extract the polarization scattering characteristics [41][42][43][44]. However, the direct use of polarization features will result in the so-called "dimension disaster" problem for the various classifiers. Therefore, the dimension reduction for the extracted multi-temporal features has become a significant work. Some methods, such as principle component analysis (PCA) [45] and locally linear embedded (LLE) [46], etc., are popular for feature compression to solve the "dimension disaster" problem. For instance, the PCA method actually provides the optimal linear solution for data compression in the sense of minimum mean square error (MMSE) [47]. The advantage of PCA lies in the fast restoration of original data by subspace projection at a cost of minimum error. However, it cannot be guaranteed that the principle components extracted by PCA provide the most relevant information for crop type discrimination. Independent Component Analysis (ICA) is the generalization of PCA, which can gain independent gains. Bartlett M.S. et al. adopt the ICA to recognize the face images of the FERET face database [48]. Tensor decomposition is often used to extract certain elementary features from image data. Dehghanpoor G. et al. used tensor decomposition method to achieve the feature learning on satellite imagery [49]. Non-negative matrix factorization (NMF) is based on non-negative constraints, which allows learn parts from objects. Ren J.M. et al. applied the reduce dimensionality method NMF as the preprocessing of remote sensing imagery classification [50]. However, they are not suitable for dimensionality reduction about PolSAR data of crops. Additionally, the LLE method can voluntarily extract the low-dimensional feature of nonlinear from high-dimensional data, but it is very sensitive to outliers [51]. In most recent years, with the development of deep learning, the convolutional neural network (CNN) has been gradually applied in remote sensing data analysis [52]. At present, some successful network structures (e.g., auto-encoder [53,54] and sparse auto-encoder (SAE) [17,55].) have been presented, yielding excellent performances in feature compression and image classification. However, the sparsity for the SAE network has not been fully exploited to further extract efficient features for classification, and the existing CNN based classifier do not utilize the multi-scale features of the compressed data. Due to these disadvantages, the crop classification performance still cannot achieve a level for practical use.
Therefore, the main purpose of this study is to propose a new method to improve the performances of crop classification for better application in agricultural monitoring. Firstly, we adopted various coherent and incoherent scattering decomposition algorithms to extract particular parameters from multi-temporal PolSAR data. Secondly, a sparse auto-encoder network with non-negativity constraint (NC-SAE) was bulit to perform feature dimension reduction, which extracts the polarimetric features more efficiently. Finally, a classifier based on crop discrimination network with multi-scale features (MSCDN) was proposed to implement the crop classification, which greatly enhanced the classification accuracy. The main contributions of this paper were to propose a NC-SAE for data compression and a MSCDN for crop discrimination.
The remainder of this paper is organized as follows. Section 2 devotes to our methodology, including the structure of PolSAR data, the polarimetric features decomposition and dimension reduction with proposed NC-SAE network, as well as the architecture of the proposed MSCDN classifier. In Section 3, the experimental results of crop classification for the proposed method are evaluated and compared with traditional method using simulated Sentinel-1 data. Finally, Section 4 concludes the study.

Methodology
In order to use the multi-temporal PolSAR data to classify crops, a neural network NC-SAE was employed to compress the data, and then a novel crop discrimination network with multi-scale features (MSCDN) was constructed to achieve the crop classification. The flowchart of the whole study method is shown in Figure 1, which mainly includes three steps: polarization feature decomposition, feature compression and crop classification.

PolSAR Data Structure
The quad-pol SAR receives target backscattering signals and measures the amplitudes and phases in terms of four combinations: HH, HV, VH and VV, where H represents horizontal mode and V represents vertical mode. A 2 × 2 complex matrix S that collects the scattering information can be obtained for each pixel, these complex numbers relate the incident and the scattered electric fields. The scattering matrix S usually reads: where the superscript T denotes the transpose of vector. The scale factor 2 on HV S is to ensure consistency in the span computation. Then, a polarimetric covariance matrix C can be constructed as the following format:

PolSAR Data Structure
The quad-pol SAR receives target backscattering signals and measures the amplitudes and phases in terms of four combinations: HH, HV, VH and VV, where H represents horizontal mode and V represents vertical mode. A 2 × 2 complex matrix S that collects the scattering information can be obtained for each pixel, these complex numbers relate the incident and the scattered electric fields. The scattering matrix S usually reads: where S V H denotes the scattering factor of vertical transmitting and horizontal receiving polarization, and the others have similar definitions. The target feature vector can be readily obtained by vectorizing the scattering matrix. Reciprocal backscattering assumption is commonly exploited, then S HV is approximately equal to S V H and the polarimetric scattering matrix can be rewritten as the Lexicographic scattering vector: where the superscript T denotes the transpose of vector. The scale factor √ 2 on S HV is to ensure consistency in the span computation. Then, a polarimetric covariance matrix C can be constructed as the following format: where the superscript * denotes the conjugate of a complex number. Alternatively the Pauli-based scattering vector is defined as By using vector k, a coherency matrix T can be constructed as follows: where M indicates the number of looks. The coherency matrix T is usually spatially averaged to reduce the inherent speckle noise in the SAR data. This preserves the phase information between the polarization channels.
The covariance matrix C has been proved to follow a complex Wishart distribution, while the coherency matrix T contains the equivalent information of the same PolSAR data. They can be easily converted to each other by a bilinear transformation as follows where N is a constant matrix:

Polarization Decomposition and Feature Extraction
Processing and analyzing the PolSAR data can effectively extract the polarization scattering features, and further achieve classification, detection and identification of quad-Pol SAR data. Therefore, polarization decomposition for PolSAR data is usually adopted to obtain multi-dimensional features. Here, we propose to consider the 36-dimensional polarimetric scattering features, which were derived from a single temporal PolSAR image using various methods. Some of these features can be directly obtained from the measured data, and others were computed with incoherent decomposition (i.e., Freeman decomposition [32], Yamaguchi decomposition [33], Cloude decomposition [34] and Huynen decomposition [35]) and Null angle parameters [52]. The 36-dimensional scattering features obtained from a single temporal PolSAR image are summarized in Table 1. Then, higher dimensional scattering features can be obtained from multiple temporal PolSAR images. The resulting features involve all the potential information of the primitive PolSAR data.

Feature Compression
Directly classifying the crops with higher dimensional features above is cumbersome, which involves complicated computations and large amount of memory to store the features, and these enormous features would suffer from the great difficulty of the dimensionality disaster. Therefore, to make full use of the wealth of multiple temporal PolSAR data, the dimension reduction in resulting features is indispensable and crucial. In the past few years, the methods of auto-encoder and sparse auto-encoder have attracted more and more attention, which were commonly used to perform the compression of high-dimension data [17,[55][56][57]. Therefore, the sparse auto-encoder with non-negativity constraint was proposed to further improve the sparsity of auto-encoder.

Auto-Encoder
An auto-encoder (AE) is a neural network which is an unsupervised learning for data representation and its aim is to set the output values approximately equal to the inputs. The basic structure of a single-layer AE neural network consists of three parts: encoder, activation and decoder, which are shown in Figure 2, where the input layer (x), hidden layer (y) and output layer (z) have, respectively, n neurons, m neurons, and n neurons. The hidden layer is commonly used to implement the encoding for the input data, while the output layer is for the decoding operation.  Figure 2. Single-layer AE neural networks structure.
The weighted input a of each neuron in encoder is defined as where Then, the encoder output y can be written as the nonlinear activation of weighted input a as follows where () f ⋅ is a sigmoid function, which is usually chosen as the logsig function: If m n < , the output y can be viewed as the compressed representation of input x , then the encoder usually plays the role of data compression. Whereas the decoder is a reverse process of reconstructing the compressed data y , which achieves the restoration of the original data, i.e., output z represents the estimate of input x . The weighted input of the decoder is defined as where (12) Here, () g ⋅ is the sigmoid function for decoder neurons, which is commonly chosen The training process of AE is based on the optimization of the cost function and obtained the optimal parameters of weight coefficients and bias. The cost function measures the error between the input x and its reconstruction at the output z , which can be written as The weighted input a of each neuron in encoder is defined as where w (1) ji represents the encoder weight coefficient, b j is the bias of neuron j Then, the encoder output y can be written as the nonlinear activation of weighted input a as follows where f (·) is a sigmoid function, which is usually chosen as the logsig function: If m < n, the output y can be viewed as the compressed representation of input x, then the encoder usually plays the role of data compression. Whereas the decoder is a reverse process of reconstructing the compressed data y, which achieves the restoration of the original data, i.e., output z represents the estimate of input x. The weighted input of the decoder is defined as where w (2) ij is the decoding weight coefficient, and b (2) i is the bias of neuron i. The decoder output reads Here, g(·) is the sigmoid function for decoder neurons, which is commonly chosen the same as f (·).
The training process of AE is based on the optimization of the cost function and obtained the optimal parameters of weight coefficients and bias. The cost function measures the error between the input x and its reconstruction at the output z, which can be written as where Q is the number of samples. Furthermore, a restriction term of weight decay is usually incorporated into the cost function to regulate the degree of the weight attenuation, which helps to effectively avoid overfitting and remarkably improve the generalization capacity for the network. Hence, the overall cost function of AE commonly reads where Ω w is a regularization term on the weights, the most commonly used restriction is the L 2 regularization term and is defined as follows, λ is the coefficient for L 2 regularization term.
where L = 2 is the number of layers. The weight coefficients and biases are optimized and trained by using the steepest descent algorithm via the classical error back propagation scheme.

Sparse Auto-Encoder with Non-Negativity Constraint
A sparse auto-encoder (SAE) results from an auto-encoder (AE). Based on AE, SAE neural network is achieved by enforcing a sparsity constraint of the output from the hidden layer, which realizes the inhibitory effects and yields fast convergence speed for training process using the back propagation algorithm [17,55]. Hence, the cost function of SAE is given by where β is the coefficient of the sparsity regularization term, Ω s is the sparsity regularization term which is usually represented by Kullback-Leibler (KL) divergence [17,55]. The part-based representation of input data usually exhibits excellent performance for pattern classification. The sparse representation scheme usually breaks the input data into parts, while the original input data can be readily reconstructed by combining the parts additively when necessary. Therefore, the input in each layer of an auto-encoder can be divided into parts by enforcing the weight coefficients of both encoder and decoder to be positive [56]. To achieve a better performance in reconstruction, we propose to consider the sparse auto-encoder with non-negativity constraint (NC-SAE), the auto encoder network decompose the input into parts by encoder via (8 and 9), and combine them in an additive manner by decoder via (11 and 12). This is achieved by replacing the regularization term (15) in cost function (16) with a new non-negativity constraint where Therefore, the proposed cost function for NC-SAE is defined as where α ≥ 0 is the parameter of the non-negativity constraint. By minimizing the cost function (19), the number of nonnegative weights of each layer and the sparsity of the hidden layer activation are all increased, and the overall average reconstruction error is reduced. Further, steepest descent method is used to update the weight and bias of (19) as follows where k is the number of iteration, and η denotes the learning rate. Then, we adopt the error back-propagation algorithm to compute the partial derivatives in (20). The partial derivatives of the cost function with respect to decoder reads The partial derivatives in (21) are straightforward, and shown below where r(w) is shown as follows In order to clarify the computation of derivatives, we define the neuronal error δ as the derivative of cost function with respect to weight input of each neuron, i.e., δ ∂E/∂a.

Then, δ
(2) i can be calculated using the chain rule as follows: Similarly, the neuronal error δ (1) i of encoder is computed as Now substituting Equations (22) and (24) into (21) leads to Then, the partial derivative of the cost function with respect to the encoding weight reads The partial derivatives with respect to the biases of encoder and decoder are computed in a compact form as ∂E ∂b (l) = δ (l) , l = 1, 2. (28)

The Crop Discrimination Network with Multi-Scale Features (MSCDN)
In the deep learning field, convolutional neural network (CNN) has become increasingly powerful to deal with the complicated classification and recognition problems. Recently, CNN has been widely adopted in remote sensing, for example, in image classification, target detection, and semantic segmentation. However, most classical CNNs only use one single convolution kernel to extract the feature images, the resulting single feature map in each convolutional layer make it difficult to distinguish the similar crops, consequently the overall crop classification performance degraded. Just as our previous work [17], the poor overall performance is devoted to the minor category of crops that possess the similar polarimetric scattering characteristics. Therefore, in this paper, a new multi-scale deep neural network called MSCDN is proposed, attempting to further improve the classification accuracy. The MSCDN not only extracts the features with different scales by using multiple kernels in some convolution layers, but also captures the tiny distinctions between feature maps of multi-scales.
The architecture of the proposed MSCDN classifier is shown in Figure 3. The network of MSCDN mainly contains three parts: multi-scale feature extraction, feature fusion and classification. First, multiple convolutional layers and multiple kernels within a certain convolution layers extract feature maps with different scales. Second, the feature information of these diverse scales was fused together as the basis to feed the classification layer. Finally, the softmax layer is adopted to perform the classification.
Then, the partial derivative of the cost function with respect to the encoding weight reads (1) (1) (1) ( ) The partial derivatives with respect to the biases of encoder and decoder are computed in a compact form as (28)

The Crop Discrimination Network with Multi-Scale Features (MSCDN)
In the deep learning field, convolutional neural network (CNN) has become increasingly powerful to deal with the complicated classification and recognition problems. Recently, CNN has been widely adopted in remote sensing, for example, in image classification, target detection, and semantic segmentation. However, most classical CNNs only use one single convolution kernel to extract the feature images, the resulting single feature map in each convolutional layer make it difficult to distinguish the similar crops, consequently the overall crop classification performance degraded. Just as our previous work [17], the poor overall performance is devoted to the minor category of crops that possess the similar polarimetric scattering characteristics. Therefore, in this paper, a new multi-scale deep neural network called MSCDN is proposed, attempting to further improve the classification accuracy. The MSCDN not only extracts the features with different scales by using multiple kernels in some convolution layers, but also captures the tiny distinctions between feature maps of multi-scales.
The architecture of the proposed MSCDN classifier is shown in Figure 3. The network of MSCDN mainly contains three parts: multi-scale feature extraction, feature fusion and classification. First, multiple convolutional layers and multiple kernels within a certain convolution layers extract feature maps with different scales. Second, the feature information of these diverse scales was fused together as the basis to feed the classification layer. Finally, the softmax layer is adopted to perform the classification.  As shown in Figure 3, the MSCDN comprises seven convolutional layers, two maxpooling layers, four fully connected layers, one concat layer, and a softmax classifier. The Rectified Linear Unit (ReLU) and Batch Normalization (BN) layers are successively connected after Conv_1 to Conv_5. The aim of ReLU layer is avoid the problems of gradient explosion and gradient dispersive to further improve the efficient of gradient descent and back propagation. As for the BN layer, it is a normalized procedure for each batchsize of internal data for the purpose of standardizing the output data as the normal distribution with zero mean and unit variance, which can accelerate the convergences. The branches of Conv_6 and Conv_7 aim to reduce the depth of the output feature image from Conv_3 and Conv_4, and decrease the computational complexity. The detailed parameters of the convolution kernel for each layer and other parameters for the MSCDN structure are listed in Table 2.
M denotes the number of categories of crops.

PolSAR Data
An experimental site, which was established by the European Space Agency (ESA), was used to evaluate the performances of the proposed method. The experimental area was an approximate 14 km × 19 km rectangular region located in the town of Indian Head (103 • 66 87.3" W, 50 • 53 18.1" N) in southeastern Saskatchewan, Canada. This area has 14 classes of different type of crops and an 'unknown' class including urban areas, transport corridors and areas of natural vegetation. The number of pixels and total area for each crop type are summarized in Table 3. The location maps from Google Earth and ground truth maps of the study area are shown in Figure 4.  The experimental PolSAR data sets were simulated with Sentinel-1 system parameters from real RADARSAT-2 data by ESA before launching real Sentinel-1 systems [58]. The real RADARSAT-2 datasets were collected on 21 April, 15 May, 8 June, 2 July, 26 July, 19 August and 12 September 2009. The multi-temporal PolSAR data in these 7 periods almost covered the entire growth cycle of major crops in the experimental area from sowing to harvesting. The polarization decomposition of the single temporal Pol-SAR data yields 36 dimensional features. Therefore, 252 dimensional features have been acquired from 7 time-series PolSAR images. The experimental PolSAR data sets were simulated with Sentinel-1 system parameters from real RADARSAT-2 data by ESA before launching real Sentinel-1 systems [58]. The real RADARSAT-2 datasets were collected on 21 April, 15 May, 8 June, 2 July, 26 July, 19 August and 12 September 2009. The multi-temporal PolSAR data in these 7 periods almost covered the entire growth cycle of major crops in the experimental area from sowing to harvesting. The polarization decomposition of the single temporal PolSAR data yields 36 dimensional features. Therefore, 252 dimensional features have been acquired from 7 time-series PolSAR images.

Evaluation Criteria
For evaluating the performances of different classification methods, the recall rate, overall accuracy (OA), validation accuracy (VA) and kappa coefficient (Kappa) are considered to perform comparison.
The overall accuracy can be defined as follows where M is the total number of pixels that correctly classified, and N is the total number of all pixels. Similarly, VA is the proportion of validation samples that are correctly classified to all validation samples. The recall rate can be written as follows: where X is the number of samples that are correctly classified for a certain class, Y is the number of samples of this class. The kappa coefficient arises from the consistency test and is commonly used to evaluate the classification performance, it measures the consistency of the predicted output and the ground-truth. Here, we use kappa coefficients to evaluate the entire classification accuracy of the model. Unlike OA and recall rate that only involve correctly predicted samples, the kappa coefficient considered various missing and misclassified samples that located at the off-diagonal of confusion matrix. The kappa coefficient can be calculated as follows: where N is the total number of samples, M is the number of crop categories, s i: and s :i are, respectively, the sum of the i-th row and i-th column elements of confusion matrix.

Results and Analysis
We now report the comparison of our method with other data compression schemes and classifiers. First, 9-dimensional compressed features were derived from the original 252-dimensional multi-temporal features using various methods, namely LLE, PCA, stacked sparse auto-encoder (S-SAE) and the proposed NC-SAE. Then, the compressed 9-dimensional features were fed into the SVM, CNN and the proposed MSCDN classifiers. The ratio of the training samples for each classifier was 1%.

Comparison of the Dimensionality Reduction Methods
Firstly, for the dimensionality reduction, the reconstruction error curves of SAE and NC-SAE in the training processes are shown in Figure 5. It can be seen that the reconstruction error of NC-SAE is slightly less than that of SAE. Moreover, the standard deviation within the same crop class were calculated and plotted in Figure 6A,B for different categories. The six main crops (i.e., lentil, spring wheat, field pea, canola, barley and flax) which have relatively larger cultivated areas shown in Figure 6a were chosen to evaluate the standard deviation. Meanwhile we also choose six easily confused crops shown in Figure 6b (i.e., durum wheat, oat, chemical fallow, mixed hay, barely, mixed pasture) for performance evaluation. We can see that the standard deviation of the proposed method NC-SAE is the smallest. Therefore, a better crop classification performance is expected by using the features that extracted through NC-SAE.    Additionally, by using CNN classifier, the OA, VA, Kappa coefficients and CPU time performances for different dimension reduction methods are listed in Table 4, and the predicted results of the classifier and their corresponding error maps are illustrated Additionally, by using CNN classifier, the OA, VA, Kappa coefficients and CPU time performances for different dimension reduction methods are listed in Table 4, and the predicted results of the classifier and their corresponding error maps are illustrated in Figure 7. In this experiment, the size of input data for CNN classifier was set to 15 × 15.  We can see that the dimensionality reduction methods of S-SAE and NC-SAE are both superior to PCA and LLE. For the CNN classifier, the OA and Kappa of S-SAE and NC-SAE are approximately 6%~8% higher than PCA and LLE. The performances of S-SAE and NC-SAE are nearly equal. However, keep in mind that these two neural networks have different structures. The proposed NC-SAE is a single-layer network, while the S-SAE uses three auto-encoders to sequentially perform the feature compression. Comparing the CPU time that required computing the compressed features, it can be seen that NC-SAE takes almost one tenth as long as S-SAE.

Comparison of the Classifier with Different Classification Methods
In this section, we compare the classification performance of feeding the 9-dimensional features, which are extracted from NC-SAE, into SVM, CNN and MSCDN classifiers. The classification results and error maps for above classifiers are shown in Figure 8. It can be readily seen that the proposed MSCDN classifier behaves the best performance. In order to provide the insight into above result, we further show the OA We can see that the dimensionality reduction methods of S-SAE and NC-SAE are both superior to PCA and LLE. For the CNN classifier, the OA and Kappa of S-SAE and NC-SAE are approximately 6~8% higher than PCA and LLE. The performances of S-SAE and NC-SAE are nearly equal. However, keep in mind that these two neural networks have different structures. The proposed NC-SAE is a single-layer network, while the S-SAE uses three auto-encoders to sequentially perform the feature compression. Comparing the CPU time that required computing the compressed features, it can be seen that NC-SAE takes almost one tenth as long as S-SAE.

Comparison of the Classifier with Different Classification Methods
In this section, we compare the classification performance of feeding the 9-dimensional features, which are extracted from NC-SAE, into SVM, CNN and MSCDN classifiers. The classification results and error maps for above classifiers are shown in Figure 8. It can be readily seen that the proposed MSCDN classifier behaves the best performance. In order to provide the insight into above result, we further show the OA performances of the different classifiers, along with the recall rates for each crop in Table 5. One sees that the OA performance of MSCDN is 24% and 5% higher than that of SVM and CNN. Observing the recall rate for each crop in Table 5, we see that the poorer OA for SVM and CNN is mainly due to the low recall rates of several individual crops (namely Duw: Durum Wheat, Mip: Mixed Pasture, Mih: Mixed Hay, and Chf: Chemical fallow). By further analyzing the categories of these crops in Table 3, we find that the above mentioned crops are easily confused with others because they have the same growth cycle or similar external morphologies with others. For example, Duw (Durum Wheat) is similar to Spw (Spring Wheat) in terms of external morphology, and Mip (Mixed Pasture) is more easily confused with Gra (Grass) and Mih (Mixed Hay). We conjecture that the poorer OA for SVM and CNN arise from the poorer distinguishable features that extracted by their network architectures. fallow). By further analyzing the categories of these crops in Table 3, we find that the above mentioned crops are easily confused with others because they have the same growth cycle or similar external morphologies with others. For example, Duw (Durum Wheat) is similar to Spw (Spring Wheat) in terms of external morphology, and Mip (Mixed Pasture) is more easily confused with Gra (Grass) and Mih (Mixed Hay). We conjecture that the poorer OA for SVM and CNN arise from the poorer distinguishable features that extracted by their network architectures.  Note: The numbers in columns using bold demonstrate the improvements of the recall rates and OA.
From the above analysis, we see that the accurate classification for these easily confused crops is the key point of enhancing the overall accuracy. For deeply understanding the improvement of our MSCDN classifier, the confusion matrix of crops Duw, Mip, Mih and Chf for CNN and MSCDN are shown in Table 6. One sees that compared to CNN, MSCDN greatly improves the recall rates of these easily confused crops, whose averaged recall rate increased more than 31%. This is not surprising because MSCDN is a multi-scale neural network, the architecture of which enables to extract the features in dif-  Note: The numbers in columns using bold demonstrate the improvements of the recall rates and OA.
From the above analysis, we see that the accurate classification for these easily confused crops is the key point of enhancing the overall accuracy. For deeply understanding the improvement of our MSCDN classifier, the confusion matrix of crops Duw, Mip, Mih and Chf for CNN and MSCDN are shown in Table 6. One sees that compared to CNN, MSCDN greatly improves the recall rates of these easily confused crops, whose averaged recall rate increased more than 31%. This is not surprising because MSCDN is a multi-scale neural network, the architecture of which enables to extract the features in different scales by using multiple kernels in convolution layers, and hence MSCDN is able to capture the tiny distinctions between the feature maps. Moreover, it should be pointed out that the above easily confused crops have very small samples in our crop data (only 7.3% of whole samples). Therefore, the improvement of OA performance for MSCDN will be foreseen. Note: The numbers using bold represent the accuracy of easily confused crops.

The Performance for the Different Size of Input Sample
The size of input sample for classifiers also affects the performance for crop classification. After the compression data with NC-SAE, Table 7 gives the classification results of MSCDN classifier with different sample size, the corresponding training curves are shown in Figure 9. Firstly, we set the size of input samples for the MSCDN classifier to 15 × 15. In this scenario, slightly over fitting has been observed when training the MSCDN, which is shown in Figure 9a. This problem has been ultimately solved by increasing the size of input sample. Figure 9b shows the training curve for input samples with size of 35 × 35. We see that the over-fitting can be completely eliminated by expanding the input size. Observing Table 7     For the CNN classifier, the same conclusion can be made. Table 8 further demonstrates the effect of the different sizes of input sample on classification results. In addition, by comparing the results in Tables 7 and 8, we can see that classification performance of MSCDN is always better than CNN under the same sample size.

Comparison of Overall Processing Procedures
The overall processing procedures and their performance evaluation are listed in Table 9. For classifiers of the traditional SVM, CNN and the proposed MSCDN, the data compression methods such as PCA, LLE, S-SAE and NC-SAE were used to obtain the compressed 9-dimensional features. Different from the above methods, the LSTM in Zhong et al. [25] can directly perform the classification with the 36 × 7 feature maps for a single pixel. Although the LSTM method avoids the feature compression procedure, the classification accuracy was poor. Whereas the combination of data compressor and trained classifiers can achieve remarkable crop classification performance. From Table 9, we can conclude that: (1) the combination of the proposed NC-SAE and MSCDN obtained the best performance; (2) with the expansion of the input size for CNN and MSCDN, the classification accuracy for these two classifiers has remarkably increased. However, it is worth noting that the phenomenon of over-fitting appears in NC-SAE + MSCDN for 15 × 15 sample case as shown in Figure 9, so the classification accuracy will be somewhat inferior to its competitors. Note: The numbers using bold represent the best performance.

Discussion
From an increasing number of experiments and analysis, the performance of crop classification can be improved remarkably based on multi-temporal quad-pol SAR data. Nowadays, a great number of spaceborne SAR systems launched into orbit around the Earth can enhance the revisiting period of satellite constellation and obtain a growing amount of real data, which provides a tremendous chance for multi-temporal data analysis. Additionally, the wide application of neural network in remote sensing has shown great abilities. Based on these two attentions, this paper attempted to divide two steps which are dimensional reduction based on NCSAE and then classification with MSCDN to achieve the crop classification. The summary for experimental results of Section 3 is discussed in the following.

The Effect of NC-SAE
In this paper, the NC-SAE was used to reduce the dimension of features from polarimetric decomposition. We can see that the NC-SAE has obtained the best performance compared with other methods through the experimental results in Section 3.3.1. Compared to the traditional dimension reduction methods PCA and LLE, the classification accuracy by using the NC-SAE compressed features has improved more than 6%, while nearly same accuracy compared with S-SAE. However, the S-SAE has three hidden layers with an intricate structure and more node members in each layer. The structure of NC-SAE is simple, it has only one hidden layer with 9 node members. The hyper-parameter λ, β and ρ of NC-SAE were set to 0.1, 2.5 and 0.45, respectively, which are directly inherited from the empirical value of S-SAE. Therefore, the NC-SAE is a computationally cheaper alternate of S-SAE.

The Effect of MSCDN Classifier
MSCDN was employed to classify the features extracted from NC-SAE dimensional reduction method, where the configuration parameters are empirically determined. The MSCDN network differs from the classical CNN network in its concatenated multi-scale features extracted by multiple kernels with different size. Though the slightly over-fitting has been observed in the training process of MSCDN when setting the input size as 15 × 15 × M, where M is the dimension of input features. This problem is readily resolved by expanding the input size to 35 × 35 × M. Moreover, the classification accuracy has been greatly improved compared to other classifiers. In general, the MSCDN classifier combined with NC-SAE feature compression method has obtained the best performance, and its overall accuracy is about 5% higher than our previous work [17].

Future Work
First of all, the phenomenon of slightly over-fitting when training the MSCDN network may be resolved by trying to put a dropout layer in MSCDN. Secondly, this study used a two stage processing networks for crop classification (features compression and subsequent classification). A more elegant one single network that implements the crop classification with multi-temporal Quad-Pol SAR data can be foreseen to further simplify the network and reduce the computation burden.

Conclusions
In this paper, we proposed a novel classification method, namely MSCDN, for multitemporal PolSAR data classification. To solve the problem of the dimension disaster, firstly, we constructed a sparse auto-encoder with non-negativity constraints (NC-SAE) which has an improved sparsity to reduce the data dimension of scattering features extracted from multi-temporal PolSAR images. Meanwhile, the simulated multi-temporal Sentinel-1 data provided by the ESA and the established ground truth map for experimental site were used to evaluate the performances of the proposed methodology. Comparing the performance of classification result, we can see that the OA of MSCDN classifier is approximately 20%