Synergistic Use of Multi-Temporal RADARSAT-2 and VEN µ S Data for Crop Classiﬁcation Based on 1D Convolutional Neural Network

: Annual crop inventory information is important for many agriculture applications and government statistics. The synergistic use of multi-temporal polarimetric synthetic aperture radar (SAR) and available multispectral remote sensing data can reduce the temporal gaps and provide the spectral and polarimetric information of the crops, which is e ﬀ ective for crop classiﬁcation in areas with frequent cloud interference. The main objectives of this study are to develop a deep learning model to map agricultural areas using multi-temporal full polarimetric SAR and multi-spectral remote sensing data, and to evaluate the inﬂuence of di ﬀ erent input features on the performance of deep learning methods in crop classiﬁcation. In this study, a one-dimensional convolutional neural network (Conv1D) was proposed and tested on multi-temporal RADARSAT-2 and VEN µ S data for crop classiﬁcation. Compared with the Multi-Layer Perceptron (MLP), Recurrent Neural Network (RNN) and non-deep learning methods including XGBoost, Random Forest (RF), and Support Vector Machina (SVM), the Conv1D performed the best when the multi-temporal RADARSAT-2 data (Pauli decomposition or coherency matrix) and VEN µ S multispectral data were fused by the Minimum Noise Fraction (MNF) transformation. The Pauli decomposition and coherency matrix gave similar overall accuracy (OA) for Conv1D when fused with the VEN µ S data by the MNF transformation (OA = 96.65 ± 1.03% and 96.72 ± 0.77%). The MNF transformation improved the OA and F-score for most classes when Conv1D was used. The results reveal that the coherency matrix has a great potential in crop classiﬁcation and the MNF transformation of multi-temporal RADARSAT-2 and VEN µ S data can enhance the performance of Conv1D.


Introduction
Annual crop inventory information is important for many agriculture applications and government statistics. Remote sensing satellite imagery has provided an efficient means for crop classification. Traditionally, optical data have been widely used by providing spatial and spectral information of land covers. For crop classification, remotely sensed time series data have been proven beneficial due to different temporal features associated with different crop types. However, due to weather condition, continuous optical time series data may be difficult to acquire. Polarimetric synthetic aperture radar (SAR) time series data can provide not only structural information but also continuous temporal changes of crops due to its capability of penetrating clouds and light rains. Therefore, multi-temporal polarimetric SAR data have been adopted for crop classifications [1][2][3][4]. The availability of various sources of satellite imagery enables to provide spatial, temporal, spectral and even structural features of land covers. It has been reported that the integration of optical and SAR data is able to reduce temporal gaps [5] and provide both spectral and structural features of land covers, which is beneficial for crop classification [6]. Previous studies have shown that the synergistic use of polarimetric SAR and optical data can increase classification accuracy in cropland areas [5,7,8].
A common method for the synergistic use of multi-source remote sensing data in land cover classification is data fusion, a process of combining images obtained by different sensors to form a composite image. Data fusion mainly focuses on the improvement of spatial resolution, structural and textural details [9]. Most of the studies related to data fusion are at pixel level [10]. The three most commonly used Optical-Radar fusion methods at pixel level are Principal Component Analysis (PCA) [11], intensity-hue-saturation (IHS) [12], and discrete wavelet transform [13]. According to previous studies, PCA is the most preferred method of the three methods [14], while the Minimum Noise Fraction (MNF) transformation performs two separate standard PCA transformation of the noise-whitened data [15], and also aims to produce principal components by maximizing the signal-to-noise ratio of the data [16].
Traditionally, supervised machine learning classification approaches such as Random Forest (RF), Support Vector Machine (SVM), k-Nearest Neighbor (kNN), Neural Networks (NN) and Decision Tree (DT) were adopted for crop classification using the integration of optical and SAR data. In recent years, deep learning (DL) has drawn attention in the remote sensing community due to the large amount of data sources and improved hardware resources. Convolutional neural network (CNN) and Recurrent Neural Network (RNN) are two deep learning architectures that have been successfully applied to remote sensing data in crop classification [9,17]. RNNs are designed for sequential or temporal data analysis such as signal processing, natural language processing and speech recognition [11], and they have shown success in remote sensing time series applications [9,[17][18][19][20]. Convolutional neural networks (CNNs) have been widely used in various remote sensing applications such as land cover mapping [21], change detection [22], building extraction [23]. CNNs include one-dimensional CNNs (1D-CNNs), two dimensional CNNs (2D-CNNs) and three-dimensional CNNs (3D-CNNs). 1D-CNNs are usually applied to pixel-based hyperspectral or multi-temporal remote sensing data [9]. 2D-CNNs are generally adopted to extract features in spatial dimension such as object detection [24] and semantic segmentation [25]. 3D-CNNs consider both spatial and temporal/spectral dimension. The 1D-, 2D-, and 3D-CNNs have been applied to remote sensing images in cropland classification [9,26,27]. A few studies have demonstrated that the CNNs are superior to RNNs in crop classification using time series data [9,18,28]. In this study, the crop lands can be regarded as homogenous area, we focus on 1D-CNN at pixel level.
Due to the free access, Sentinel-1 SAR time series data are the most used dataset for crop classification using deep learning methods in recent years. However, Sentinel-1 data have only two polarizations (VV + VH). When RADARSAT-2 full polarimetric SAR data were used in cropland classification, the polarimetric SAR parameters were usually extracted from the coherency matrix using different decomposition methods such as Pauli decomposition, Cloude-Pottier decomposition, Freeman-Durden decomposition [29], Neumann decomposition [3], and the optimum power [30] for crop classification. It has been studied that the elements of the coherency matrix of the fully polarimetric SAR data also show good performance in crop classification since the coherency matrix is the basic matrix representing the information of the polarimetric SAR data [15].
To our best knowledge, no study has been tested on deep learning methods using the combination of multi-temporal RADARSAT-2 fully polarimetric SAR and optical data for cropland classification. The main objectives of this study are (1) to develop a deep learning model to map cropland areas using multi-temporal full polarimetric SAR and multi-spectral remote sensing data, and (2) to evaluate the influence of different input features on the performance of deep learning methods in crop classification.

Study Site
The study site is located in the agricultural area in Mixedwood Plains Ecozone Southwestern Ontario (Figure 1), characterized by an abundant water supply and productive soils for agriculture. The dominant crops in the study site are winter wheat, corn, soybeans, and forage, including alfalfa and grass. The crop types in each field changed every year due to crop rotation. Generally, corn and soybean are seeded in May and harvested in October. Winter wheat in this study site is seeded in October the previous year and harvested in July of the following year.

Study Site
The study site is located in the agricultural area in Mixedwood Plains Ecozone Southwestern Ontario (Figure 1), characterized by an abundant water supply and productive soils for agriculture. The dominant crops in the study site are winter wheat, corn, soybeans, and forage, including alfalfa and grass. The crop types in each field changed every year due to crop rotation. Generally, corn and soybean are seeded in May and harvested in October. Winter wheat in this study site is seeded in October the previous year and harvested in July of the following year.

Ground Truth Data
From July to October 2018, intensive field surveys were conducted nearly every week. Field data including crop types, crop phenology, crop height, leaf area index (LAI), and soil moisture were collected. In addition, a general landcover survey was also conducted in October. Soybean, corn, winter wheat, alfalfa, grass, tobacco and squash were the crop types found in the field survey in this study site. Forest and built-up classes were defined in Google Map. The ground truth data were digitized into polygons (Figure 1). We avoided including field boundaries when digitizing the

Ground Truth Data
From July to October 2018, intensive field surveys were conducted nearly every week. Field data including crop types, crop phenology, crop height, leaf area index (LAI), and soil moisture were collected. In addition, a general landcover survey was also conducted in October. Soybean, corn, winter wheat, alfalfa, grass, tobacco and squash were the crop types found in the field survey in this study site. Forest and built-up classes were defined in Google Map. The ground truth data were digitized into polygons ( Figure 1). We avoided including field boundaries when digitizing the ground truth data to ensure that there is no mixed pixel. Alfalfa and grass were aggregated into a single forage class. Tobacco and squash were aggregated into other class due to their limited sample sizes. Therefore, seven classes were defined to represent the land cover in the study site. Then the polygons were converted to raster with 10 m spatial resolution in order to be consistent with the pixel size of the processed remote sensing data. The ground truth data were split into training, validation and testing datasets. The training and testing sets were used to train and test individual classification algorithms. The validation set was used to select the optimal hyper-parameters of the two deep learning methods. As the pixels in the same field are homogenous and highly correlated [18], we need to avoid the three datasets being from the same field. The ground truth data were split randomly into five mutually exclusive folds at polygon level, and then the same number of pixels were selected for each subset as the number of pixels in each polygon was not equal. Then one fold (20%) of the datasets was used as validation data and another fold (20%) was used as testing data. And the remaining three folds (60%) of the datasets were used as training data. This combination was repeated five times. Hence, each algorithm was evaluated based on five different train/test splits. Table 1 shows the number of fields and the pixels used for training, validation and testing datasets. A total of 10 fine-quad wide beam mode (FQW) RADARSAT-2 polarimetric SAR images were acquired throughout the 2018 growing season from July to October ( Table 2). The RADARSAT-2 data are single look complex (SLC) format containing four polarizations HH, HV, VV, and VH. The revisit time for the same beam mode of RADARSAT-2 data is 24 days. The coherency matrices (T3) were extracted from the RADARSAT-2 data, and a 9 × 9 Boxcar filter was applied to suppress the inherent speckle noise. This window size is selected to preserve the sufficient Equivalent Numbers of Looks (ENL) and to keep details as many as possible. A Digital Elevation Model (DEM) of Ontario, Canada with resolution of 30 m was used for geocoding, and the output spatial resolution is 10 m in UTM geographical coordinate system. The linear polarizations, Pauli decomposition, Freeman-Durden decomposition, and Cloude-Pottier decomposition were conducted based on each geocoded coherency matrix, respectively. Then the overlapping area of multi-temporal RADARSAT-2 images that covers the study site was selected, and all the RADASART-2 data were resized.

VENµS Data
Vegetation and Environment monitoring on a New Micro-Satellite (VENµS) was launched in August 2017. It is a near polar sun-synchronous orbit microsatellite developed jointly by the Israel Space Agency (ISA) and the French space agency (CNES). It provides images with 12 narrow spectral bands ranging from 420 nm to 910 nm at high spatial and temporal resolutions (5-10 m every 2 days). As our study site has been selected as one of the study areas in the world, the data can be downloaded from Theia Data Center (https://www.theia-land.fr/en/data-and-services-for-the-land/) for free. In this study, two cloud-free VENµS level 2 (L2A) surface reflectance products with 10 m spatial resolution acquired on 11 June 2018 and 9 July 2018 were utilized to fill the temporal gap and provide the spectral information. The reflectance data were divided by 1000 so that the range was between 0 and 1. The range of polarimetric SAR backscattering was also between 0 and 1. Thus, the scale of the optical and SAR data was normalized.

Data Preparation
Firstly, the coherency matrix, backscattering coefficients at linear polarization and polarimetric features from three polarimetric decompositions (Pauli decomposition, Cloude-Pottier decomposition, Freeman-Durden decomposition) of the multi-temporal RADARSAT-2 data were stacked respectively and tested as inputs for all the classifiers. Then, the polarimetric SAR parameters were combined with the VENµS multispectral data, respectively, and the MNF transformation was conducted on the combination of the two sources of data. As the values of the backscattering coefficients and spectral reflectance both range between 0 and 1, we did not apply any other normalization approach to the original features.
Labels were created from the training, validation and testing ground truth datasets. The input features were split into training, validation and testing features according to the label datasets. They were shuffled in order to reduce variance and overfitting.

Methods
In this study, a one-dimensional convolutional neural network (Conv1D) was proposed for cropland classification. For comparison purpose, a multi-layer perceptron (MLP) and a recurrent neural network (RNN) were built and tested as deep learning methods, and the XGBoost, RF and SVM were testes as non-deep learning benchmark classifiers. The hyperparameters of all the classifiers were trained and optimized using the MNF transformation of multi-temporal RADASART-2 coherency matrix and VENµS multispectral data. Then, different scenarios of remote sensing datasets and the ground truth data were used to train and validate the optimized classifiers based on a five-fold cross validation process. The final accuracy assessment was based on the average and standard deviation of the five folds cross validation. A flowchart of the methodology used in this study is presented in Figure 2. Basically, there are four steps. (1) Data acquisition and preprocessing; (2) training of hyperparameters for the deep learning methods; (3) training and cross validation of the optimized classifier using remote sensing different datasets and ground truth data; (4) final classification map generation and accuracy assessment using the trained classifiers.

Figure 2.
Flowchart of the methodology. The black arrow denotes data processing; the blue indicates means training of architecture and hyperparameters; the red arrow represents training and cross validation of the optimized classifier; the green arrow indicates the final classification map generation and accuracy assessment using the trained classifiers and ground truth data.

Neural Network Classifiers
The Conv1D deals with one-dimensional features and adopts one-dimensional convolution filters. It can be used for hyperspectral data or time series data by capturing temporal features or spectral features of the input data at pixel level [18]. A Conv1D classifier generally contains convolution layers, pooling layers, dense layers, and an output layer. Through applying different convolution filters within each convolution layer, the Conv1D can extract different one-dimensional features from different layers. The pooling layers are generally used for dimension reduction, and the pooling layers are optional for a Conv1D classifier. Dense layers are simple neural network layers. Dropout is a technique of dropping out some neurons in hidden layers randomly during training to prevent overfitting of training data [31]. Therefore, a dropout rate is usually applied to the convolution layers and the dense layers. To build a Conv1D architecture for this study, the VGG16 [32] convolutional neural network is modified for one dimensional data. VGG16 is one of the famous models for 2D image classification. It contains five blocks of convolutional layers and pooling layers combined with three fully connected layers. The first two blocks have 2 padding layers, 2 convolution layers, and 1 pooling layer. The next three blocks have 3 padding layers, 3 convolution layers, and 1 pooling layer respectively. The numbers of convolution filters for each block are 64, 128, 256, 512 and 512, and pooling layers were fixed as max-pooling with a pooling size of 2. One flattened layer is added to the last pooling layer, and two dense layers with 4096 neurons and one output layer are followed. To build the best Conv1D for our study, the 1D-VGG16 was tested by removing 1, 2, 3, or 4 of the blocks, and removing the pooling layers. The convolution filter widths of 3, 5, and 7 were tested and the values of 256, 512, 1024, 2048, 4096 were tested for the number of neurons in the two dense layers. The values of 0, 0.2, 0.5, and 0.8 were also tested for dropout rate for the three fully connected layers.
The MLP is a simple deep feedforward neural network [18], and there are at least three layers (input layer, hidden layer, and output layer) of nodes or neurons. For MLP, the search range of hidden layer number was tested from 1 to 5. The number of neurons in each layer was set to the same, Figure 2. Flowchart of the methodology. The black arrow denotes data processing; the blue indicates means training of architecture and hyperparameters; the red arrow represents training and cross validation of the optimized classifier; the green arrow indicates the final classification map generation and accuracy assessment using the trained classifiers and ground truth data.

Neural Network Classifiers
The Conv1D deals with one-dimensional features and adopts one-dimensional convolution filters. It can be used for hyperspectral data or time series data by capturing temporal features or spectral features of the input data at pixel level [18]. A Conv1D classifier generally contains convolution layers, pooling layers, dense layers, and an output layer. Through applying different convolution filters within each convolution layer, the Conv1D can extract different one-dimensional features from different layers. The pooling layers are generally used for dimension reduction, and the pooling layers are optional for a Conv1D classifier. Dense layers are simple neural network layers. Dropout is a technique of dropping out some neurons in hidden layers randomly during training to prevent overfitting of training data [31]. Therefore, a dropout rate is usually applied to the convolution layers and the dense layers. To build a Conv1D architecture for this study, the VGG16 [32] convolutional neural network is modified for one dimensional data. VGG16 is one of the famous models for 2D image classification. It contains five blocks of convolutional layers and pooling layers combined with three fully connected layers. The first two blocks have 2 padding layers, 2 convolution layers, and 1 pooling layer. The next three blocks have 3 padding layers, 3 convolution layers, and 1 pooling layer respectively. The numbers of convolution filters for each block are 64, 128, 256, 512 and 512, and pooling layers were fixed as max-pooling with a pooling size of 2. One flattened layer is added to the last pooling layer, and two dense layers with 4096 neurons and one output layer are followed. To build the best Conv1D for our study, the 1D-VGG16 was tested by removing 1, 2, 3, or 4 of the blocks, and removing the pooling layers. The convolution filter widths of 3, 5, and 7 were tested and the values of 256, 512, 1024, 2048, 4096 were tested for the number of neurons in the two dense layers. The values of 0, 0.2, 0.5, and 0.8 were also tested for dropout rate for the three fully connected layers.
The MLP is a simple deep feedforward neural network [18], and there are at least three layers (input layer, hidden layer, and output layer) of nodes or neurons. For MLP, the search range of hidden layer number was tested from 1 to 5. The number of neurons in each layer was set to the same, and the numbers of 64, 128, 512, and 1024 were tested. The values of 0, 0.2, 0.5, and 0.8 were tested for dropout rate. Long short-term memory (LSTM) is a special RNN unit that is capable of learning long-term dependencies or capturing long distance connections in sequence prediction problems [33]. It is composed of memory cells to remember information for long periods of time. For LSTM, the search range of LSTM layer number was tested from 1 to 5. The number of neurons in each layer was set to the same, and the numbers of 64, 128, 256, and 512 were tested. The values of 0, 0.2, 0.5, and 0.8 were tested for dropout rate. Same as Conv1D, three dense layers were added to the last LSTM layer. The values of 256, 512, 1024, 2048, 4096 were tested for the number of neurons in the two dense layers.
The three deep learning architectures were trained using the Adam optimizer [34]. A maximum number of epochs was set to 20 using an early stopping with a patience value of zero. The values of 32, 320, and 3200 were tested for batch size. Parameters of Adam were fixed as: learning rate = 0.00015, β1 = 0.9, β2 = 0.999. As the classes are represented using integers, sparse_categorical_crossentropy loss function was adopted. The three deep learning classification models were built and evaluated using the Keras library [35] on top of Tensorflow [36], and were trained on NVIDIA GeForce RTX 2080Ti Graphical Processing Units (GPUs).

Other Classifiers
Three efficient machine learning classifiers XGBoost, RF and SVM were used as benchmark classifiers. The XGBoost is a state-of-the-art algorithm that has been growing in popularity in data science due to its accuracy and scalability. It is an implementation of gradient tree boosting technique designed for high efficiency and performance [37]. Zhong et al. [18] tested this algorithm in crop classification using remote sensing time series data.
The RF classifier is an ensemble of decision tree classifiers. Each tree is grown to the maximum depth independently using a random combination of features from the input features [15]. It has been widely used in remote sensing image classifications due to its high performance and less overfitting characteristics.
The SVM classifier can separate hyperplanes and perform non-linear classification using kernel functions [38]. SVM also has been extensively applied in remote sensing classification tasks [18,39,40]. The hyperparameters of the three classifiers were selected after running a grid search on the training dataset. The hyperparameters we tested and adopted in this study are shown in Table 3.

Evaluation
To assess the performance of each algorithm, the confusion matrix was generated using the testing dataset. The producer's accuracy (PA), user's accuracy (UA), overall accuracy (OA) and the Kappa coefficient were computed. In addition, F1 score was used as an indicator of classification accuracy for each class [9,18]. F1 score is the harmonic mean of producer's accuracy and user's accuracy. Table 4 shows the hyper-parameters of the three deep learning methods. The Conv1D performances the best when the last block of VGG16 networks without the pooling layer and the fully connected layers were kept. The number of neurons on the dense layer is 512 instead of 4096, and the dropout rate is 0.5 for the dense layers. The proposed architecture of the Conv1D is shown in Figure 3. The optimized MLP architecture includes 1 input layer, 3 hidden layers and 1 output layer. The input layer and each hidden layer have 512 neurons. The dropout rate is 0.5. The optimized LSTM-based RNN model contains three LSTM units with 256 output channels for each unit and followed by dropout of 0.5. Similarly to the Conv1D, there are 512 neurons for each dense layer and the dropout rate is 0.5. All the polarimetric SAR parameters (coherency matrix, linear polarizations, Pauli decomposition, Cloude-Pottier decomposition, Freeman-Durden decomposition) described in Section 2.4 were tested by all the classifiers. It was found that the Pauli decomposition gave the best OA among the polarimetric parameters. To explore the potential of coherency matrix, the coherency matrix and Pauli decomposition were combined with the multispectral data respectively. The results of Freeman-Durden decomposition, Cloude-Pottier decomposition, and linear polarizations are not listed here in order to focus on the coherency matrix and the best polarimetric SAR parameters. All the classifiers were run on ten scenarios (Table 5)  All the polarimetric SAR parameters (coherency matrix, linear polarizations, Pauli decomposition, Cloude-Pottier decomposition, Freeman-Durden decomposition) described in Section 2.4 were tested by all the classifiers. It was found that the Pauli decomposition gave the best OA among the polarimetric parameters. To explore the potential of coherency matrix, the coherency matrix and Pauli decomposition were combined with the multispectral data respectively. The results of Freeman-Durden decomposition, Cloude-Pottier decomposition, and linear polarizations are not listed here in order to focus on the coherency matrix and the best polarimetric SAR parameters. All the classifiers were run on ten scenarios (Table 5) Table 6 presents the average OA (± standard deviation), Kappa coefficient and training time over five folds of training datasets for all methods from Scenario 1 to Scenario 10. Three convolutional layers are consecutively applied, then three dense layers, and the output layer, that provides the predicting class.  Table 6 presents the average OA (± standard deviation), Kappa coefficient and training time over five folds of training datasets for all methods from Scenario 1 to Scenario 10.

Overall Classification Accuracy and Training Time
The results show that (1) the Pauli decomposition gave the best OA among other polarimetric SAR parameters. (2) The combination of multi-temporal RADARSAT-2 polarimetric SAR data and VENµS optical data performed better than using either data alone. This is because the multi-temporal RADARSAT-2 polarimetric SAR can capture the structure information of the land covers at temporal scale and the VENµS optical images can provide rich spectral information. The combination of the two sources of data provides not only spectral features, but also structure features and temporal features of the land covers. The MNF transformation further improved the classification accuracies because the noise in the raw data was segregated after MNF transformation. (3) The MNF transformation of Pauli+ VENµS and the MNF transformation of coherency matrix+ VENµS gave similar and the highest OA when Conv1D was applied (96.65 ± 1.03% vs. 96.72 ± 0.77%), which indicates that the MNF transformation is able to extract information from the raw data and the Conv1D has the best learning capability. (4) The Conv1D also performed the best among all the classifiers when coherency matrix only (OA = 91.85 ± 2.51%), or VENµS data only were utilized (OA = 93.15 ± 2.06%). Through MNF transformation, the information and noise were reordered. The first band contains the most information and the last few bands are basically noise. In terms of the efficiency, among the deep learning methods, the MLP needs the least time, which is mostly less than 1 min, to be trained over the five folds of training dataset. While the LSTM needs the most time (13-44 min). The Conv1D needs about 3 min to 11 min. Among the three non-deep learning methods, the RF is the most efficient classifier (2-0 min), while the SVM is the least efficient classifier (20 min-17.5 h).

Classification Accuracy of Individual Land Cover Class
The F-score of each class was calculated for different classifiers and different input datasets (Scenario 1-10) according to Equation (1) ( Table 7). The highest F-score values for Soybean (97.23 ± 0.66%), Corn (97.60 ± 0.87%), and Other (86.37 ± 9.14%) were generated by the Conv1D, and the highest F-score values for Wheat (98.69 ± 0.92%), Forage (94.06 ± 1.97%), and Built-up (99.79 ± 0.22%) were given by the LSTM. The RF gave the best accuracy on Forest (98.91 ± 1.15%). However, the accuracies of individual classes vary a lot when the input datasets vary. The performance of LSTM seems not as stable as the Conv1D. The F-score values of Soybean, Corn, Forest and Built-up are higher than 90% except for the LSTM classifier when the optical data were utilized. The class Other has the worst classification accuracy among all the classes due to the limited number of training samples. The combination of multi-temporal RADARSAT-2 and VENµS data improved the accuracies of all classes but Built-up for all the classifiers except the LSTM. The MNF transformation improved the accuracies of most classes but forest when the Conv1D was applied, and the class Other was improved the most. From Scenario 7 to Scenario 10, the F-score values of Soybean and Corn, which are two of the three main crops in this study site, are similar no matter the coherency matrix or the Pauli decomposition were utilized. The confusion matrix of the best overall accuracy result is shown in Table 8. Misclassifications mainly occur between corn and soybean, wheat and forage, and soybean and other.   All the classifiers were applied to the MNF transformation of Pauli decomposition and VENµS data (Scenario 10) to make classification for the whole study area. The classification maps generated by the four classifiers Conv1D, MLP, LSTM, SVM, XGBoost, RF are shown in Figure 4a-f respectively. Three major differences are marked using black boxes. The MLP shows more misclassifications between Other and Soybean than the Conv1D as the small cyan patches spread in the purple soybean fields. The non-deep learning methods show more misclassification between winter wheat and forage. This is reasonable because the forage and winter wheat started to green up at the same time of the growing season. The Conv1D also gave misclassification between these classes. However, the accuracy of the marked areas cannot be validated since there are no accurate reference data in those areas.
Remote Sens. 2020, 12, x FOR PEER REVIEW 13 of 18 by the four classifiers Conv1D, MLP, LSTM, SVM, XGBoost, RF are shown in Figure 4a-f respectively. Three major differences are marked using black boxes. The MLP shows more misclassifications between Other and Soybean than the Conv1D as the small cyan patches spread in the purple soybean fields. The non-deep learning methods show more misclassification between winter wheat and forage. This is reasonable because the forage and winter wheat started to green up at the same time of the growing season. The Conv1D also gave misclassification between these classes. However, the accuracy of the marked areas cannot be validated since there are no accurate reference data in those areas.

Discussions
In this study, multi-temporal RADARSAT-2 polarimetric SAR data and VENµS data were acquired. It is worth noting that the RADARSAT-2 constellation was launched and multi-temporal full polarimetric SAR data will be available for free to users (subject to security restrictions set out in Canadian legislation). The VENµS data are currently only available at a few specific areas in the world. The data were utilized for the first time and they show good potential in cropland classification. Sentinel-2 data can also be used instead if this approach will be applied in other study areas.

Comparisons between Conv1D and Other Classifiers
In this study, we found that the use of pooling layer will lower the performance of the Conv1D, which confirmed the findings of [9]. For example, when one pooling layer was added to the Conv1D architecture, the average overall accuracy was 96.06 ± 1.53% for Scenario 10, while the average overall accuracy of Conv1D without the pooling layer was 96.72 ± 0.77%. The Conv1D gave the highest overall classification accuracy when the MNF transformation of multi-temporal RADARSAT-2 (both coherency matrix and Pauli decomposition) and the VENµS data were utilized. In terms of the execution time, the MLP is the most efficient classifier and the Conv1D is more efficient than the LSTM and the three non-deep learning methods. The LSTM-based RNN performs the worst among the deep learning and non-deep learning methods and it needs the more time to be trained than other two deep learning methods and the RF classifier. Therefore, the LSTM seems not suitable for the datasets in this specific classification task. The SVM also performs well for the combination of multi-temporal RADARSAT-2 and VENµS data, but it needs the most time to be trained among all the classifiers. A paired t-test was performed to compare the mean OA of Conv1D against the other classifiers when the MNF transformation of multi-temporal RADARSAT-2 and VENµS data. The p-value is about 0.1 for Conv1D and MLP, which indicates that the Conv1D outperforms the MLP but not significantly. However, the MLP performs significantly worse than the Conv1D on the Cloude-Pottier decomposition, Freeman-Durden decomposition and the two VENµS data. The p-values are lower than 0.05 for Conv1D and other classifiers, which means that the Con1D significantly outperforms the LSTM, XGBoost, RF and SVM. These indicate that the Conv1D has a great potential in classification tasks using the combination of multi-spectral, multi-temporal, and multi-modal data.

Influence of Input Features on the Performance of Conv1D
This study shows that the performances of the deep learning methods still depend on the feature selection. The coherency matrix contains all the raw information of the polarimetric SAR data, but it also contains noise, which will affect the performance of the classifiers. The two multi-spectral data can produce higher classification accuracy than RADARSAT-2 data. This may be because the speckle noise in RADARSAT-2 data and there is a lack of RADARSAT-2 data in May and June. The combination of RADARSAT-2 and VENµS data significantly improved the accuracy for Grass, Forest, Built-up, and Other class, and slightly improved the accuracy of soybean.
Among the four polarimetric SAR parameters, the overall classification accuracy of Pauli decomposition is the highest, which confirms the finding of a previous study [15]. The components of the Pauli decomposition can be represented by the three diagonal elements of the coherency matrix, which represent single bounce scattering (e.g., bare soil), double bounce scattering (e.g., soil-stalk), and volume scattering (e.g., crop canopies), respectively [41]. While the linear polarizations can be represented by the three diagonal elements of the covariance matrix. Freeman-Durden decomposition decomposes each polarimetric SAR backscatter three components that are modeled as the first-order Bragg surface scatter, the scattering from a dihedral corner reflector, and canopy scatters from randomly oriented dipoles respectively [42]. The Cloude-Pottier decomposition decomposes the coherency matrix into entropy (H), anisotropy (A), and alpha angle (α) [43]. These parameters reflect the crop structural features.
When directly stacking the multi-temporal RADARSAT-2 coherency matrix elements and VENµS multi-spectral data, the performances of the three deep learning methods are similar or even inferior to the three classical machine learning methods. However, when MNF transformation was applied to the original data, the overall classification accuracies improved about 1% for the Conv1D and MLP. For Conv1D, the OA of the two MNF transformation data are similar, indicating that the MNF transformation of the raw data can extract information as much as the MNF transformation of the best feature manually selected. As we know, the manual selection of features is time consuming, and we cannot guarantee that the feature we arbitrarily selected is the best.
In this study, we compared the MNF transformation and the PCA transformation, we found that the results of MNF (96.65%) is superior to PCA (95.36%) when Conv1D was applied. It demonstrated that the MNF transformation can also be used as a multi-source data fusion method in land use land cover (LULC) classification applications.

Extended Experiments on Datasets with Different Number of RADARSAT-2 and VENµS Data
To test the robustness of the Conv1D classifier and the contribution of the MNF transformation and the coherency matrix, the proposed architecture was tested on a dataset with different number of RADARSAT-2 and VENµS data. And this is the case that we cannot acquired as many dates of data as we used in this study site, especially the RADARSAT-2 data. We selected the RADARSAT-2 acquired on 1 July, 25 July, 18 August, 1 September, 15 September, and VENµS data acquired on 9 July as Sub-dataset 1 and the RADARSAT-2 acquired on 8 July, 1 August, 25 August, 8 September, 5 October, and VENµS data acquired on 11 June as Sub-dataset 2. The results (Table 9) show that the combination of Pauli decomposition and VENµS multispectral data gave slightly better overall classification accuracy than the combination of coherency matrix and VENµS data for the two sub-datasets (93.86% vs. 93.58% and 92.00% vs. 91.59%). However, the MNF transformation of the two datasets gave the similar overall accuracy (95.85% vs. 95.85% and 92.77% vs. 92.70%). The results confirm the findings that the coherency matrix has a great potential in crop classification and the MNF transformation of multi-temporal RADARSAT-2 and VENµS data can improve the classification accuracy when the Conv1D was adopted. In addition, this experiment indicates that the OA and the increase of OA by MNF transformation depend on the dates of acquired remote sensing data. If there is only one date of optical data, the one acquired on the time when all the crops are present (e.g., middle growing season) with more separable spectral features would be better than the beginning or later in the growing season. In this experiment, the VENµS data acquired on 9 July are better than the ones acquired on 11 June. Table 9. Average OA (± standard deviation) of the Conv1D using a dataset with a different number of remote sensing data.

Future Work
This study shows that among the pixel-based classification methods, the proposed Conv1D performs the best on the MNF transformation of the multi-temporal RADARSAT-2 and VENµS data. However, the proposed Conv1D may not perform well when the number of input data is higher than the number of data that were used for training the hyperparameters. Obviously, the Conv1D does not consider the spatial features. The 3D CNNs were proven as a more effective deep learning architecture than 1D CNN in crop classification using multi-temporal optical images [18]. The 3D CNNs considered not only the spatial features but also the temporal features. Future work could focus on considering the 3D CNNs on the combination of multi-temporal polarimetric SAR and optical data.

Conclusions
In this study, a one-dimensional convolution neural network (Conv1D) was proposed for cropland classification using multi-temporal fully polarimetric SAR and optical data. It was compared with two another deep learning methods including the MLP, and LSTM-based RNN and three non-deep learning methods including the XGBoost, RF, and SVM. We also evaluated the influence of different input features on the performance of the Conv1D by comparing with the benchmark methods. The results show that the performance of all the methods varies with the input datasets. The Conv1D not always performs better than other methods. When VENµS data were directly combined with the coherency matrix, the Conv1D performs slightly inferior to the MLP, XGBoost, and SVM. However, the MNF transformation gave the best OA and improved the F-score values for most classes when the Conv1D was applied. In addition, the MNF transformation of the combination of multi-temporal RADARSAT-2 coherency matrix and VENµS spectral data can give similar OA as the MNF transformation of the combination of Pauli decomposition and VENµS data. These findings indicate that the coherency matrix has great potential in crop classification and the Conv1D can learn features from MNF transformation of multi-temporal RADARSAT-2 and VENµS data better than other classifiers.