Crop Classiﬁcation Method Based on Optimal Feature Selection and Hybrid CNN-RF Networks for Multi-Temporal Remote Sensing Imagery

: Although e ﬀ orts and progress have been made in crop classiﬁcation using optical remote sensing images, it is still necessary to make full use of the high spatial, temporal, and spectral resolutions of remote sensing images. However, with the increasing volume of remote sensing data, a key emerging issue in the ﬁeld of crop classiﬁcation is how to ﬁnd useful information from massive data to balance classiﬁcation accuracy and processing time. To address this challenge, we developed a novel crop classiﬁcation method, combining optimal feature selection (OFSM) with hybrid convolutional neural network-random forest (CNN-RF) networks for multi-temporal optical remote sensing images. This research used 234 features including spectral, segmentation, color, and texture features from three scenes of Sentinel-2 images to identify crop types in the Jilin province of northeast China. To e ﬀ ectively extract the e ﬀ ective features of remote sensing data with lower time requirements, the use of OFSM was proposed with the results compared with two traditional feature selection methods (TFSM): random forest feature importance selection (RF-FI) and random forest recursive feature elimination (RF-RFE). Although the time required for OFSM was 26.05 s, which was between RF-FI with 1.97 s and RF-RFE with 132.54 s, OFSM outperformed RF-FI and RF-RFE in terms of the overall accuracy (OA) of crop classiﬁcation by 4% and 0.3%, respectively. On the basis of obtaining e ﬀ ective feature information, to further improve the accuracy of crop classiﬁcation we designed two hybrid CNN-RF networks to leverage the advantages of one-dimensional convolution (Conv1D) and Visual Geometry Group (VGG) with random forest (RF), respectively. Based on the selected optimal features using OFSM, four networks were tested for comparison: Conv1D-RF, VGG-RF, Conv1D, and VGG. Conv1D-RF achieved the highest OA at 94.27% as compared with VGG-RF (93.23%), Conv1D (92.59%), and VGG (91.89%), indicating that the Conv1D-RF method with optimal feature input provides an e ﬀ ective and e ﬃ cient method of time series representation for multi-temporal crop-type classiﬁcation.


Introduction
In recent years, with the increase in satellites at different spatial, temporal, radiometric, and spectral resolutions, remote sensing techniques have emerged as optimal tools to identify crop types over larger areas. Timely and accurate crop-type classification is essential for estimating crop yields, strengthening crop production management, and crop insurance [1]. Currently, crop-type classification methods (1) One of the main innovations of this paper is OFSM, which is different from traditional feature selection methods, including filter, embedded, wrapper, and hybrid. The filter selection method selects features regardless of the model used and is commonly robust in overfitting and effective in computation time. The wrapper method performs evaluation on multiple subsets of the features and chooses the best subset of features that gives the highest accuracy to the model. Since the classifier needs to be trained multiple times, the computation time using the wrapper method (e.g., RFE) is usually much larger than that using the filter method. The embedded method (e.g., RF and XGBoost) can interact with the classifier and is less computationally intensive than the wrapper method, but it ignores the correlation between multiple features. OFSM is a hybrid method of filter, embedded, and wrapper, and has advantages in processing time and recognition accuracy. Considering the correlation between the multi-features and the processing time during the feature selection process, the features selected by OFSM are independent of each other and the time required for processing is acceptable. The experimental results demonstrate that OFSM performs optimally and the accuracy of the selected features for crop classification is higher than that of the original image directly sent to the classifier. Thus, we show that the preprocessing of feature selection is critical prior to classification. (2) Considering the advantages of multiple classifiers, we propose two hybrid CNN-RF networks to integrate the advantages of Conv1D and Visual Geometry Group (VGG) with RF, respectively. A traditional CNN uses an FC layer to make the final classification decision, and there is usually overfitting, especially with inadequate samples, which is not sufficiently robust and is computationally intensive. The use of RF instead of the FC layer to make the final decision can effectively alleviate the occurrence of overfitting. At the same time, we are committed to providing a reasonable scheme for the selection of a CNN network structure in crop mapping based on multi-temporal remote sensing images, and selecting the optimal hyperparameters for the CNN network can further improve the identification accuracy of crops. The results demonstrate that the proposed hybrid networks can integrate the advantages of the two classifiers and achieve more optimal crop classification results than the original deep-learning networks.
In particular, the combination of temporal feature representation network (Conv1D) and RF achieves the optimal crop classification results. Compared with the mainstream networks (e.g., LSTM-RF, ResNet, and U-Net), the proposed Conv1D-RF still obtains better crop recognition Remote Sens. 2020, 12, 3119 4 of 23 results, indicating that the Conv1D-RF framework can mine more effective and efficient time series representations and achieve more accurate identification results for crops in multi-temporal classification tasks.
The remainder of this paper is organized as follows. Section 2 introduces the study area and data used in this work. Section 3 details the specific workflow of research, including OFSM and traditional feature selection methods (TFSM), classification methods based on hybrid CNN-RF networks and original deep-learning networks, and evaluation. Section 4 compares various classification results using the proposed method and other traditional methods. Discussions and conclusions are presented in Sections 5 and 6, respectively.

Study Area
The study area was in the Jilin province of northeast China (Figure 1), and is a major area for agricultural production. The climate is characterized by a temperate continental semi-humid monsoon pattern, is warm and rainy in the summer and cold and humid in the winter, and has an annual mean temperature of 7 • C. The field investigation was conducted from June to September in 2017, which is the main growing season for crops. In the study area, there were 83 experimental measurements of four land cover types, including rice, urban, corn, and soybean, as shown in Table 1.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 23 feature selection methods (TFSM), classification methods based on hybrid CNN-RF networks and original deep-learning networks, and evaluation. Section 4 compares various classification results using the proposed method and other traditional methods. Discussions and conclusions are presented in Sections 5 and 6, respectively.

Study Area
The study area was in the Jilin province of northeast China (Figure 1), and is a major area for agricultural production. The climate is characterized by a temperate continental semi-humid monsoon pattern, is warm and rainy in the summer and cold and humid in the winter, and has an annual mean temperature of 7 °C. The field investigation was conducted from June to September in 2017, which is the main growing season for crops. In the study area, there were 83 experimental measurements of four land cover types, including rice, urban, corn, and soybean, as shown in Table  1.

Data
Sentinel-2 imagery (L2C-level), was selected and processed by radiometric calibration and atmospheric correction. Sentinel-2 delivers high-resolution optical images for land monitoring, emergency response, and security services. The imagery provides a versatile set of 13 spectral bands spanning from visible, red-edge, and near-infrared (NIR) to shortwave infrared (SWIR), featuring four spectral bands (B2, B3, B4, B8) at a 10 m spatial resolution, six bands (B5, B6, B7, B8A, B11, B12) at a 20 m spatial resolution, and three bands (B1, B9, B10) at a 60 m spatial resolution, as listed in Table 2. In this study, three scenes of Sentinel-2 imagery (28/6/2017, 18/7/2017, 11/9/2017) without cloud cover were collected in the crop growing season. Excluding the three "atmospheric" bands with a spatial resolution of 60 m, the other 10 bands of each scene were selected and resampled to a 10 m spatial resolution [34,35].

Band Name
Central Wavelength(um) Resolution(m)

Data
Sentinel-2 imagery (L2C-level), was selected and processed by radiometric calibration and atmospheric correction. Sentinel-2 delivers high-resolution optical images for land monitoring, emergency response, and security services. The imagery provides a versatile set of 13 spectral bands spanning from visible, red-edge, and near-infrared (NIR) to shortwave infrared (SWIR), featuring four spectral bands (B2, B3, B4, B8) at a 10 m spatial resolution, six bands (B5, B6, B7, B8A, B11, B12) at a 20 m spatial resolution, and three bands (B1, B9, B10) at a 60 m spatial resolution, as listed in Table 2. In this study, three scenes of Sentinel-2 imagery (28 June 2017, 18 July 2017, 11 September 2017) without cloud cover were collected in the crop growing season. Excluding the three "atmospheric" bands with Remote Sens. 2020, 12, 3119 5 of 23 a spatial resolution of 60 m, the other 10 bands of each scene were selected and resampled to a 10 m spatial resolution [34,35].  19 20 In Figure 2, by combining the field investigation samples with higher-resolution optical remote sensing images, four labels were created for the corresponding land cover types to form the reference dataset [36] for training, validation, and testing. Applying the training dataset, the individual classification model was trained by setting the parameters of the classifier. The validation dataset was used to select the optimal parameters for the model. The test dataset was used to evaluate the performance of the final classification. Here, the training, validation, and test datasets were independent of each other and randomly assigned by a ratio of 25%:25%:50%. There were 45 19 20 In Figure 2, by combining the field investigation samples with higher-resolution optical remote sensing images, four labels were created for the corresponding land cover types to form the reference dataset [36] for training, validation, and testing. Applying the training dataset, the individual classification model was trained by setting the parameters of the classifier. The validation dataset was used to select the optimal parameters for the model. The test dataset was used to evaluate the performance of the final classification. Here, the training, validation, and test datasets were independent of each other and randomly assigned by a ratio of 25%:25%:50%. There were 45,136 pixels in the training dataset, 44,312 in the validation dataset, and 87,898 in the test dataset.  Figure 3 presents a flowchart outlining the method used in this study. First, five types of features from the spatial and spectral information of multi-temporal remote sensing images are extracted as introduced in Section 3.1. Then, TFSM and the proposed OFSM are described in Section 3.2. In Section 3.3, two hybrid CNN-RF networks (Conv1D-RF and VGG-RF) are designed in detail. Finally, six parameters are introduced to evaluate the performance of the crop classification methods.  Figure 3 presents a flowchart outlining the method used in this study. First, five types of features from the spatial and spectral information of multi-temporal remote sensing images are extracted as introduced in Section 3.1. Then, TFSM and the proposed OFSM are described in Section 3.2. In Section 3.3, two hybrid CNN-RF networks (Conv1D-RF and VGG-RF) are designed in detail. Finally, six parameters are introduced to evaluate the performance of the crop classification methods.

Feature Extraction
Feature extraction transforms the original features into a group of features with obvious physical or statistical significance. In this study, the raw spectral features, color features, segmentation features, spectral index features, and texture features are extracted. These features can effectively cooperate with the spectral and spatial information of land cover, and their combination can greatly improve the crop recognition ability and accuracy in remote sensing images. As shown in Figure 4, a total of 234 features were extracted from three scenes of Sentinel-2 images employed to identify crop types. Among them, we selected 30 raw spectral features, 30 segmentation features using a graphbased segmentation algorithm [37], 9 color features extracted from HSI color space [38], 45 spectral index features [39] listed in Table 3, and 120 texture features [40] listed in Table 4 using three scenes of Sentinel-2 imagery. In addition, "seg" represents segmentation feature, "H" represents hue, "S" represents saturation, "I" represents intensity, "CON" represents contrast, "ENT" represents entropy, "ASM" represents angular second moment, and "HOM" represents homogeneity in this study.

Feature Extraction
Feature extraction transforms the original features into a group of features with obvious physical or statistical significance. In this study, the raw spectral features, color features, segmentation features, spectral index features, and texture features are extracted. These features can effectively cooperate with the spectral and spatial information of land cover, and their combination can greatly improve the crop recognition ability and accuracy in remote sensing images. As shown in Figure 4, a total of 234 features were extracted from three scenes of Sentinel-2 images employed to identify crop types. Among them, we selected 30 raw spectral features, 30 segmentation features using a graph-based segmentation algorithm [37], 9 color features extracted from HSI color space [38], 45 spectral index features [39] listed in Table 3, and 120 texture features [40] listed in Table 4 using three scenes of Sentinel-2 imagery. In addition, "seg" represents segmentation feature, "H" represents hue, "S" represents saturation, "I" represents intensity, "CON" represents contrast, "ENT" represents entropy, "ASM" represents angular second moment, and "HOM" represents homogeneity in this study.

Feature Extraction
Feature extraction transforms the original features into a group of features with obvious physical or statistical significance. In this study, the raw spectral features, color features, segmentation features, spectral index features, and texture features are extracted. These features can effectively cooperate with the spectral and spatial information of land cover, and their combination can greatly improve the crop recognition ability and accuracy in remote sensing images. As shown in Figure 4, a total of 234 features were extracted from three scenes of Sentinel-2 images employed to identify crop types. Among them, we selected 30 raw spectral features, 30 segmentation features using a graphbased segmentation algorithm [37], 9 color features extracted from HSI color space [38], 45 spectral index features [39] listed in Table 3, and 120 texture features [40] listed in Table 4 using three scenes of Sentinel-2 imagery. In addition, "seg" represents segmentation feature, "H" represents hue, "S" represents saturation, "I" represents intensity, "CON" represents contrast, "ENT" represents entropy, "ASM" represents angular second moment, and "HOM" represents homogeneity in this study.

Spectral Index
Calculation Formula

Texture Features Statistical Characteristics
Homogeneity: Measure local homogeneity Measure the difference between the maximum and minimum values in the neighborhood Entropy: Measuring image disorder Angular Second Moment: ASM = i j (f(i, j)) 2 Describe local stationarity

Feature Selection
The basic types of approaches exploited in feature selection and reduction include filter, wrapper, embedded, and hybrid, respectively. First, two traditional feature selection methods (TFSM) are introduced, including RF-FI and RF-RFE, where RF-FI is an embedded method and RF-RFE is a hybrid method of embedded and wrapper. On the basis of TFSM, OFSM is proposed as a hybrid method of filter, embedded, and wrapper. In this method [41], the features are firstly sorted according to their importance score, then the unimportant features are eliminated. Here, we use the prediction performance of RF to realize the quantification of the feature importance, including the out of bag (OOB) error and the quantification of feature importance (FI), which are the key elements of the feature selection strategy. For example, these features are ranked by sorting the FI in descending order, and the less important features are eliminated under the given threshold. Here, M denotes the number of retained features with an averaged FI exceeding this threshold. The RF-RFE selection method [42] is basically a recursive process that ranks features according to the measure of feature importance (FI) given by RF. At each iteration, the less important feature is eliminated according to the measurement result of feature importance (FI). The recursive is necessary during the stepwise elimination process because the relative importance of each feature could change substantially when evaluating over a different subset of features. The final ranking is constructed by Remote Sens. 2020, 12, 3119 8 of 23 the inverse order of feature elimination. The feature selection process itself only includes the first M features from this ranking.

Optimal Feature Selection Method (OFSM)
In order to gain the advantages provided by the different feature selection methods, this study developed a hybrid method to increase the efficiency and provide a higher accuracy. The implementation steps of OFSM are as follows and the structure of OFSM is shown in Figure 5.
Step 1: Calculate the Spearman rank correlation coefficient [43,44] between the input features and the labels. The Spearman rank correlation coefficient tests the direction (negative or positive) and strength of the relationship between two variables. First, the measurements of feature and labels are assigned corresponding ranks according to their average descending position in the total measurements. Then, the Spearman correlation coefficient ρ s FL between the F th feature data (F) and labels (L) is calculated by: where n is the number of measurements in each of the two variables (the feature and label) in the correlation, x i and y i represent the rank of the i th measurements for the two variables, and x and y represent the average rank of the two variables, respectively. ρ s FL is between 1.0 (a perfect positive correlation) and −1.0 (a perfect negative correlation). The larger the Spearman correlation coefficient, the stronger the monotonicity between the two variables. Then, we rank the features by sorting the ρ s FL in descending order, and eliminate the less relevant features whose ρ s FL value is smaller than a given threshold T 1 . By setting the threshold T 1 , the features with a strong monotonicity with the labels are retained. Here, M 1 denotes the number of retained features after Step1.
Step 2: The M 1 features are ranked by sorting the ρ s FL in descending order, and we further calculate the rank correlation coefficient between these features. The measurements of the two features are firstly assigned corresponding ranks based on their average descending position in the total measurements. Then, the Spearman correlation coefficient ρ s FF between any two features is calculated by: where n is the number of correlation measurements of the two selected features (F and F'), x i and x i ' represent the rank of the i th measurements for the two selected features, and x and x represent the average rank of the two selected features, respectively. The nested loop of ρ s FF between the baseline features (F) and other features (F') is constructed to eliminate the strongly relevant features satisfying the condition that the ρ s FF value is higher than a given threshold T 2 . Here, M 2 denotes the number of retained features after Step 2.
Step 3: Given the number of final retained features M, we construct the nested collection of RF models involving the K features for K = M 2 down to M, and eliminate the feature involved in the model leading to the smallest feature importance (FI) during each iteration. Finally, the remaining M features compose the optimal feature combination.

Deep-Learning Classification
A traditional CNN usually uses an FC layer as the final decision. This section introduces two hybrid CNN-RF networks. The designed networks use original deep-learning networks to extract high-dimensional features and combine the advantages of RF instead of the FC layer to make the final classification decision.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 23 model leading to the smallest feature importance (FI) during each iteration. Finally, the remaining M features compose the optimal feature combination.

Deep-Learning Classification
A traditional CNN usually uses an FC layer as the final decision. This section introduces two hybrid CNN-RF networks. The designed networks use original deep-learning networks to extract high-dimensional features and combine the advantages of RF instead of the FC layer to make the final classification decision.
3.3.1. Visual Geometry Group Combined with Random Forest (VGG-RF) Figure 6 shows the architecture of the VGG-RF and VGG [45], including the convolutional layers, pooling layers, fully connected layers, and dropout in detail. This paper studies a deep-learning method of crop classification based on pixels, which are limited by the number of bands, so the width of the convolutional filter was set to 2. Continuous 2 × 2 convolution kernels were selected to replace the larger convolution kernel to ensure network depth improvement under the same perception field. In the network structure of the VGG combined with random forest (VGG-RF), we tested the hyperparameters and selected the optimal hyperparameter values for training the network. The channel numbers of the first convolution layer were measured as 32, 64, and 128, respectively. The optimal value is 64, which is the same as that in [46]. During the training process, the pooling layers were fixed to "max-pooling" with a window size of 2 × 2. Dropout is a regularization technique that randomly drops some neurons. The proportion of dropped neurons was set to 50%. VGG contains three fully connected layers at the output end, and the last fully connected layer contains four neurons, corresponding to the probability of the four classes: rice, urban, corn, and soybean. The 1024 × 1 feature vector of the Fc8 layer output was extracted and put into random forest (RF) for classification. As a hybrid CNN-RF network, the designed VGG-RF used the high-dimensional features extracted by VGG and combined the advantages of RF to replace the fully connected (FC) layer to make the final decision.   Figure 6 shows the architecture of the VGG-RF and VGG [45], including the convolutional layers, pooling layers, fully connected layers, and dropout in detail. This paper studies a deep-learning method of crop classification based on pixels, which are limited by the number of bands, so the width of the convolutional filter was set to 2. Continuous 2 × 2 convolution kernels were selected to replace the larger convolution kernel to ensure network depth improvement under the same perception field. In the network structure of the VGG combined with random forest (VGG-RF), we tested the hyperparameters and selected the optimal hyperparameter values for training the network. The channel numbers of the first convolution layer were measured as 32, 64, and 128, respectively. The optimal value is 64, which is the same as that in [46]. During the training process, the pooling layers were fixed to "max-pooling" with a window size of 2 × 2. Dropout is a regularization technique that randomly drops some neurons. The proportion of dropped neurons was set to 50%. VGG contains three fully connected layers at the output end, and the last fully connected layer contains four neurons, corresponding to the probability of the four classes: rice, urban, corn, and soybean. The 1024 × 1 feature vector of the Fc8 layer output was extracted and put into random forest (RF) for classification. As a hybrid CNN-RF network, the designed VGG-RF used the high-dimensional features extracted by VGG and combined the advantages of RF to replace the fully connected (FC) layer to make the final decision.

Deep-Learning Classification
A traditional CNN usually uses an FC layer as the final decision. This section introduces two hybrid CNN-RF networks. The designed networks use original deep-learning networks to extract high-dimensional features and combine the advantages of RF instead of the FC layer to make the final classification decision.
3.3.1. Visual Geometry Group Combined with Random Forest (VGG-RF) Figure 6 shows the architecture of the VGG-RF and VGG [45], including the convolutional layers, pooling layers, fully connected layers, and dropout in detail. This paper studies a deep-learning method of crop classification based on pixels, which are limited by the number of bands, so the width of the convolutional filter was set to 2. Continuous 2 × 2 convolution kernels were selected to replace the larger convolution kernel to ensure network depth improvement under the same perception field. In the network structure of the VGG combined with random forest (VGG-RF), we tested the hyperparameters and selected the optimal hyperparameter values for training the network. The channel numbers of the first convolution layer were measured as 32, 64, and 128, respectively. The optimal value is 64, which is the same as that in [46]. During the training process, the pooling layers were fixed to "max-pooling" with a window size of 2 × 2. Dropout is a regularization technique that randomly drops some neurons. The proportion of dropped neurons was set to 50%. VGG contains three fully connected layers at the output end, and the last fully connected layer contains four neurons, corresponding to the probability of the four classes: rice, urban, corn, and soybean. The 1024 × 1 feature vector of the Fc8 layer output was extracted and put into random forest (RF) for classification. As a hybrid CNN-RF network, the designed VGG-RF used the high-dimensional features extracted by VGG and combined the advantages of RF to replace the fully connected (FC) layer to make the final decision.

One-Dimensional Convolution Combined with Random Forest (Conv1D-RF)
The hybrid Conv1D-RF network uses the high-dimensional features extracted by Conv1D and leverages the advantages of RF to replace the fully connected (FC) layer to make the final decision. Similar to the hybrid CNN-SVM [33] network, the CNN is used for feature extraction and the SVM classifier is used for classification. The proposed Conv1D-RF network uses one-dimensional convolution (Conv1D), which has great potential in temporal feature representation. Thus, we aim to combine the high-dimensional features extracted by the FC1 layer of Conv1D with RF. A traditional CNN uses an FC layer to make the final classification decision, and there is usually overfitting, especially with inadequate samples, which is not sufficiently robust and is computationally intensive. The use of the RF instead of the FC layer to make the final decision can effectively alleviate the occurrence of overfitting [47]. Due to the low requirement of the number of samples, the RF classifier can still obtain satisfactory decision results when the sample size is insufficient. Similarly, other studies have attempted to replace the FC layer with other structures, such as the convolutional layer is used to replace the FC layer in FCN [48]. As shown in Figure 7, Conv1D is a special form of CNN that is implemented by pooling layers, fully connected layers, and dropout. The convolutional filter width was set to 3. The number of channels in the first convolution layer was set to 64 and the channel number increased with depth. The proportion of dropped neurons was set to 40% and 50%. We designed and tested the inception module to concentrate convolutional and pooling layers with different sizes. The input of this module had three branches, including two convolutional layers with filter widths of 3 and 5, respectively, and one max-pooling layer with a filter width of 2. Convolution was simultaneously performed on multiple scales, and features at different scales were extracted. The obtained multi-scale features were concentrated to make the subsequent classification decisions more accurate. Conv1D contains two fully connected layers at the output end, and the last fully connected layer contains four neurons, corresponding to the probability of the four classes. The 512 × 1 feature vector of the first fully connected layer output was extracted and put into random forest (RF) for classification. As a hybrid CNN-RF network, the designed Conv1D-RF used the high-dimensional features extracted by Conv1D and combined the advantages of RF to replace the FC layer to make the final decision.
The hybrid Conv1D-RF network uses the high-dimensional features extracted by Conv1D and leverages the advantages of RF to replace the fully connected (FC) layer to make the final decision. Similar to the hybrid CNN-SVM [33] network, the CNN is used for feature extraction and the SVM classifier is used for classification. The proposed Conv1D-RF network uses one-dimensional convolution (Conv1D), which has great potential in temporal feature representation. Thus, we aim to combine the high-dimensional features extracted by the FC1 layer of Conv1D with RF. A traditional CNN uses an FC layer to make the final classification decision, and there is usually overfitting, especially with inadequate samples, which is not sufficiently robust and is computationally intensive. The use of the RF instead of the FC layer to make the final decision can effectively alleviate the occurrence of overfitting [47]. Due to the low requirement of the number of samples, the RF classifier can still obtain satisfactory decision results when the sample size is insufficient. Similarly, other studies have attempted to replace the FC layer with other structures, such as the convolutional layer is used to replace the FC layer in FCN [48]. As shown in Figure 7, Conv1D is a special form of CNN that is implemented by pooling layers, fully connected layers, and dropout. The convolutional filter width was set to 3. The number of channels in the first convolution layer was set to 64 and the channel number increased with depth. The proportion of dropped neurons was set to 40% and 50%. We designed and tested the inception module to concentrate convolutional and pooling layers with different sizes. The input of this module had three branches, including two convolutional layers with filter widths of 3 and 5, respectively, and one max-pooling layer with a filter width of 2. Convolution was simultaneously performed on multiple scales, and features at different scales were extracted. The obtained multi-scale features were concentrated to make the subsequent classification decisions more accurate. Conv1D contains two fully connected layers at the output end, and the last fully connected layer contains four neurons, corresponding to the probability of the four classes. The 512 × 1 feature vector of the first fully connected layer output was extracted and put into random forest (RF) for classification. As a hybrid CNN-RF network, the designed Conv1D-RF used the high-dimensional features extracted by Conv1D and combined the advantages of RF to replace the FC layer to make the final decision.

Evaluation
The crop classification accuracy of each network was evaluated on the test dataset. We applied six parameters of pixel-based evaluation, overall accuracy (OA), K coefficient, precision, recall, F1 score, and intersection-over-union (IoU) to assess the crop classification accuracy.
OA is expressed as: Figure 7. Architecture of the one-dimensional convolution combined with random forest (Conv1D-RF) and Conv1D model.

Evaluation
The crop classification accuracy of each network was evaluated on the test dataset. We applied six parameters of pixel-based evaluation, overall accuracy (OA), K coefficient, precision, recall, F1 score, and intersection-over-union (IoU) to assess the crop classification accuracy.
OA is expressed as: where p i,j represents the total number of pixels that belong to class i and are assigned to class j and n represents the number of categories. The K coefficient is: where N denotes the total number of samples, a 1 , a 2 , . . . , a n are the numbers of real samples in each type, and b 1 , b 2 , . . . , b n are the numbers of samples predicted for each type.
By comparison with ground truth (GT), true positive (TP), false positive (FP), and false negative (FN) represent the number of correctly extracted classes, incorrectly extracted classes, and missing classes, respectively. Using these counts, precision and recall are defined as: F1 score is a representation of the harmonic mean of precision and recall, and is calculated by: IoU describes the overlap rate between the crop classification result and GT, and is calculated by:

Feature Selection Comparison
In order to effectively compare the OFSM with TFSM, these feature selection methods used the same classifier and importance evaluation criteria. The random forest (RF) classifier was used and the parameters were obtained by grid search, including n_estimators, max_depth, min_samples_leaf, min_samples_split, and max_features.

Features from OFSM
The OFSM presented in Section 3.2 is used to select the optimal feature combination from the input 234 features. First, the Spearman rank correlation coefficient ρ FL s was used to calculate the correlation between the input features and the labels (e.g., rice, corn, soybean, urban). Then, we ranked the features by sorting the ρ FL s in descending order, and eliminated the unimportant features satisfying ρ FL s < 0.2. There were 88 features retained after this step. The top 88 important correlation coefficients between the input features and labels are shown in Figure 8a. Through comparative analysis, there was a stronger correlation between the features in June and the labels, which indicated that the features in June contribute greatly to identifying different crops. Moreover, the features with a stronger correlation were concentrated in the raw spectral features and segmentation features of the red-edge bands (e.g., B7 and B8A), SWIR (B11 and B12), and NIR band (B8), indicating that these bands can provide special spectral information to improve the identification of crops. The correlation coefficient between the 88 features is shown in Figure 8b. The results demonstrated that there was still much redundant information existing in the remaining features. We calculated the correlation ρ ' features retained at the end, which greatly reduced the time consumption of subsequent processing. The correlation between the 33 features is shown in Figure 9. It can be seen that the redundancy between the retained 33 features was greatly reduced. We computed the FI of the RF of the nested models starting from the 33 features, ranked the features by sorting the FI in descending order, and eliminated the feature with the smallest FI until 16 features were retained. Table 5 shows the optimal feature combination selected by OFSM. From the results of the feature combination, it can be seen that segmentation features contribute greatly to the classification of crops. OFSM also selected the saturation feature in September as an important feature to effectively identify crops. Similar to the raw spectral feature selection results using RF-RFE, the red-edge band centered at 705 nm (B5) and SWIR band centered at 2190 nm (B12) in OFSM were also found to be the most important spectral bands for identifying crops. This showed that SWIR and red-edge bands can indeed provide effective spectral information to identify crops. OFSM and RF-RFE simultaneously selected the spectral index of Green Atmospherically Resistant Vegetation Index (GARI) in June as an important feature. These results demonstrate that the combination of spatial, We calculated the correlation ρ FF s between the 88 features and constructed the nested loops to continuously eliminate the strongly relevant features satisfying ρ FF s > 0.9. Thus, there were 33 features retained at the end, which greatly reduced the time consumption of subsequent processing. The correlation between the 33 features is shown in Figure 9. It can be seen that the redundancy between the retained 33 features was greatly reduced. We calculated the correlation ρ ' features retained at the end, which greatly reduced the time consumption of subsequent processing. The correlation between the 33 features is shown in Figure 9. It can be seen that the redundancy between the retained 33 features was greatly reduced. We computed the FI of the RF of the nested models starting from the 33 features, ranked the features by sorting the FI in descending order, and eliminated the feature with the smallest FI until 16 features were retained. Table 5 shows the optimal feature combination selected by OFSM. From the results of the feature combination, it can be seen that segmentation features contribute greatly to the classification of crops. OFSM also selected the saturation feature in September as an important feature to effectively identify crops. Similar to the raw spectral feature selection results using RF-RFE, the red-edge band centered at 705 nm (B5) and SWIR band centered at 2190 nm (B12) in OFSM were also found to be the most important spectral bands for identifying crops. This showed that SWIR and red-edge bands can indeed provide effective spectral information to identify crops. OFSM and RF-RFE simultaneously selected the spectral index of Green Atmospherically Resistant Vegetation Index (GARI) in June as an important feature. These results demonstrate that the combination of spatial, We computed the FI of the RF of the nested models starting from the 33 features, ranked the features by sorting the FI in descending order, and eliminated the feature with the smallest FI until 16 features were retained. Table 5 shows the optimal feature combination selected by OFSM. From the results of the feature combination, it can be seen that segmentation features contribute greatly to the classification of crops. OFSM also selected the saturation feature in September as an important feature to effectively identify crops. Similar to the raw spectral feature selection results using RF-RFE, the red-edge band centered at 705 nm (B5) and SWIR band centered at 2190 nm (B12) in OFSM were also found to be the most important spectral bands for identifying crops. This showed that SWIR and red-edge bands can indeed provide effective spectral information to identify crops. OFSM and RF-RFE simultaneously selected the spectral index of Green Atmospherically Resistant Vegetation Index (GARI) in June as an important feature. These results demonstrate that the combination of spatial, spectral, and color information is of great significance for the classification of crops; however, the contribution of texture information is not obvious.  Figure 10 shows the Spearman rank correlation coefficient between the 16 features of the three feature selection methods. The features are arranged in the order of raw spectral features, segmentation features, spectral index features, color features, and texture features. It can be seen that the 16 features of TFSM, including RF-FI and RF-RFE, were still highly correlated, meaning that there was redundant information between the selected features. Compared with the RF-FI method, there was more redundant information in the features selected by the RF-RFE method. Moreover, we found that the correlation between the segmentation features and the spectral index features was high using the RF-RI method, and the correlation between the raw spectral features and the segmentation features was high using the RF-RFE method. By contrast, the features of OFSM were relatively independent, and the redundant information between the selected features was relatively small.
Remote Sens. 2020, 12, x FOR PEER REVIEW 13 of 23 spectral, and color information is of great significance for the classification of crops; however, the contribution of texture information is not obvious.  Figure 10 shows the Spearman rank correlation coefficient between the 16 features of the three feature selection methods. The features are arranged in the order of raw spectral features, segmentation features, spectral index features, color features, and texture features. It can be seen that the 16 features of TFSM, including RF-FI and RF-RFE, were still highly correlated, meaning that there was redundant information between the selected features. Compared with the RF-FI method, there was more redundant information in the features selected by the RF-RFE method. Moreover, we found that the correlation between the segmentation features and the spectral index features was high using the RF-RI method, and the correlation between the raw spectral features and the segmentation features was high using the RF-RFE method. By contrast, the features of OFSM were relatively independent, and the redundant information between the selected features was relatively small.  Table 6 lists the time consumption of the three feature selection methods. It can be seen that the time consumption of OFSM was in the intermediate level, while that of RF-FI and RF-RFE were the smallest and largest, respectively. According to the designed feature selection strategy, OFSM greatly reduced the time consumption. In summary, the features selected by OFSM were independent of each other and the time consumption was acceptable.   Table 6 lists the time consumption of the three feature selection methods. It can be seen that the time consumption of OFSM was in the intermediate level, while that of RF-FI and RF-RFE were the smallest and largest, respectively. According to the designed feature selection strategy, OFSM greatly reduced the time consumption. In summary, the features selected by OFSM were independent of each other and the time consumption was acceptable.

Deep-Learning Network Hyperparameter Selection
The hyperparameter settings of deep-learning networks usually affect the training results. In order to select the optimal hyperparameters for the hybrid CNN-RF networks, we tested the common hyperparameters, including num_filter1, convolution kernel_size, pooling kernel_size, learning_rate, dropout, max_iterations and batch_size. The tested and optimal hyperparameters of VGG-RF and Conv1D-RF are listed in Table 7. Based on the experimental results, when the optimal parameters were selected the training efficiency and the accuracy of the hybrid CNN-RF networks were the best.

Classification and Accuracy Assessment
Based on the selected optimal features using OFSM and TFSM, four networks were tested for comparison: Conv1D-RF, VGG-RF, Conv1D, and VGG. Since the most satisfactory classification results were achieved by the designed hybrid Conv1D-RF network, we further compared Conv1D-RF with three mainstream networks, including LSTM-RF, ResNet, and U-Net.

Comparison of the Hybrid CNN-RF Networks with the Original Deep-Learning Networks
In this study, the training and validation datasets were used to train and select the optimal parameters for the deep-learning networks. Classification results by Conv1D and VGG were used to represent the performance of popular deep-learning algorithms. Furthermore, Conv1D-RF and VGG-RF, as presented in Section 3.3, were proposed to represent the fusion performance of deep-learning and machine-learning algorithms. Table 8 shows the classification results of the hybrid CNN-RF networks and original deep-learning networks based on the selected features using OFSM and TFSM. The Conv1D and VGG classifiers performed worse than the Conv1D-RF and VGG-RF, with many areas with a higher "speckle" (i.e., more heterogeneity) of classes across the landscape. Compared to the TFSM results, there was less noise in the OFSM classification results, and some improvements were made regarding the misclassification and omission of the indistinguishable corn and soybean. The use of Conv1D-RF based on OFSM particularly improved the classification results of crop types compared with the other three networks.
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study.

Comparison of Conv1D-RF with Mainstream Networks
This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study.

Comparison of Conv1D-RF with Mainstream Networks
This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study.

Comparison of Conv1D-RF with Mainstream Networks
This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study.

Comparison of Conv1D-RF with Mainstream Networks
This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study. This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study. This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study. This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study. This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study. This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study. This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study. This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study. This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers.

OFSM
To evaluate the effectiveness of the hybrid CNN-RF networks and original deep-learning networks, comparisons were conducted based on the test dataset. As shown in Table 9, the OA and K coefficient of OFSM were superior to those of TFSM. The classification results of the hybrid CNN-RF networks were better than those of the original deep-learning networks with softmax. The OA of Conv1D-RF was higher than that of Conv1D by 1.7%, and OA of VGG-RF was higher than that of VGG by 1.3%. The features extracted by the deep-learning networks were further input into random forest (RF) classifier for analysis, which can better extract and identify useful crop information. The experimental results demonstrate that the hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops.
Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study. This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers. Table 10 shows the overall Although the time consumption of RF-RFE was the largest, the OA of RF-RFE was obviously higher than that of RF-FI. The OA of OFSM was slightly higher than that of RF-RFE by 0.3%, and much higher than that of RF-FI by 4%. Conv1D and VGG showed a distinct capability. Conv1D employed one-dimensional filters to capture the temporal pattern or shape features of the input sequence. The OA of Conv1D-RF was higher than that of VGG-RF by 1%, and the OA of Conv1D was higher than that of VGG by 0.7% for multi-temporal features. The Conv1D-RF based on OFSM had the highest accuracy (94.27%) and K coefficient (0.917) among all types of networks, so this hybrid network is assumed to be the best choice in this study.

Comparison of Conv1D-RF with Mainstream Networks
This study compared the hybrid Conv1D-RF network with popular deep-learning based networks, including LSTM-RF [49], ResNet [50], and U-Net [51]. LSTM-RF uses RF instead of the FC layer to make the final classification decision. ResNet uses global average pooling (GAP) instead of the FC layer and U-Net is a fully convolutional network without FC layers. Table 10 shows the overall accuracy (OA) and Kappa (K) coefficient of the four deep-learning networks using the three feature selection methods. The proposed hybrid Conv1D-RF network combined with OFSM achieved the highest OA of 94.27% compared with ResNet (93.55%), LSTM-RF (92.91%), and U-Net (91.92%). In order to further analyze the superiority of Conv1D-RF, we applied the four parameters of pixel-based evaluation, precision, recall, F1 score, and IoU to evaluate the performance of each class using the four deep-learning networks and three feature selection methods. Figure 11 shows the analysis of the evaluation parameters of the four deep-learning networks using OFSM and TFSM. The experimental results demonstrated that Conv1D-RF had an outstanding performance compared with other three networks for the same feature combinations obtained by OFSM and TFSM. The performance of the four deep-learning networks for rice recognition was similar, but for corn and soybean, which were difficult to distinguish, Conv1D-RF achieved better recognition results for the two crop types in the four networks. In particular, the combination of Conv1D-RF and OFSM obtained the best crop recognition results and performed well in the four evaluation parameters.
Remote Sens. 2020, 12, x FOR PEER REVIEW 16 of 23 accuracy (OA) and Kappa (K) coefficient of the four deep-learning networks using the three feature selection methods. The proposed hybrid Conv1D-RF network combined with OFSM achieved the highest OA of 94.27% compared with ResNet (93.55%), LSTM-RF (92.91%), and U-Net (91.92%). In order to further analyze the superiority of Conv1D-RF, we applied the four parameters of pixel-based evaluation, precision, recall, F1 score, and IoU to evaluate the performance of each class using the four deep-learning networks and three feature selection methods. Figure 11 shows the analysis of the evaluation parameters of the four deep-learning networks using OFSM and TFSM. The experimental results demonstrated that Conv1D-RF had an outstanding performance compared with other three networks for the same feature combinations obtained by OFSM and TFSM. The performance of the four deep-learning networks for rice recognition was similar, but for corn and soybean, which were difficult to distinguish, Conv1D-RF achieved better recognition results for the two crop types in the four networks. In particular, the combination of Conv1D-RF and OFSM obtained the best crop recognition results and performed well in the four evaluation parameters.

Analysis of Feature Selection Using OFSM
Current studies often input the raw bands of remote sensing images into deep-learning models. For example, Kussul et al. [52] input the raw bands of multi-temporal Landsat-8 and Sentinel-1A images into 1D CNN and 2D CNN networks for training and learning. Ji et al. [3] input the raw bands of multi-temporal GF-2 images into a 3D CNN network for crop recognition. We compare the crop recognition accuracy and processing time for the networks' training processes for optimal feature combination and raw spectral bands to illustrate the necessity of feature selection processing.
The 30 raw spectral bands of multi-temporal Sentinel-2, and the 16 feature bands selected by OFSM were put into the Conv1D-RF and VGG-RF models, respectively. The OA and time consumption during training of the two hybrid networks are shown in Table 11. The features selected by OFSM were superior to the raw spectral bands in both time consumption and OA, indicating that this proposed method can be applied for crop identification with a high efficiency and accuracy.

Analysis of Feature Selection Using OFSM
Current studies often input the raw bands of remote sensing images into deep-learning models. For example, Kussul et al. [52] input the raw bands of multi-temporal Landsat-8 and Sentinel-1A images into 1D CNN and 2D CNN networks for training and learning. Ji et al. [3] input the raw bands of multi-temporal GF-2 images into a 3D CNN network for crop recognition. We compare the crop recognition accuracy and processing time for the networks' training processes for optimal feature combination and raw spectral bands to illustrate the necessity of feature selection processing.
The 30 raw spectral bands of multi-temporal Sentinel-2, and the 16 feature bands selected by OFSM were put into the Conv1D-RF and VGG-RF models, respectively. The OA and time consumption during training of the two hybrid networks are shown in Table 11. The features selected by OFSM were superior to the raw spectral bands in both time consumption and OA, indicating that this proposed method can be applied for crop identification with a high efficiency and accuracy.

Conv1D Feature Map Visualization
The characteristics of the Conv1D-based network can be inspected by visualizing the feature maps on different layers. For example, Zhong et al. [26] inspected the behavior of the Conv1D-based model by visualizing the activation on different layers. The shallow layers of the Conv1D model could capture local feature variations, while the higher layers focused on the overall feature patterns. The Conv1D layers were used as a multi-level feature extractor in crop classification tasks, which automatically extract features from the input time series during the training process. We use visualization techniques to examine what the deep-learning network model learns and how it understands the input optimal features from the time series. Figure 12 visualizes the output feature maps obtained from the training dataset of the four classes using Conv1D, including the first Conv1D layer, the inception module, the second Conv1D layer, and the third Conv1D layer. The output feature map size of the first Conv1D layer is 14 × 64, the inception module is 29 × 128, the second Conv1D layer is 27 × 128, and the third Conv1D layer is 25 × 256, where 14, 29, 27, and 25 refer to the feature size, and 64, 128, and 256 refer to the number of channels. As shown in Figure 12, there are significant differences between the features extracted from various classes (urban, corn, rice, and soybean) in the same neural network layer. Conv1D layers can be stacked so that lower layers focus on certain temporal patterns, whereas higher layers can aggregate simple patterns into complex shapes. Therefore, the neurons in the shallow layers of the network can extract low-level features, with increasing network depth, the neural network can still efficiently extract more holistic features from the higher layers.

Conv1D Feature Map Visualization
The characteristics of the Conv1D-based network can be inspected by visualizing the feature maps on different layers. For example, Zhong et al. [26] inspected the behavior of the Conv1D-based model by visualizing the activation on different layers. The shallow layers of the Conv1D model could capture local feature variations, while the higher layers focused on the overall feature patterns. The Conv1D layers were used as a multi-level feature extractor in crop classification tasks, which automatically extract features from the input time series during the training process. We use visualization techniques to examine what the deep-learning network model learns and how it understands the input optimal features from the time series. Figure 12 visualizes the output feature maps obtained from the training dataset of the four classes using Conv1D, including the first Conv1D layer, the inception module, the second Conv1D layer, and the third Conv1D layer. The output feature map size of the first Conv1D layer is 14 × 64, the inception module is 29 × 128, the second Conv1D layer is 27 × 128, and the third Conv1D layer is 25 × 256, where 14, 29, 27, and 25 refer to the feature size, and 64, 128, and 256 refer to the number of channels. As shown in Figure 12, there are significant differences between the features extracted from various classes (urban, corn, rice, and soybean) in the same neural network layer. Conv1D layers can be stacked so that lower layers focus on certain temporal patterns, whereas higher layers can aggregate simple patterns into complex shapes. Therefore, the neurons in the shallow layers of the network can extract low-level features, with increasing network depth, the neural network can still efficiently extract more holistic features from the higher layers.

Crop Distribution Analysis
Two important agricultural commodities, corn and soybeans, are commonly difficult to distinguish due to their phenological similarity. In recent years, many studies have been carried out based on the mapping of corn and soybeans [36]. Zhong et al. [53] used a decision tree classifier and vegetation phenology information to distinguish corn and soybeans, and achieved an overall accuracy and K coefficient of 87.2% and 0.804, respectively, for crop mapping in the state of Paraná, Brazil, for the crop year 2012. The results showed that some corn was mistakenly classified as soybeans. When the data used for training the classifier was the same as the mapping year, a classification accuracy of more than 88% was achieved [54]. The main factors affecting the accuracy

Crop Distribution Analysis
Two important agricultural commodities, corn and soybeans, are commonly difficult to distinguish due to their phenological similarity. In recent years, many studies have been carried out based on the mapping of corn and soybeans [36]. Zhong et al. [53] used a decision tree classifier and vegetation phenology information to distinguish corn and soybeans, and achieved an overall accuracy and K coefficient of 87.2% and 0.804, respectively, for crop mapping in the state of Paraná, Brazil, for the crop year 2012. The results showed that some corn was mistakenly classified as soybeans. When the data used for training the classifier was the same as the mapping year, a classification accuracy of more than 88% was achieved [54]. The main factors affecting the accuracy of crop classification are commonly caused by mixed pixels. For example, the bias of mixed pixel classification will be affected by the complexity of the terrain, and sometimes even a small amount of sub-pixel natural vegetation will also affect the phenological detection of crops.
In 2018, the cultivated land areas of corn and soybean planted in the study area were approximately 906 ha (44.8% of the area) and 259 ha (12.80%), respectively. We analyzed the effects of different feature selection methods and networks on crop distribution, especially corn and soybeans, which are difficult to distinguish. The comparison of crop-type distribution is shown in Table 12. In the blue columns, the Conv1D-RF network is used with the features selected by OFSM and TFSM (RF-RFE and RF-FI), and the last column shows the crop-type distribution using VGG-RF with the optimal features selected by OFSM. Compared with TFSM, the classification results based on OFSM were much closer to the reference dataset for the distribution of crop-type, which further illustrates the effectiveness of the proposed method. Conv1D-RF based on OFSM more accurately mapped areas of corn and soybean than the VGG-RF based on OFSM.  Figure 13 more clearly shows the differences in the crop classification results obtained by the four methods. Compared with the reference dataset, the changes in rice distribution obtained by all the methods were not significantly different, while the distribution changes for corn and soybeans varied greatly. A negative percentage change in the corn area indicates that some corn is underestimated, while a positive percentage change in the soybean area indicates that the soybeans are overestimated. Table 13 shows the confusion matrix of the classification results of Conv1D-RF based on OFSM. The number of missing pixels of corn was the highest, and most of the missing corn was mistakenly classified as soybeans (5903 pixels), resulting in a relatively high commission for soybeans. The omitted pixels for rice and urban were relatively small. Part of the missing soybeans was mistakenly classified as corn (1205 pixels), and the other part was mistakenly classified as urban (1078 pixels). According to the field survey, the corn planting in the study area is relatively regular, while the soybean planting is more scattered and some areas are mixed with corn, causing the mixed pixel phenomenon. In addition, the reflectance of soybean in the spectrum is high, which may make the reflectance of mixed pixels shift toward that of soybeans, resulting in some corn being mistakenly classified as soybeans. Remote Sens. 2020, 12, x FOR PEER REVIEW 20 of 23 Figure 13. Comparison of the crop-type distribution changes based on the reference dataset.

Conclusions
In this study, a novel crop-classification method was developed by combining optimal feature selection and the hybrid CNN-RF networks using multi-temporal Sentinel-2 images to classify summer crops in the Jilin province, northeast China. Regarding the spectral information from feature selection, case studies from traditional feature selection methods and the optimal feature selection method (OFSM) confirmed that the red-edge bands (e.g., B5) and shortwave infrared bands (e.g., B12) are the best spectral bands for crop mapping. Based on the optimal features selected by OFSM, including information on both the temporal and spatial dimensions, the most satisfactory classification results in terms of the overall accuracy (OA) (94.27%) and K coefficient (0.917) were achieved by a hybrid CNN-RF network model built with one-dimensional convolution (Conv1D) and RF. The hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops. Based on their capability of identifying individual crop types, Conv1D-RF had a 1% greater OA than VGG-RF, a 1.7% greater OA than Conv1D, and a 2.4% greater OA than VGG. Since the hierarchical architecture of Conv1D uses time series as a classification input, it can effectively extract features of crop-growth dynamics during model training. In summary, the proposed hybrid CNN-RF network based on the features selected by OFSM is a promising approach for utilizing the advantages of two classifiers by complementary classification which can achieve a higher accuracy of crop identification and a lower time consumption. The method of applying the hybrid deeplearning models to classify remote sensing imagery is still in the stage of continuous practice and exploration. Given sufficient data, however, the hybrid deep-learning models can be used to learn the most appropriate band combination for a specific task, possibly eliminating the input of

Conclusions
In this study, a novel crop-classification method was developed by combining optimal feature selection and the hybrid CNN-RF networks using multi-temporal Sentinel-2 images to classify summer crops in the Jilin province, northeast China. Regarding the spectral information from feature selection, case studies from traditional feature selection methods and the optimal feature selection method (OFSM) confirmed that the red-edge bands (e.g., B5) and shortwave infrared bands (e.g., B12) are the best spectral bands for crop mapping. Based on the optimal features selected by OFSM, including information on both the temporal and spatial dimensions, the most satisfactory classification results in terms of the overall accuracy (OA) (94.27%) and K coefficient (0.917) were achieved by a hybrid CNN-RF network model built with one-dimensional convolution (Conv1D) and RF. The hybrid networks can make full use of the advantages of the two classifiers to effectively identify crops. Based on their capability of identifying individual crop types, Conv1D-RF had a 1% greater OA than VGG-RF, a 1.7% greater OA than Conv1D, and a 2.4% greater OA than VGG. Since the hierarchical architecture of Conv1D uses time series as a classification input, it can effectively extract features of crop-growth dynamics during model training. In summary, the proposed hybrid CNN-RF network based on the features selected by OFSM is a promising approach for utilizing the advantages of two classifiers by complementary classification which can achieve a higher accuracy of crop identification and a lower time consumption. The method of applying the hybrid deep-learning models to classify remote sensing imagery is still in the stage of continuous practice and exploration. Given sufficient data, however, the hybrid deep-learning models can be used to learn the most appropriate band combination for a specific task, possibly eliminating the input of redundant bands. Therefore, the kind of information needed and how to transform the information for classification by deep-learning models is worth exploring. To yield a higher accuracy in future applications, model architectures based on three-dimensional (3D) spatiotemporal convolution should be considered. Since Sentinel-2 imagery can be disturbed by clouds, future work could focus on developing a multi-source remote sensing imagery fusion approach for crop classification. The combination of optical images and synthetic-aperture radar images may improve the accuracy of crop classification, better evaluating the distribution of crops. In addition, to improve the accuracy of agricultural mapping, it would be helpful to identify the components of mixed pixels in heterogeneous regions for the modeling and inversion process of agricultural remote sensing methods which would support the strategic needs of agricultural sustainable development.