Mapping Maize Area in Heterogeneous Agricultural Landscape with Multi-Temporal Sentinel-1 and Sentinel-2 Images Based on Random Forest

: Accurate estimation of crop area is essential to adjusting the regional crop planting structure and the rational planning of water resources. However, it is quite challenging to map crops accurately by high-resolution remote sensing images because of the ecological gradient and ecological convergence between crops and non-crops. The purpose of this study is to explore the combining application of high-resolution multi-temporal Sentinel-1 (S1) radar backscatter and Sentinel-2 (S2) optical reﬂectance images for maize mapping in highly complex and heterogeneous landscapes in the middle reaches of Heihe River, northwest China. We proposed a new two-step method of vegetation extraction and followed by maize extraction, that is, extract the vegetation-covered areas ﬁrst to reduce the inter-class variance by using a Random Forest (RF) classiﬁer based on S2 data, and then extract the maize distribution in the vegetation area by using another RF classiﬁer based on S1 and/or S2 data. The results demonstrate that the vegetation extraction classiﬁer successfully identiﬁed vegetation-covered regions with an overall accuracy above 96% in the study area, and the accuracy of the maize extraction classiﬁer constructed by the combined multi-temporal S1 and S2 images is signiﬁcantly improved compared with that S1 (alone) or S2 (alone), with an overall accuracy of 87.63%, F1_Score of 0.86, and Kappa coefﬁcient of 0.75. In addition, with the introduction of multi-temporal S1 and/or S2 images in crop growing season, the constructed RF model is more beneﬁcial to maize mapping.


Introduction
Accurately estimating the spatial distribution and planting area of agricultural objects is of great significance for adjusting crop planting structure and water and soil resource management. Satellite remote sensing (RS) has the unique ability of large-area observation, and great achievements have been made in crop mapping by RS. It has become a common way to obtain large-scale crop planting area information by using RS satellites, such as Landsat satellite, Terra and Aqua satellite, Gaofen-1 satellite, and Sentinel satellite [1,2], with the availability of a series of remote sensing cloud computing platforms, such as google earth engine and Amazon web services. Limited by the spatial and temporal resolution of high-resolution images and the processing ability of massive remote sensing images, low and moderate spatial resolution RS images [3], such as MODIS, AVHRR images, are dominant data sources in the previous study of crop classification and extraction, since these low and moderate remote sensing images hold considerable promise for large area crops mapping given their global coverage, daily temporal resolution, and free access. These low and moderate RS images are beneficial to preliminary understanding of the spatial distribution of crops. However, limited by the coarse spatial resolution of the RS image, it is unable to meet the need of fine mapping requirements of small farmland [4]. Sentinel-2 (S2) is a European wide-format, high-resolution, multi-spectral imaging mission, including two polarities-orbiting satellites S2A and S2B, in the same orbit. The two satellites are complementary, with a wide cover (290 km), moderate revisit period (5 days) and high spatial resolution. It can support the monitoring of vegetation growth season changes [5], significantly reducing the problem of mixed pixels.
Nevertheless, optical RS images are inevitably affected by cloud cover, resulting in the lack of information in space and time, which greatly limits the accuracy of crop mapping. With the outstanding advantage of not being affected by clouds of Synthetic Aperture Radar (SAR) and the strong sensitivity of C-band backscattering signals to crop phenological cycles, SAR images have also been used to monitor the crop distribution extent, plant pattern and growth stage, and achieved high accuracy [6,7]. Sentinel-1 (S1) SAR images with a relatively short revisit time (i.e., 6 days) and high spatial resolution, which can be used to accurately track the crop growth and extract the phenological phase [8]. Previous studies have confirmed the great potential of S1 SAR for crop classification in areas with widespread cloudiness [9]. In addition, recent studies have indicated that the SAR data can replace optical images with the same resolution in the cloudy region [10], and improve the identification of certain crops [11]. Therefore, the potential of combining optical and SAR images to improve the mapping accuracy of crops in highly heterogeneous areas is a topic worth exploring [12].
One of the key points of high-precision crop mapping by using optical RS images is to select the appropriate spectral index. So far, a large number of spectral indices have been constructed for vegetation growth monitoring. Among them, the red edge point (REP) is very sensitive to the change of chlorophyll concentration [13] and the red edge inflection point (REIP) has been used to indicate vegetation stress and senescence [14]. Normalized difference vegetation index (NDVI) quantifies photosynthetic capacity, water stress and vegetation productivity [15]. Meanwhile, the time series of NDVI can be used to observe phenological information of different land cover types [16] and vegetation characteristics [17]. However, NDVI is prone to saturation in areas with high vegetation coverage. Enhanced vegetation index (EVI) has tried to compensate for this by using more spectral information, which further enhances the high vegetation regional sensitivity and reduces the disturbance of the soil and atmosphere [18]. Furthermore, crop growth is also affected by human activities, such as irrigation and fertilization. Normalized difference water index (NDWI) and Land Surface Water Index (LSWI) are two vegetation indices sensitive to surface moisture content [19]. The spatial distribution of green chlorophyll vegetation index (GCVI) is of great significance to the response of fertilizer (nitrogen) [20]. In addition, soil tillage index (STI) and Normalized difference tillage index (NDTI) are useful indices for identifying cultivated land [21]. Crop Residue Cover (CRC) is a vegetation index indicating the density of straw mulch in the field. The normalized burn ratio (NBR) has been used to estimate fire severity [22]. The Normalized difference building index (NDBI) index is also used to discriminate built up areas [23]. For more information on the construction of vegetation indices based on optical RS images, please refer to [24]. In view of the problem of crop mapping in areas with highly complex and heterogeneous vegetation structures, it is almost impossible to distinguish complex variation relationships among different vegetation types only by using single or a few vegetation feature variables. In contrast, if too many vegetation feature variables are used, it will easily lead to information redundancy and a large amount of computation, which can reduce the crop mapping accuracy [25].
Despite the above obstacle, the optimal feature variable combination analysis is used to effectively improve the classification accuracy and reducing data redundancy and time consumption in crop mapping research [26]. It has become a hot topic to search for optimal features from a large number of variables through a special feature selection [27]. At present, many feature selection methods have been developed in crop mapping research, such as stepwise discriminant analysis method [28], Jeffries-Matusita distance [29], separability analysis method [12], mutual information rate [25], the feature selection method based on the variable importance of random forest algorithm [11,30]. In addition, the combination of the above methods has also been proposed to strike a balance between separability and relevance, such as the spectrum-dimensional optimization algorithm [31], the automatic spectral-temporal feature selection method [32], and the combined method of time series correlation analysis, random forest feature importance, and univariate statistical tests with mutual information criteria to select features [33].
In recent years, using machine learning methods to extract crop spatial distribution information has become a research hotspot. The pixel-by-pixel supervised learning algorithm, based on decision tree (DT), maximum likelihood classification (MLC), artificial neural network (ANN), support vector machine (SVM) and random forest (RF), were the most widely used methods for crop spatial distribution information extraction, e.g., Basukala et al., (2007) compared MLC, SVM and RF methods to classify cotton, wheat and rice in the Khorezm Region of Uzbekistan and southern part of Autonomous Republic of Karakalpakstan, the results indicate that the RF performed much better than MLC and SVM, especially for the cases when the number of training samples is limited [34]. Moumni et al., (2021) used three machine learning classifier algorithms, ANN, SVM, and MLC, to map the crop types in irrigated perimeter, their results show that combined images acquired in C-band and the optical range improved clearly the crop types classification performance compared to the classification results of optical or SAR data alone [35]. Peña et al. (2014) evaluated the C4.5 DT, logistic regression (LR), ANN, and SVM, both as single classifiers and combined in a hierarchical classification, for the mapping of nine major summer crops from ASTER satellite images [36]. Zhang et al. (2020) compared the classification and regression tree (CART) DT, SVM, and RF, to map different crop types by using time-series S2 images, and they also further analyzed the effectiveness of the temporal and spectral features [37]. By using ANN, SVM and RF methods, multiple crop types classification was performed by using single Landsat 8, S1, and S2 image, and combined S1, S2, and Landsat 8 images [38]. These studies demonstrate that the RF algorithm achieves relatively higher accuracy in crop types extraction, compared with other machine learning algorithms. The RF method can not only easily handle different data types, but also quantitatively calculate the importance of variables. Meanwhile, unsupervised classification algorithms based on k-means clustering and iterative self-organized class (ISOCLASS) cluster algorithm also have certain advantages when there is no crop label sample [39,40]. However, Biggs et al. (2007) mapped rainfed and irrigated crops using this ISOCLASS cluster algorithm with MODIS NDVI time series images over Krishna Basin, India, and they found the classification accuracy of this method is disappointing in heterogeneous landscapes [41]. In addition, Wardlow and Egbert (2008) proposed a hierarchical classification scheme of combining supervised and unsupervised classification to extract the spatial distribution information of crops in the US Central Great Plains [42], the unsupervised classification method (ISODATA) was applied first to produce the crop/non-crop map, then the supervised decision tree algorithm was applied to produce the three crop maps, and the high classification accuracy (>84%) was attained by this method. Therefore, with enough label samples, supervised learning may be a more direct and effective way to identify the detailed distribution information of crops.
Our goal in this study is to explore more accurate crop classification under complex and heterogeneous vegetation landscapes by combining high-resolution multi-temporal optical and radar images based on RF algorithm. Specifically, a two-step method is proposed in this study, the first step is to extract vegetation cover information based on S2 images, and the second step is to extract maize coverage by using S1 (alone), S2 (alone) and combined data of S1 and S2. Numerous feature variables that may be beneficial to maize classification have been pre-selected in advance. A feature selection process through quantitatively evaluating the importance of the variables based on RF has been implemented to find the best combination of feature variables in the proposed two-step crop mapping.
Finally, the accuracy of classifiers for S1 (alone), S2 (alone) and combined data are assessed using the ground reference dataset.
The following section describes the study area and data used in this study and the pre-processing of RS data. Section 3 describes the detailed methodology of maize extraction. Section 4 shows the vegetation extraction results based on S2 and maize extraction results based on S1 (alone), S2 (alone) and combined data, respectively. Section 5 discusses the influence of image pre-processing and multi-temporal images on crop mapping and the advantages of the feature selection program in classification applications. Section 6 is the conclusion of the study.

Study Area
The study area is located in the middle reaches of the Heihe River basin, which is the central part of the Hexi Corridor Plain in central Gansu Province, China ( Figure 1) [43]. It is a national demonstration area of modern agriculture and the largest corn seed production area in China with flat terrain, fertile soil, sufficient sunshine and convenient water diversion [44]. The climate in this region is arid, with little precipitation but high evapotranspiration [45]. Therefore, the oasis in the study area is dominated by artificial oasis, and the planting structure of crops is highly heterogeneous and diverse [46]. Some areas have realized modern agricultural management, while other areas are small-scale agriculture with landscape fragmentation structure. Crops are primarily one-season planting in this area, and the growth period is mainly from April to September. The crop types in the study area include food crops (i.e., wheat, rice, maize, millet, soybeans and potatoes), cash crops (i.e., cotton, oilseed, hemp, beets, medicinal herbs, vegetables and melons) and green fodder (hybrid seed maize production). Maize is the crop with the largest planting area in this region, with large water requirements and high evapotranspiration during the growth period [47]. 2.2. RS Images and Processing 2.2.1. Sentinel-1 Images S1 level-1 Ground Range Detected (GRD) products are generated by C-band SAR to detect targets in the interferometric wide swath mode for both VV and VH polarizations with 10 m × 10 m pixel size and a swath width of 290 km. The GRD images have been pre-processed into an analysis-ready format using border noise removal, thermal noise removal, radiometric calibration, and orthorectification. As speckle noises impaired the performance of S1 on crop identification [48], further pre-processing included spatial filtering using a 7 × 7 Refined Lee speckle filter for both VV and VH polarizations [49,50]. Two additional indicators (Note, we also regard them as two bands in the rest of the paper), the difference between VV and VH (VV-VH) and the ratio of VV and VH (VV/VH), were also calculated [51].
We collected 67 S1 images with four bands (VV, VH, VV-VH, VV/VH) ( Table 1) from a full growing season (i.e., April 1 to September 30) in 2019. As the temporal composite with regular time intervals can overcome the spatial heterogeneity of observations, the monthly median composition method is used to construct S1 data aggregation on a monthly scale [52].

Sentinel-2 Images
The corresponding S2 multi-spectral instrument (MSI) Level-1C (L1C) top-ofatmosphere (TOA) reflectance products, which have been radiated and geometrically corrected [53], are used in this study. A total of 428 images are collected. The pre-processing process of S2 images mainly includes re-projection, resampling, cloud mask and moving median composition. First, re-project the image dataset to the WGS 1984 projected coordinate system. Second, all the bands with the lower spatial resolution are resampled to 10 m with cubic convolution interpolation. Then, the QA60 band is used to evaluate the quality of images and automatically hide the cloud and cirrus pixels in each image [25]. Furthermore, the monthly median composition method is also used for constructing S2 data aggregation on a monthly scale. Finally, the moving median synthesis method is used to smooth and gap-fill the image dataset [52].
In addition to the above four bands, a set of spectral indicators used in many previous studies to indicate vegetation growth have also been calculated. A summary of these spectral indices and their definitions are described in Table 1. In addition to the 11 spectral indices mentioned in Section 1, NBR1 and NBR2 are evolved from the NBR index, and there are four indices, RDNDVI1, RDNDVI2, RDGCVI1 and RDGCVI2, which are respectively extended from NDVI and GCVI indices.

Ground-Based Reference Dataset
A ground-based reference dataset is collected for model construction and accuracy evaluation. This ground-based reference dataset covers the typical land cover types in the study area, which is a point dataset with spatial and attributes information. A total of 2896 ground-based reference sample points were collected, including 623 maize samples, 731 non-maize vegetation samples, and 1542 non-vegetation samples. It should be noted that the non-maize vegetation samples include forest, wheat, pepper, oats, alfalfa, oil cabbage, onion, broccoli, bamboo shoots, potatoes, and beets. The non-vegetation samples include buildings, water bodies, Gobi, and bare land. The ground-based reference samples are derived from: (1) Field survey. A field survey was conducted in the study area from October 22 to November 1, 2019. A total of 780 samples were collected, including 623 maize samples and 157 non-maize vegetation samples. Limited by sampling time, we only recorded non-maize vegetation samples in areas with obvious vegetation types.
(2) High-resolution images from Google Earth. A total of 241 greenhouse cultivation samples were manually collected, and it is assumed that these points are non-maize vegetation samples. Besides, 333 grassland/forest samples, 529 building samples, and 677 other non-vegetation land cover samples were collected from Google Earth. Since Google earth images are mosaic from multi-source and high spatial resolution RS images with different transit times, the actual acquisition time of these samples may be different. However, theoretically speaking, the land cover types of these samples will not change significantly in a short time, so we think that these samples can be used as the ground-based reference samples in this study.
(3) S2 images. A total of 334 water samples were collected from S2 images. The purpose of this is to avoid deviations caused by narrow water system in certain geographical areas between Google earth images and S2 images.

Method
The goal of identifying maize pixels is split into two tasks: one to distinguish vegetation pixels from non-vegetation pixels, and one to identify maize pixels from vegetation pixels. Both tasks are implemented by RF classifier. The brief introduction of RF classification algorithm, the specific methodology of maize planting area extraction, and the performance evaluation index of the classifier defined in this study are as follows:

Random Forest
RF is an ensemble learning method proposed by Breiman [54], which is widely used in statistical classification and non-parametric regression problems. RF adopts the bootstrap sampling strategy with replacement to generate several independent training sets and build a decision tree on each training set by randomly selecting features or linear combinations of features. RF is an integrated model combining multiple decision trees [55].
RF algorithm has the characteristics of fast, easy parameterization and strong robustness [56,57]. Therefore, it has high accuracy in classifying large-scale data sets with many different characteristics and can handle high-dimensional and noisy input data [58]. At present, this method has shown excellent performance in the studies of land cover classification and crop mapping [59].
Another essential characteristic of RF is that it can quantify the variable importance (VI), which enables it to be used for feature ranking or selection [60]. Many studies have demonstrated that it is crucial to select a feature subset from over-abundant feature variables to prevent over-fitting and reduce the computational complexity of the model [58,61,62]. The RF uses "out-of-bag" (OOB) samples to generate an internal unbiased estimate of its generalization error, which is similar to K cross-validation to evaluate the accuracy of the classification [55].
It should be noted that the random forest classifier in this paper is implemented by using the Scikit-learn Library in Python environment.

The Methodology of Maize Planting Area Extraction
The methodology based on random forest classification is proposed to extract maize planting area, the detailed technical flow is shown schematically in Figure 2. The proposed methodology mainly consists of two steps: Figure 2. Technical flow of the maize planting area extraction scheme. "k = k − 5" represents the elimination of 5 smallest importance features in each iteration process. It includes three parts: (1) Data preparation and preprocessing; (2) Extraction of vegetation area using S2 images; (3) Extraction of corn planting area using S1(alone), S2(alone) and combined dataset, respectively.
Step I: Vegetation extraction. This is mainly to separate vegetation from non-vegetation to reduce the inter-class variance between maize/non-maize vegetation and non-vegetations. In this step, the RF classification model is constructed by using S2 time series and groundbased reference data, and the number of input indicators of RF is 30 (i.e., five spectral indices of GCVI, NDTI, NDVI, STI and RDGCVI1, each with six different dates from April to September). Only these five spectral indicators of S2 were selected because we found that these five indicators are more sensitive to vegetation and contribute the most to the extraction of vegetation regions in our previous study [63].
Step II: Maize extraction. The new RF classification model is trained by using S1 and S2 time-series images and ground-based reference data to distinguish between maize and non-maize vegetation in vegetation regions. In order to illustrate the necessity of introducing S1 data and the advantage of combined S1 and S2 data in crop mapping on highly complex and heterogeneous vegetation landscapes, three new RF models are developed by using indicators from S1 (alone), S2 (alone), and combination of S1 and S2 as the input of RF, respectively. The developed RF by using S1 (alone), S2 (alone) and the combined data are named as RF_S1, RF_S2, RF_S1&S2, respectively. The number of input indicators of these three RF models is 24 (i.e., four polarization characteristics of VV, VH, VV-VH, and VV/VH from S1 images, each with six different dates from April to September), 180 (i.e., all 30 indicators from S2 images, each indicator has six different dates from April to September), and 204 (i.e., combined indicators from both S1 and S2), respectively.
Considering that high-dimension input information may contain noise and redundancy, it will lead to over-fitting of a machine learning model. Thus, a specific feature selection procedure is essential to reduce information redundancy and prevent over-fitting when S2 (alone) or combined data are used to train a RF model. Embedded feature selection technique is embedding the feature selection process in the training of prediction models, which not only investigates the importance of input variables, but also gives the final prediction model [27], such as the feature selection strategy based on RF [30], in which the recursive variables elimination is used to reduce the computational complexity [64]. We adopt this RF-based embedded feature selection strategy to select a set of optimal feature variables with the smallest OOB error, wherein each iteration eliminates five features with the lowest importance until the number of features is less than 10 (Here, we set the threshold to 10 because we found that when the number of features is less than 10, the OOB error of the RF model begins to increase significantly), to establish the final RF maize extraction model with good generalization ability.
In our study, the sample dataset is divided into training set and validation set by a ratio of 8:2. The training set is used to train the RF classification model, and the independent validation set is used to examine the generalization performance of the developed RF model. Considering that the distribution of samples may be unbalanced, the stratified sampling method is adopted in the sample dataset division. The objective of RF model training process is to determine the optimal structure and hyperparameters to obtain the highest classification accuracy [65]. However, some previous studies have shown that different parameterization schemes have a limited influence on classification accuracy [57,66]. To reduce the computation cost and achieve a relatively good classification accuracy, the number of decision trees in the RF was set to 200, and the number of random feature variables was set to the square root of the total number of variables. Besides, the minimum number of terminal nodes was set to 10.

Accuracy Assessment
For quantitative evaluation of classification accuracy, most of the literature define a confusion matrix to calculate some evaluation indicators, such as overall accuracy, recall, precision, Kappa coefficient [67]. The confusion matrix was also constructed in this study, and five evaluation indexes, including recall (R), precision (P), overall accuracy (OA), F1_Score and Kappa coefficient, are calculated.
where N represents the number of training samples, n represents the number of classification categories, and i represents a specific category. P quantifies the number of positive samples that actually belong to the positive category, and R quantifies the number of positive samples that have been correctly predicted. OA refers to the probability that a sample will be correctly classified. F1_Score is used to quantify the output quality of the model, which is the harmonic mean of P and R. The value of F1_score ranges from 0 to 1, and the closer it is to 1, the higher the classification accuracy of the model is. Kappa coefficient is an indicator for detecting and verifying the consistency between the actual categories and the predicted categories of samples. The value of Kappa is usually between 0 and 1, and the closer the value is to 1, the higher the consistency is.

Vegetation Extraction Result
The RF classification model based on 30 input indicators from S2 can effectively classify vegetation and non-vegetation regions in the study area, as shown in Figure 3. It can be seen that 31% of the study area is covered by vegetation. The OA and Kappa coefficients of the developed RF classification model are 97.36% and 0.95, respectively. In addition, the OA, P, R, Kappa coefficient, and the F1_Scores of the model are 96%, 0.94, 0.98, 0.92, and 0.96, respectively, in the independent validation set. Meanwhile, the importance of the 30 input variables is also calculated by the RF model ( Figure 4). The results indicate that the NDVI, NDTI and STI are the three most important variables in the vegetation extraction process, especially NDVI from July to September (i.e., NDVI_Jul, NDVI_Aug, and NDVI_Sep), and their importance values reach 9.50%, 9.28% and 13.47%, respectively. Thus, the NDVI is the dominant indicator to identify vegetation and non-vegetation. Moreover, the other seven relatively important indicators are NDTI_Aug, NDTI_Sep, STI_ Aug, STI_ Sep, STI_Jul, NDVI_Jun and RDGCVI1_ Aug. On the time scale, the importance of indicators is more vital in July, August, and September. This is consistent with the actual situation as there are obvious differences in spectral signals between vegetation and non-vegetation during the vegetation flourishes in these months.

The Optimal Features
The results of feature selection show that the number of optimal features with the minimum OOB error is 30 and 40 for RF_S2 and RF_S1&S2 classification model, respectively. The 24 feature indicators of RF_S1 and the optimal features of RF_S2 and RF_S1&S2 are displayed in Figure 5.
Obviously, the RF_S1 model is extremely dependent on VV and VH polarization, and the top-10 features are the VV and VH polarization signals from May to September during the main growth period of crops, such as VV_May, VH_Aug, VV_Jun, VV_Sep, VH_Sep, VV_Aug, VH_May, VH_Apr, VV_Apr and VH_ Jun. The top-10 features of RF_S2 model include both band information and spectral indices from S2 images, where three band features are B2_Sep, B10_Sep and B8A_Jun, and seven spectral index features are NBR2_Apr, NBR1_Apr, NBR1_May, RDGCVI1_Aug, RDNDVI1_Jul, NDVI_Jul and NDBI_Apr. The optimal features of RF_S2 are relatively large, which may be related to the complex and heterogeneous surface coverage in the study area. The number of optimal features of RF_S1&2 model is the most, even up to 40, including polarization characteristics, band information and spectral indices, which may be related to the abundant spectral information that can more fully reflect complicated surface coverage. The top 10 features of RF_S1&S2 are 6 spectral indicators (i.e., STI_Apr, NBR1_Apr, NDVI_Jul, RDNDVI1_Jul, NDBI_Apr and NDTI_Jul), 2 polarization characteristics (i.e., VH_Aug and VV_Jun from S1), and 2 band indicators (i.e., B8A_Jun and B12_Apr from S2).
VV polarization contributed the most compared with other polarization in this study of maize planting area extraction based on S1 image. It can detect crop information better because of its strong penetrability. In the maize planting area extraction research based on S2 image, the contribution of NBR index is dominant, especially in April, which is related to the local cultivated land activities. Before planting crops in that year, farmers usually loosen the soil of the farmland and burn the roots left in the farmland in the previous year, which makes almost all farmland have their own unique NBR response. In general, the contribution of characteristic variables of S2 is greater than that of S1, which indicates that the bands and spectral indices from S2 image are more sensitive to crop information.
However, unlike S1 (alone) and S2 (alone), the top three indicators are VH, STI, and NBR in descending order of importance. Combined with more spectral features and polarization patterns, the importance of variables will become difficult to explain.  The feature indicators that used to construct three different RF maize extraction models of RF_S1, RF_S2, and RF_ S1&S2, respectively. (a) the 24 feature indicators from S1 (alone); (b) the 30 optimal feature indicators from S2 (alone); (c) the optimal 40 feature indicators from combined S1 and S2 data.

Comparison of Maize Extraction Results from Three RF Models
The performance quantitative evaluation results of RF_S1, RF_S2 and RF_S1&S2 maize extraction models on the training set and the independent test set are shown in Table 2. On the training set, the OAs of these models are 86.42%, 88.39% and 89.46%, and Kappa coefficients are 0.72, 0.76 and 0.79, respectively. The OA, R, P, Kappa coefficients and F1_Score of RF_S1 are 80.14%, 0.76, 0.81, 0.6 and 0.78, respectively. Compared with RF_S1, the performance of RF_S2 has been improved significantly, and the OA, R, P, Kappa and F1_Score of RF_S2 are 85.51%, 0.82, 0.87, 0.71 and 0.84, respectively. This indicates that the multi-temporal S2 optical images are more suitable for maize extraction than the microwave S1 images in the study area. The various spectral indicators provided by S2 images can better identify the different phenological characteristics of different vegetation, which is beneficial to distinguish maize from other vegetation. The performance of RF_S1&S1 has been further improved, with the OA, R, P, Kappa and F1_Score are 87.63%, 0.84, 0.87, 0.75 and 0.86. This indicates that combining S1 and S2 data, the spatial information of the two can complement each other, so as to provide more comprehensive features and extract the maize cover more robustly and accurately.
The distribution map of maize planting area extracted by RF_S1&S2 (the maize distribution maps extracted by RF_S1 and RF_S2 models are very close to RF_S1&S2, so we will not show them here) in the study area is shown in Figure 6. It can be seen from the figure that maize is mainly distributed in areas with flat terrain, close to water sources and well-developed irrigation systems (Figure 6a). In addition, there are a small number of mazes distributing in the northern, southwest and southern of the study area, which is consistent with the results of our field investigation. According to our field investigation, the cultivated land in the southern part of the study area is mainly terraced fields. Because these areas are difficult to irrigate, they are planted with forage maize that needs less water and is more drought resistant. Owing to the agricultural planting structure of the study area is the combination of modern agriculture and traditional agriculture, one of the important characteristics of corn distribution in the study area is relatively fragmented, such as the typical desert-oasis zone in the study area (i.e., local enlarged region as shown in Figure 6c), mixed distribution of corn, residential land, and greenhouse planting is found. In addition, the true color composite map of S2 image in the local enlarged area (Figure 6b) also shows that the surface heterogeneity of the area is very strong, and the surface coverage is very fragmented. Table 2. The performance quantitative evaluation results of RF_S1, RF_S2 and RF_S1&S2 for training set and test set for using S1 (alone), S2 (alone) and combination of S1 and S2.

Effects of Pre-processing on Maize Mapping
The quality of optical images and radar images is one of the keys to accurate extraction of maize planting area in this study. Because of the different dates of adjacent tracks and cloud contamination [52], the S2 time series images with different time intervals were composited using cloud mask algorithm and monthly median synthesis method. The QA60 band of S2 dedicated to monitoring cloud information was used in cloud masking processing. However, for time aggregation of S2 time series images, each scene will affect the final estimate [49]. Thus, the number of good observations (i.e., not polluted by cloud cover) on each pixel in the study area from April to September were recorded ( Figure 7). As can be seen from the Figure 7, the number of good observations from July to September was significantly larger than from April to June. In particular, the number of good observations in April was less than or equal to 3. In addition, it is worth noting that a very small fraction of the pixels in the monthly median composite image of April and June were missing. To improve image quality and realize information reconstruction, moving median processing was adopted for S2 monthly median synthesis images [52]. To verify the effectiveness of the moving median processing, the RF classification models were constructed by using S2 images before and after moving median processing per month in April, May, and June. The accuracy of maize/non-maize classification is shown in Table 3. The results indicate that the classification accuracy is improved for RF models that were constructed by the moving median processed S2 data, with higher OA, Kappa, and F1_Score values in most cases. Figure 7. The number of good observations in the study area from April to September. Banded areas had a higher value due to the more S2 images are overlapped. The color in the graph represents the amount of data available on the pixel during the time interval. From "blue" to "red" indicates that the number of good observations increases gradually. Table 3. Comparison of maize classification accuracy before and after S2 and S1 image pre-processing. Preprocessing refers to moving median synthesis for S2 images and 7 × 7 Refined Lee spectral filtering for S1 images. Currently, S1 is the only radar satellite system that can provide a dense time series with global coverage. It is an excellent alternative data source for agricultural land cover maps. We used the 7 × 7 Refined Lee speckle filter, which can preserve the edges and present the preserved details [68] when filtering the image in uniform and non-uniform areas to reduce the speckle noise. We still take April, May, and June as examples to construct the corresponding RF classification models per month by using S1 data before and after the speckle filter, the maize/non-maize classification accuracy results (Table 3) indicate that the performance of constructed RF models is significantly improved in April and May after speckle filter. Although there was no improvement for RF model after speckle filter in June, there was no significant deterioration, which indicates that the Refined Lee speckle filter can improve the quality of S1 data effectively. Inglada et al. (2016) also discussed the influence of spectral filtering (e.g., with a simple 3 × 3 pixels filter window) on classification accuracy in the early crop type recognition, they found that the speckle filtering introduced a statistically significant small improvement [11], which is consistent with our research results.

Effects of Multi-Temporal Images on Maize Mapping
There are some limitations in extracting ground information from a single image. Due to the strong heterogeneity, high degree of fragmentation and various land cover types in the study area, there will always be the phenomenon of the same object with different spectra and different objects with the same spectrum for RS images. In addition, extracting crop distribution information from high-resolution RS imagery is much difficult than from the other resolution because high-resolution imagery exhibits more complex spectral character. This undoubtedly increases the difficulty of crops classification and makes the capability of applying single RS data on crops classification is reduced. Multi-temporal images can provide more spectral information [69], which seems to reduce misclassification greatly.
In this study, the incremental learning strategy is adopted to explore the impact of multi-temporal images on maize mapping accuracy. The incremental learning trains the initial model on the data subset and then continuously adds new data to update the model [30,70]. Thus, we train the initial RF models with the data (including S1 alone, S2 alone, and combined S1 and S2) of April as the data subset and validate these models to obtain the initial classification accuracy. Then, the next month's data is introduced as the data subset in turn and retrain the RF models classification, and until the data of September is also introduced. The experiment results indicate that the classification accuracy of the RF models is steadily improved with the introduction of more temporal information (Figure 8), especially for the RF model that constructed by using S1 data alone, which is improved significantly wherein the information of May, August, June, and September is introduced. This may be due to the fact that these months are the seedling stage, milky stage, jointing stage and mature stage of maize, and the difference of SAR signals between maize and other surrounding ground objects are relatively large. The classification accuracy of the RF model is gradually improved from 72%, 0.41 and 0.64 of the initial models to 80.14%, 0.6, and 0.78 of the final models for OA, Kappa coefficient, and F1_Score, respectively, with the availability of more data. With respect to the impact of S1 images in early crop classification, our result is consistent with the study of Inglada et al. (2016), which shows that the classification accuracy is gradually improved with the evolution of seasons and more data available [11]. However, the classification accuracy of their study is slightly higher than ours (the maximum value of Kappa coefficient is about 0.675), this may be due to the use of many variables related to crop texture features. We also found that S1 images do not perform very well in the early maize planting area extraction, this is similar to the study of Demarez et al., (2019), which indicates that the classification accuracy of early crop is the lowest with Kappa ≈ 0.25, as their study did not conduct monthly synthesis processing for the multi-temporal images [30]. Nguyen et al., (2017) found that the dense SAR time series and phenological characteristics are very critical in the study of rice cropland monitoring with Sentinel-1 Data, and the Kappa coefficient can reach 0.76~0.87 [71]. Bargiel (2017) pointed that the detailed phenological information can be regarded as the foundation of crop classification in large regions, as the phenology varies with the different climate conditions in different regions [72]. In our study, the classification accuracy is somewhat low, which is related to the complex and heterogeneous vegetation structure in the study area. In the further classification study, we will consider the texture features of crops in this region. Figure 8. Performance comparison of different RF models constructed by incremental learning and feature selection. "noFSP" means that the feature selection procedure is not used. In the figure, the month marked on the abscissa means that the input features of the RF model include the data of that month and previous months.
The corresponding classification accuracy of RF model constructed by S2(alone) and combined S1&S2 data are significantly improved than that of S1(alone), which may be related to image characteristics since optical images generally understand the relationship between observations and vegetation phenology better than radar images [11]. Marais Sicre et al. (2016) indicate that the impact of crop classification accuracy mainly comes from available data and it is difficult to distinguish crop types accurately in the early stage of crop growth [73]. These results are consistent with the conclusions of Vuolo et al. (2018), indicating that classification accuracy is very poor (OA ≈ 50%) in the early stages of the crop growth [69]. However, relatively high accuracy (OA ≈ 81% ) was achieved at the very beginning of the crop growing season in our study, this may be related to the large number of spectral features we used is beneficial to crop classification. With the evolution of seasons, the maize classification accuracy increased significantly, and the accuracy of S2 based RF classification model achieves the best performance when the image information from May to August are all used as the model input, the OA, Kappa coefficient, and F1_Score values are 86.27%, 0.72, and 0.85, respectively. The classification accuracy of the corresponding RF model constructed by combined S1 and S2 data is further improved compared with that S2 (alone), this is consistent with the conclusion of Demarez et al., (2019), which indicates that the combination of optical data and SAR data can improve the precision of crop acreage extraction in highly heterogeneous regions [30]. The RF classification model achieves the best performance while all the available information is introduced, with the higher OA, Kappa coefficient, and F1_Score values even reach 87.63%, 0.75, and 0.86, respectively. Multi-temporal images of both optical S2 and/or S1 SAR can better capture the phenological information of different growth stages of maize, which is beneficial to crop classification in areas with complex vegetation structure and strong heterogeneity, this is also confirmed by Mercier et al., 2020, in which they indicate that the combined use of S1 and S2 was more accurate in identifying principal and secondary phenological stages for wheat and rapeseed mapping than using S1 or S2 data alone [74]. However, Mercier et al. (2019) also pointed out that the combination of S1 and S2 data has only a little improvement compared to those of S2 data alone, with Kappa coefficient appropriate 0.89 versus 0.87, in the study of land cover classification of forest-agriculture mosaic by using RF classifier [70], this study gives us a warning that whether S1 or S2 should be combined for classification research still needs specific analysis.

Effects of Feature Selection Procedure on Maize Mapping
The previous study of maize planting area extraction in Tanzania and Kenya [25] showed that the B1, STI, NDTI and shortwave infrared bands are the most important variables, which is consistent with the conclusion in our study area. In contrast, the importance of NBR1, NDVI and RDNDVI1 are stronger in our study area. In addition, the indicators from S1 (e.g., VV_May, VH_aug, VV_Jun, VV_Sep, and VH_Sep) are also of great importance in maize classification. The indicators from the near infrared band of S2 are the dominant contribution factors of maize classification in the study area. The importance of VV polarization is the most significant, and this conclusion can also be drawn from other crop classification studies [70]. However, some studies have shown that the index VH/VV can play an important role in crop classification due to it can reduce the impact of soil moisture and detect post-harvest spontaneous regrowth [48].
To explore the effect of the feature selection on maize mapping, we compared the RF models constructed by using S2 (alone) and combined S1 and S2 data with and without feature selection (Figure 8). We called the constructed RF model without feature selection procedure is a standard RF model, in which all the feature variables are involved in the training. Overall, the performance of RF models with the feature selection is slightly better than corresponding standard RF models. Meanwhile, the performance of standard RF model constructed by combined S1 and S2 is superior to that RF model constructed by S2 data alone with feature selection although the overall trend of classification accuracy for the RF models constructed by the combined S1 and S2 data is similar to that S2 data.
However, as can be seen from the trend of Figure 8, the RF models with feature selection procedure are superior to the standard RF models. In short, it is very important to select the best feature and construct the appropriate RF model to improve the accuracy of maize classification in complex heterogeneous environment.

Uncertainty and Future Enhancement of Maize Mapping
Accurate ground truth data are crucial for training and validating classification algorithms [25], especially the quality of the training data is the key to obtaining satisfactory classification results [49]. The collection of sampling data is a major limitation of this study. Due to the limitation of sampling time in the field survey, there are fewer non-maize samples. Although non-maize samples were supplemented by Google earth image and S2 images, these samples may be biased and not representative enough, which may cause classification uncertainty.
Image quality is another key to obtaining satisfactory classification results. Cloud masking and moving median of processing of optical images and spectral filtering of radar images can significantly improve image quality, as described in Section 5.1. The Cloud or cirrus cover of S2 images was generally masked using the QA60 band [48,50]. Recent research found that clouds were masked more thoroughly using the Landsat simple cloud score algorithm than merely utilizing the QA60 quality assessment band [52]. In this algorithm, four bands and two spectral indices were used to compute cloud scores and detect clouds for S2 data. In our study, only moving median processing was used to eliminate the influence of clouds or cirrus clouds. However, the limitations of cloud masking may also still have contributed to the uncertainty of maize classification. Therefore, more attention should be paid to image pre-processing, especially to the cloud removal of optical images.
In the previous Sections 4.2.2 and 5.2, we have confirmed that the RF models constructed by the combination of optical S2 data and S1 radar data have better classification accuracy than using either S1 or S2 data alone. This can be attributed to the complementary interaction of spatiotemporal information from multiple sources, which makes it possible to capture the surface ground objects' feature features more completely. However, most of the previous studies have excluded the "atmospheric" bands, e.g., B1, B9, B10, of S2 [74], even though that information may also contribute to crop classification. In this study, we found that bands B1 and B10 of S2 or combined data have a non-negligible contribution in the process of maize classification (it can be seen in Section 4.2.1). This indicates that spectral reflectance from some atmospheric band is also related to crop growth and should be taken into account when establishing a crop classification model. Besides, the bands and artificial spectral indicator from S1 and S2 images used in this study may not represent the influence of complex factors, such as weather conditions and farming patterns. Thus, we need to further explore and extract other useful features in future maize classification study.
In addition, more advanced deep learning methods, such as convolutional neural networks, recursive neural networks, and long short-term memory networks, are also very popular methods in agricultural crop classification study recently [75,76]. Judging from the performance of existing research, deep learning seems to have more potential advantages than classic machine learning methods. This is because the features of traditional machine learning are manually screened, whereas deep learning technology can automatically mine the spatiotemporal information contained in the RS images. However, there is also a study that confirms that the long short-term memory network is similar to the random forest without significantly improvement for maize planting area mapping in the Liangzhou district of Gansu province, China [77]. Therefore, both RF and deep learning techniques need further in-depth study in the application of crop mapping.

Conclusions
This study explores how to use the combined high-resolution multi-temporal optical S2 and/or S1 SAR images to map the maize planting area in the middle reaches of the Heihe River basin via RF classification model with a feature selection procedure. The results revealed the multi-source RS images have the potential to provide more complementary feature information and improve the accuracy of maize mapping under the condition of complex and heterogeneous landscapes. The maize mapping accuracy of RF model constructed by using the combined S1 and S2 data was higher than that either S1 (alone) or S2 (alone), with OA of 87.63%, Kappa coefficient of 0.75, and F1_Sccore of 0.86. Compared with single temporal image, multi-temporal images can improve the abundance of input information of RF model and then improve the accuracy of classification. Besides, the optimal feature selection is also the key to improve the accuracy of maize mapping in the study area, which can effectively eliminate the redundant information led by a large number of similar or related input features, and reduce the possibility of over-fitting and then improve the classification accuracy of RF model. The proposed method in this study is suitable for fine classification of crops in highly complex and heterogeneous areas.