Potential of Hybrid CNN-RF Model for Early Crop Mapping with Limited Input Data

: When sufﬁcient time-series images and training data are unavailable for crop classiﬁcation, features extracted from convolutional neural network (CNN)-based representative learning may not provide useful information to discriminate crops with similar spectral characteristics, leading to poor classiﬁcation accuracy. In particular, limited input data are the main obstacles to obtain reliable classiﬁcation results for early crop mapping. This study investigates the potential of a hybrid classiﬁcation approach, i.e., CNN-random forest (CNN-RF), in the context of early crop mapping, that combines the automatic feature extraction capability of CNN with the superior discrimination capability of an RF classiﬁer. Two experiments on incremental crop classiﬁcation with unmanned aerial vehicle images were conducted to compare the performance of CNN-RF with that of CNN and RF with respect to the length of the time-series and training data sizes. When sufﬁcient time-series images and training data were used for the classiﬁcation, the accuracy of CNN-RF was slightly higher or comparable with that of CNN. In contrast, when fewer images and the smallest training data were used at the early crop growth stage, CNN-RF was substantially beneﬁcial and the overall accuracy increased by maximum 6.7%p and 4.6%p in the two study areas, respectively, compared to CNN. This is attributed to its ability to discriminate crops from features with insufﬁcient information using a more sophisticated classiﬁer. The experimental results demonstrate that CNN-RF is an effective classiﬁer for early crop mapping when only limited input images and training samples are available.


Introduction
The demand for timely and accurate agricultural thematic information has been increasing in the agricultural community owing to increased food security needs and climatic disasters [1][2][3]. In particular, crop type information is considered to be vital for sustainable agricultural monitoring and management [4]. Conventional official agricultural statistics, including the cultivation areas and crop yields, have been widely used to support agricultural planning and management. However, from a practical perspective, such statistics may not be useful for the preemptive establishment of national food policies as it is usually released either after crop seasons or on a simple annual basis. Therefore, crop type information should be provided to decision-makers either before crop seasons or at a time desirable for effective agricultural management [5,6].
Remote sensing technology has been widely employed to generate crop type maps at both local and regional scales owing to its ability to provide useful information for the periodic monitoring and effective management of agricultural fields [7,8]. However, the accurate classification of crop types using remote sensing imagery still poses challenges, particularly for early crop mapping, as certain important aspects for crop classification should be properly considered during the classification procedure. Time-series remote Thus, CNN-RF is promising when limited input data are available for early crop mapping. However, to the best of our knowledge, the full potential of CNN-RF has not yet been thoroughly explored in existing studies for early crop mapping with both limited input images and insufficient training data. Instead of applying advanced models, most previous studies on early crop mapping have focused on either the construction of a sufficient time-series SAR image set [11], or the fusion of optical and SAR images [39]. Yang et al. [35] applied CNN-RF for crop classification with multi-temporal Sentinel-2 images but focused on the selection of optimal features. Furthermore, the lack of sufficient training data has not been investigated in conjunction with different time-series lengths.
The objective of this study is to evaluate the potential of a hybrid CNN-RF model for early crop classification. This study differs from previous ones on the classification using CNN-RF by considering practical issues frequently encountered in early crop mapping, including both the incomplete time-series set and the insufficient training data. The benefit of CNN-RF is quantitatively evaluated by incremental classification experiments using multi-temporal unmanned aerial vehicle (UAV) images and different sizes of training data in two crop cultivation areas in Korea, with an emphasis on early crop mapping.

Study Areas
Crop classification experiments were conducted in two major crop cultivation areas in Korea. The first case study area is located in the central part of Anbandegi, a major highland Kimchi cabbage cultivation area in Korea (Figure 1a). The altitude of the study area is relatively higher than its surroundings (1 km above the mean sea level) because low temperatures are required during summer for growing highland Kimchi cabbage [4]. Two other crops, including cabbage and potato, are also grown in the study area, and some fields are managed as fallow which is left without sowing for the following year's cultivation. The total area of all crop parcels is 27.13 ha and the average size of each crop field is 0.6 ha.
The second case study area is a subarea of Hapcheon, a major onion and garlic cultivation area in Korea (Figure 2a). Onion and garlic are grown in relatively warm southern parts of Korea in spring and are harvested prior to rice planting [40]. Barley is also grown with onion and garlic in the study area, and several fallow fields are maintained, such as in Anbandegi. The total crop cultivation area is 43.98 ha and the average size of each crop field is 0.24 ha, which is smaller than that of Anbandegi. Figure 3 shows the phenological stages of crops in the two study areas obtained by field surveys. In Anbandegi (Figure 3a), all crops are grown in Anbandegi during the normal growth period for summer crops but have different sowing and harvesting times at intervals of approximately two weeks to one month. Highland Kimchi cabbage is sown later than cabbage and potato and is harvested in mid-September. Potato and cabbage were sown in early June, but the harvesting time of potato was faster than that of cabbage (mid-August). All crops in Hapcheon are sown from the fall of the previous year (Figure 3b). Onion and garlic are managed by plastic mulching for winter protection until mid-March. Thus, the growing stages of onion and garlic can be monitored from the end of March or early April.

Datasets
UAV images were used as inputs for crop classification by considering the small fields of the two study areas. The UAV images were taken using a fixed-wing eBee unmanned aerial system (senseFly, Cheseaux-sur-Lausanne, Switzerland) equipped with a Canon IXUS/ELPH camera (Canon U.S.A., Inc., Melville, NY, USA), including blue (450 nm), green (550 nm), and red (625 nm) spectral bands. The raw UAV images were preprocessed using Pix4Dmapper (Pix4D S.A., Prilly, Switzerland). By considering the growth cycles of the major crops, eight and three multi-temporal UAV images with a spatial resolution of 50 cm were used for crop classification in Anbandegi and Hapcheon, respectively (Table 1). It should be noted that only three UAV images acquired from the early April were used for crop classification in Hapcheon by considering the plastic mulching period of onion and garlic fields (until mid-March). Ground truth maps prepared by field surveys (Figures 1b and 2b) were used to select both the training data for supervised classification and the reference data for accuracy assessment. Furthermore, a land-cover map provided by the Ministry of Environment [41] was used to extract crop field boundaries within the study area and mask out the non-crop areas. The second case study area is a subarea of Hapcheon, a major onion and garlic cultivation area in Korea (Figure 2a). Onion and garlic are grown in relatively warm southern parts of Korea in spring and are harvested prior to rice planting [40]. Barley is also grown with onion and garlic in the study area, and several fallow fields are maintained, such as in Anbandegi. The total crop cultivation area is 43.98 ha and the average size of each crop field is 0.24 ha, which is smaller than that of Anbandegi.     [42], was used as a conventional ML classifier in this study. RF can solve overfitting problems and maximize the diversity through the random selection of input variables and tree ensembles. Furthermore, RF is not significantly affected by outliers [38]. Bootstrap aggregating (bagging), which randomly extracts a certain portion of training data, is first applied to build each decision tree [42]. The remaining training data, referred to as out-of-bag data, are subsequently used for cross-validation to evaluate the performance of the RF classifier. The Gini index is used as a measure of heterogeneity to determine the conditions for partitioning nodes in each decision tree.

CNN
CNN is a specific DL algorithm for image classification using 2D images as inputs [43,44]. Like ANN, the CNN model has a network structure comprising many layers, with the output of the previous layer sequentially connected to the input of the subsequent layer. A spatial feature extraction stage applying a series of convolutional filters is combined with a conventional ANN structure. CNNs are classified into 1D, 2D, and 3D models according to the dimension of the convolution filter. In this study, the 2D-CNN model was employed for crop classification for the following reasons: In our previous crop classification study using multi-temporal UAV images [33], the classification performance of 2D-CNN was similar to that of 3D-CNN. Furthermore, from a computational perspective, 2D-CNN is more efficient than 3D-CNN in that it has fewer parameters to optimize than 3D-CNN.
The architecture of the CNN model typically consists of three major interconnected layers: convolutional, pooling, and fully connected layers. The convolutional layer first computes the weights by applying a convolution filter that conducts a dot product on the local area to either the two-dimensional input data or the outputs of previous layers. A nonlinear activation function, such as a rectified linear unit (ReLU) or sigmoid, is then applied to generate feature maps. The pooling layer is applied to simplify the extracted features as representative values (maximum or mean). The max pooling layer has been widely used in the CNN-based classification [45]. The convolutional and pooling layers are alternately stacked until high-level features are extracted. After the high-level features are extracted through the convolutional and pooling operations, the output feature maps are transformed into a 1D vector and transferred to the fully connected layer. The last fully connected layer generally normalizes the network output to obtain probability values over the predicted output classes using a softmax function. Finally, the classification result is obtained by applying the maximum probability rule.

Hybrid CNN-RF Model
In the hybrid CNN-RF model, RF is applied to classify high-level features extracted from the CNN by considering the advantages of both CNN and RF. CNN-RF uses spatial features extracted from the optimal structure and parameters of the CNN as input for the classification. Therefore, no additional feature extraction or selection stages are required prior to the RF-based classification. Compared with the fully connected layer as a classifier in CNN, RF uses more sophisticated classification strategies, such as bagging [38]. Furthermore, the advantages of RF, including its robustness to outliers and its ability to reduce overfitting, can improve the classification performance, even when proper or informative spatial features cannot be extracted from CNN because of insufficient training data and input images.

Training and Reference Data Sampling
This study focused on investigating the applicability of CNN-RF in case where limited input data are used for early crop classification. To consider the case wherein limited train-ing data are available in early crop mapping, the classification performance is compared and evaluated with respect to the different sizes of training data.
Both training datasets with different sizes and reference data were first extracted from the ground truth maps shown in Figures 1b and 2b, which were subsequently used for classification and accuracy evaluation, respectively ( Table 2). Five training data with different sizes were prepared to analyze the effect of training data size on the classification performance. Here, the ratio of the number of training pixels between classes was defined based on the size of crop fields within each study area. Randomly extracted 20,000 pixels that did not overlap the training pixels were used as reference data for both areas.

Optimization of Model Parameters
As the parameters of the ML and DL models greatly affect the classification accuracy, the determination of optical parameters is vital in achieving satisfactory classification accuracy and generalization capability. Detailed parameter information tested by the three classifiers is presented in Table 3. Table 3. Hyper-parameters of the three models applied in this study (p is the number of input variables).

Model
Hyper-Parameter (Layer Description) Unlike other ML models such as SVM, RF requires relatively few parameters to be set, such as the number of trees to be grown in the forest (ntree) and the number of variables for the node partitioning (mtry) [16]. The optimal values of ntree and mtry were determined based on a grid search procedure for various parameter combinations.
In a CNN, several parameters, including the number of layers, image patch size, and number and size of convolution filters, must be carefully determined to obtain a satisfactory classification performance. The optimal structure of the CNN model depends heavily on both the parameters and inputs. In particular, the image patch size can significantly affect the classification performance [9,24,25,46]. Using a small patch size may result in overfitting of the model [43], whereas a large patch size may generate over-smoothed classification results [9]. In this study, eight different image patch sizes from 3 to 17, with an interval of 2, were first examined by considering the crop field scale within the study area. Based on our previous study [24], the size of the convolution filter was set to 3 × 3 to avoid the over-smoothing of the feature maps. Another important parameter is the depth of the network (the number of layers). The depth significantly affects the classification accuracy as the level or information content of the trained feature maps varies according to the number of layers [47]. Based on our DL-based classification results using UAV images [24,25], the number of layers was set to six to balance the complexity and robustness of the network structure. ReLU was applied as an activation function for all layers except for the last fully connected layer.
Two distinct regularization strategies were applied to prevent overfitting while training the CNN model. Dropout regularization, which randomly drops certain neurons during the training phase, was employed to reduce the inter-dependent learning among neurons. Early stopping was applied as the second regularization strategy to stop training at the specified number of iterations (epochs) when the model performance did not improve any further during the iterative training process. The Adam optimizer with a learning rate of 0.001 and cross entropy loss was applied to the model training as its effectiveness in time-series data classification has been proven previously [11,45].
The CNN-RF model shares the network structure of the CNN model, but some parameters must be adjusted prior to the RF-based classification. As features extracted from the fully connected layer are used as inputs for the RF classifier, mtry and ntree are determined using a grid search procedure similar to that in the RF model. Figure 4 shows the architecture of the CNN-RF model developed in this study. Five-fold cross-validation was employed for optimal model training. Of the five partitions of training data, four were used for the model training. The remaining partition was used as validation samples to seek the optimal hyper-parameters of the classification models.

Incremental Classification
An incremental classification procedure [11,39] was employed to seek the optimal time-series set with fewer images for early crop mapping. Supervised classification is first conducted using RF, CNN, and CNN-RF classifiers using the initially acquired UAV images presented in Table 1 (A1 and H1 for Anbandegi and Hapcheon, respectively). Classification is then conducted incrementally using a time-series set in which the subsequent UAV image is progressively added to the images used for the previous classification. In the incremental classification, A8 and H3 in Table 1 indicate that eight and three images were used for classification in Anbandegi and Hapcheon, respectively. This procedure facilitates both the analysis of variations in the classification performance with respect to the growing cycles of crops, and the determination of the best classifier that effectively identifies crop types as early as possible.

Analysis Procedures and Implementation
The processing steps applied in this study for crop classification are shown in Figure 5. After preparing datasets for classification, temporal variations in the vegetation index (VI) for each crop type were first analyzed to identify the optimal dates for early crop mapping and obtain supporting information for the interpretation of results. The commonly used VI is the normalized difference in vegetation index (NDVI) based on reflectance values from the red and near-infrared spectral bands. However, it is not feasible to compute NDVI as the UAV system is not equipped with the near-infrared spectral sensor. Instead, the modified green-red vegetation index (MGRVI), which is a VI based on the visible bands [48], was computed in the study areas. The average MGRVI value per crop type was then computed using ground truth maps. After the optimal hyper-parameters for CNN were selected using five-fold crossvalidation, the spectral features and spatial features extracted by the CNN were then visualized using t-distributed stochastic neighbor embedding (t-SNE) for qualitative comparison of class separability. The t-SNE is a nonlinear dimensionality reduction technique of high-dimensional data for visualization in the 2D space [49,50]. This is employed to visually compare the relative differences in class separability in the feature space regarding the different training data and input images. All training samples in each class were projected onto the 2D using t-SNE.
All three classifiers with the optimized hyper-parameters were then applied to generate incremental classification results with respect to the different combination cases for training data sizes and the lengths of the time-series. The classification performance of the three different classifiers was quantitatively evaluated with respect to different training data sizes and the time-series length. After preparing a confusion matrix using independent reference data, the overall accuracy (OA) and F-measure were calculated and used as quantitative accuracy measures. The F-measure is defined as the harmonic mean of precision and recall, wherein the precision and recall correspond to the user's accuracy and product's accuracy, respectively [11]. As the DL models have a stochastic nature, considerably different classification results may be achieved even with the same input images and training data. To ensure a fair comparison of the classification performance of the three classifiers, the classification was repeated five times for each classifier. The average and standard deviation of the five accuracy values were used for the quantitative comparison. For each independent classification run, different training samples were randomly selected using different random seeds, but the total number of training data for the five different cases of training data in Table 2 was fixed. Based on the quantitative accuracy measures, time-series analysis of VI, and qualitative analysis results of class separability, the classification results were finally interpreted in the context of early crop mapping.
Classification using RF, CNN, and CNN-RF was implemented using the Scikit-learn [51], TensorFlow [52], and Keras [53] libraries on Python 3.6.7. All three models were run on the CentOS 7.0 operation system with an Intel XEON E5-2630 v4 @ 2.2GHz CPU and two NVIDIA GTX1080ti GPU with 11 GB of memory. Figure 6 shows the temporal variations in the average MGRVI for individual crop types in the two study areas. As shown in Figure 6a, June UAV images in Anbandegi provide useful information to distinguish the highland Kimchi cabbage from other crops because of the difference in the sowing time. As the difference in MGRVI between potato and cabbage increased from the end of June, the image in late June (A3) or mid-July (A4) is useful for distinguishing between potato and cabbage. The MGRVI value of fallow was consistently high throughout the entire crop growth cycle. Hence, it is relatively easy to discriminate fallow from crops. The time when the optimal OA is achieved for early crop mapping may vary annually depending on the agricultural environment; however, the optimal time in 2018 for early crop mapping in Anbandegi was at the end of June (A3) or middle of July (A4). In Hapcheon, the MGRVI values of all the crops increased gradually from April to May, and only slight differences in the variation of MGRVI were observed, except for onion ( Figure 6b). Consequently, it may be difficult to discern crops, even if all time-series images are used for the classification. In contrast, it is easy to identify fallow that is consistently managed as bare soil while crops are growing.

Time-Series Analysis of Vegetation Index
Such different behaviors of vegetation vitality in the two study areas imply the different effects of limited input images on the classification accuracy.

Comparison of Class Separability in the Feature Space
The final optimal hyper-parameters determined from 5-fold cross validation are given in Tables S1 and S2 for Anbandegi and Hapcheon, respectively. Figures 7 and 8 show the feature embedding of features using the combination of two differently sized training data (T1 vs. T5) and two different input images (first image vs. all images) for Anbandegi and Hapcheon, respectively. These representative cases were visualized for comparison purposes. The spectral features indicate three spectral bands of each UAV imagery. In Anbandegi (Figure 7), the spatial features extracted from CNN are likely to better separate all crops than the spectral features. The spatial features extracted from CNN with all eight images (A8) and the largest number of training data (T5) maximize the interclass variability and minimize the intraclass variability, thereby clearly separating each crop type. When using all eight images (A8) and the smallest training data (T1), each crop type is better distinguished in the CNN-based feature space than in the spectral feature space. The CNN-based features show much better class separability than spectral features when using A1 and T1, particularly for highland Kimchi cabbage, which is the major crop type in Anbandegi. All crops are widely spread out and overlap significantly in the spectral feature space, whereas the highland Kimchi cabbage and fallow samples are well discretized in the CNN-based spatial feature space. However, cabbage and potato samples slightly overlap in the CNN-based feature space because of their similar sowing times, thus indicating that CNN-based features may not provide useful information for classification in the worst case of data availability.
In Hapcheon, as shown in Figure 8, better class separability was obtained in spatial features extracted from CNN than the spectral features, similar to the results in Anbandegi. For example, barley and garlic samples largely overlap in the spectral feature space, regardless of the training data size and the length of the time-series. As shown in the temporal profiles of MGRVI in Figure 6b, the small changes in MGRVI of barley and garlic samples during the entire growth period led to large overlaps in the spectral feature space. However, onion samples were well discretized owing to the gradual change in MGRVI, unlike other crops. It should be noted that in the case of T1, the difference in class separability between garlic and barley for H1 and H3 is insignificant, unlike in Anbandegi.
The visual comparison results of feature embedding in the two study areas indicate that the CNN-based spatial features are more useful for crop classification than the spectral features. However, some confusion still persists between some crop types in the CNNbased feature space when using the smallest training data (T1) and input images. This implies the necessity of applying a more sophisticated classifier.  Figure 9 shows the variations in average OA of the three different classifiers in Anbandegi with respect to five different training data sizes (T1 to T5) and eight different lengths of time-series (A1 to A8) used for classification. The standard deviation values of OA from five classification runs for different combination cases are presented in Table S3.

Results in Anbandegi
As expected, OA values for all classifiers increased as more input images and training data were used for classification. Using all eight images (A8) and the largest training data (T5) yielded the highest classification accuracy. In this case, both CNN and CNN-RF achieved a similar classification accuracy. For all combination cases, the best and worst classifiers were CNN-RF and RF, respectively. A substantial improvement in the OA of CNN-RF over RF and CNN was obtained when using only one image and two images (A1 and A2) with small training data (T1 and T2) for classification. The maximum increases in the OA of CNN-RF over RF and CNN are 21.3%p for A2 and T2 and 6.7%p for A1 and T2, respectively. Given a much larger number of reference samples (20,000) than the training samples (80 to 1280), the small difference in the OA between CNN-RF and CNN can be considered significant. When accuracy statistics are calculated using large reference samples, even small difference in the OA indicates that the number of correctly classified samples is large. Furthermore, the OA between CNN-RF and CNN for each classification run with respect to the combination cases with limited input data was statistically significant at the 5% significance level from the McNemar test [54]. When adding input images until the A3 case (the optimal date identified from Figure 6a), the greatest increase in OA was achieved for all classifiers, regardless of the training data size. Conversely, the incremental classification results using more input images (A4 to A8) yielded a slight increase in OA for T1 and T2, but a small change in OA for T3 to T5. Similar OA values for A3 and A4 indicate that the optimal time in 2018 for early crop mapping was the end of June (A3) as the fewest possible images are preferred for early crop mapping. When three images until the end of June (A3) were used with relatively small training samples (T2), the OA of CNN-RF was 14.4%p and 2.1%p higher than RF and CNN, respectively. In contrast, the cases using relatively large training samples (T4 and T5) yielded a similar classification accuracy between CNN and CNN-RF. Therefore, CNN-RF is more beneficial than CNN for early crop mapping in Anbandegi when using only a few images acquired until the optimal date and small training samples. Figure 10 shows the F-measure for each crop type of specific incremental classification results generated using T1 (the case with a large difference in OA between classifiers) and T5 (the case with the highest OA for all classifiers). The largest difference in the OA was observed for T2. However, the incremental classification result for T1 was selected for the comparison as T1 yielded a similar temporal variation in OA and showed slightly higher distinctive class separability of CNN-based spatial features than the spectral features ( Figure 7). When using T1, the F-measure of highland Kimchi cabbage increased from A1 to A3 and showed slight variation from A4 to A8. This is consistent with the results of temporal variations in MGRVI suggesting that the optimal time for the classification of all crops in Anbandegi is the end of June (A3), as shown in Figure 6a. The best classifier of highland Kimchi cabbage is CNN-RF. The improvement in the F-measure of CNN-RF compared to RF and CNN is 18.5%p and 3.5%p for A1 and T1, respectively. The F-measure of cabbage and potato for the T1 increased significantly for all three classifiers from A1 to A4, except for the F-measure of potato for RF. Such an increase in the F-measure is mainly due to the increasing difference in MGRVI between cabbage and potato from the end of June (A3) to the middle of July (A4) (Figure 6a). Similar to highland Kimchi cabbage, the best classifier for cabbage and potato was also CNN-RF.
However, when a single image was used for classification (A1 with T1), the F-measure of CNN was lower than that of RF. As shown in Figure 7, both cabbage and potato samples are spread widely and overlap in both the spectral features and CNN-based spatial features. Meanwhile, despite similar or slightly more overlapping of those crops in the spectral feature space, RF showed better classification accuracy for potato and cabbage than CNN. This indicates that CNN-based spatial features cannot provide useful information to classify cabbage and potato for the worst case of data availability. Consequently, the poor classification performance of CNN is caused by both the failure to extract informative spatial features and the application of the fully connected layer as a classifier. Conversely, this result demonstrates a better classification capability of RF than CNN. By applying RF as a sophisticated classifier to the CNN-based spatial features, CNN-RF can distinguish between cabbage and potato more accurately than CNN and RF.
When using the largest training samples (T5), the F-measure values of all crops increased from A1 to A3 and varied slightly from A4 to A8. The highest F-measure values for all crops were achieved by CNN-RF when fewer images were used for classification (A3). The F-measure values of CNN and CNN-RF were very similar, as more images were used for the classification (A4 to A8).
The quantitative accuracy assessment results shown in Figures 9 and 10 indicate that CNN-RF is the most accurate classifier for crop mapping in the study area, regardless of the training data size and length of the time-series. Furthermore, the superiority of CNN-RF over RF and CNN is more prominent for the worst data availability with limited training data and input images; this confirms the potential of CNN-RF for early crop mapping.
To illustrate the advantage of CNN-RF for early crop mapping, the specific classification results using T2 which showed significant differences in the OA between classifiers, are shown in Figure 11 for visual comparison. Highland Kimchi cabbage fields are properly classified when one image (A1) and three images (A3) are used with small training samples, except for certain isolated pixels at field boundaries. However, most cabbage and potato fields are misclassified when only one image (A1) is used as the input. As the sowing time of cabbage and potato is early June, using a single June image is not adequate for their discrimination. Conversely, adding more images acquired until the end of June (A1 to A3) substantially reduced the misclassified pixels in cabbage and potato fields. Although noise effects still exist, the overall classification result using A3 is similar to that using all eight images (A8), thus demonstrating the benefit of CNN-RF for early crop mapping.

Results in Hapcheon
The OA in Hapcheon increased for all classifiers as more input images and training samples were used for classification, similar to the results in Anbandegi (Figure 12). Average OA and standard deviation values from five classification runs are presented in Table S4. The maximum OA values for RF, CNN, and CNN-RF were 82.6%, 91.7%, and 92.1%, respectively, with CNN-RF achieving the highest OA when using all three images (H3) along with the largest training data (T5). Furthermore, even with limited inputs, CNN-RF was the best classifier in the study area, and the difference in the OA between CNN-RF and CNN for each classification run was statistically significant from the McNemar test.
Unlike Anbandegi, variations in the OA in Hapcheon were affected more by the training data size than the length of the time-series. The minor impact of the length of the time-series can be explained by the small number of input images and the slight difference observed in the temporal variation in MGRVI of the three images. When all three images (H3) were used for classification with CNN-RF, its OA for T5 increased by 13.1%p, compared with that for T1. However, for T5, the improvement in the OA of H3 over H1 was 4.9%p. In some cases, a slight decrease in OA was achieved for RF and CNN (RF with H3 and T2, and CNN with H2 and T2). Similar patterns in the feature space for H1 and H3 in Figure 8 yielded a slight improvement in the OA. Based on these results, early or middle April is estimated to be the optimal date in 2019 for early crop mapping in Hapcheon, when sufficient training samples are unavailable. The best classifier for early crop mapping is CNN-RF, similar to that in Anbandegi.
The variations in the class-wise F-measure for the two different training data sizes (T1 and T5) are shown in Figure 13. In the case of T1 (smallest training samples), the F-measure values for garlic and barley were lower than those for onion for all three classifiers. In particular, the F-measure of barley for RF and CNN was very low when using H1. Using more images did not improve the accuracy of barley. The low accuracy for garlic and barely is mainly due to the spectral similarity in Figure 6b and the low class separability in Figure 8. The F-measure of barley for CNN-RF was not high (0.68) in the case of T1, but CNN-RF increased the F-measure by approximately 9.6%p, compared with that for CNN. Furthermore, CNN-RF increased the F-measure for garlic when all three images were used. The accuracy of onion and fallow for H3 was lower than that for H2 because of the confusion between the two crops (1.9%p and 2.6%p lower for onion and fallow, respectively), but the OA of CNN-RF was still 4.2%p higher than that of CNN ( Figure 12). In the case of T5 (the largest training samples), a small variation in accuracy for onion and fallow was observed for all three classifiers. The class-wise accuracy values of CNN-RF are similar to those of CNN, regardless of the number of input images. The F-measure of barley was improved by approximately 10%p for all three classifiers.
As the classification accuracy is less affected by the length of the time-series, collecting sufficient training samples is vital for early crop mapping in Hapcheon. When considering the difficulty in collecting sufficient training samples, CNN-RF is the best classifier for the early stage classification of crops with high similarity using limited training samples. Figure 14 shows the incremental classification results of CNN-RF with T1 for different input images. Considering the difficulty in collecting training samples, T1 (the smallest training samples) was selected for the visual comparison. The confusion between garlic and barley is observed for all classification results. When using only one image (H1), the misclassification of barley occurs more in the classification result; however, other crops are well identified, compared with the classification results using more images. As the major crop types in the study area are onion and garlic, it can be concluded that CNN-RF can generate reliable classification results in Hapcheon even with limited inputs for early crop mapping (T1 and H1).

Discussion
Unlike most previous studies on early crop mapping focusing on the effects of various time-series image combinations for the selection of optimal dates [10,11,55,56], the main contribution of this study is the assessment of the classification accuracy considering the data availability conditions specific to early crop mapping. These include the training sample sizes and the length of the time-series, along with the benefit of CNN-RF as a sophisticated classifier for the same body of work.
From a methodological viewpoint, CNN-RF is attractive for supervised classification. Using informative high-level spatial features extracted from CNN is more promising than using only spectral information. Thus, end-to-end learning, automatic extraction of such spatial features that can account for spatial contextual information, is the merit of CNN. However, a large amount of training data is usually required to extract informative high-level spatial features in CNN-based classification [37]. The CNN models trained using limited training data may produce poor classification accuracy. Moreover, in the worst case of data availability, a simple classifier such as the full connected layer with a softmax activation function in the CNN model may fail to achieve satisfactory classification performance. In this case, more sophisticated conventional ML classifiers can be applied to the incomplete spatial features extracted by the CNN. Once input features are prepared for classification, RF is one of the promising ML classifiers as it applies more sophisticated classification strategies to avoid overfitting. However, much effort is required for the preparation of input features prior to the RF-based classification. The strength of CNN-RF is its ability to combine the complementary characteristics of CNN (a simple classifier with an automatic spatial feature extractor) and RF (a sophisticated classifier without a feature extractor). As RF requires fewer user-defined parameters, CNN-RF enables end-to-end learning with a satisfactory classification performance. Therefore, CNN-RF is a promising classifier for supervised classification using limited training data, particularly for early crop mapping.
Despite the promising results of CNN-RF for early crop mapping, some aspects can be improved through future research. With CNN-RF, once the optimal time for early crop mapping has been determined, the number of training samples is critical to achieve satisfactory classification accuracy [56]. In this study, using more training samples with images acquired until the optimal date yielded higher classification accuracy (Figures 9 and 12). Thus, developing strategies to add informative samples to the training data is necessary. The first possible strategy is data augmentation (DA), which artificially increases the quantity of training data [10,57]. However, if DA is applied to the training data collected during the early growth stage, wherein confusion between crops may exist, the classification accuracy may not significantly improve as ambiguous samples with questionable labels are likely to be selected. Another possible strategy is to select informative unlabeled pixels through the learning process and subsequently add them to the training data. Semi-supervised learning [58], active learning [59] or self-learning [18] can be applied to extract informative pixels as candidates for new training data. CNN-RF can be employed as a basic module for the feature extraction and classification within an iterative framework. However, the iterative addition of new informative pixels incurs high computational costs, particularly in DL-based classification. It should be noted that whatever strategy is applied to expand the training data, the focus should be on increasing the diversity of training data to improve the generalization ability of the classification model.
In this study, a feature selection procedure was not employed because few features are available for early crop mapping with limited input data. As not all CNN-based spatial features contribute equally to the identification of different crop types, assigning more weights to informative features can improve the classification performance. Within the CNN-RF framework, the weighting scheme such as squeeze-and-excitation networks [60] can be combined to assign feature-specific weights to CNN-based spatial features. The weighted CNN-based spatial features are then fed into the RF classifier. Another possible approach is a metric learning network in which the similarity between high-dimensional features is evaluated during training the CNN model [61]. Thus, it is worthwhile to investigate the potential of the improved feature learning scheme in future work.
In this study, ultra-high spatial resolution UAV images were used as inputs for the crop classification as most crops in Korea are usually cultivated in small fields. If remote sensing images with coarser spatial resolutions are used for early crop mapping, the mixed pixel effects of training samples are inevitable and may greatly affect the classification performance as they may fail to provide the representative signature of individual crops. Recently, Park and Park [62] reported that the classification performance of CNN-based classification depends on the class purity within a training patch. Thus, this aspect should be considered in future studies on satellite images with coarser spatial resolutions than UAV images to generalize the key findings of this study.

Conclusions
The provision of early information on the crop types and distributions is critical for the timely evaluation of crop yield and production for food security. This study quantitatively evaluated the potential of the hybrid classification model (CNN-RF) that leverages the advantages of both CNN and RF to improve the classification performance for early crop mapping. Crop classification experiments using UAV images in the two study areas in Korea demonstrated the benefits of CNN-RF for early crop mapping. The superiority of CNN-RF over CNN and RF is prominent when fewer images and smaller training samples are used for classification. Fewer image times-series are preferred for early crop mapping and collecting sufficient training samples is often difficult during the early growth stage of crops. This case wherein limited input datasets are available is common in supervised classification due to cloud contamination of optical images and the difficulty in collecting sufficient training samples. Therefore, the hybrid CNN-RF classifier presented in this study is expected to be a promising classifier for general supervised classification tasks using limited input data as well as early crop mapping. Furthermore, the optimal time for early crop mapping determined in this study can be effectively used to plan future UAV image acquisitions in the study area.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/rs13091629/s1, Table S1: Optimal hyper-parameter values determined for combination cases of different training data sizes and the lengths of time-series in Anbandegi (p is the number of input variables). Table S2: Optimal hyper-parameter values determined for combination cases of different training data sizes and the lengths of time-series in Hapcheon (p is the number of input variables). Table S3: Average overall accuracy with standard deviation of five classification results with respect to combination cases of different training data sizes and the lengths of time-series in Anbandegi (the best case is shown in bold. Table S4: Average overall accuracy with standard deviation of five classification results with respect to combination cases of different training data sizes and the lengths of time-series in Hapcheon (the best case is shown in bold).