Generating Up-to-Date Crop Maps Optimized for Sentinel-2 Imagery in Israel

: The overarching aim of this research was to develop a method for deriving crop maps from a time series of Sentinel-2 images between 2017 and 2018 to address global challenges in agriculture and food security. This study is the ﬁrst step towards improving crop mapping based on phenological features retrieved from an object-based time series on a national scale. Five main crops in Israel were classiﬁed: wheat, barley, cotton, carrot, and chickpea. To optimize the object-based classiﬁcation process, different characteristics and inputs of the mean shift segmentation algorithm were tested, including vegetation indices, three-band combinations, and high/low emphasis on the spatial and spectral characteristics. Four known vegetation indices (VIs)-based time series were tested. Additionally, we compared two widely used machine learning methods for crop classiﬁcation, support vector machine (SVM) and random forest (RF), in addition to a newer classiﬁer, extreme gradient boosting (XGBoost). Lastly, we examined two accuracy measures—overall accuracy (OA) and area under the curve (AUC)—in order to optimally estimate the accuracy in the case of imbalanced class representation. Mean shift best performed when emphasizing both the spectral and spatial characteristics while using the green, red, and near-infrared (NIR) bands as input. Both accuracy measures showed that RF and XGBoost classiﬁed different types of crops with signiﬁcantly greater success than achieved by SVM. Nevertheless, AUC was better able to represent the signiﬁcant differences between the classiﬁcation algorithms than OA was. None of the VIs showed a signiﬁcantly higher contribution to the classiﬁcation. However, normalized difference infrared index (NDII) with XGBoost classiﬁer showed the highest AUC results (88%). This study demonstrates that the short-wave infrared (SWIR) band with XGBoost improves crop type classiﬁcation results. Furthermore, the study emphasizes the importance of addressing imbalanced classiﬁcation datasets by using a proper accuracy measure. Since object-based classiﬁcation and phenological features derived from a VI-based time series are widely used to produce crop maps, the current study is also relevant for operational agricultural management and informatics at large scales.


Introduction
Agriculture plays a crucial role in both water management and food security. Irrigated agriculture is one of the most significant water consumers [1], and agricultural production is key to food supply. Due to the impact of agriculture on the world's resources, it must be optimally managed to secure our future. Identifying which crop is grown where and when is essential for better decision-making regarding water and fertilization. For example, crop maps can facilitate crop-specific water consumption estimates [2,3] and provide a basis for regional yield forecasting [4]. Therefore, decision-makers may rely on crop maps as a reliable tool to facilitate better agricultural management. and evaluation. Each one of those steps was performed in several ways. Here, we compare the performance of different combinations of parameters (i.e., high/low spectral or spatial) and inputs (i.e., three optical bands or NDVI greyscale band) to the Mean Shift segmentation technique. In addition, to determine the best input for classification, we compare time series of different VIs, such as NDVI, OSAVI, NDRE, and NDII, in addition to different phenology features. Moreover, we compare three machine learning methods-random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGBoost)-for crop classification. Lastly, we test and discuss two accuracy measures, AUC and OA, to evaluate imbalanced classification.

Sentinel-2 Data Acquisition, Preprocessing, and VIs Calculation
A time series of Sentinel-2 images from 119 dates between October 2017 and September 2019 was acquired from the Copernicus Open Access Hub (https://scihub.copernicus.eu/ accessed on 1 September 2021). The data were obtained for seven Sentinel-2 L2A tiles (36SXB, 36SYB, 36SXA, 36SYA, 36RXV, 36RXU, 36RXT) in order to achieve nationwide coverage of Israel ( Figure 1). The tiles were preprocessed using ERDAS Imagine 2018 (Hexagon Geospatial, Peachtree Corners Circle Norcross). The resolution of 20 m bands was resampled into 10 m using the nearest neighbor algorithm. All bands from the same tile were stacked into one tile to create a 12-band tile, and all tiles from the same date were mosaicked to create an image of the total area of Israel. Finally, clouds were masked to create a cloudless mosaic using the Sentinel-2 level 2A product that includes a cloud detection flag. The images were also masked by the borders of Israel's agricultural area [39]. This way, we created 10 m cloud-free mosaics for a period of two years ( Figure 2A). Then, four VIs: NDVI, OSAVI, NDRE, and NDII (Appendix A) were calculated for each one of the 119 mosaics ( Figure 2B).

Segmentation
Object-based classification begins with the creation of objects in a process named segmentation. The segmentation process groups similar adjacent pixels into an object for a joint classification. In crop classification, ideally, each object represents the boundaries of a different agricultural field.
Mean shift segmentation [40] is a nonparametric iterative technique that uses kernel density estimation (KDE). This algorithm has two bandwidth parameters-spectral and spatial. Satellite images are represented as a three-dimensional matrix, employing length, height/width, and spectral bands. The space of the matrix is known as the spatial domain, and the spectral information is known as the range domain. The values of the spectral and spatial parameters determine the spectral and spatial distance between pixels in the range and spatial domains.
Mean shift segmentation was performed using ArcGIS Pro (ESRI, West Redlands, CA, USA) ( Figure 2C). In this platform, the spectral and spatial parameters are represented according to the level of importance (in the range of [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20] given to the spectral differences and the importance given to the proximity, respectively. In this study, these parameters are referred to as spectral and spatial characteristics of pixels within an object. When the parameter value is closer to 20, the emphasis of the parameter is greater, and the importance given to the spectral or spatial characteristics is greater. Therefore, high importance will result in greater similarity of pixels within an object compared with a parameter value closer to 1. Six different combinations of characteristics importance and inputs (Table 1) were assessed for the mean shift algorithm. These inputs were chosen due to the relatively high spatial resolution of the bands used to calculate them (10 m), and the emphasis on the characteristics was selected since this ratio performed well for Sentinel-2 image segmentation [31]. As reference data, polygons of field boundaries were digitized. To generate these polygons, we used six Sentinel-2 images and created 350 polygons for each one of them. To evaluate the segmentation accuracy, segmentation outputs were compared to the compatible reference polygons using the intersection over union (IoU) method [41]. IoU measures the similarity between two segments by dividing the area of overlap by the area of the union of the two segments. This process was repeated six times for the following dates: 31 January 2018, 16 April 2018, 9 August 2018, 31 January 2019, 11 April 2019, and 9 August 2019. To test if there is a significant difference between at least two of the combinations, analysis of variance (ANOVA) was performed, and to test whether there is one combination that is significantly different than the others, a Tukey's post hoc analysis [42,43] was performed. The statistical tests were performed using the Python StatsModels statistics package [44]. Finally, 24 images, one per month with minimal cloud coverage, were segmented into objects using input Combination 1.

Time Series Analysis
For each one of the 119 images and each one of the VIs, the average pixel value inside the area of each segment was calculated to create a VI time series for each field. The results were four time series for every field-one for each VI. Noise was removed from the VI time series ( Figure 2D) using the Python Pandas package [45]. First, it was determined whether one crop was grown during the growing season or two crops were grown one after the other during the growing season in each of the fields. A field-specific sample was considered as a single crop sample if the VI time series contained a single maximum point, and a double-crop sample if it contained two maximum points that were at least 75 days apart and had a local minimum point between them. In the case of the latter, the VI time series of the sample was split into two; one for each crop cycle. Then, all outliers (VI values that were out of trend) were removed. Finally, a Whittaker interpolation was performed for gap filling, followed by Whittaker smoothing to smooth the data [46], using RStudio version 1.2.5033 [47]. The VIs time series were used as the input for the feature extraction process ( Figure 2E). Eleven phenological features that represent the unique crops pattern were extracted: the crop start of season (SOS) and end of season (EOS) dates and VI values, the growing season length, the maximum VI date and value, the crop enhancement and reduction periods, the seasonal integral, and the geographic location of the plot. The essential features that served as the basis for most of the other features are the crop start of season (SOS) and end of season (EOS) dates and VI values. To detect those features automatically, we used a modified approach based on a moving average developed by Reed et al. [48]. This method uses a five-point moving average such that the first point is the average of the first five points from the VI time series; the next point is the average of five points following a shift by one point forward, and so on. In this method, the SOS and EOS are the points where the VI time series crosses the moving-average curve or the reverse moving-average curve, respectively.
Once the SOS and EOS dates and the corresponding VI values were extracted, we extracted the remaining features: the growing season length, which is the distance between the SOS and EOS dates; the maximum VI value of the growing season, as well as the date when the growing season reaches its maximum; the crop enhancement and reduction periods, which are the distances from the SOS and EOS dates to the maximum VI date, respectively; the seasonal integral, which is the area below the VI curve between the SOS and EOS (Figure 3), and lastly, the geographic location of the plot ( Figure 4). We hypothesized that some crops would be more abundant in specific regions and used the K-means [49] clustering technique to assign every sample to a cluster based on the sample's coordinates. At the end of the process, four datasets, one for each VI, were created. Each one of the datasets contained the phenological features of all the fields. Figure 3. Phenology features. The x-axis value of the left red dot represents the SOS date, and the y-axis value of the left red dot represents the SOS VI value. The x-axis value of the right red dot represents the EOS date, and the y-axis value of the right red dot represents the EOS VI value. The x-axis value of the green dot represents the maximum date, and the y-axis value of the red dot represents the maximum VI value. The orange, green, and blue lines represent the growing season length, the length of the enhancement period, and the length of the reduction period, respectively.
The phenological signature of each crop was calculated to compare how different features were pronounced in every crop. Therefore, we computed the mean standard score (the distance of the result from the mean in standard deviation units) of each crop by each feature using the Python Scikit-learn package. Since the standard score has slightly different interpretations depending on whether the feature's distribution is normal or not, the Kolmogorov-Smirnov test was performed using the Python Scipy package to determine the normality of the distribution.

Classification and Accuracy Assessment
Using the Python Scikit-learn package, three classification algorithms (i.e., RF, SVM, XG Boosting) were tested to determine the best VI dataset and classification algorithm for crop classification ( Figure 2F): 1361 fields of five crops in Israel were selected to train and test the model. We used crop type polygons from the Ministry of Agriculture [39] that demark the boundaries of each field and the crop type that is grown in that field to assign the type of crop to each segment, using the k-dimensional tree (k-d tree) algorithm [50]. These polygons were created manually from an annual survey carried out by the Ministry of Agriculture [39]. Comparing these polygons to information collected directly from farmers shows that the polygons are reliable but do not cover all of the agricultural areas of Israel.
The four VI datasets were split into 75% train dataset and 25% test dataset ( Figure 4). The effect of each phenology feature on the RF and XGBoost classifiers was assessed using the Python Scikit-learn package. The importance was determined by checking how randomizing each feature (and therefore excluding its impact on the classification) affected the accuracy and then standardizing the result to a scale of 0-1. The present Python libraries cannot find the feature importance for nonlinear SVM because the importance of the features is not directly related to the input features.
Evaluation of the accuracy of each crop classification method is often performed using the overall accuracy (OA) statistic. One of the challenges with assessing the accuracy of the classification is the imbalanced classification. This issue occurs when there is a different number of samples from each class. This uneven distribution causes the classification to be biased or skewed. Since the OA can change according to the data distribution in different columns in the confusion matrix, even when the classification performance does not change [51], it is necessary to use an estimation technique that is insensitive to the imbalance in the dataset. One such method is derived from the receiver operating characteristics (ROCs) by relating to the area under the ROC curve. ROC is a graphic method for presenting different classification results. The y-axis of the ROC graph shows the true positive rate, and the x-axis shows the false positive rate. Points generate the curve; each one of them represents the classification results using a different threshold. The AUC accuracy measure is the calculation of the area under the ROC curve. The traditional use of the ROC curve and the AUC measure is for binary classification, as described above. In order to modify this method to imbalanced multiclass classification problems, a ROC curve is produced for each pair of classes. The AUC in such a case would be the mean AUC from all the pairs [51][52][53][54][55]. Therefore in order to evaluate the classification results, we used both the OA and AUC measures. These measures were used for train, validation, and cross-validation evaluation.
The user's accuracy (UA) and producer's accuracy (PA) measurements were computed for each class, for each VI, and for each classification method. Additionally, OA and AUC measurements were calculated for each VI and classification method with train, test, and 10-fold cross-validation estimations. ANOVA and post hoc Tukey statistical tests at α < 0.05 level were computed to determine whether there was a significant difference between the accuracy results of the VIs input and the classifiers ( Figure 2F).

Segmentation
Six combinations of two spectral and spatial characteristics and two types of inputs (Table 1) were compared on six different dates in order to determine the optimal way to use the mean shift segmentation algorithm to segment agricultural plots in Sentinel-2 imagery. We found a statistically significant difference as yielded by one-way ANOVA for the six combinations (F (5,30) = 6, MSE = 150, p = 0.0006). Subsequently, Combination 1 was chosen as the best combination according to its high IoU result and low standard deviation ( Figures 5 and 6). This result demonstrates that both the spectral and spatial characteristics of the blue, red, and NIR bands are important for mean shift object segmentation.  Table 1) applied to the mean shift segmentation algorithm. The box represents 50% of the data, from the 25th percentile to the 75th percentile. The horizontal line inside the box represents the median of the data. The lines extending from the box (whiskers) represent the lower or upper quartiles. The gray lines represent the standard deviation, and the gray squares represent the average of each combination. For both NDVI and three-band combination, when the emphasis on both the spectral and spatial characteristics was high (combinations 1 and 4), the number of output segments was similar to that obtained when the emphasis on the spectral characteristic was high and the emphasis on the spatial characteristic was low (combinations 2 and 5). However, when the emphasis on the spatial characteristic was high and the emphasis on the spectral characteristic was low (combinations 3 and 6), the number of segments decreased by more than a half (Figure 7). This trend is presented in Figure 8.  Table 1).

Figure 8.
An example of mean shift segmentation output using the blue, red, and NIR bands as input and high/low spectral and spatial parameters. (A) Input image (the green, red, and NIR bands). (B) Segmentation output using high emphasis on the spectral characteristic and low emphasis on the spatial characteristic (Combination 2). (C) Segmentation output using low emphasis on the spectral characteristic and high emphasis on the spatial characteristic (Combination 3). (D) Segmentation output using high emphasis on the spectral and spatial characteristics (Combination 1).

Phenology Features
All the date-based features (max date, SOS date, and EOS date) showed a similar trend of crop mean standard score; cotton had a distinctly higher mean standard score compared with the other crops, chickpea and wheat had a rather discrete mean standard scores as well, but barley and carrot had similar mean standard scores. Cotton also showed a particularly high mean standard score of the seasonal integral, the enhancement period, the EOS NDII value, and the max NDII value (Figure 9).
The importance of the phenology features in the RF and XGBoost classifiers is shown in Figure 10. The most important feature of both classification methods was the max date. Additionally, in both methods, the EOS and SOS dates were very important as well. The least important feature of both classifiers was the reduction periods. The rank of the rest of the features was not consistent. For example, growing season length and the geographic location of the field were important in XGBoost but not in RF.

Segmentation
The segments in this study were created using the mean shift algorithm with different inputs and several combinations of emphasis on the spectral and spatial characteristics. A high emphasis on the spectral or spatial characteristics should lead to the creation of more segments. Likewise, a low emphasis on the spectral or spatial characteristics will create a smoother output with fewer segments [56]. Therefore, it was expected that segmentation with a high emphasis on the spectral and spatial characteristics output would have more segments than that obtained when the emphasis on one of the characteristics was low. Additionally, we expected the segmentation output obtained when the emphasis on one of the characteristics was low to have a similar number of segments. The results showed that for both inputs, the number of segments resulting from a high emphasis on the spectral and spatial characteristics was similar to the number of segments obtained when using a high emphasis on the spectral characteristic and low emphasis spatial characteristic. In contrast, the number of segments obtained when using a low emphasis on the spectral characteristic and a high emphasis on spatial characteristic was much lower. This result indicates that the spectral characteristic has more influence on the mean shift segmentation algorithm than has the spatial characteristic. The main segmentation challenges occur in cases of under segmentation where the fields are very similar to one another (such as plowed fields; Figure 6B) or in cases of over segmentation where there is large variability within one field (patches of soil and vegetation or vegetation in different growth stages in one field; Figure 6C). As shown in Figure 8 and from the IoU results, the best and relatively balanced result between over and under segmentation was obtained using a high emphasis on the spectral and spatial characteristics. Therefore, even though the spectral characteristic is more influential, both the spectral and spatial characteristics are important for achieving optimal segmentation results.
Even though segmentation is very common in crop classification, there is little information regarding the segmentation accuracy. Immitzer et al. [31] used a modification of the mean shift segmentation algorithm to determine the boundaries of agriculture fields. By visually evaluating the segmentation results by a trial-and-error approach, they concluded that the pixels in each segment should be less spectrally similar (i.e., they should correspond to a low emphasis on the spectral characteristic) and more spatially similar (i.e., they should correspond to a high emphasis on the spatial characteristic). This result does not agree with the results of this study. This might be explained by Immitzer et al.'s use of a minimum segment size (smaller than the one used in the current study). The opposite conclusion was obtained by Ming et al. [57], presenting a method for optimal selection of the spatial and spectral characteristics. They concluded that for farmland segmentation, the pixels in each segment should be more spectrally similar (corresponding to a high emphasis on the spectral characteristic) and less spatially similar (corresponding to a low emphasis on the spatial characteristic). The minimum optimal segment size in their study was larger than in our study, influencing these results. Since there is no agreement about the ideal emphasis of the spectral and spatial characteristics, this topic warrants further investigation. The results obtained in this study may be pertinent to such an endeavor.

Phenology Features
The importance of the date-based classification features was the highest for both RF and XGBoost classifiers. More importantly, the importance of those features was higher than that of the corresponding values-based features; for example, the EOS date was more important than the VI value at the EOS. These results were also reflected in the crops' mean standard score for the date-based and value-based features: for cotton, chickpea, and wheat, date-based features had more distinct mean standard scores than did value-based features. These results demonstrate that most of the crops have similar VI values, but different crops reach those values at different dates. Some features, such as the season's length and the field's geographic location (that is represented as a cluster), were less important for the RF classifier and more important for the XGBoost classifier. Even though the season's length was an important feature for XGBoost, its standard score was relatively similar for all the crops. This result demonstrates the complexity of machine learning-based decisionmaking. It was expected that the geographic location of the field would be one of the most important features since in Israel, specific crops are commonly grown in different areas, e.g., barley is mainly grown in the south of Israel. Using an automated coordinates-based clustering technique to create the agricultural areas (clusters) could allow different but close agricultural areas to be represented by one cluster. That could decrease the optimal performance of this feature.

Classification
Since SVM and RF are commonly used in remote sensing classification, we anticipated that the two methods would function well (Table 2). In contrast to these classifiers, XGBoost is newer and has only recently been used in crop classification. Both the RF and XGBoost classifiers achieved significantly better AUC and OA results than the SVM classifier did. This result is in agreement with Ghazaryan et al. [36] and Inglada et al. [58], who also concluded that SVM was less accurate for crop classification than were decision-tree-based classifiers. In addition, Peterson et al. [59] reported that boosted trees was the best learning algorithm, RF was second, and SVM was ranked fourth. In previous studies, XGBoost showed the highest performance in crop classification among other machine learning classifiers [60], and specifically with Sentinel-2 [61]. OA values in this study significantly differed between the classifiers with less certainty than did the AUC (i.e., the OA p-value was higher than the AUC p-value). One explanation for this could be that OA is sensitive to imbalanced data, and thus, the largest group (in this case, wheat) had more impact on OA results. The imbalanced data might also explain the difference between the accuracy results calculated by the two measures; AUC results were higher than OA results. The reason could be that cotton, which was relatively well classified, had the smallest number of samples and therefore did not greatly increase OA results and had a more positive effect on AUC results. This highlights the preferability of AUC to OA as an accuracy measure, as was recommended in a previous study by Ling et al. [62]. Comparing the OA accuracy results in previous crop classification studies (Table 5) showed that this study achieved lower results than those of the other studies, possibly since OA is not a suitable measure for this dataset.
NDII was shown to be a slightly better input for the time series analysis by AUC and with XGBoost classifier. This result demonstrates the importance of SWIR bands for crop classification, specifically when using Sentinel-2 images, and is in agreement with previous studies that also reported the importance of the SWIR bands for crop classification with various satellites such as Sentinel-2, Landsat, and ASTER [8,10,31]. The SWIR region is related to water, nitrogen, lignin, and cellulose content [63][64][65]. The relationship between those plant components and the SWIR band can be the reason for SWIR importance for crop identification.
Wheat had the highest UA and PA values, while the mean standard scores were similar to those of the other crops. We assumed that the reason for this result was that wheat had the largest number of training samples. Therefore, the algorithm had more data with which to create and learn distinctive phenology features for wheat. However, this may be the reason for the misclassification of most crops; barley, carrot, and chickpea were misclassified as wheat. The second-best-classified crops were cotton and chickpea. Since cotton is a summer crop and chickpea is a spring crop, in contrast to the other crops, they could be separated from the other crops by the date-based features, as was shown by their distinct mean standard score. Carrot was poorly classified and did not have a unique mean standard score, possibly due to the fact that it has more than one growing cycle, in both winter and spring, in addition to having a small number of samples. Barley was not well classified, most likely due to the fact that barley and wheat belong to the same family and are very similar crops, as reflected in their mostly similar mean standard scores, and therefore it is a great challenge to correctly classify them. In a study performed by [66] Bargiel with radar data, the majority of spring barley was misclassified as wheat. Additionally, Chakhar et al. [67] concluded that the confusion between wheat and barley in crop classification is due to the similarity of their growing cycles. Overall, having a large number of samples and a distinct mean standard score for date-based features seems to greatly improve classification accuracy.

Conclusions
In this study, a method for crop classification was developed for a time series of Sentinel-2 images. The best combination for this task was an XGBoost object-based classification of phenological features extracted from an NDII time series. Based on the results and analyses, we reached the following conclusions: 1.
High emphasis on the spectral and spatial characteristics is important when using the mean shift segmentation algorithm; nevertheless, the spectral parameter is more dominant.

2.
Date-based phenological features had the most influence on the classification models 3.
NDII with XGboost showed the highest classification results. 4.
RF and XGboost classified different types of crops with significantly greater success than did SVM. XGboost showed a better accuracy trend than did RF. 5.
The AUC was able to assess the significant differences between the classification algorithms in contrast to the OA.
It has been demonstrated in this study that crop maps can be created accurately, without human intervention, and at a reduced computational time in comparison to handmade maps. The method used in this study can be applied to other areas worldwide, as well as to other crops in Israel, using data from additional spaceborne sensors. Further research can explore other methods and parameters such as texture for the segmentation process. Moreover, more accuracy measures for imbalanced data, such as the F1 score and precisionrecall curve, could be considered for this type of classification. Finally, it is suggested to incorporate additional data (e.g., meteorological data) to improve classification accuracy.  Normalized Difference Infrared Index (NDII) NDI I = N IR − SW IR N ID + SW IR Hardisky et al., 1983 [28] VIs that were calculated from Sentinel-2 data and were employed in this study. * were NIR (band 8), RED (band 4), RED EDGE (band 5), and SWIR (band 12) are the spectral reflectance measurements acquired in the near-infrared band, the red band, the red edge band, and the short wave infrared band from MSI onboard Sentinel-2.