Harmonized Landsat 8 and Sentinel-2 Time Series Data to Detect Irrigated Areas: An Application in Southern Italy

: Lack of accurate and up-to-date data associated with irrigated areas and related irrigation amounts is hampering the full implementation and compliance of the Water Framework Directive (WFD). In this paper, we describe the framework that we developed and implemented within the DIANA project to map the actual extent of irrigated areas in the Campania region (Southern Italy) during the 2018 irrigation season. For this purpose, we considered 202 images from the Harmonized Landsat Sentinel-2 (HLS) products (57 images from Landsat 8 and 145 images from Sentinel-2). Such data were preprocessed in order to extract a multitemporal Normalized Di ﬀ erence Vegetation Index (NDVI) map, which was then smoothed through a gap-ﬁlling algorithm. We further integrated data coming from high-resolution (4 km) global satellite precipitation Precipitation Estimation from Remotely Sensed Information using Artiﬁcial Neural Networks (PERSIANN)-Cloud Classiﬁcation System (CCS) products. We collected an extensive ground truth in the ﬁeld represented by 2992 data points coming from three main thematic classes: bare soil and rainfed (class 0), herbaceous (class 1), and tree crop (class 2). This information was exploited to generate irrigated area maps by adopting a machine learning classiﬁcation approach. We compared six di ﬀ erent types of classiﬁers through a cross-validation approach and found that, in general, random forests, support vector machines, and boosted decision trees exhibited the best performances in terms of classiﬁcation accuracy and robustness to di ﬀ erent tested scenarios. We found an overall accuracy close to 90% in discriminating among the three thematic classes, which highlighted promising capabilities in the detection of irrigated areas from HLS products.


Introduction
Availability of data about irrigated areas is of extreme importance in many different applications including management of water resources [1], modeling of water exchange between atmosphere and land surface [2], and impact of climate change on irrigation water supplies [3]. Despite the great effort conducted in the literature in this direction, there is a general lack of accurate and up-to-date maps about irrigated areas and related irrigation amounts, which is hampering the full implementation and compliance of the Water Framework Directive (WFD). Current monitoring practices rely on acquiring periodical surveys meant to give pictures at national scales. However, they are rather imprecise when regional or local scale resolutions need to be achieved, as the case of management of water resources in The HLS project [10,11] is meant to harmonize data coming from both satellites programs to ease their combined use. The program aims at removing the biases between the two sensors in terms of different spectral band ranges and view geometries with the final goal to obtain a global land surface coverage at 30 m spatial resolution with a time gap of 2-3 days.
The workflow for the HLS data is schematized in [11]. The process that leads to the final HLS outputs starts by applying the same atmospheric correction algorithm to both Landsat 8 OLI (L1T) and Sentinel-2 MSI (L1C) products. Then, Landsat 8 data are geographically divided in accordance with Sentinel-2 data tiling, and both are geometrically resampled using the Automated Registration and Orthorectification Package (AROP) [21]. This step arises from the fact that the view angles between sensors can change for a given ground target, and therefore the angle effect (bidirectional reflectance distribution function, or BRDF) needs to be normalized. The last step consists in applying the band pass adjustment (based on a linear fit between equivalent spectral bands) to the Sentinel-2 data only using the OLI spectral band passes as reference. The resulting outputs are the L30 (OLI NBAR 30m), the S10 (MSI SR 10m), and the S30 (MSI NBAR 30m) data products, but only the first and third types of products were considered in this work.

Dataset
The study area is the Campania region (Southern Italy), monitored through the 2018 irrigation season (from April to October).
For this purpose, we considered the HLS data (version 1.4) that resulted in four distinct tiles (T33TVE, T33TVF, T33TWE, and T33TWF) ( Figure 1). We collected an initial number of 650 images that spanned the four tiles in the January-December time span of year 2018. This number was reduced to 210 images by removing those images that were partially or entirely covered by clouds or that overlapped only marginally (<20%) with the area of interest. We further removed 12 images in which both Sentinel-2 and Landsat 8 images were acquired at the same date, and the Sentinel-2 data was kept in this case. We finally added one more image for each tile from the end of year 2017 in order to prevent issues in applying the gap-filling algorithm due to cloud-covered data as we will describe later. This resulted in a final number of 202 images, with 141 of them coming from Sentinel-2 and 57 from Landsat 8. The inclusion of Landsat 8 data allowed us for more than 40% increase in the number of images available in our study. Such images were spread across the four tiles as follows: T33TVE (N = 42), T33TVF (N = 74), T33TWE (N = 45), and T33TWF (N = 41). The whole data selection process is summarized in Table 1. The mean effective revisit time spanned was 4.9 and 8.9 for T33TVF and T33TWF, respectively (Table 1), which highlighted a temporal resolution that was appropriate for the application involved in this study.   Figure 1. The four tiles on Campania region from Harmonized Landsat Sentinel-2 (HLS) products that we considered in this work. Colored points represent ground truth samples that will be further described in Figure 7 and Figure 8.

Satellite Data Collection and Preprocessing
The full set of bands and associated wavelengths acquired by Sentinel-2 and Landsat 8 are listed in [11]. We considered here the red (0.64-0.67 µm) and NIR narrow (0.85-0.88 µm) bands, which were used to compute the NDVI [22]. The NDVI can be used as a good indicator for irrigated areas since it represents the amount of green biomass that varies in response to changes in vegetation conditions [23].

Satellite Data Collection and Preprocessing
The full set of bands and associated wavelengths acquired by Sentinel-2 and Landsat 8 are listed in [11]. We considered here the red (0.64-0.67 µm) and NIR narrow (0.85-0.88 µm) bands, which were used to compute the NDVI [22]. The NDVI can be used as a good indicator for irrigated areas since it represents the amount of green biomass that varies in response to changes in vegetation conditions [23].
The raw NDVI values were processed through a gap-filling algorithm in order to overcome the presence of missing data generated from cloud covers. For this purpose, we first considered the Quality Assessment (QA) layer derived from the cloud masks [11] (Table 2). From this, we extracted integer values (Table 3), and derived the Binary Mask to discriminate between Land (1) and no-Land (0) classes, with the latter one dominated by the information on the cloud cover.
At this point, the NDVI value of pixels belonging to the no-Land class was estimated through the smoothing and gap-filling process based on the Whittaker smoother (WS) [24] function available in the MODIS package [25] into the R environment [26]. In the WS filtering process, we set the main parameters empirically as follows: (i) the number of iteration = 1, which implied to perform two running cycles; (ii) the lambda value (i.e., the smoothing factor) = 10, which guaranteed to preserve the temporal variability of specific phenological patterns; and (iii) the number of days = 5. Finally, in order to avoid artifacts in the NDVI values resulting from undetected cloud and poor atmospheric conditions, we forced their values to be constrained in a range defined on the cloud-free multitemporal NDVI series. An example of the derived product is shown in Figure 2.

Integer value
At this point, the NDVI value of pixels belonging to the no-Land class was estimated through the smoothing and gap-filling process based on the Whittaker smoother (WS) [24] function available in the MODIS package [25] into the R environment [26]. In the WS filtering process, we set the main parameters empirically as follows: i) the number of iteration = 1, which implied to perform two running cycles; ii) the lambda value (i.e., the smoothing factor) = 10, which guaranteed to preserve the temporal variability of specific phenological patterns; and iii) the number of days = 5. Finally, in order to avoid artifacts in the NDVI values resulting from undetected cloud and poor atmospheric conditions, we forced their values to be constrained in a range defined on the cloud-free multitemporal NDVI series. An example of the derived product is shown in Figure 2.  At this stage, we obtained a multitemporal, filtered, and cloud-free NDVI series data with a 5-day time step. This resulted in a total of 75 observations (features) per pixel spanning the entire year 2018, which were able to preserve crop phenologies and limit differences in terms of revisit time existing across the four tiles covering the Campania region. An example of the NDVI time series product is shown in Figure 3 for different crop types. In such an example, we can observe the distinctive NDVI patterns that distinguish the herbaceous (two cases in panels a and b) from the tree class (one case in panel c). Of relevance is also the importance of the WS algorithm. The raw NDVI decreases when clouds are present with an intensity proportional to the density of the cloudy body. On the other hand, it increases in shadowed areas created by clouds as the values in the red band have larger decreases than in the NIR band. Therefore, the application of the WS algorithm is relevant to smooth such inconsistencies that may occur in the data due to the presence of clouds.  In the NDVI panel, the red line represents the raw NDVI data, the blue line shows the NDVI after applying smoothing and gap-filling based on the by Whittaker smoother (WS) algorithm on the original dates (i.e., using the acquisition dates with variable time sampling), and the green line depicts the NDVI by running WS with a constant sampling of 5 days. The WS algorithm was run by applying two filtering iterations (niter = 1) in order to limit the effect of undetected clouds and poor atmospheric conditions. The smoothing parameter (lambda) was limited to 10 in order to preserve the fidelity of the input data in the final resulting curve, which was particularly relevant for the alfalfa herbaceous class as shown in panel b. The gray line indicates the presence (value = 0) or absence (value = 1) of clouds for a specific date. In the NDVI panel, the red line represents the raw NDVI data, the blue line shows the NDVI after applying smoothing and gap-filling based on the by Whittaker smoother (WS) algorithm on the original dates (i.e., using the acquisition dates with variable time sampling), and the green line depicts the NDVI by running WS with a constant sampling of 5 days. The WS algorithm was run by applying two filtering iterations (niter = 1) in order to limit the effect of undetected clouds and poor atmospheric conditions. The smoothing parameter (lambda) was limited to 10 in order to preserve the fidelity of the input data in the final resulting curve, which was particularly relevant for the alfalfa herbaceous class as shown in panel (b). The gray line indicates the presence (value = 0) or absence (value = 1) of clouds for a specific date.
As final step, we applied a regional-scale spatial aggregation algorithm. The four tiles were merged together into a single mosaic layer using GDAL [27] for each time step (Figure 4). product is shown in Figure 3 for different crop types. In such an example, we can observe the distinctive NDVI patterns that distinguish the herbaceous (two cases in panels a and b) from the tree class (one case in panel c). Of relevance is also the importance of the WS algorithm. The raw NDVI decreases when clouds are present with an intensity proportional to the density of the cloudy body. On the other hand, it increases in shadowed areas created by clouds as the values in the red band have larger decreases than in the NIR band. Therefore, the application of the WS algorithm is relevant to smooth such inconsistencies that may occur in the data due to the presence of clouds.
As final step, we applied a regional-scale spatial aggregation algorithm. The four tiles were merged together into a single mosaic layer using GDAL [27] for each time step (Figure 4).

Satellite Precipitation Data Collection and Preprocessing
The NDVI product described in the previous section was integrated with satellite precipitation data in order to ease the discrimination between irrigated and not irrigated areas. For this purpose, we considered the Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks (PERSIANN)-Cloud Classification System (PERSIANN-CCS), a real-time global high-resolution (0.04° × 0.04° or 4 km × 4 km) satellite precipitation product developed by the Center for Hydrometeorology and Remote Sensing (CHRS) at the University of California, Irvine (UCI) that reports the cumulative daily rainfall [28]. The raw PERSIANN-CCS data was processed to be consistent with the NDVI data. More specifically, we first performed a temporal aggregation with a time interval of 5 days, and then cropped and reprojected it in accordance with the HLS grid (merging of tiles T33TVF, T33TWF, T33TVE, and T33TWE) derived as described in the previous sections. An example of the PERSIANN-CSS product is shown in Figure 5.

Satellite Precipitation Data Collection and Preprocessing
The NDVI product described in the previous section was integrated with satellite precipitation data in order to ease the discrimination between irrigated and not irrigated areas. For this purpose, we considered the Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks (PERSIANN)-Cloud Classification System (PERSIANN-CCS), a real-time global high-resolution (0.04 • × 0.04 • or 4 km × 4 km) satellite precipitation product developed by the Center for Hydrometeorology and Remote Sensing (CHRS) at the University of California, Irvine (UCI) that reports the cumulative daily rainfall [28]. The raw PERSIANN-CCS data was processed to be consistent with the NDVI data. More specifically, we first performed a temporal aggregation with a time interval of 5 days, and then cropped and reprojected it in accordance with the HLS grid (merging of tiles T33TVF, T33TWF, T33TVE, and T33TWE) derived as described in the previous sections. An example of the PERSIANN-CSS product is shown in Figure 5.
We further reported an example of the NDVI time series product overlapped with the accumulated rainfall data ( Figure 6). Such example highlighted the improved capabilities in discriminating between rainfed and irrigated areas when the two data types were integrated. When the NDVI trend was in phase with the accumulated rainfall, the area was considered as "not irrigated or rainfed" as the crop growth was strictly correlated to the rainfall events ( Figure 6a). On the other hand, when the NDVI trend was not synchronized with the accumulated rainfall, the area was considered as "irrigated" as the crop growth resulted independent from the accumulated rainfall ( Figure 6b). We further reported an example of the NDVI time series product overlapped with the accumulated rainfall data ( Figure 6). Such example highlighted the improved capabilities in discriminating between rainfed and irrigated areas when the two data types were integrated. When the NDVI trend was in phase with the accumulated rainfall, the area was considered as "not irrigated or rainfed" as the crop growth was strictly correlated to the rainfall events ( Figure 6a). On the other hand, when the NDVI trend was not synchronized with the accumulated rainfall, the area was considered as "irrigated" as the crop growth resulted independent from the accumulated rainfall ( Figure 6b).   We further reported an example of the NDVI time series product overlapped with the accumulated rainfall data ( Figure 6). Such example highlighted the improved capabilities in discriminating between rainfed and irrigated areas when the two data types were integrated. When the NDVI trend was in phase with the accumulated rainfall, the area was considered as "not irrigated or rainfed" as the crop growth was strictly correlated to the rainfall events ( Figure 6a). On the other hand, when the NDVI trend was not synchronized with the accumulated rainfall, the area was considered as "irrigated" as the crop growth resulted independent from the accumulated rainfall ( Figure 6b).

Ground Truth Acquisition
An extensive ground truth was collected in the study area by field inspection using the app mapitGIS [29]. More specifically, we collected a total of 2992 samples spanning the three thematic classes: bare soil and rainfed (class 0, N = 336), herbaceous (class 1, N = 1897), and tree crop (class 2, N = 759). The acquired points are shown in Figure 7 and more detailed in Table 4.

Ground Truth Acquisition
An extensive ground truth was collected in the study area by field inspection using the app mapitGIS [29]. More specifically, we collected a total of 2992 samples spanning the three thematic classes: bare soil and rainfed (class 0, N = 336), herbaceous (class 1, N = 1897), and tree crop (class 2, N = 759). The acquired points are shown in Figure 7 and more detailed in Table 4.  We note that we acquired one single label/sample per parcel at best in order to limit the spatial correlation among the collected labeled samples. This a relevant aspect to take into consideration in order to avoid an overestimation of the classification accuracies. An example is represented in Figure  8, in which, we show some of the collected samples acquired in an agricultural area in northern Campania and spanning multiple parcels. We note that we acquired one single label/sample per parcel at best in order to limit the spatial correlation among the collected labeled samples. This a relevant aspect to take into consideration in order to avoid an overestimation of the classification accuracies. An example is represented in Figure 8, in which, we show some of the collected samples acquired in an agricultural area in northern Campania and spanning multiple parcels.

Machine Learning-Based Supervised Classification
We employed a machine learning-based supervised classification approach to discriminate among the three classes from the time series NDVI and satellite precipitation data. For this purpose, we considered and compared six classifiers that are commonly used in satellite image classification

Machine Learning-Based Supervised Classification
We employed a machine learning-based supervised classification approach to discriminate among the three classes from the time series NDVI and satellite precipitation data. For this purpose, we considered and compared six classifiers that are commonly used in satellite image classification applications. The classifiers are summarized in Table 5 and comprises random forests (RF), support vector machines (SVM), single decision trees (DT), boosted decision trees (DT), artificial neural networks (ANN), and k-nearest neighbors (k-NN). More specifically, SVMs aim at finding the hyperplane that maximizes the distance among classes and using a limited number of samples (called support vectors) that lie close to the separation margin [30]. Transformation from a linear to a nonlinear problem is done by mapping samples from the original to an high-dimensional feature space through a kernel function (such as the radial basis function (RBF) used in this work [31,32]), an operation that is known as kernel trick. DT classifier is based on a sequential decision process [33]. Starting from the root, a generic feature is evaluated, and one of the two branches is selected. This procedure is repeated for every branch until a final leaf is reached. In case of classification, the leaf values usually correspond to the target classes. RFs are a set of decision trees built on random samples [34]. In this case, a different criterion is used to split nodes; for each tree, a random subset of features is selected in order to find the best splits. As a result, there is the generation of multiple weak trees, whose predictions are combined through a fusion strategy. A common fusion strategy relies on the majority vote rule, although more advanced algorithms based on weighting the posterior probabilities can also be exploited. In addition, RFs provide the relative importance of each feature which can be exploited to understand which features contribute more in the creation of the classification model. Boosted DTs are based on an ensemble of classifiers in which the training set is updated in an iterative way in order to focus more on that samples that tend to be misclassified [35]. ANNs are based on a collection of nodes (called artificial neurons) which mimic the model of neurons in a biological brain [36]. The network is usually composed by an input layer that receives the input features, one or more hidden layers, and an output layer that gives the final predictions. A generic node receives n input channels, which are weighted, summed up, and sent to the output after being processed by an activation function. Finally, k-NN departs from the previous methods since the class of each unknown sample is predicted directly from the training set without really building a classification model [37]. The label is assigned by applying a majority rule on the k labels associated with the k closest training samples.

Experimental Setting for Irrigated Area Classification Assessment
We first conducted a sensitivity analysis devoted to compare the six classification algorithms introduced in Section 2.5. We used such different supervised classifiers that are widely used in remote sensing image classification applications and evaluated their effectiveness to deal with the classification problem specifically addressed in this paper.
We also compared four preprocessing procedures in order to evaluate the sensitivity of the classification approach to different training data selection approaches (Table 6). More specifically, in the scenario 1, the classifier was applied to the original training data and the initial number of input features (d = 150, which included both NDVI and precipitation data). The scenario 2 implied a classification performed on a balanced training dataset, in which random oversampling was applied to have the same number of training samples per class. Scenario 3 implied a RF-based recursive feature elimination process that reduced the number of features used as input. Finally, scenario 4 incorporated both the balanced training dataset and the recursive feature elimination processes described in scenarios 2 and 3. We, therefore, compared a total of 24 classification scenarios associated with 6 types of classifiers and 4 training data selection strategies. The entire set of ground truth samples (N = 2992, Table 4) was used in the classification process. Two scenarios were explored in order to evaluate the sensitivity of the framework to different sample sizes. In the first one, 25% of the labeled samples were randomly chosen as training set (stratified by class) and the remaining 75% as validation set. In the second case, 75% of the samples were used for training set and 25% for validation set. In both cases, free parameters of the classifiers were estimated automatically on the training data only using a 10-fold cross-validation approach. More specifically, such parameters were estimated for each type of classifier: (i) number of drawn candidate variables in each split (mtry) for RF; (ii) cost (C) and sigma for the RBF kernel (σ) for SVM; (iii) pruning parameter (cp) for DT; iv) number of boosting iterations (trials), indication whether the trees should be collapsed into rules (rules), and activation of the feature selection step before model building (winnow) for Boosted DT; (v) number of units in hidden layer (size) and regularization parameter (decay) for ANN; and (vi) number of neighbors for k-NN. In addition, the tuneLength parameter was set to 10, while the other parameters were kept to their default values. After estimating the best parameters through cross-validation, the classification model was built on the entire training set and evaluated on the independent validation set. The analysis was done using the free statistical software R [26]. Within R, we used the caret package [43], which provides a standard syntax for running a variety of machine learning algorithms, thus simplifying the process of systematic comparison of different algorithms and approaches.
Accuracies of the classification models were evaluated by considering multiple metrics derived from the confusion matrix associated with the validation set. These metrics were useful to compare the classification performances obtained across classifiers and preprocessing scenarios. More specifically, we considered the overall accuracy (OA), producer's accuracy (PA), user's accuracy (UA), and Cohen's kappa coefficient (denoted as Kappa statistic in the manuscript) [44]. For OA, a confidence level of 95% was considered through the normal approximation method altogether with a continuity correction.
A summary of the overall experimental setting and the different types of analyses conducted as described in the previous paragraph is shown in Figure 9.
the classification performances obtained across classifiers and preprocessing scenarios. More specifically, we considered the overall accuracy (OA), producer's accuracy (PA), user's accuracy (UA), and Cohen's kappa coefficient (denoted as Kappa statistic in the manuscript) [44]. For OA, a confidence level of 95% was considered through the normal approximation method altogether with a continuity correction.
A summary of the overall experimental setting and the different types of analyses conducted as described in the previous paragraph is shown in Figure 9.

Experimental Results
We compared the six types of classifiers and applied them in conjunction with the four preprocessing scenarios. We first focused on the scenario in which 25% of the labeled samples were used for training. By considering multiple evaluation metrics, we found that RF, SVM, and Boosted DT models exhibited in general better results than the other three classifier types (Table 7). In general, RF and Boosted DT were robust to the different preprocessing steps and comparable results were obtained in the four scenarios (OA ranging from 86.3% to 87.8% for RF and OA ranging from 86.0% to 86.6% for Boosted DT). On the other hand, SVM exhibited a higher variability (OA ranging from 82.7% to 87.8%) and the necessity to employ a proper preprocessing strategy to maximize the performances in terms of classification accuracy. Moreover, RF exhibited the highest accuracies for scenario 1, scenario 2, and scenario 4, whereas SVM gave the best results for scenario 3, which maximized the accuracy at all, both in terms of OA and Kappa statistic ( Figure 10). In this case, we obtained an OA = 87.8% and a Kappa statistic equal to 0.770, which highlighted promising capabilities of the proposed approach to detect irrigated areas.

Experimental Results
We compared the six types of classifiers and applied them in conjunction with the four preprocessing scenarios. We first focused on the scenario in which 25% of the labeled samples were used for training. By considering multiple evaluation metrics, we found that RF, SVM, and Boosted DT models exhibited in general better results than the other three classifier types (Table 7). In general, RF and Boosted DT were robust to the different preprocessing steps and comparable results were obtained in the four scenarios (OA ranging from 86.3% to 87.8% for RF and OA ranging from 86.0% to 86.6% for Boosted DT). On the other hand, SVM exhibited a higher variability (OA ranging from 82.7% to 87.8%) and the necessity to employ a proper preprocessing strategy to maximize the performances in terms of classification accuracy. Moreover, RF exhibited the highest accuracies for scenario 1, scenario 2, and scenario 4, whereas SVM gave the best results for scenario 3, which maximized the accuracy at all, both in terms of OA and Kappa statistic ( Figure 10). In this case, we obtained an OA = 87.8% and a Kappa statistic equal to 0.770, which highlighted promising capabilities of the proposed approach to detect irrigated areas. We dissected the best model identified, which corresponded to SVM classifier applied in conjunction with the preprocessing step employed in scenario 3. This was based on a recursive feature selection process that, starting from the original 150 features, found the optimal set and number of variables by maximizing the Kappa statistic on the training set using a cross-validation approach.
This was obtained for a number of features equal to 41 (Kappa statistic = 0.802 and OA = 90.0% in cross-validation), which were therefore selected to build the final classification model (Figure 11). We dissected the best model identified, which corresponded to SVM classifier applied in conjunction with the preprocessing step employed in scenario 3. This was based on a recursive feature selection process that, starting from the original 150 features, found the optimal set and number of variables by maximizing the Kappa statistic on the training set using a cross-validation approach. This was obtained for a number of features equal to 41 (Kappa statistic = 0.802 and OA = 90.0% in cross-validation), which were therefore selected to build the final classification model ( Figure 11). The same analysis was repeated by changing the size of the training set and considering 75% of the samples as training set (N = 2244) and the remaining ones (N = 748) for validation. In this setting, we confirmed that RF, SVM, and Boosted DT gave better results than the other types of classifiers (Table 8 and Figure 12). Again, RF and Boosted DT resulted quite robust to the preprocessing step in comparison with SVM that required, also in this case, a feature selection process in order to maximize the accuracy. Overall, RF seemed the more robust method, and the highest accuracies were obtained for scenarios 1 and 4. Boosted DT was instead associated with the best models for scenarios 2 and 3, which suggested the effectiveness of this approach when an enough number of training samples was available. In particular, the model related to scenario 3 gave the best accuracy across all classifiers and scenarios tested with this training sample size. SVM gave slightly worse results, which highlighted the superiority of ensemble models such as RF and Boosted DT when the training set was sufficiently large. On the other hand, we recall that SVM gave the best accuracy when the training set was smaller ( Table 7). Figure 11. Classification model generated using SVMs as classifier and by applying a recursive feature selection process (scenario 3), which maximized the accuracies on the validation set as shown in Table 7. The plot reports the Kappa statistic obtained in cross-validation in function of the number of variables/features used to build the model. A number of features equal to 41 maximized the Kappa statistic, which is highlighted with the filled circle.
The same analysis was repeated by changing the size of the training set and considering 75% of the samples as training set (N = 2244) and the remaining ones (N = 748) for validation. In this setting, we confirmed that RF, SVM, and Boosted DT gave better results than the other types of classifiers (Table 8 and Figure 12). Again, RF and Boosted DT resulted quite robust to the preprocessing step in comparison with SVM that required, also in this case, a feature selection process in order to maximize the accuracy. Overall, RF seemed the more robust method, and the highest accuracies were obtained for scenarios 1 and 4. Boosted DT was instead associated with the best models for scenarios 2 and 3, which suggested the effectiveness of this approach when an enough number of training samples was available. In particular, the model related to scenario 3 gave the best accuracy across all classifiers and scenarios tested with this training sample size. SVM gave slightly worse results, which highlighted the superiority of ensemble models such as RF and Boosted DT when the training set was sufficiently large. On the other hand, we recall that SVM gave the best accuracy when the training set was smaller (Table 7). Table 8. OA (%) and Kappa statistic results for each tested classifier and preprocessing scenario. Bold denotes the best accuracies for each preprocessing scenario. The 75% of the labeled samples were used for training and the remaining ones for validation.

Preprocessing
Accuracy   We further analyzed the best model obtained with this training size, which resulted to be Boosted DT with the feature selection preprocessing step employed in scenario 3. In this case, a quite large number of variables (d = 126) were required to maximize the cross-validation Kappa statistic ( Figure 13). We report the feature relevance score estimated by the feature selection process in Figure 13. Interestingly, larger weights were overall attributed to NDVI features with respect to those derived from satellite precipitation data. Among the NDVI features, greater importance was attributed to data spanning the June-October time frame, which was consistent with the relevance of these acquisitions in discriminating among the different thematic classes. It was also evident the presence of two local maximum located at the end of July and in September, which was again in line with the classification problem involved here. Finally, Figure 14 shows the classification map obtained on the full study area associated with the best classification model.   We further analyzed the best model obtained with this training size, which resulted to be Boosted DT with the feature selection preprocessing step employed in scenario 3. In this case, a quite large number of variables (d = 126) were required to maximize the cross-validation Kappa statistic ( Figure 13). We report the feature relevance score estimated by the feature selection process in Figure  13. Interestingly, larger weights were overall attributed to NDVI features with respect to those derived from satellite precipitation data. Among the NDVI features, greater importance was attributed to data spanning the June-October time frame, which was consistent with the relevance of these acquisitions in discriminating among the different thematic classes. It was also evident the presence of two local maximum located at the end of July and in September, which was again in line with the classification problem involved here. Finally, Figure 14 shows the classification map obtained on the full study area associated with the best classification model.

Cross-Validation with Spatially Separated Folds
The experimental analysis conducted in Section 3.2 was done by considering a random crossvalidation approach in which labeled samples were randomly split into training and validation sets. Although this represents a quite common scenario in evaluating and comparing classification accuracies in the remote sensing domain, this may bring to an overestimation of the accuracies as discussed recently [45,46]. More specifically, remote sensing images are characterized by spatial

Cross-Validation with Spatially Separated Folds
The experimental analysis conducted in Section 3.2 was done by considering a random cross-validation approach in which labeled samples were randomly split into training and validation sets. Although this represents a quite common scenario in evaluating and comparing classification accuracies in the remote sensing domain, this may bring to an overestimation of the accuracies as discussed recently [45,46]. More specifically, remote sensing images are characterized by spatial structures that are ignored when performing a traditional cross-validation with the result of underestimating predictive errors. To account for this, we conducted an additional analysis in which we validated the model using a block cross-validation approach in which samples were split strategically rather than randomly. We considered the recently developed blockCV package [47], which was specifically proposed to generate spatially or environmentally separated folds. We considered the spatial blocking strategy (function "spatialBlock"), which aims at building square spatial blocks of a specified size, by setting the parameters as follows: (i) size of the blocks (parameter "theRange") was set to 10,000, (ii) allocation of blocks to folds (parameter "selection") was done in a random way in order to find the block-to-fold allocations that achieved most even spread of classes across folds, and (iii) number of folds (parameter "k") equal to 4 in order to split the labeled samples with the same percentages considered in Section 3.2. We show in Figure 15, the partition of the region in multiple spatially disjoint blocks and their subsequent random assignment to different folds.
Remote Sens. 2020, 12, x FOR PEER REVIEW 17 of 21 which was specifically proposed to generate spatially or environmentally separated folds. We considered the spatial blocking strategy (function "spatialBlock"), which aims at building square spatial blocks of a specified size, by setting the parameters as follows: i) size of the blocks (parameter "theRange") was set to 10,000, ii) allocation of blocks to folds (parameter "selection") was done in a random way in order to find the block-to-fold allocations that achieved most even spread of classes across folds, and iii) number of folds (parameter "k") equal to 4 in order to split the labeled samples with the same percentages considered in Section 3.2. We show in Figure 15, the partition of the region in multiple spatially disjoint blocks and their subsequent random assignment to different folds. We considered the best scenarios found in the analysis exploited in Section 3.2. In the case that 75% of the labeled samples were used for training, the OA in random cross-validation was maximized using a Boosted DT and was equal to 90.8% (Table 8). Using the same configuration, spatially distinct cross-validation gave an OA equal to 87.2%. Although, as expected, the results shown a decrease of the accuracies with respect to the random cross-validation case, accuracies remained quite satisfactory, which confirmed the validity of our proposed framework. A similar pattern was verified when only 25% of the labeled information was used for training. In this case, OA slightly decreased from 87.8% (Table 8, using SVM in conjunction with feature selection) to 84.2% when moving from the random to the spatially distinct cross-validation. Figure 15. Partition of the region in spatially disjoint blocks (with size of 10,000 m) and their random assignment to four folds. This was accomplished using the R package blockCV [47].

Discussion and Conclusion
In this paper, we have presented the framework that we have developed, implemented, and validated within the DIANA project to map the extent of irrigated areas in the entire Campania region (Southern Italy) during the 2018 irrigation season. The procedure described here is currently employed in the Campania region to assess the annual water volumes that are actually used in the irrigated areas, which it is required to be compliant with the EU Water Framework Directive. Figure 15. Partition of the region in spatially disjoint blocks (with size of 10,000 m) and their random assignment to four folds. This was accomplished using the R package blockCV [47].
We considered the best scenarios found in the analysis exploited in Section 3.2. In the case that 75% of the labeled samples were used for training, the OA in random cross-validation was maximized using a Boosted DT and was equal to 90.8% (Table 8). Using the same configuration, spatially distinct cross-validation gave an OA equal to 87.2%. Although, as expected, the results shown a decrease of the accuracies with respect to the random cross-validation case, accuracies remained quite satisfactory, which confirmed the validity of our proposed framework. A similar pattern was verified when only 25% of the labeled information was used for training. In this case, OA slightly decreased from 87.8% (Table 8, using SVM in conjunction with feature selection) to 84.2% when moving from the random to the spatially distinct cross-validation.

Discussion and Conclusion
In this paper, we have presented the framework that we have developed, implemented, and validated within the DIANA project to map the extent of irrigated areas in the entire Campania region (Southern Italy) during the 2018 irrigation season. The procedure described here is currently employed in the Campania region to assess the annual water volumes that are actually used in the irrigated areas, which it is required to be compliant with the EU Water Framework Directive.
We have considered multiple data types that have spanned the area of interest along the entire year. We have collected 202 images coming from the HLS data product in conjunction with global satellite precipitation PERSIAN-CCS data. We have also acquired an extensive ground truth in the field associated with three thematic classes, namely, bare soil and rainfed, herbaceous, and tree crop. We have compared six machine learning classification algorithms and demonstrated that random forests, support vector machines, and boosted decision trees have given the best results in terms of classification accuracy and robustness to different scenarios. In the best case, the three classes have been distinguished with an overall accuracy close to 90%, which has highlighted the relevance of this recent data product for the detection and characterization of irrigated areas from satellite observations. This has demonstrated promising capabilities in using this set of optical time series, meteorological, and in-situ collected ground truth data to map irrigated areas that may be used in operational applications for water management purposes.
The framework presented here has been specifically developed for the detection of irrigated areas with the goal of being able to distinguish between trees and herbaceous crops without requiring prior knowledge of inter-and intraclass characteristics. While several papers based on machine learning approaches have been proposed in the literature to deal with land cover classification problems, a few works have been specifically devoted to the detection and characterization of irrigated areas. Irrigated and not irrigated croplands were estimated with high accuracy (Kappa statistic > 0.9) by considering the historical evolution of irrigated croplands for the post monsoon (rabi) and summer cropping seasons in the Berambadi watershed of Kabini River basin, southern India. In this case, 30 m spatial resolution Landsat satellite images spanning the 1990-2016 periods were classified with SVM [48]. Landsat imagery (Thematic Mapper, Enhanced Thematic Mapper Plus, and Operational Land Imager) was again processed in conjunction with SVM to quantify the changes in irrigated land areas surrounding the Mogtedo water reservoir, Burkina Faso, between 1987 and 2015. Overall accuracy and Kappa statistic ranged from 94.2% to 95.6% and from 0.92 to 0.94, respectively [49]. Annual maps at 30 m spatial resolution of irrigated corn and soybean areas in southwestern Michigan, United States, were generated in the 2001-2016 timeframe with an OA = 82% by considering Landsat surface reflectance products and RF classification [50]. The relevance in using Sentinel-1 images jointly with Landsat 8 optical imagery was studied and a digital elevation model of the Shuttle Radar Topography Mission (SRTM) was investigated in [51]. The combined use of the different data types in conjunction with RF was able to improve the early classification of irrigates crops from 0.84 to 0.89 (Kappa statistic) in a dataset involving the detection of irrigated maize crops in a temperate zone in South West France. A novel classification-based irrigation mapping procedure that utilized MODIS time series data coupled with ancillary data on climate and agricultural extent was proposed in [52]. It gave moderate Kappa statistic equal to 0.36 and 0.65 for Eastern US and Western US, respectively, suing a tree-based method for supervised classification. Finally, a methodology to identify irrigated crops in a semiarid zone in La Mancha, Spain, was proposed using Landsat TM imagery and a multitemporal supervised classification approach [53]. Overall accuracy was equal to 93.1% and 90.2% during 1996 and 1997 growing seasons, respectively.
While we have focused in this paper on mapping areas at regional scale, a similar procedure can be applied to detect and characterize larger regions. Further methodological advancements in this direction can include transfer learning approaches in order to take into account the spatial variability of the class distributions [54] and deep learning strategies to deal with large training set scenarios [55]. Finally, future research lines can be also devoted to the inclusion of additional indices such as the Normalized Difference Water Index (NDWI) or the Modified NDWI. It would be also interesting to introduce into the framework additional information based on the VIS-NIR and SWIR bands, which have been recently exploited by novel algorithms such as the OPtical TRApezoid Model [56] and that have been used to extract additional indices such as the Soil Moisture Index.