National Scale Land Cover Classiﬁcation for Ecosystem Services Mapping and Assessment, Using Multitemporal Copernicus EO Data and Google Earth Engine

: Land-Use / Land-Cover (LULC) products are a common source of information and a key input for spatially explicit models of ecosystem service (ES) supply and demand. Global, continental, and regional, readily available, and free land-cover products generated through Earth Observation (EO) data, can be potentially used as relevant to ES mapping and assessment processes from regional to national scales. However, several limitations exist in these products, highlighting the need for timely land-cover extraction on demand, that could replace or complement existing products. This study focuses on the development of a classiﬁcation workﬂow for ﬁne-scale, object-based land cover mapping, employed on terrestrial ES mapping, within the Greek terrestrial territory. The processing was implemented in the Google Earth Engine cloud computing environment using 10 m spatial resolution Sentinel-1 and Sentinel-2 data. Furthermore, the relevance of di ﬀ erent training data extraction strategies and temporal EO information for increasing the classiﬁcation accuracy was also evaluated. The di ﬀ erent classiﬁcation schemes demonstrated di ﬀ erences in overall accuracy ranging from 0.88% to 4.94% with the most accurate classiﬁcation scheme being the manual sampling / monthly feature classiﬁcation achieving a 79.55% overall accuracy. The classiﬁcation results suggest that existing LULC data must be cautiously considered for automated extraction of training samples, in the case of new supervised land cover classiﬁcations aiming also to discern complex vegetation classes. The code used in this study is available on GitHub and runs on the Google Earth Engine web platform.


Introduction
Information on the spatial explicit distribution of ecosystem services (ES) is important for their conservation and management, policy design, implementation and monitoring, fulfillment of reporting 1.
The evaluation of the relevance of different training data extraction strategies (i.e., manual and automated) to increase the classification accuracy. 2.
The evaluation of seasonal and monthly EO information to increase the classification accuracy. 3.
The assessment of the relative importance of the different features in the classification process, facilitating data dimensionality reduction and computational time and resources optimization.

Study Area
The study area is the terrestrial territory of Greece, including nearby coastal areas (2 km buffer). Greece covers an approximate area of 131,957 km 2 , with one of the largest coastlines in Europe of 15,021 km long, [19]. Geomorphologically, 78% of Greece's area is covered by mountains [36] and has a highly diverse landscape. The climate, according to the Köppen-Geiger climate classification, is mainly temperate Mediterranean with large areas in northern Greece being classified as semi-arid and fewer areas, mostly in higher altitudes, classified as humid continental [37]. The combination of landscape and climate conditions together with Greece's biological assets has led to the existence of a large ecosystem type variety, with forest habitats being the predominate habitat group covering 37% of the natural habitats in Greece [38].
The only currently-available, nation-wide, LULC information is the Corine Land Cover (CLC) dataset with a spatial resolution of 25 ha, while, in regard to ecosystem types, the new enhanced Ecosystem Type Map of Europe (v. 3.1) [39] with a spatial resolution of 1 ha is now available. The new version of the ecosystem map represents the terrestrial European Nature Information System (EUNIS) level 2 habitat classes, for the European Environment Agency (EEA) EEA-39 members. It was based on Figure 1. The general processing chain with its data inputs, processing steps and final products. Items in grey boxes indicate processing done outside Google Earth Engine (GEE) and more specifically in R or QGIS software environments.
The general algorithm workflow followed is presented in Figure 1. Specifically, sections 1 (Composites creation), 3 (Feature creation), and 4 (Sample generation) differ in each scenario accordingly.
In all six workflows, temporal (seasonal or monthly), composites of S1 and S2 imagery were created at first, followed by image segmentation for object creation within Google Earth Engine (GEE). In the feature set creation step, different feature sets were generated per-object, considering either all available EO data or a reduced subset of features excluding the S2 original bands. Consequently, sample objects were identified based on a fully automated and a manual sample identification strategy. Finally, a Random Forest (RF) classification algorithm was used for class discrimination in the respective training objects set (six in total). The six classification schemes were The general processing chain with its data inputs, processing steps and final products. Items in grey boxes indicate processing done outside Google Earth Engine (GEE) and more specifically in R or QGIS software environments. The general algorithm workflow followed is presented in Figure 1. Specifically, sections 1 (Composites creation), 3 (Feature creation), and 4 (Sample generation) differ in each scenario accordingly.
In all six workflows, temporal (seasonal or monthly), composites of S1 and S2 imagery were created at first, followed by image segmentation for object creation within Google Earth Engine (GEE). In the feature set creation step, different feature sets were generated per-object, considering either all available EO data or a reduced subset of features excluding the S2 original bands. Consequently, sample objects were identified based on a fully automated and a manual sample identification strategy. Finally, a Random Forest (RF) classification algorithm was used for class discrimination in the respective training objects set (six in total). The six classification schemes were evaluated by analyzing their confusion matrices and the derived standard accuracy measures, to find the best performing workflow. The JavaScript code used in this study is available on GitHub [41] and runs on the Google Earth Engine Code Editor, after signing-in to the service and following the instructions.

Satellite Data and Preprocessing
Sentinel-2 optical data and Sentinel-1 active data were jointly used in this study. The Sentinel-2 (S2) constellations carry the MSI multispectral optical sensor, which records 13 spectral bands in the visible, near-infrared, and short-wave infrared parts of the spectrum, with a spatial resolution of up to 10 m, and a revisit time of 5 days; in addition, compared to other free satellite imagery (e.g., Landsat imagery), it offers three bands in the red-edge region of the spectrum, that have been proven to play an important role in vegetation types species identification and vegetation status monitoring [31,42]. For the S2 data, both level-1C (L1C) and level-2A (L2A) processed data from the GEE data catalog were used in this study. The L1C products are top of atmosphere reflectance images while the L2A are produced from the L1C products, after applying atmospheric correction algorithms, providing bottom of atmosphere reflectance images. The L1C products were only used for computing the Tasseled Cap transformation [43], since the coefficients for the L2A Sentinel-2 products have not yet been calculated in literature.
The Sentinel-1 (S1) satellites are C-band active microwave Synthetic Aperture Radars (SAR), configured with dual polarization (VV, VH) interferometric wave modes, with an effective revisit of 6 days. The Ground Range-Detected (GRD) products are radiometrically calibrated and ortho-corrected, using SRTM 30 or ASTER DEM, and offered in geo-coded backscattering coefficients (σ 0 ) in GEE [44].
Data from both satellites are available in Analysis Ready Data (ARD) format, through GEE, and were therefore retrieved from the corresponding collections. Regarding S2 L2A data, all images from May 2019 to October 2019 were used, along with May 2018 image acquisitions, to compensate for the increased cloud coverage observed in May 2019. S2 L1C data were only used for the 2019 summer period (June to August 2019). Finally, all S1 images in ascending and descending orbit from May 2019 to October 2019 were used. The one-year timeframe is considered as an adequate range for capturing yearly phenology and has been used in other national-scale LULC studies [17,33,45,46]. Subsequently, clouds were masked from optical images and all images (optical and SAR) were transformed to seasonal and monthly composites to achieve spatially homogenous and temporally equidistant images allowing a uniform processing framework for the whole county [47].
Clouds and cirrus in the S2 L1C images were masked using the QA60 band [48], while cloud shadow was masked using the solar azimuth, zenith, and possible cloud heights [49]. Clouds, cirrus, and cloud shadows from the S2 L2A images were masked using the scene classification map resulting from the application of the sen2cor processing algorithm [50]. S2 images temporal aggregation was based on the median reflectance values which has been proven to lead to more accurate LULC classifications [51,52]. S1 images composite creation was based on the minimum value instead of the median value, in order to overcome terrain and humidity effects. In GEE, S1 products are provided as sigma-naught (σ 0 ) backscatter images, so, although terrain correction has already been applied in GEE using SRTM30, performing additional terrain flattening was not feasible.

Image Segmentation
The Simple Non-Iterative Clustering (SNIC) super-pixel segmentation, which is available in GEE was used in this study. The SNIC algorithm starts from seeds created on a regular grid and then, groups pixels into super-pixel clusters based on their distance from cluster centroids considering the normalized five-dimensional space of CIELAB color and spatial coordinates [53,54]. Subsequently, all image pixels are added to priority queues based on their connectivity to existing super-pixel clusters. Each time a pixel with the smallest distance to the cluster's centroid is added, an updated centroid value and priority queue is generated for each growing cluster [53].
Parameters that need to be set in the SNIC segmentation are the seed distance, segment compactness, connectivity, and the neighborhood size [55]. Although inadvisable, numerous studies that apply segmentation for LULC mapping, rely on trial-and-error and visual inspection for finding optimal segmentation parameters [56]. Based on such an approach, SNIC parameters were selected after various experiments, based on visual inspection to produce the most homogeneous and meaningful objects within the experiment test sites and in accordance with the concept that "a good segmentation is one that shows little over-segmentation and no under-segmentation" by [29]. SNIC was applied on the 10 m S2 L2A July composite bands (B2, B3, B4, B8), for achieving the finest possible object detail since these were the bands with the highest spatial resolution. The July composite was chosen as it is one of the driest months in Greece, accompanied by vegetation growth peak. The SNIC parameters were set to a) seed distance of 10 pixels, b) segment compactness = 3, c) 8-pixel connectivity, and d) neighborhood size of 100 pixels.

Feature Extraction
A wide range of spectral and spatial features were considered in the classification process of the study and used to calculate descriptive statistics for each object ( Table 2). Sentinel-1 bands, representing polarized (VV,VH) backscatter signals, along with the VV/VH ratio feature were incorporated in the classification process [22,54,57]. In addition to raw multispectral bands of Sentinel-2 imagery (considered only in the full feature set), spectral indices and multispectral transformations were employed to enhance class separability while limiting band correlation and data redundancy [21].
Besides spectral properties, the high spatial resolution of Sentinel imagery facilitates the use of texture features such as the gray-level co-occurrence matrix (GLCM) metrics [58] for improving LULC classifications. For enhancing urban areas, the so-called PANTEX index was employed, which highlights local contrast variations by keeping the minimum value of the energy computed on the GLCM in various directions [59].
Based on previous studies, the GLCM "correlation" metric [60] was calculated for enhancing separability of vegetation-covered classes over selected bands. The GLCM correlations for the Sentinel-2 blue band (B2) [61,62], the Biophysical Composition Index (BCI), and the VV and VH Sentinel-1 bands were used in order to improve separability in classes of this study which contain the same vegetation Remote Sens. 2020, 12, 3303 7 of 24 type and vary only in terms of height and density (i.e., classes "Sclerophyllous vegetation" and "Mediterranean sclerophyllous forests") [63].
To further explore the strengths of the OBIA approach, for each object, the area and perimeter were extracted, along with shape metrics such as the form factor, square pixel metric, fractal dimension, and shape index [64] in order to describe polygon complexity.
Finally, ancillary information on the topography was derived from the EU-DEM v1.1, a hybrid pan-European product based mainly on SRTM and ASTER GDEM, with a 25 m spatial resolution [65]. More specifically, the variables used herein were elevation, slope and the Topographic Solar-Radiation Index (TRASP), which is a linear transformation of the circular aspect variable [66].

Classification Scheme
For the identification and mapping of ecosystem types, a typology allowing the correspondence of all terrestrial habitat types to the MAES Level 3 ecosystem type classes is used on the basis of the habitat types' interpretation [4,38,75]. Marine ecosystems were represented in MAES level 1 category and were mapped in a distance of 2 km away from the shore. The final classification scheme consisted of 21 classes and is presented in Table 3. Because of the ecological interpretation of most classes, it was challenging to harmonize them with the LULC classification schemes of the different reference data sets used. Urban classes were straightforward and based on the Impervious HRL density and the two subclasses were split according the thresholds used in the Corine Land Cover nomenclature [76] (i.e., density higher than 30% and lower than 30%). Cropland classes were assigned to homogenous LPIS classes that also agreed in the CLC in LULC type. Complex and heterogeneous agricultural parcels from the LPIS where not used in this study. Woodland and forest classes were split into multiple sub classes following their ecological characteristics of the different forest types of Greece and based on the corresponding habitat types [38]. On the other hand, grasslands engulf a broad range of ecological and land use categories; in terms of land use, they span from natural grasslands to managed rangelands and pastures and furthermore they vary based on their tree-cover densities. These aspects imply difficulties in separating grasslands from other land cover classes based on their spectral and temporal characteristics and were therefore depicted in one single class, in align with [46].

Reference Data
Relying only on in-situ procedures for reference data collection would require tremendous efforts for national-scale mapping. However, existing geospatial national and EU's Copernicus datasets, can be used for collecting reference data with a proper class correspondence and a careful sampling procedure. The Natura 2000 (N2K) product is a pan-European LULC dataset providing information about both status and change, for the protected areas under the N2K network, covering approximately 630,000 sq km [77]. The detailed N2K classification scheme, including 55 thematic LC/LU classes, allowed us to identify class correspondence with MAES level 3 classes with natural cover (Table 3). A large scale (1:5000) habitat map of Natura 2000 terrestrial areas by the National Cadastre and Mapping Agency was used for sample identification, assuming that land cover inside protected areas remains stable throughout a 2-year period [17].
The Land Parcel Information System (LPIS) used to validate eligibility of agricultural subsidies under the EU's Common Agriculture Policy, has been applied for generating crop-related samples similar to previous studies [33,46]. In this study, the 2018 Greek LPIS layer (scale 1:5000) was used to generate crop land cover samples by making the assumption that minor changes happen in major agriculture type classes from year to year. Samples extracted from LPIS, were further refined using information from the Corine Land Cover 2018 dataset in order to identify and select only the areas with a high probability of belonging to the assigned class [52] (Table 3).
Reference samples regarding built-up areas were initially extracted from the Copernicus Imperviousness High Resolution Layer (HRL) [17,46]. The Imperviousness layer is 20 m spatial resolution product depicting the proportion of sealed soil per pixel. The product is generated using a semi-automated classification of maximum Normalized Difference Vegetation Index (NDVI) mosaics. The 2015 version is used herein, with the assumption that built-up areas rarely convert back to natural surfaces [59].
A 3-step sampling was implemented, for reference data extraction, relying on the N2K, LPIS, and HRL maps. In step 1, a reference sample collection, for training and validation, was created, by performing a stratified random sampling, in order to identify 650 reference objects per class.
In order to derive the validation sub-sample from the reference dataset (i.e., step 2), the approach of Mack et al. [45] was followed. More specifically, we applied a random shuffling of the reference samples selected in step 1 for each class, in order to identify a sub-sample of 150 objects per class for the validation task. Subsequently, these objects were visually checked and corrected using Google Earth Engine imagery and ESA's VHR IMAGE 2018 dataset (providing VHR orthorectified multispectral products over Europe) in order to create a highly reliable validation dataset [78]. The reliability was ensured through a sequential visual check for each validation object to verity its purity and homogeneity relative to the reference class assigned. In case of erroneous or ambiguous class assignments, the object was replaced from the original sample until 150 validation samples were selected for each class.
Finally, for rare classes (e.g., Peat bogs-7.2.1), where the total number of sample objects did not reach 150 objects, only half of the samples were kept for validation, allowing the rest to be assigned as training and thus having a balanced dataset.
The validation objects from step 2 were then excluded from the initial reference sample collection and the remaining objects were used for developing two different training datasets (i.e., step 3). The first training dataset (automated) contained the objects withheld outside the reference sample collection, without any modification. On the other hand, for the second training dataset (manual), a manual correction procedure to the withheld objects was followed, by interpreting the available very high-resolution imagery. In the case of training objects with erroneous or ambiguous class assignments, the respective object was replaced by introducing to the training sample, the nearest object belonging to the same class. Finally, few additional sample objects were added to classes that were noticed to require a large number of replacements (e.g., floodplain forest-class 3.2.1).

Classification Algorithm and Accuracy Assessment
Random Forests (RF) is a popular supervised machine learning algorithm that is nonparametric and fits several decision tree classifiers on various subsamples of the dataset using averaging to improve the predictive accuracy and control over-fitting [79]. The model uses only a part of the samples for training and keeps the remaining (called out-of-the bag samples) for an internal error estimate, called out-of-bag (OOB) error. RF has been proven to be robust to reference data errors up to 15%, works fairly well with small sample sizes and highly dimensional feature spaces and, in addition, has a small execution time compared to other classification methods [33,80]. For the above reasons, RF has been widely used in LULC classification studies and has shown the best performance in object-based classification [56].
In the present study, RF classification was accomplished within the GEE environment. Important parameters that must be specified for the implementation of the RF algorithm are (a) the number of trees in the forest (Ntree) and (b) the number of input variables randomly chosen at each split (Mtry). Optimal values of Ntree and Mtry were calculated locally, instead of GEE, in the R environment [81], using the "tuneRF" function in the "randomForest" package.
An additional by-product of the RF classification, which is calculated internally, is the variable importance. RF in GEE works by using the so-called Gini Index as a measure for the best split selection. Variable importance is measured by the OOB error increase and the Gini Index decrease, when switching one of the input random variables while keeping the rest constant. This measure helps identify the most influential predictor features in the classification and can sometimes be useful for model reduction [82,83].
For each classification experiment the validation objects were used for generating a confusion matrix and the derived standard accuracy measures, i.e., overall accuracy (OA), producer's (PA) accuracy, user's accuracy (UA) [84]. The OA shows the proportion of the total correctly classified objects, providing a general understanding of the overall classification algorithm effectiveness. The PA expresses how well the samples for each class were identified (omission error) while the UA indicates the probability that a sample actually represents that class on the ground (commission error) [33,57]. An additional metric used for assessing individual class accuracy and classification effectiveness was the Individual Classification Success Index (ICSI), [85], for each as class i:

Classification Schemes
With respect to the achieved accuracy levels under different training data sampling approaches (Table 4), the manual sampling procedures presented the highest overall accuracy (77.33%-79.55%) for all feature sets evaluated. Table 4. Overall, producer's (%) and user's (%) accuracy, and ICSI (%) for the six different classification experiments. Higher red color saturation indicate ICSI values further below the ICSI average, while higher green color saturation indicate ICSI values further above the average. On the other hand, the automation of the training sample extraction procedure for all three feature/composite sets, resulted to the lowest overall accuracies (74.89%-75.64%).

M-S-F M-S-R M-M-R A-S-F A-S-R M-M-R
Regarding the different feature sets evaluated, high accuracy was achieved when the monthly features both using the manual (M-M-R) and automated (A-M-R) training sample approaches (overall accuracy 75.64% and 79.55%, respectively) were employed. On the other hand, the use of all features, including S2 original bands through seasonal composites presented the least satisfactory results for both manual (M-S-F) and automated (A-S-F) training (overall accuracy 74.89% and 77.33%, respectively).
The highest accuracy was noted in the classification scheme considering the monthly indices set and the manually identified training samples (M-M-R) (Figure 2).
The classification accuracy of individual classes (PAs and UAs) are provided in Table 4. The manual training sample extraction provides a better balance of commission and omission errors, presenting lower mean absolute differences between producer's and user's accuracy across all three feature sets (7.84%), compared to the automated (13.85%) approaches. Again, the classification experiment relying on the monthly indices feature set and visually identified training samples (M-M-R) classification experiment, presented the lowest mean absolute difference (7.43%) between PA and UA. Nevertheless, in all classification schemes, moderate to low accuracies (lower than 30%) in terms of the ICSI, were attained for the sclerophyllous vegetation  Nevertheless, in all classification schemes, moderate to low accuracies (lower than 30%) in terms of the ICSI, were attained for the sclerophyllous vegetation and saline marshes (7.1.1). All these misclassifications were minimized with the manual sampling and monthly feature approach (M-M-R) ( Table 5).  1.1 1.1.2 2.1.1 2.2.1 3.1.1 3.1.2 3.2.1 3.3.1 3.3.2 3.4.1 3.5.1 4.1.1 5.1.1 5.2.1 6.1.1 6.2.1 6.3.1 7.1.1 7.2.1 7.3.1 8.1 Table 6. Confusion matrix for the worst performing classification (A-S-F). Values are given in percentage of reference objects (normalized per row).  1 1.1.2 2.1.1 2.2.1 3.1.1 3.1.2 3.2.1 3.3.1 3.3.2 3.4.1 3.5.1 4.1.1 5.1.1 5.2.1 6.1.1 6.2.1 6.3.1 7.1.1 7.2.1 7.3.1

Visual Assessment of the Results
Landscape complexity in large-scale LULC mapping generates errors that are not always evident in the classification quantitative assessments using the confusion matrix and derived standard accuracy measures. Visual assessments are always essential for gaining an insight on the final product accuracy and for revealing errors that were not highlighted in the accuracy measures [33].
In general, the visual comparison of classifications revealed minor differences between different feature sets, but considerable differences for classifications of different sampling techniques.
Particularly, the automated sampling classifications resulted to an overestimation of mixed forests (5.3.1) and Mediterranean coniferous forests (3.3.2) (Figure 3a-c). On the contrary, low density urban fabric (1.1.2) was underestimated, an observation also consistent with the PA and UA of the automated sampling classification ( Table 4). As it can be seen in image subsets included in the second row of Figure 3, areas surrounding the dense urban fabric, are classified as permanent crops (2.2.1), in the A-M-R experiment, instead of low density urban fabric, in the M-M-R approach. This effect is mainly caused by the automated LPIS samples containing greenhouses, which mixes the crop spectra with artificial materials.
Manual samples classifications, although reaching higher accuracy values, also demonstrated problems in specific areas. Particularly, in the manual sampling classifications, arable land (2.1.1) was frequently classified as floodplain forests (3.2.1), an effect noticed mostly in high moisture agriculture areas, such as river estuaries and can be explained by the additional number of manual floodplain samples obtained, for solving the spectral confusion with riverbanks in the automated sampling (Figure 3g-i).
Riverbank areas identification was challenging in both automated and manual sampling approaches, being classified as either dense urban fabric (1.1.1) or floodplain forests (3.2.1). In Figure 3a-c the riverbank of the river crossing the forest in the south, is mistakenly classified as floodplain forests in the manual sampling classification and as dense urban fabric in the automated sampling classification. Conversely, in Figure 3d-f one can see the riverbank in the west, mistakenly classified as dense urban fabric in the manual sampling classification and as floodplain forests in the automated sampling classification. These errors can be explained by the large temporal variations of moisture in riverbanks, throughout seasons and between years, caused by differences in precipitation.
Regarding the visual inspection between the different feature set classifications, a noticeable effect was the classification of arable land areas (2.1.1) as inland freshwater and saline marshes (7.1.1), evident in several high-moisture agriculture areas (Figure 4). This effect was minimized with the use of seasonal features, excluding the S2 L2A bands (M-S-R and A-S-R classifications).
Finally, besides the errors produced by the classifiers, artefacts created by the preprocessing steps were also observed. These errors had a negligible effect in the different classification schemes' overall accuracy; however, they are noticeable on the local level. The cloud masking using the Cloud Probability and Scene Classification of S2 L2A images resulted in no-data values in artificial surfaces, due to constant misclassification of artificial areas in the Scene product and therefore constant masking in the compositing process. In addition, some mountain areas which were probably covered by clouds and snow most of the time also had no-data values due to constant masking in the compositing process.

Variable Importance
Variable importance was measured in GEE by the OOB error increase and the Gini Index decrease, when permuting one of the input random variables while keeping the rest constant. The 20 most important predictor features are presented in Figure 5, averaged over the two sampling methods, for each feature set. In all feature set cases, the feature mean values instead of the standard deviations for each object were among the most important features. Moreover, in all cases, the two most important variables were topographic features, namely the mean value of elevation and slope, followed by TRASP in the third and fourth position. Object properties, in specific, area, form factor, and fractal dimension were also found to be among the 20 most important variables in all cases. SAR VV/VH ratio was found in all feature combinations among the top 10 features, unlike SAR bands which appeared much lower in the total feature lists and therefore do not exist in these figures. Regarding the original S2 spectral bands, only the first Short-Wave InfraRed (SWIR) (B11) was found to be an important feature, while in respect to texture features, only the PANTEX index was found important, though only for seasonal composites. According to the output of the variable importance, the most informative spectral indices in all classifications were BCI, GRVI, MSI, the TC components, reNDVI, and NDRBI. Finally, features from all the available months and seasons were selected as optimal in all six classification experiments.
Remote Sens. 2020, 12, x FOR PEER REVIEW 16 of 26 sampling classification. Conversely, in Figure 3 (sections d, e, f) one can see the riverbank in the west, mistakenly classified as dense urban fabric in the manual sampling classification and as floodplain forests in the automated sampling classification. These errors can be explained by the large temporal variations of moisture in riverbanks, throughout seasons and between years, caused by differences in precipitation.
Regarding the visual inspection between the different feature set classifications, a noticeable effect was the classification of arable land areas (2.1.1) as inland freshwater and saline marshes (7.1.1), evident in several high-moisture agriculture areas (Figure 4). This effect was minimized with the use of seasonal features, excluding the S2 L2A bands (M-S-R and A-S-R classifications).   Finally, besides the errors produced by the classifiers, artefacts created by the preprocessing steps were also observed. These errors had a negligible effect in the different classification schemes' overall accuracy; however, they are noticeable on the local level. The cloud masking using the Cloud Probability and Scene Classification of S2 L2A images resulted in no-data values in artificial surfaces, due to constant misclassification of artificial areas in the Scene product and therefore constant masking in the compositing process. In addition, some mountain areas which were probably covered by clouds and snow most of the time also had no-data values due to constant masking in the compositing process.

Variable Importance
Variable importance was measured in GEE by the OOB error increase and the Gini Index decrease, when permuting one of the input random variables while keeping the rest constant. The 20 most important predictor features are presented in Figure 5, averaged over the two sampling methods, for each feature set. In all feature set cases, the feature mean values instead of the standard deviations for each object were among the most important features. Moreover, in all cases, the two most important variables were topographic features, namely the mean value of elevation and slope, followed by TRASP in the third and fourth position. Object properties, in specific, area, form factor, and fractal dimension were also found to be among the 20 most important variables in all cases. SAR VV/VH ratio was found in all feature combinations among the top 10 features, unlike SAR bands which appeared much lower in the total feature lists and therefore do not exist in these figures. Regarding the original S2 spectral bands, only the first Short-Wave InfraRed (SWIR) (B11) was found to be an important feature, while in respect to texture features, only the PANTEX index was found important, though only for seasonal composites. According to the output of the variable importance, the most informative spectral indices in all classifications were BCI, GRVI, MSI, the TC components, reNDVI, and NDRBI. Finally, features from all the available months and seasons were selected as optimal in all six classification experiments. . Averaged feature importance (for manual and automated sampling), for the full, seasonal (S-F), for the reduced, seasonal (S-R) and the monthly, reduced (M-R) feature sets. Letters "S" and "M" stand for "season" and "month" respectively, while the number next to them indicates the season and month in ascending order starting from S2=spring and M5=May, till S4=autumn and M10=October.

Classification Accuracy
The overall goal of this study was the development of a classification workflow for fine scale object-based land cover mapping, using a 21-class scheme, facilitating ecosystem condition and ES mapping, within the Greek territory, towards MAES implementation in Greece. Efforts for national scale LULC mapping using RS for Greece have been made in the past [19,86], using a lower thematic resolution that cannot harmonize with the EUNIS habitat classes and relying on manually interpretation of samples. A more complex classification scheme of 27 classes was used by Karakizi et al. [87], for mapping a relatively small area over Northern Greece, covered by a single Landsat tile.
Object-based image analysis, which mimics the way the human brain interprets images, by grouping pixels into homogenous and meaningful objects through image segmentation [29] has been predominantly employed for image classification having pixel sizes larger than 2m and small (i.e., Figure 5. Averaged feature importance (for manual and automated sampling), for the full, seasonal (S-F), for the reduced, seasonal (S-R) and the monthly, reduced (M-R) feature sets. Letters "S" and "M" stand for "season" and "month" respectively, while the number next to them indicates the season and month in ascending order starting from S2 = spring and M5 = May, till S4 = autumn and M10 = October.

Classification Accuracy
The overall goal of this study was the development of a classification workflow for fine scale object-based land cover mapping, using a 21-class scheme, facilitating ecosystem condition and ES mapping, within the Greek territory, towards MAES implementation in Greece. Efforts for national scale LULC mapping using RS for Greece have been made in the past [19,86], using a lower thematic resolution that cannot harmonize with the EUNIS habitat classes and relying on manually interpretation of samples. A more complex classification scheme of 27 classes was used by Karakizi et al. [87], for mapping a relatively small area over Northern Greece, covered by a single Landsat tile.
Object-based image analysis, which mimics the way the human brain interprets images, by grouping pixels into homogenous and meaningful objects through image segmentation [29] has been predominantly employed for image classification having pixel sizes larger than 2 m and small (i.e., smaller than 300 ha) study areas [56]. However, it has been used recently in LULC classification with lower spatial resolution images including data from Sentinel-1 and Sentinel-2 satellites [22,54,55]. An object-based approach can account for geolocation offset errors that in linear shaped objects (such as roads and rivers), can affect classification results when adopting a pixel-based approach [88]. Furthermore, OBIA methods have been found to improve classification accuracy by reducing "salt and pepper" noise that occurs in pixel-based LULC classifications [54]. This effect is amplified by artifacts caused by inadequate cloud masking and atmospheric correction, in best pixel compositing strategies, like the median method used in this study [46]. Lastly, although all processing was performed on the powerful GEE cloud environment, OBIA helps reduce computational complexity caused mainly due to the large study area extent [22]. OBIA applied in this study would not be fully beneficial without the exploitation of object spatial properties. Among the few recent national-scale LULC studies that applied OBIA [54,55,89], there is little evidence of object shape properties use. Specifically, only Tu et al. [55] used the area property of each object as an input feature in the classification. The appearance of object shape properties in the most important variables in all the classification experiments, underlines the strength of the object-based approach used in this study.
Six classification schemes were developed, using automated and manual sampling techniques and different multi-temporal feature combinations. Among the six schemes, 0.88% to 4.94% differences in OA were observed. The most accurate classification scheme was the classification considering the monthly features (excluding S2 original bands) and manually identified samples (M-M-R), achieving an OA of 79.55%. Previous nation-wide land cover studies achieved higher accuracies than our study, which can be likely related to the less detailed classification scheme incorporating fewer classes [48]. However, comparing the accuracy attained in the current study with earlier national scale land cover maps is not straightforward, not only due to the complex classification scheme used here, but also due to the phyto-sociological criteria used for defining some of the classes.
In all classification experiments, almost all land cover classes (17 out of 21 classes) exhibited high accuracies in terms of PA and UA. Lower accuracies are particularly noticeable for Mediterranean sclerophyllous forests (3.4.1), mixed forest (3.5.1), moors and heathland (5.1.1), and sclerophyllous vegetation (5.2.1). The lower accuracy observed for these classes is mainly explained by the large similarities in spectral responses these ecosystem types have. It is not a surprise that mapping of moors and heathland which cover such an broad ecological and land use gradient is challenging [18]. Likewise, mixed forest was hard to capture because of the different types of vegetation combinations in this category. The confusion of Mediterranean sclerophyllous forests with sclerophyllous vegetation was expected, since conceptually, the only differences between the two are vegetation height, structure, and density. Ecological modeling, regarding each forest class presence is needed to meliorate classification into the different MAES level 3 classes (e.g., presence in specific phytogeographical regions of Greece, altitude, slope, distance from shore etc.).

Automated Sampling Potential with Available LULC Products
Besides mapping land cover in large scales in the finest MMU and highest accuracy possible, there is still a need for a framework for mapping in regular intervals. One of the key aspects of regular large-scale land cover mapping, relies on the ability to generate samples in an automated manner. Several studies, have used available national or regional products as a basis to obtain samples, thus avoiding the tedious work of manual sample collection for nation-wide LULC. Automated sampling is often followed by a "data cleansing" procedure. However, a simpler classification scheme is observed in these studies. In specific, Griffiths et al. [46] used automated sampling deriving from LPIS for national crop mapping, while Mack et al. [45] used only LUCAS points in a 9-class classification scheme. Inglada et al. [33] relied on a 17-class scheme, though they obtained samples from additional local products.
In our study, the visual comparison of different classification schemes as well as the examination of the respective confusion matrices, revealed minor differences among classification schemes using different feature sets, but considerable differences in the maps resulting from the different sampling procedures. Specifically, the manual sampling procedure presented higher accuracy when compared with the automated, regardless the feature set evaluated. Although the Copernicus program and national scale readily available LULC products (i.e., N2K) provide detailed thematic information datasets at large map scales, it seems that visual interpretation is a prerequisite for training sample extraction and mapping when it comes to complex and ecological-relevant classification schemes. For example, the intrinsic heterogeneity of N2K habitats types and the spectral variability of the different land cover components of each habitat type, result in non-systematic high spectral intra-variability of habitat patches, as well as the low inter-variability between different habitat types, [90,91].
Even governmental, official products, with less complex nomenclature, created through on-screen digitization, such as LPIS, can contain errors e.g., due to false claims or digitization errors [46]. In addition, several national sub-contractors are employed from governmental organizations for LPIS and N2K creation across EU countries, likely resulting in inconsistencies and product heterogeneity nationwide. Furthermore, in the case of the Copernicus datasets, previous studies had observed underestimates of imperviousness density and omission errors in the HRL Imperviousness dataset [92]. Furthermore, introduction of location-based constraints during automated sample extraction (i.e., creation of buffers of a certain distance around polygon limits), does not guarantee for the assignment of correct land cover labels or the prevention of mixed objects or pixels [17]. Training data characteristics and quality, therefore, are of key importance in LULC classification, an observation also in line with [93] and [52].

Multi-Temporal Features and Variable Importance
Despite using a large number of features (multi-temporal and multi-sensor) evaluated in the study's classification schemes, the relatively high accuracy attained at all cases, supports the RF classifier ability in the handling and identification of the most important features when used with highly correlated multidimensional data [80].
The RF variable importance analysis revealed that in all cases, the most important features were elevation and slope, followed by spectral indices, shape variables, and texture variables.
Topography-derived variables (i.e., elevation, slope), which had the highest rank in importance, have been related in numerous studies to physical and environmental factors that influence vegetation distribution [94,95]. In particular, elevation has been proven to be one of the most important parameters affecting vegetation phenology in [96] and forest mapping in [97] and has been found to be considerably important for wetland identification [57]. Slope also influences vegetation by affecting soil, water, and nutrient quantities [98].
With respect to the highest ranking spectral indices, the GRVI has been reported also in other studies using multitemporal data for its effectiveness in vegetation mapping, for instance in [99] for crop mapping and in [100] for monitoring forest phenological variations. Furthermore, the large significance of the reNDVI, which in this study uses B5 of S2, can be explained with the reported importance of B5 among the red-edge bands, for crop and tree species classification [31,101]. The MSI index appeared as most important in all classification experiments due to being one of the few water-sensitive indices employed, with the advantage of differentiating total chlorophyll and water concentration at the canopy level [71]. The appearance of all the TC components in the highest ranking features, verifies a statement by [21], that dimensionality reduction methods is crucial for retaining important information in data compression, especially when utilizing time series. The importance of multiple month NDRBI can be explained by its capability of separating vegetative from non-vegetative areas [102]. The variable importance measures, in terms of temporal-interval choice, further revealed the value of multi-temporal data in LULC mapping [21], since all seasons appeared in the 20 most important variables for classification.

Conclusions
In this paper, different classification schemes were evaluated for fine-scale, land-cover mapping at a national scale, following a fine-scale (21 classes) classification nomenclature.
Our approach integrates cloud-based analysis, a machine learning classification algorithm and freely available multi-sensor and multi-seasonal earth observation data, in order to develop a cost-effective, country-level, land-cover map that will be subsequently used for ecosystem condition and ES mapping and assessment, i.e., for the MAES implementation in Greece, at the national scale.
We have analyzed different approaches considering different feature sets and training sample extraction methods. The random forest classification algorithm is assumed to be relatively robust to reference data inaccuracies; however, our results also support earlier research findings that the refinement of the automated selected training samples leads to higher classification accuracies.
The processing chain of the different classification schemes evaluated, adopts an object-based perspective instead of the traditional per-pixel classification paradigm. This approach, not only facilitates integration of earth observation data from sensors with different characteristics, such as the active Sentinel-1 and passive Sentinel-2 satellite images used in our study, but also addresses challenges related to excessive computational and time demands, an issue that must be treated even in cloud-computing environments.
Assessment of variable importance in all classification schemes highlighted that EO data alone, are not adequate for predicting and mapping the complex Mediterranean landscape, especially when it comes to broad, diverse area mapping. Incorporation of auxiliary data related to the ecological factors seemed to have the highest impact in the classification process, indicating that this information is necessary for discriminating ambiguous habitat patches.
Seasonal and monthly spectral information features were also very important, quantifying phenological differences among map classes. In the future, additional spectral-temporal features, capturing habitat phenology, should be evaluated for further improving classification accuracy.
The present study acts as a roadmap for the further development of an end-to-end, automated workflow, for annual land-cover mapping in Greece, providing up-to-date estimates of ecosystem extent, condition, ES supply (or potential supply), and ES bundles, nationwide. The resulting map will be available to the public through the webGIS portal of the LIFE-IP 4 Natura project (currently under development).