Detailed Land Cover Mapping from Multitemporal Landsat-8 Data of Different Cloud Cover

Detailed, accurate and frequent land cover mapping is a prerequisite for several important geospatial applications and the fulfilment of current sustainable development goals. This paper introduces a methodology for the classification of annual high-resolution satellite data into several detailed land cover classes. In particular, a nomenclature with 27 different classes was introduced based on CORINE Land Cover (CLC) Level-3 categories and further analysing various crop types. Without employing cloud masks and/or interpolation procedures, we formed experimental datasets of Landsat-8 (L8) images with gradually increased cloud cover in order to assess the influence of cloud presence on the reference data and the resulting classification accuracy. The performance of shallow kernel-based and deep patch-based machine learning classification frameworks was evaluated. Quantitatively, the resulting overall accuracy rates differed within a range of less than 3%; however, maps produced based on Support Vector Machines (SVM) were more accurate across class boundaries and the respective framework was less computationally expensive compared to the applied patch-based deep Convolutional Neural Network (CNN). Further experimental results and analysis indicated that employing all multitemporal images with up to 30% cloud cover delivered relatively higher overall accuracy rates as well as the highest per-class accuracy rates. Moreover, by selecting 70% of the top-ranked features after applying a feature selection strategy, slightly higher accuracy rates were achieved. A detailed discussion of the quantitative and qualitative evaluation outcomes further elaborates on the performance of all considered classes and highlights different aspects of their spectral behaviour and separability.


Introduction
Land cover is one of the essential terrestrial climate variables [1] and therefore land cover information in terms of regularly updated, detailed and accurate land cover maps arises as a significant input for several scientific communities working on climate change studies, geomorphology, sustainable development and social sciences, natural resources management and agricultural monitoring. At the same time, open data policies in both the USA and EU are delivering an unprecedented volume of satellite imagery data with an increasing level of spatial, spectral and temporal resolution. Currently, the availability of Landsat-8 (L8) and Sentinel-2 data significantly increases the capacity for high-resolution land cover mapping using dense time series while demanding efficient and cost-effective classification methods. Recent research efforts have focused on the use of finer resolution satellite imagery for land cover map production, offering products at a global scale such as the updated version of FROM-GLC [2], the FROM-GLC-agg [3] product with nine general land cover classes at 30 m and an estimated accuracy of 65.51%, and the recent GlobeLand 30 product [4], providing land cover information at 30 m with 10 land cover classes. Landsat archive data were also employed for national-level product approaches such as the 30 m map of 16 classes from the USA National Land Cover Database [5] as well as the recent land cover map of France with 17 classes and an achieved accuracy of over 80% [6]. Additionally, focusing on agricultural landscapes, Xiong et al. [7] produced a nominal 30-m cropland extent product of the entire African continent using Sentinel-2 and L8 time series for the year 2015, resulting in a weighted overall accuracy of 94.5% for binary classification of cropland versus non-cropland areas.
Beyond the achieved and reported overall accuracy rates, a key aspect of the produced map, in terms of usability and applicability, is the number of land cover classes considered. Most of the aforementioned products do not incorporate a highly analytic classification system and nomenclature in order to deliver detailed information regarding several different land cover classes. On the other hand, mapping products that deliver a rich class nomenclature, like the CORINE Land Cover (CLC), are not regularly updated since they are based mainly on laborious image interpretation and manual digitization procedures. Furthermore, regarding the challenging case of agriculture, even in the detailed CLC product, crop areas like arable land are mapped depicting the implied agricultural practice and not the specific crop cultivated. Recent studies [8][9][10][11] have indicated that crop mapping is quite a challenging problem requiring high spatial and temporal resolution data but also a significant amount of crop-type reference data.
Regarding the production of annual land cover maps based on multitemporal datasets, a number of approaches have employed data from two different seasons [12,13], while the general trend is to employ frameworks that can exploit all the available annual data (below a given cloud presence threshold). However, how and to what extent the presence of clouds affects the accuracy and effectiveness of mapping and monitoring applications has not been studied adequately, and further insights are required [14]. In particular, for discriminating between different crop types, more multitemporal observations are required in order to note key periodic events in the life cycle of different species (phenology), sufficiently capturing spatial, spectral and temporal features [9,10,15]. However, in most geographical regions and climate zones it is rare to find several absolutely clean, cloud-free images [14,16]. The norm is that even the clearest images of a Landsat path/row or Sentinel tile will contain a number of pixels that are affected by different type of clouds. Although, recent efforts are employing cloud masks and interpolation procedures to tackle invalid values and produce synthesized data [6,8,9], cloud detection approaches (e.g., f-mask) fail to detect all clouds/cloud types in a scene (especially cirrus, warm or thin clouds) and therefore even with cloud masks multitemporal datasets include noisy pixels (e.g., clouds, shadows). Moreover, all additional processes (e.g., interpolation) require/cost a significant amount of CPU and/or GPU hours.
Another critical issue concerning land cover map production is the selection of the classification method and algorithm. Supervised classification methods are considered superior to unsupervised for this type of study [6,17,18], but they require the use of accurate and sufficient training data. Additionally, the use of machine learning techniques like Support Vector Machines-SVM [19,20], Random Forests-RF [21] and artificial Neural Networks-NN have gained rapid recognition for classification studies. SVM have been widely used in recent land cover and crop type classification literature [22][23][24][25][26], while in many cases they have been reported to outperform algorithms like RF and NN [18,27,28]. SVM offers the ability to attain high classification accuracies even if using small training sets [27,29,30], and has also proven robust for low noise levels in the presence of mislabelled training data [22]. An important advantage of the SVM classifier is its ability to manage well a large feature space, which is the case when classifying multi-temporal multispectral imagery [31]. RF classifiers have been also reported to achieve high accuracy metrics in similar classification tasks [9,22,32], while studies have indicated that RF may present lower sensitivity to mislabelled training data than SVM [22]. On the other hand, deep learning is currently one of the fastest-growing trends in remote sensing data analysis, successfully tackling a variety of challenging problems and benchmark datasets in object recognition, semantic segmentation and image classification [33][34][35][36]. In particular, Convolutional Neural Networks (CNNs) are currently heavily used for classification and semantic segmentation tasks in a variety of remote sensing datasets [28,[37][38][39]. However, compared to more shallow architectures (e.g., SVM), the operational and complexity aspects of CNN-based techniques are directly associated with the considered data dimensionality, which is a key limitation when employing several spectral bands and/or multitemporal datasets [25,40].
Towards addressing the aforementioned challenges, in this paper we introduce a framework for systematic, detailed land cover mapping from annual L8 satellite data. In particular, we consider 27 classes based on CLC Level-3 nomenclature, focusing on the cover type and not the use type classes, while analysing further the crop type classes. Without employing cloud masks or interpolation procedures to recover cloudy pixels, we form different spatio-spectro-temporal datacubes composed of images with gradually increasing levels of cloud cover. Two types of machine learning classifiers, i.e., one with a shallow (SVM) and one with a deep architecture (CNN) were applied and compared. An assessment of the impact on classification accuracy and the robustness of the classifier was performed by classifying the different datacubes. The influence of clouds on the reference data was also evaluated by comparing cloud cover percentages and resulting per-class accuracy. Different sets of classification features were tested by applying a standard feature selection strategy. A detailed analysis and discussion of the validation results is also provided in the last section.

Study Area
The study area is located in northern Greece and includes the greatest part of Central Macedonia Region, the east part of West Macedonia Region, the northern part of Thessaly Region and also a small area belonging to the former Yugoslav Republic of Macedonia-FYROM (Figure 1, left). It covers an area of about 22,000 km 2 without including the marine region of the Aegean Sea in the southeast. We selected this region as the study area since it presents highly heterogeneous landscapes, significant land cover diversity including metropolitan centres, vast agricultural areas, high-altitude mountains and water bodies as well as different climate zones, i.e., from sea level to almost 3000 m altitude. Generally, the climate of the region is characterized by rainy winters and dry, warm to hot summers, according to the Köppen climate classification [41,42]. Terrain relief in the whole scene varies greatly, including plains but also several mountain masses scattered across the region such as Mount Olympus, the highest in the country at 2917 m above sea level, the Voras Mountains at 2524 m and the Pierian Mountains at 2190 m. (c) (d) Figure 1. The study area is located in northern Greece, as can be seen in a natural color Landsat-8 (L8) composite from year 2016 (left) and covers an area of about 26,000 km 2 . The study area was partitioned in order to form independent training and testing datasets: training blocks are pictured in yellow, while testing ones with a cyan (right).
The main land cover types are natural vegetation and agriculture. Uplands and mountainous areas are covered mainly by broadleaved forests and to a smaller extent by coniferous forests. Lowlands are occupied mostly by Mediterranean-type evergreen sclerophyllous vegetation of bushes, scrubs and maquis. Agricultural land consists principally of permanent crops like vineyards, olive groves and fruit trees (e.g., citrus, apricot, pear, quince, cherry, apple, nut trees). The latter are mainly cultivated in plains around Edessa, Naoussa and Veria. Rice cultivation is also practiced around the Axios River and Aliakmonas River deltas in Central Macedonia. Arable land includes mainly cereals (e.g., wheat, barley, rye, triticale), maize and grass fodder (e.g., clover, alfalfa) fields. Artificial land is typically comprised of cities and towns, including the metropolitan area of Thessaloniki, the second-largest city in Greece. Smaller towns and villages are scattered across the region. In addition, major water bodies of the country are situated in our study area such as Volvi and Koroneia Lakes and Haliacmon, Axios and Strymonas Rivers in Central Macedonia and also Vegoritida and Polyfytos Lakes in West Macedonia. Low-lying marches land, usually flooded, can be found near water bodies but also on river estuaries to the Aegean Sea.

Landsat-8 Annual Datasets
All surface reflectance L8 products for the 182/34 path/row of the year 2016 were downloaded from the USGS (United States Geological Survey) EarthExplorer platform. The L8 Surface Reflectance product is generated at 30-m spatial resolution on a Universal Transverse Mercator (UTM) projection including seven multispectral bands from the Ultra Blue to the Shortwave Infrared (SWIR) spectral regions. Based on our previous research efforts [43,44], we selected all annual data with less than 60% of cloud cover (over the land) for the experiments. In order to address our main goal for assessing the impact of integrating images with gradually increased cloud cover, four different datasets were formed including images with: #1. Less than 20% (11 dates); #2. Less than 30% (14 dates); #3. Less than 40% (16 dates) and #4. Less than 60% (18 dates) land cloud cover. Moreover, a smaller dataset formed with six images (Dataset #0) of approximately less than 10% land cloud cover was used for the comparison between the two different classification architectures (see Section 3.1). It is important to mention that all datasets include dates from all four seasons and not just summer or spring (Table 1). In Figure 2, cumulative cloud maps are presented, indicating per pixel the frequency of cloud cover per dataset in the study area. The cloud presence per-image was derived from the F-mask algorithm [45] and the Quality Assessment (QA) L8 band. In particular, for the first dataset (Dataset #1), a large continental part of the study area is cloud-free (white) across the corresponding 11 dates/images. Moreover, pixels that are occasionally covered by clouds mainly have a cloud frequency of up to three dates (beige). Regarding Dataset #2 (14 images in total), a large part of the study area is covered in cloud for up to three out of 14 dates. However, there are parts that are reported with up to six (and even up to nine) cloudy dates, mainly from northwest to south following the mountain massifs formations of the study area. This is also the case and in a more intense frequency for Dataset #3 (16 dates), where generally the regions that had up to three dates frequency in Dataset #1 are reported in #3 with a frequency of up to six cloudy dates. The most cloudy Dataset #4 (18 dates) is mostly covered at three to six dates, while certain mainly mountainous regions are covered by cloud for up to 12 dates (magenta) and up to 15 dates (purple) out of the 18 dates/images of the dataset.

Land Cover Classes and Reference Data
One of the main contributions of this paper is that the proposed methodology has been designed, developed and validated for a detailed land cover classification into more than 25 land cover classes. In particular, the developed nomenclature was mainly derived from the CLC nomenclature [46] while the main crop types were based on the available geospatial data of the Greek Payment Authority of the Common Agricultural Policy (CAP). The introduced land cover class nomenclature for this study was formed starting from the CLC2012 Level-3 subclasses and by either grouping, i.e., merging classes (e.g., 4.1.1 Inland and 4.2.1 Salt marches) or splitting further classes (e.g., Dense and Sparse 3.2.3 Sclerophyllous vegetation) when needed.
Additionally, certain exclusions or modifications were applied mainly with the use type classes (e.g., 1.4.2 Sport and leisure facilities), with classes that were not present in our study area (e.g., 3.3.5 Glaciers and perpetual snow) and also with classes that are defined for the CLC-specific scale (e.g., 2.4.2 Complex cultivation patterns), since CLC uses a minimum mapping unit of 25 hectares. Agriculture classes were further analysed based on available geospatial data. The introduced nomenclature, including 27 different classes, is presented in Table 2. An intensive manual annotation procedure was carried out for the production of reference data. Two image interpretation experts who are also familiar with the CLC product and the study area manually digitized polygons for the different land cover classes. Since representative training samples are one of the most critical components in supervised classification, the experts studied the area thoroughly and intensively and noted as many variations of each class as possible. The annotation of each sample followed the principle of avoiding mixtures of the defined land cover classes [2]. Different data sources and multitemporal images were employed for the photointerpretation and annotation process, including Sentinel 2 and L8 images for the year of 2016, Google Earth Pro data for 2016 using Historical Imagery, the CLC2012 product and available CAP geospatial data for the crop fields of 2016. CAP geospatial data are collected by the agency through field surveys on a yearly basis. In Table 2 one can observe the number of available polygons for each class and the number of pixels used for training and testing for each category, as described in Section 2.4.2.

Training and Testing Areas
Reference data were created through an intensive manual annotation procedure and were divided into two independent sets: one for the training and one for the validation procedure. To avoid any spatial correlation between the training and validation datasets, the study area was partitioned in nine different blocks. Six of them were used for the training procedure and the remaining three for the testing one ( Figure 1, right). Strips of 900 m around the borders of the blocks were excluded in order to avoid correlation between the training and validation phases when applying the patch-based deep CNN classification. Regarding the sample size, we applied as minimum reference dataset size required for accuracy assessment, the rule of thumb proposed by [54], whereby 75-100 testing sample units (pixels in our case), per thematic class should be regarded for large areas and complex maps of more than 12 classes [25,55]. Moreover, mainly for computational reasons, the maximum number of pixels per class was limited to 15,000 pixels for training and 10,000 for testing. The pixel's number limitation was performed with a random selection on the available reference data of each class that exceeded these limits. In addition, another random selection was performed, in order to ensure that the training set, per-class, included a 60% to 80% of all reference pixels and the testing set a 20% to 40%, respectively. This was due to the fact that the annotated polygons did not cover the study in a uniform way.

Experimental Setup
Workflow: First, we ran two comparative experiments, for evaluating the performance of the SVM and CNN frameworks. Due to the high computational power and memory that was required for the CNN implementation, the initial experiments were performed on a spatiotemporal datacube with six dates (Dataset #0) and also on Dataset #1 with 11 dates (Table 1), but using a smaller subset of the reference data (i.e., approximately 30%). Based on the quantitative and qualitative evaluation, the SVM classifier was chosen to be applied for the main four classification experiments towards assessing the impact of integrating images of gradually increased cloud cover in the datasets (Section 3.2). Note, that for these experiments the computational requirements were gradually increased since for the first dataset (Dataset #1, 11 dates), the datacube contained 121 layers/features, for Dataset #2 (14 dates) contained 154, for Dataset #3 (16 dates) 176 and for Dataset #4 (18 dates) contained 198 features. Furthermore, additional experiments were performed by employing a subset of the most significant features for all datasets based on a feature selection strategy (Section 3.3). The pipeline of the developed classification framework and performed experiments are illustrated in the flowchart in Figure 3. Classification Algorithms: We compared the performance of two classification frameworks that have demonstrated high accuracy rates in the recent literature, i.e., one shallow kernel-based classifier (SVM) and one patch-based deep learning approach (CNN). In particular, a SVM classifier [19,20] from the LIBSVM library using the C-SVC type was employed [56]. For the SVM application different kernel functions have been proposed in the literature [22,57]. In this study, we employ two kernel functions: a linear one that has been proposed and tested as optimal when the number of features is large [58] and has resulted in relatively high accuracy rates in our previous studies [43,44], and the Radial Basis Function (RBF). The tuning of parameter C, referring to the cost of error penalty, is required for both kernels, while the RBF kernel also needs the definition of a second parameter g > 0, which denotes the width of the Gaussian kernel function. For our experiments, the classifiers parameters have been optimized by a five-fold cross validation grid search process: C = {0.01, 0.1, 1, 10, 100, 1000}, g = {0.0001, 0.001, 0.01, 0.1, 1, 10}.
For the deeper architecture among those that have been employed recently for classification tasks [33][34][35]37,59] we employed the relatively simple CNN architecture of the ConvNet Network. The network consists of 10 layers: two convolutional, two max pooling, two batch normalization, two activation functions (rectified linear unit-ReLU) and two fully connected ( Figure 4). We used a patch of 29 × 29 pixels centred in the pixel of interest. For the training, we randomly selected patches centred on the training pixels allowing them to overlap. During the validation phase, for each pixel we centred and extracted a patch. Then the predicted label was assigned to this specific pixel. More specifically, the raw input patch for Dataset #1 of 11 dates with size 121 × 29 × 29 is given as input to the first convolutional layer, which applied convolutional kernels of size 5 × 5. Next comes a transfer function layer that applies a ReLU element-wise to the input tensor and is followed by a batch normalization layer. Then a max-pooling layer follows, which downsamples with a size of 2 × 2 the input and lightens the computational burden. The next four layers follow the same pattern, with exactly the same parameters, and result in two final, fully connected layers that produce the final outputs classified in 27 different classes. For Dataset #0 of six dates the architecture remains the same, except for the size of the input patch, which changes to 66 × 29 × 29. The network had been trained using the stochastic gradient descent (SDG) optimizer for 30 epochs and early stopping had been used to prevent overfitting. We set the batch size to 100, the learning rate to 0.1 while every 3 epochs had been reduced by 0.01, the momentum to 0.9 and the weight decay parameter to 5 × 10 −5 . For the extraction of the training patches, the only selection criterion was the central pixel of the patch to belong to the specific class. For testing, a patch was extracted in each pixel of the image and the predicted value was assigned to the central pixel. Feature Selection: In order to further evaluate the contribution of the employed features, a dimensionality reduction procedure was applied in all four datasets. Generally speaking, feature selection methods can be categorized into three types: filter, wrapper, and embedded [60,61]. The filter methods that rank the discriminative power of each feature are independent of the learning process. The other two incorporate the supervised learning algorithm in their approaches for estimating the contribution of each feature. Nonetheless, the last two are also computationally expensive, thereby preventing their use in tasks where the dimensionality and the amount of the data are large [61]. To this end, for our experiments, where the volume of features is large, we selected a computationally efficient filter-based method [62,63], independently of the classification algorithm to be applied. In particular, we employed the Fisher score [60] selection strategy, which calculates the ratio of interclass separation and intraclass variance for each feature and thus expresses the discriminative power of each feature [61,64]. The Fisher score is one of the most widely used supervised feature selection methods and has been combined with SVM classifications in many recent studies [62,[65][66][67][68]. In detail, the Fisher score of the rth feature is computed by the formula where n w is the number of samples in class w, µ r w and σ r w are the mean and standard deviation of the w-th class for the rth feature, µ r the mean of the whole data set for the rth feature while the serial number or classes goes from 1 to c. Based on the Fisher score, all features across the temporal cube were ranked and two extra classification experiments for each of the four datasets were performed using the 70% and the 50% of the top ranked features.
Validation Accuracy Metrics: All performed experiments were evaluated qualitatively through a thorough visual inspection as well as quantitatively forming confusion matrices, i.e., double entry tables where row entries are the actual classes corresponding to the testing data and column entries are the predicted classes from the classifier. The standard accuracy metrics of Overall Accuracy (OA), User's (UA), Producer's Accuracy (PA) and Kappa coefficient (Kappa) were calculated for each case [69]. Additionally the combined metric of per class F-measure (F1) score is also provided, calculated as the harmonic mean of UA and PA: Additionally, to assess the statistical significance of the accuracy differences between pairs of the different classification algorithms, we employed the McNemar test, since we have used identical testing sets (related samples) for the accuracy assessment [25,70].

Experimental Results and Discussion
In this section, experimental results are presented regarding the performance of the classification algorithms (Section 3.1), the influence of the different cloud cover conditions (Section 3.2) on accuracy metrics, the importance of the number and type of features (Sections 3.3 and 3.4 ), as well as the per-class performance analysis and overall quantitative and qualitative evaluation of the mapping (Section 3.5).

Comparing the Performance of the SVM and CNN Frameworks
The initial comparative experiments, for evaluating the performance of the SVM with linear and RBF kernels and the CNN framework, were performed on Dataset #0 and Dataset #1 (Section 2.2). Both training and testing datasets were kept exactly the same for both the SVM and the CNN classification frameworks. Table 3 summarizes the resulting quantitative results for all classifiers regarding the average PA, UA and F1 metrics as well as the OA and Kappa accuracy metrics. For all experiments, the OA and Kappa differ with less than 2.3%. Regarding the per-class accuracy metrics, the average PA, UA and calculated F1 metrics highlighted that the SVM classifier with both kernel choices reported higher accuracy rates (up to 5.4%) when comparing with the CNN results. Overall, the linear SVM scored the highest rates for the majority of validation metrics on both datasets. In order to objectively assess the aforementioned differences, the McNemar test was employed [70]. In particular, the conventional threshold for declaring statistical significance of 5% was adopted, i.e., a p-value of less than 0.05 declares significance. Table 4 presents the recorded p-values of the McNemar test for all combinations of the two experiments performed, along with the level of significance stated for significant differences. Apart from the SVM-RBF with CNN pair on both datasets, all other differences were found to be significant at the 5% level of significance, while differences between different kernel SVM classifiers were also found to be statistically significant for the stricter 0.1% level of significance. Furthermore, in Figure 5 (top) the resulting land cover maps for the CNN and linear SVM are presented. Their disagreement (in black) can be also observed ( Figure 5, bottom) as well as the reference data for this particular region. After a closer look, one can observe that the resulting land cover map from the CNN, due to its patch-based nature, tends to generalize the prediction, especially across the class boundaries. Consequently, it resulted in a coarse generalized output including certain omission cases especially for linear objects like roads and rivers as well as for regions with high heterogeneity (e.g., small parcels of different crop type). Regarding the spatial distribution of the differences ( Figure 5, bottom left), they are scattered in the entire region mainly across class boundaries, apparently due to the generalized CNN output effect. In particular, these observations are obvious in areas of heterogeneous texture like mixed agricultural areas and other natural areas with frequent cover variation. On the other hand, large areas with continuous single-class coverage, e.g., sea, rice fields and urban regions, present less disagreement between the two land cover products. In comparison with the reference data ( Figure 5, bottom right), both resulting maps present relatively high agreement in most cases.  In overall, SVM delivered higher reliability and accuracy for individual classes and more crisp and accurate boundaries. Taking also into consideration the high computational cost of the CNN framework, we chose to perform the main experiments for assessing the impact of different cloud conditions and number of features based on the linear SVM classification framework.

Assessing the Influence of Different Cloud Cover Conditions
In order to assess the influence of cloud cover conditions in classification, we performed several experiments on the four datasets (Table 1), which were formed by adding more images with incrementally increasing cloud cover over the land. In all experiments the same training and testing sets were employed. For the linear SVM, the C parameter value was set to 0.1. The main question here was whether the addition of more dates/features would help the classifier by adding more observations and potentially strengthening certain phenological characteristics, especially for crop type discrimination, or impede performance by deteriorating class statistics due to the increased presence of clouds.
In Figure 6, the resulting accuracy scores after classifying the four different datasets are presented. In all cases the OA and Kappa coefficient indicated a quite good performance, with rates of over 72%. The highest score was achieved for Dataset #1 (with the lower amount of cloud cover, i.e., <20%) with an OA at 80.5% and Kappa at 78.0%. For the second dataset, when adding three more images with up to 30% clouds, the resulting OA and Kappa rates decreased slightly (~1%). By adding more cloudy images (Datasets #3 and #4) the OA and Kappa rates decreased further by approximately 4.5%. Regarding the average PA, UA and F1 scores, the second dataset resulted in slightly higher accuracy rates than the first one, with rates of 69.1%, 67.6% and 66.4%, respectively. By adding more images with higher than 30% cloud cover, the PA, UA and F1 scores dropped below 66%. Additionally, in Figure 7 (bars) all the resulting per-class F1 scores for all four datasets are presented. After a close look, one can observe that certain classes, e.g., BLF, KWP, CRL, RCF, WBD and CWT, stand out as the most accurately classified, since they achieve F1 scores of over 80% for all datasets. In 13 of 27 cases, the highest F1 rates were achieved for Dataset #2, while Dataset #1 follows, holding the best rate for nine classes. More specifically, natural vegetation classes like CNF, NGR, SSV, SVA and MRS present higher rates for Dataset #2, like most of the agriculture classes (OLG, RCF, PTT, GRF, GRH). Dataset #4 (more images, more clouds) delivered the higher F1 scores for four classes among those three agricultural varying cycle ones, i.e., FRT, KWP and MAZ. Influence of clouds in the reference data: In order to evaluate the influence of cloud presence in the reference data (training and testing sets) we estimated the cloud masks for all images based on the F-mask algorithm [45] and the QA band. In Figure 7 (lines), the average per-class cloud presence in the reference data for all datasets is presented. As expected, the cloudy pixels in the reference data are increasing from Dataset #1 to #4. The reference data for the PTT class is an exception since Dataset #2 presents a lower proportion of cloudy pixels than Dataset #1, indicating that the three additional images offered a significant amount of non-cloudy observations. Generally, most classes do not exceed the 20% of cloud cover in the reference data.
The highest rates of cloudy pixels in the reference data are recorded for the classes MNL (48.1-55.5%) and SVA (18.7-33.3%), which are usually found in alpine areas of high altitudes. By comparing both lines and bars of Figure 7, one can observe that for most classes that achieved higher scores in Dataset #2 (like DUF, RAN, SSV, OLG, RCF, GRF, GRH, MRS and WCR), the cloud presence in Dataset #2 did not exceed 10% but was slightly higher than in Dataset #1. This fact indicates that the trade-off between more observations, in terms of more images and less reliable observations, i.e., more cloudy pixels, was favourable for these cases. Another interesting point is that most classes (except for KWP) that, as analysed in the previous paragraph, achieved high rates of over 80% for all datasets, were not the ones that reported the lowest presence of clouds in the reference data.
It should also be highlighted that a significant increase in the cloud presence (in the reference data) between the less cloudy dataset and the cloudiest one does not result in an equivalent decrease in the reported accuracy (F1) rates (Figure 7). For example, the land cover classes of ICU and RCF have an almost 20% difference in cloud presence (comparing the 1st and 4th dataset), while the reported F1 scores differ by less than 7%. This is also the case for the largest differences on the reported F1 scores. For example, classes MES and GRH resulted in a 53% (1st-4th datasets) and a 31% (2nd and 3rd datasets) difference in the F1 scores, respectively, while the corresponding differences of cloud presence on the reference data were 13% and 0.1%, respectively.
These observations can be also justified from Table 5, which presents the highest F1 scores per class, the corresponding dataset and the average cloud presence in the reference data of this dataset.
In particular, the highest F1 scores (except for KWP) were not achieved from classes with the lowest rates of cloudy pixels (marked with green). Generally, cloud cover of more than 25% (e.g., MNL, SVA) resulted in relatively low accuracy rates. Overall, the combined analysis did not find a direct relationship between cloud presence and F1 scores achieved, indicating that the combination of the spatial, spectral and temporal resolution of the data along with the intrinsic characteristic and phenological behaviour of each class determines how successful the classification will be.

Assessing the Influence of Feature Selection
In order to assess the impact of dimensionality reduction on the classification accuracy, additional experiments were performed based on the SVM classification framework (Section 2.4.3), applying a feature selection strategy. In particular, classification results derived by employing all available features were compared with the ones derived by using the 70% and the 50% of the top ranked features based on the calculated Fisher score per dataset.
In Table 6, quantitative results regarding the classification accuracy of the total 12 experiments are presented. After a close look, one can observe that for the first two less cloudy datasets, the 70% selection leads to increased accuracy metric rates, while further removal of the 50% of the lower ranked features gives slightly less accurate results. On the other hand, for the more cloudy datasets, #3 and #4, classifications on top 70% and 50% features present a significant improvement in all accuracy metrics, in some cases up to approximately 10% (average F1 for #4 and Top 50%). In particular, the 50% selection on those datasets in fact excluded all features of the cloudiest dates forming datacubes composed of 15 and 16 dates for Datasets #3 and #4, respectively.
Among all performed experiments, the highest OA and kappa rates (81.38% and 79.05%, respectively) were achieved when the top 70% of the features were employed for Dataset #1 (in bold in Table 6). On the contrary, the lowest rates (double underlined in Table 6) were reported for Dataset #4 (with up to 60% of clouds) and all available features (75.25% and 72.28%, respectively). Concerning the average PA, UA and F1 metrics, higher rates were achieved for top 70% feature selection in Dataset #2 (71.70%, 69.94% and 69.96%). It is worth mentioning that classification on top 70% of Dataset #2 presents only four cases of low rates (<40%) for PA, UA and F1 while having the highest achieved number (28) of high rates (>80%) for those metrics. Although in all experiments the OA and kappa metrics differed by less than 7%, the per-class PA, UA and F1 metrics varied significantly and up to 70% (e.g., F1 for MNL class).
More specifically, concerning the individual land cover classes' performance on the different experiments, certain classes i.e., BLF, KWP, CRL, RCF, WBD and CWT reported high (in most cases >80%, marked bold) UA, PA and F1 rates. As can be observed in Table 6, some of these classes have a large sampling size (BLF, RCF, WBD), while others have a smaller one (e.g., KWP, CRL). On the other hand, other classes (e.g., NGR, DSV, SVA) resulted in overall lower accuracy rates (i.e., lower than 40%, marked with red). Again in this case, those classes correspond to both larger and smaller training sample sizes. The feature selection strategy for the first two datasets significantly improves the performance of MNL, OLG and FRT classes, even up to 46%. On the contrary, for the same datasets the exclusion of 50% of the lowest ranked features had a negative effect on the accuracy metrics, mainly for certain man-made classes (e.g., DUF, SUF, ICU, RAN, GRH) and water classes (WCR, WBD). For Datasets #3 and #4 with more dates and cloud cover, increases are observed after the feature selection on the top 50% for agriculture classes VNY and PTT and significant rises (up to 59%) for high brightness classes MES, MNL and GRH.

Assessing the Contribution of Spectral Features
The analysis of the selected, top-rank features in all experiments indicated that certain spectral indices as well as the spectral bands after 600 nm were proven to be the most useful for the classification procedure. In particular, in all dataset cases the overall ranking was the following: (1) NDWI; (2) NDVI; (3) NDBI; (4) NIR; (5) SWIR 1 or 2; (6) SWIR 2 or 1; (7) Red; (8) MSAVI; (9) Green; (10) Blue; (11) Ultra Blue. These observations are, in general, in accordance with the recent related literature, where spectral indices are widely used and proven to contribute significantly in remote sensing classification tasks [6,8,9,47,71,72] since their calculation reduces variability and enhances the separability of terrain classes [18]. NIR and SWIR bands as well as the Red one follow in high ranking, as expected, since they are specifically designed on these wavelength ranges to enhance the discrimination of different vegetation types, highlighting moisture and water content while the SWIR bands also provide thin cloud penetration capabilities [73]. All these attributes can be highly useful when separating, as in our case, numerous classes (more than 25) of different land cover type that include many natural vegetation and crop categories. On the other hand, Ultra Blue and Blue bands were the features found to have less discriminative power in all cases.

Per-Class Analysis, Validation and Discussion
Following the comparative analysis between the different algorithms, datasets and features, in this section per-class performance and separability aspects are discussed based on the SVM experiment on the Dataset #2 with the top 70% of the employed features. This specific experiment presented the minimum number of classes with low PA, UA and F1 per-class scores (see Section 3.3 and Table 6). The resulting classification map is presented in Figure 8. Note that a water mask was computed based on the L8 QA band. It was then applied for marine waters with a buffer of 5 km in order to exclude the non-coastal marine waters from the prediction. One can observe in Figure 8 that urban areas like the metropolitan area of Thessaloniki, as well as smaller towns like Katerini, Veria, Naoussa, Kozani and Ptolemaida have been accurately detected by the proposed classification framework (i.e., classes DUF, SUF and ICU) with a small extent of misclassification cases. Moreover, large inland water bodies such as Vegoritida, Polyfytos and Koroneia lakes have been successfully detected as well. In a few cases, certain pixels have been wrongly labelled as WCR (Pikrolimni Lake) or CWR (Volvi Lake). Regarding the performance over agricultural areas, in the northern region of Central Macedonia, which is characterized by vast cereal crop cultivation, the resulting detailed (in terms of number of classes) land cover map managed to correctly detected the CRL class as well as the extended rice cultivation regions (RCF class) around Axios and Aliakmonas River deltas.
In a similarly successful manner, the classification framework managed to detect and correctly label fruit tree cultivation areas (FRT class) around Naoussa, Veria and Edessa regions. Forest classes (BLF, CNF) were correctly reported on upland and mountainous areas, while Mediterranean-type evergreen sclerophyllous vegetation classes (DSV, SSV) were mapped on lowlands. Mountain massifs are also spotted with the presence of MNL, SVA and RNS classes. Certain classification errors are observed around Olympus Mountain where parts of bareland rocks usually covered by snow are classified as GRH. More detailed cases with classification errors are further discussed in Section 3.5.2.

Per-Class Quantitative Assessment
The aforementioned qualitative observations are in accordance with the reported quantitative results from the corresponding confusion matrix (Table 7). Firstly, regarding the artificial terrain classes (i.e., DUF, SUF, ICU and RAN) that consist of sealed man-made surfaces, they overall achieved medium to high PA rates (57.38-88.15%) and in most cases they are confused between each other. The artificial terrain class MES resulted, also, into a number of classification errors mainly with ICU (23%) and RNS (21%). The spectral behaviour distinct characteristics among MES (open-pit extraction of minerals/inert materials) and RNS (bare land areas, rock, cliffs and sand) were proven challenging to separate.
Concerning natural areas, BLF was detected with relatively high accuracy (PA: 86.24%) and was mainly confused with CNF, while the latter also achieved a high PA rate of approximately 88%, with a few cases of confusion with DSV and BLF. NGR areas were successfully detected with a PA rate of approximately 75%. The DSV class, corresponding to areas of dense bushy sclerophyllous vegetation, reported a lower PA rate. Approximately 34% and 41% of its testing pixels were classified as BLF and CNF, respectively. These classification errors involve three classes of dense woody vegetation that possess spectral similarities across the spectral-temporal datacube at a 30-m spatial scale. SSV achieved an average PA score (55.47%), presenting several mixing cases with DSV and SVA. High-altitude moor areas with moss and lichen, represented by class MNL, reported a PA rate of 63.05% and were mainly confused with areas of scattered vegetation (SVA). The latter hold a PA rate of 68.30% and in certain cases the testing samples (17%) were confused with SSV, a class also representing sparse vegetation.
The agricultural classes of KWP, CRL, MAZ, RCF, GRH were accurately classified at rates over 85%. VNY and OLG reported a medium PA rate of approximately 57%, since the first one presented mixings mainly with classes SUF and FRT and the second with SSV and FRT. Furthermore, FRT, which represents a heterogeneous land cover class including a variety of fruit trees and therefore patterns, texture, etc., was classified with a PA rate of 64.10%.  In particular, FRT presented confusions with a number of classes like KWP, VNY, OLG, GRF and SUF. The fact that permanent crop classes (i.e., VNY, OLG, FRT) presented misclassification error mixings not only with crop classes but also with SUF and SSV, can be attributed to similar spectral behaviour at this scale since those specific cultivation covers at a 30 m pixel include both vegetated (plants and trees) and non-vegetated (mainly soil) parts, similar to classes SUF and SSV. Another heterogeneous class, GRF, which includes grass fodder cultivations like clover and alfalfa, also presented some confusion (PA: approx. 51%) with FRT, VNY, CRL and NGR. On the other hand, PTT was detected with a relatively high PA rate (approx. 70%) and presented certain mixings with MAZ.
Bare land class RNS was evaluated with a lower PA of 45.75% as it presented mixings with many classes, especially with SVA and MNL, all mainly found in bare/semi-bare mountainous areas, but also to a smaller extent with high brightness man-made land covers like ICU and GRH. The MRS class, which includes low-lying land usually flooded by water and partially covered with herbaceous vegetation, presented a relatively low PA rate (46.5%). MRS was mainly confused with certain vegetation classes (e.g., NGR, BLF, SSV, FRT) but also with a few water classes, especially WCR. On the contrary, all water classes (i.e., WCR, WBD and CWT) were detected with high classification accuracy rates of over 90%.

Discussing Classification Errors in Challenging Cases
In this section, certain classification difficulties and misclassification cases are discussed in more detail. In particular, Figure 9a illustrates the challenges in separating MES areas that present high spatial, spectral and temporal heterogeneity and similar behaviour with a variety of terrain objects, materials and mixed covers. In this case (Figure 9a), one of the largest lignite reserves in Greece, the lignite mine of Ptolemais and the Ptolemais power plant area, are presented with a natural colour composite (RGB) from a L8 and Google Earth imagery. As can be observed in the resulting land cover map, apart from the MES class, the mining areas were also misclassified to other artificial terrain classes like DUF, SUF, ICU, RAN and also RNS due to similar spectral characteristics.
Another example with extended classification errors is presented in Figure 9b in the region of mountain Olympus. In particular, apart from MNL, SVA and RNS classes, all to be expected in high-altitude terrains, we also observe mapping of man-made classes MES, ICU and GRH. In particular, Mount Olympus, as the highest mountain of Greece and of our study area, presents snow and cloud coverage for a long period of the year, and therefore the higher intensity values across the spectrum for this area may have been confused with similar values of stable artificial surfaces classes. Moreover, this particular region was one of the areas participating in testing after the partition and thus no training data incorporating this seasonal information for alpine classes MNL, SVA and RNS were fed to the classifier. The latter emerges as a possible drawback of the strict approach of spatial partition of training and testing areas for the map production, especially for cases like this, i.e., semi-arid areas where the influence of lithological differences on the spectral response of a land cover class is not negligible. Nonetheless, partition was required for a direct quantitative comparison of the SVM with the CNN approach. In general, if we assessed solely the SVM framework validation, although a not-partitioned sampling procedure will not involve exactly the same samples in the learning and the testing process, it would undoubtedly include neighbours, probably leading to an overestimation of the calculated accuracy.
Another case of resulted classification errors is presented in Figure 9c. In particular, an agricultural area near Lake Vegora is presented where a variety of crop types are cultivated including vineyards, fruit trees, cereals, potatoes and maize. After a closer look, one can observe that several fruit parcels (FRT, marked with blue in the RGB images) have been wrongly classified as VNY. At this particular spatial scale (30 m), where linear against dot pattern characteristics cannot be effectively recorded, the crop types of similar spectral and temporal (phenological) behaviour were not distinguished successfully by the classifier in many cases and thus presented mixings between them.

Conclusions
In this paper, we introduced a framework for the systematic, detailed land cover mapping from annual high-resolution satellite data. The main contribution of our approach lies in the assessment of the influence of cloud cover on the classification of multitemporal datasets, composed of images with different cloud cover levels. Moreover, we performed detailed mapping in terms of targeting to discriminate between more than 25 land cover classes. The developed nomenclature contained 27 classes, based mainly on CLC Level-3 classes and several additional crop types. The annual satellite L8 data were formed into different spatio-spectro-temporal datacubes and a machine learning classifier with either a shallow (SVM) or a deeper architecture (CNN) was applied for the classification procedure. Quantitatively, the resulting overall accuracy rates differed by less than 3%. However, the SVM classification framework was less computationally demanding and produced more accurate results across class boundaries. The highest overall accuracy rates (above 80%) were derived for multitemporal datasets with up to 20-30% cloud cover. The feature selection procedure was proven beneficial for most cases. Regarding the influence of cloud presence on the training and testing data, our experiments indicated that a cloud cover of over 25% per class resulted in lower F1 rates. However, a direct correlation between cloud presence (in the reference data) and the resulting per-class accuracy was not observed. This fact highlights that the combination of the employed features along with the distinct spectral and/or phenological characteristics of each class determine how successful the classification would be. These promising results indicate that operational classification frameworks can be developed for detailed land cover mapping on an annual basis. Towards the latter, the optimal integration of topographic, climate and microclimate information in the framework is an open matter and can address the different phenological patterns of vegetation, which currently are modelled during the training procedure as a variation among labelled features. The contribution of additional temporal features (e.g., day of year with the max value of a vegetation index) that are not correlated with clouds will be also assessed.
Author Contributions: C.K. and K.K. conceived and designed the experiments; C.K., M.V. and G.A. performed the experiments; C.K. and K.K. evaluated the results and wrote the paper.