Validation of Visually Interpreted Corine Land Cover Classes with Spectral Values of Satellite Images and Machine Learning

We analyzed the Corine Land Cover 2018 (CLC2018) dataset to reveal the correspondence between land cover categories of the CLC and the spectral information of Landsat-8, Sentinel-2 and PlanetScope images. Level 1 categories of the CLC2018 were analyzed in a 25 km × 25 km study area in Hungary. Spectral data were summarized by land cover polygons, and the dataset was evaluated with statistical tests. We then performed Linear Discriminant Analysis (LDA) and Random Forest classifications to reveal if CLC L1 level categories were confirmed by spectral values. Wetlands and water bodies were the most likely to be confused with other categories. The least mixture was observed when we applied the median to quantify the pixel variance of CLC polygons. RF outperformed the LDA’s accuracy, and PlanetScope’s data were the most accurate. Analysis of class level accuracies showed that agricultural areas and wetlands had the most issues with misclassification. We proved the representativeness of the results with a repeated randomized test, and only PlanetScope seemed to be ungeneralizable. Results showed that CLC polygons, as basic units of land cover, can ensure 71.1–78.5% OAs for the three satellite sensors; higher geometric resolution resulted in better accuracy. These results justified CLC polygons, in spite of visual interpretation, can hold relevant information about land cover considering the surface reflectance values of satellites. However, using CLC as ground truth data for land cover classifications can be questionable, at least in the L1 nomenclature.


Introduction
Land use/land cover (LULC) mapping is one of the most important tasks of environmental assessments and environmental monitoring [1,2]. LULC mapping is based on remotely sensed images and has a history stretching back about 40 years [3]. Land cover categories can be identified with automatic image classification using machine learning algorithms using training data to develop a model, which provides the best fit for the known data points; another alternative is a visual interpretation and land cover patch is 25 ha, land cover polygons reflect better accuracy: CLC1990 was the first product; since then, the change layers have been produced with 5 ha MMU, and only the changes are implemented in the actual mapping using the last revised version of the CLC map (in 2018 the CLC2012 was updated) [26]. CLC has a detailed hierarchical nomenclature on three levels. The most detailed level is Level 3, with a maximum of 44 categories. Level 2 has a maximum of 15 categories, and it aggregates the detailed categories to a less detailed system. Level 1 has a maximum of 5 categories. The three levels are connected to each other since the level 1 nomenclature is a hierarchical stepwise aggregation of level 2 and level 3 categories [27]. We used the most up-to-date CLC2018 dataset, which was based on 2017 and 2018 satellite images (mainly interpreted from Sentinel-2 images), and we compared the pixel characteristics of CLC polygons with satellite images (Landsat-8, Sentinel-2 and PlanetScope) from 2018 and analyzed the L1 categories: artificial surfaces (Art), agricultural areas (Agr), forests and semi-natural areas (F), wetlands (Wet) and water bodies (Wat).
We obtained the satellite images for 20 August, 20 August and 23 August 2018 taken by three different satellites: Landsat-8 [28], Sentinel-2B [29] and PlanetScope Dove [30], respectively. The close dates of captures (a few days of difference) were the most important prerequisite to exclude differences of changes in the phenophase of vegetation or land cover change. August is the driest month in Hungary and, according to the meteorological data [31], there was no rain in the region in the studied period between 13 and 23 August (i.e., there was no rain even in the period preceding the images; thus, reflectance values were not biased by the weather). All images were cloud-free.

Study Area
The study area was a 25 × 25 km area in the surroundings of the town of Tokaj, NE-Hungary. The main purpose was to examine an area where as many CLC categories are present as possible. This area contained 19 level 3 categories out of the possible 44. Thus, the area represented a heterogeneous landscape mosaic concerning the landscape elements (i.e., patches), as it is located at the junction of five different microregions with various land types of different characteristics, including floodplains along the Tisza and Bodrog rivers, the hilly areas of the Tokaj hill in the center of the study area, and the sandy plains in the southeast part of the study area [40]. We aggregated the categories in the area into 5 main categories according to the CLC level 1 nomenclature [27] (Figure 2).

Study Area
The study area was a 25 × 25 km area in the surroundings of the town of Tokaj, NE-Hungary. The main purpose was to examine an area where as many CLC categories are present as possible. This area contained 19 level 3 categories out of the possible 44. Thus, the area represented a heterogeneous landscape mosaic concerning the landscape elements (i.e., patches), as it is located at the junction of five different microregions with various land types of different characteristics, including floodplains along the Tisza and Bodrog rivers, the hilly areas of the Tokaj hill in the center of the study area, and the sandy plains in the southeast part of the study area [40]. We aggregated the categories in the area into 5 main categories according to the CLC level 1 nomenclature [27] (Figure 2).

Image Processing
The workflow started with the tasks related to GIS processing: reprojection of the CLC2018 vector dataset from the ETRS89 to the UTM34N coordinate system and clipping to match the study area extent. Then we extracted the descriptive statistical values (minimum, mean, median, maximum, standard deviation (STD)) of the pixels of each satellite

Image Processing
The workflow started with the tasks related to GIS processing: reprojection of the CLC2018 vector dataset from the ETRS89 to the UTM34N coordinate system and clipping to match the study area extent. Then we extracted the descriptive statistical values (minimum, mean, median, maximum, standard deviation (STD)) of the pixels of each satellite band (blue, green, red (RGB), near-infrared (NIR) bands and the calculated NDVI layers) covered by CLC polygons. This resulted in 15 variables (3 sensors × 5 layers). This part of the workflow was performed in QGIS 3.10 [41].

Statistical Analysis
The extracted dataset was processed with R software [42] using "ggplot2" [43] and "multcomp" [44] packages. Distributions of the variables were tested for normality with the Shapiro-Wilk test. We applied the -2-way ANOVA test based on 20% trimmed values to test the median of the reflectance values (dependent variable) against two independent variables. In addition to nominal factors (the independent variables: CLC categories, satellites), we also included the analysis of interactions to reveal if a factor is influenced in the function by another factor [45]. Satellites had different spatial resolutions; thus, in increasing order, it was suitable for the main effect and the abscissa of the interaction plot, while the land cover categories formed the other main effect.
We applied General Linear Modeling (GLM) to determine the most important biasing factors of the reflectance values. Independence of data was ensured by using the aggregated values (i.e., mean and STD) of the CLC polygons, variance homogeneity was checked by the Levene test, and the normal distribution was checked by the Shapiro-Wilk test. We performed two GLMs; in the first model, the median of the reflectance was the dependent variable, while in the second model, the STD, the independent variables were the following factors: land cover category (L1), the satellite type (SAT) and the band of the satellites (band). We also determined the effect size (ω 2 ) to estimate the contribution of the given factors to the variance of the dependent variable [46]; the effect can be categorized as small (ω 2 : 0.01-0.06), medium (ω 2 : 0.06-0.14), or large (ω 2 > 0.14) [45]. Thus, we were able to evaluate the significance and contribution of the given influencing factors and the statistical interactions. GLMs were conducted in jamovi 1.2.16 [47].
We conducted classifications to reveal whether the CLC L1 level classes can be discriminated by the spectral values of satellite bands and the NDVI. We applied a traditional and parametric method, the Linear Discriminant Analysis (LDA), and a robust nonparametric classifier, the Random Forest (RF).
LDA is a popular method in multivariate statistical analysis and is powerful but has several assumptions. It requires a balanced set of cases within different categories and multivariate normal distribution [48][49][50]. Accordingly, if there are outlier data within the categories, the number of misclassifications will increase. The RF algorithm works with hundreds of classification decision trees: the dataset is sampled as many times as the number of decision trees with bootstrapping (i.e., the same data can be used in the same dataset several times). The number of variables is the square root of the total number of variables. As a consequence of the procedure, there are no limitations or prerequisites; moreover, the approach is efficient [48,51].
We performed the classifications involving all variables, and we then repeated all classifications with the most important variables determined by the Recursive Feature Elimination (RFE). Fewer variables can result in good accuracy and help to avoid overfitting. The RFE works as follows: we need to specify an algorithm (e.g., LDA or RF), and the RFE performs a classification; next, the classification runs again, omitting one variable and recording the validation loss. The variable omission is repeated until all variables are omitted and, based on the contribution of the variables, a rank is produced [52]. The RFE was conducted with the RF algorithm with repeated 5-fold cross-validation with 10 repetitions (i.e., based on 50 models) using the "caret" package of R [53].

Analysis of Data Representativeness
The representativeness of data is always among the main questions in all studies. Although we had 984 CLC polygons, the water body category had only 39 cases. We aimed to reach 85% accuracy (OA) with 5% allowable error, which requires at least 200 cases according to Foody [54] (Equation (1)), practically 50 cases per CLC-class. We aimed to use a balanced design with the same data size, so we omitted the water category from this analysis. Accordingly, we performed an analysis using stratified random sampling (10 random datasets) and 2-fold cross-validation with 25 repetitions for all random samples; therefore, all randomizations were based on 50 models: where n is the required number of data, z is the standard error associated with the chosen level of confidence, P is the planned value of accuracy (%), and h is the allowable error (%). In our test z was 1.96 (95% confidence), P was 85%, and h was 5%. We applied only the RF classifier, including all variables, and applied complete random models (samples came from random sampling, and the RF models were also randomized, as R makes it possible to ensure similar or different model runs). The models' OA values were evaluated by satellites using ANOVA. Our H 0 was that the means of different RF models based on random samples were equal. The Tukey-test was used in the post hoc analysis. We supposed that if the model means were equal, data representativeness is ensured, as reflected in the 10 randomized models; therefore, the results can be general and be true for other areas as well.

Accuracy Assessment
Classification accuracy was tested with two methods. Overall Accuracy (OA) was tested with a repeated 5-fold cross-validation with 10 repetitions (RCV) [55]. Here, we had split the whole dataset into five subsets and used 4 of them as train data and 1 as a test set; then, another subset was used as the test set and the rest 4 as a train. The process was concluded when all subsets were used as a test set, and the whole procedure was repeated 10 times. Finally, we obtained 50 models, and we could calculate the medians, quartiles, minimums, and maximums of OAs for each model. Classification accuracy on the category level was evaluated using the traditional approach of Congalton [13], based on the confusion matrices and the derived accuracy metrics: User's Accuracy (UA) and Producer's Accuracy (PA). We used the 10 randomizations of data representativeness to determine the ranges of UA and PA (instead of reporting only one confusion matrix). Classification accuracy was determined with the "caret" package of R [53].

Differences of Spectral Characteristics of CLC Categories
The spectral bands of different satellites had significant differences (p < 0.05 according to Tukey's test), i.e., CLC polygons of the same category were represented by different statistical parameters.
Regarding the visual bands, the median and minimum of the blue and the minimum of the red bands in the PlanetScope image had significantly higher (p < 0.05) values compared to the other two sensors. In either blue, green or red median graphs, artificial and agricultural categories showed significantly higher values with all sensors than the forest, wetland and water bodies categories. STDs were significantly higher (p < 0.05) in the case of Sentinel-2 data for almost all categories in RGB bands ( Figure 3).

Reflectance Values and Nominal Factors
The 2-way ANOVA revealed that both satellite type (SAT) and land cover category (L1) had significant effects in all cases. Furthermore, there was a significant interaction effect in the case of red, NIR bands and NDVI, i.e., CLC categories were affected by the satellite sensors differently in these cases (Table 1; Figure 4). Regarding the visible bands, the agricultural and artificial areas and the water, wetland and forest categories formed groups. In the case of the NIR band, the groups were almost the same, but the highest values were found in the forests and wetlands, and water was placed at the bottom of the diagram ( Figure 4). Regarding the NDVI, agricultural, artificial and water categories were in one group with lower values and forests and wetlands in another group with higher values.

Reflectance Values and Nominal Factors
The 2-way ANOVA revealed that both satellite type (SAT) and land cover category (L1) had significant effects in all cases. Furthermore, there was a significant interaction effect in the case of red, NIR bands and NDVI, i.e., CLC categories were affected by the satellite sensors differently in these cases (Table 1; Figure 4). Regarding the visible bands, the agricultural and artificial areas and the water, wetland and forest categories formed groups. In the case of the NIR band, the groups were almost the same, but the highest values were found in the forests and wetlands, and water was placed at the bottom of the diagram ( Figure 4). Regarding the NDVI, agricultural, artificial and water categories were in one group with lower values and forests and wetlands in another group with higher values.  GLMs revealed that the variance of reflectance values' medians and STDs could be explained by the independent factors involved. When medians were the dependent variable (Table 2), the adjusted R 2 was 0.972, and all factors and their interactions were significant (p < 0.001). Satellite bands made the largest contribution to the explained variance; the next largest was the interaction of the bands and land cover classes (L1), and the L1 alone also made a small contribution. The model with the STD as the dependent variable was also strong; the adjusted R 2 was 0.895 (Table 3). Regarding the contributions, similarly to medians, satellite bands had the largest effect, and the interaction of bands and LCs (band × L1) and the LCs alone (L1) was also in the second and third place. However, in this case, the satellite type (SAT) and its interaction with the bands (SAT × bands) had a small effect. GLMs revealed that the variance of reflectance values' medians and STDs could be explained by the independent factors involved. When medians were the dependent variable (Table 2), the adjusted R 2 was 0.972, and all factors and their interactions were significant (p < 0.001). Satellite bands made the largest contribution to the explained variance; the next largest was the interaction of the bands and land cover classes (L1), and the L1 alone also made a small contribution. The model with the STD as the dependent variable was also strong; the adjusted R 2 was 0.895 (Table 3). Regarding the contributions, similarly to medians, satellite bands had the largest effect, and the interaction of bands and LCs (band × L1) and the LCs alone (L1) was also in the second and third place. However, in this case, the satellite type (SAT) and its interaction with the bands (SAT × bands) had a small effect.

CLC Categories as Reflected in Classification Algorithms
Classifications revealed that the best classification accuracies (OAs) belonged to the PlanetScope satellite, while the Landsat provided the worst; however, the medians of 50 models (for each algorithm and satellite data from the 10-fold cross-validation with 5 repetitions) ranged between 71.1% (LDA with Landsat data) and 78.5% (LDA with PlanetScope data; Figure 5). Although LDA provided the best median of OAs, RF was only 1% worse; furthermore, the median was only one parameter: the minimum value was 65.1% for LDA, and 69.2% for RF and the maximums were equal (86.5%). In the case of Sentinel, the medians of OAs were almost equal: 75.3% (LDA) and 76.1% (RF), but the RF's minimum and maximum were both higher than the same parameters of LDA. Classification with Landsat data showed the worst performance, and this was especially true of the LDA models where the maximum (79.1%) was worse than the upper quartile of RF.s, LDA.p and RF.p ( Figure 5). We repeated the analysis with all available bands in the case of Landsat-8 and Sentinel-2, and the OAs improved by 3-6%. We observed the largest (6%) increase of medians in the case of Sentinel-2 classified with LDA, but for Landsat-8, the increase was only 2% (Table 4). Class level accuracy measures revealed a heterogeneous result ( Figure 6). PAs indicated greater accuracies (~90%) for agricultural areas and forests; artificial areas had moderate accuracies (about 60%), while the performance of wetlands was the poorest (even below 40%). Usually, PAs were similar regarding the satellites; the only divergent case occurred in the artificial areas where the models running with Landsat data had the lowest values. UAs were larger than PAs even in the case of the poorest results, which belonged to the agricultural areas (75-55%). Artificial areas and wetlands had the highest values We repeated the analysis with all available bands in the case of Landsat-8 and Sentinel-2, and the OAs improved by 3-6%. We observed the largest (6%) increase of medians in the case of Sentinel-2 classified with LDA, but for Landsat-8, the increase was only 2% (Table 4). Class level accuracy measures revealed a heterogeneous result ( Figure 6). PAs indicated greater accuracies (~90%) for agricultural areas and forests; artificial areas had moderate accuracies (about 60%), while the performance of wetlands was the poorest (even below 40%). Usually, PAs were similar regarding the satellites; the only divergent case occurred in the artificial areas where the models running with Landsat data had the lowest values. UAs were larger than PAs even in the case of the poorest results, which belonged to the agricultural areas (75-55%). Artificial areas and wetlands had the highest values (about 90%), and forest UAs were around 80%. Regarding the satellites, agricultural areas and wetlands had decreasing accuracy as they moved towards coarser spatial resolution (i.e., from the PlanetScope to Landsat). Two-way ANOVAs where the UA and PA were the dependent and the L1 and satellite type were the independent variables showed that the main effect of L1 was significant (F = 17.105, df = 3, p < 0.001), but the main effect of satellites was insignificant (F = 0.642, df = 2, p = 0.527); furthermore, their interaction was also insignificant (F = 0.789, df = 6, p = 0.579).

Data Representativeness
A crucial question is how the data represent the study area. Our randomized sampling, using randomizations combined with 50 classifications per satellite (10 × 50 × 3, 1500 models), showed that results had minor differences (Figure 8). Our hypothesis was that if the OA means of 50 models do not differ significantly by randomizations, the results are representative because input data do not influence classification accuracy. In the case of Landsat and Sentinel satellites, classification accuracies were statistically equal in 7 cases out of the 10 randomizations. Landsat had two groups (7 and 3 randomizations), and the difference between the best and worst OA medians was 10% (L.4 and L.8 according to Figure 8). The difference between the statistically different groups (i.e., the threshold) was 3.3%. Sentinel had 3 groups, and the difference between the poorest and the best performance was 7.8%. Although the difference was significant (p < 0.05) between S.8 and S.9 (according to Figure 8) in statistical terms, the difference between OA medians was only 1.42%; therefore, the effect (the magnitude of differences) was low. Regarding the other threshold (between S.5 and S.1), the difference of medians was only 0.2. In the case of PlanetScope, differences were more enhanced, as the 10 randomizations formed 4 groups. The differences between the best and worst model were the highest among the satellites, reaching 14.4%; however, differences between the two sides of the thresholds were 2.9% (P.9 vs. P.3), 1.1% (P.4 vs. P.2) and 2.2% (P.8 vs. P.1). A change in the minimums regarding P.4 and P.2, from 51.1% to 46.4%, can explain the statistical difference; the remaining groups, however, had theoretical relevance.

Data Representativeness
A crucial question is how the data represent the study area. Our randomized sampling, using randomizations combined with 50 classifications per satellite (10 × 50 × 3, 1500 models), showed that results had minor differences (Figure 8). Our hypothesis was that if the OA means of 50 models do not differ significantly by randomizations, the results are representative because input data do not influence classification accuracy. In the case of Landsat and Sentinel satellites, classification accuracies were statistically equal in 7 cases out of the 10 randomizations. Landsat had two groups (7 and 3 randomizations), and the difference between the best and worst OA medians was 10% (L.4 and L.8 according to Figure 8). The difference between the statistically different groups (i.e., the threshold) was 3.3%. Sentinel had 3 groups, and the difference between the poorest and the best performance was 7.8%. Although the difference was significant (p < 0.05) between S.8 and S.9 (according to Figure 8) in statistical terms, the difference between OA medians was only 1.42%; therefore, the effect (the magnitude of differences) was low. Regarding the other threshold (between S.5 and S.1), the difference of medians was only 0.2. In the case of PlanetScope, differences were more enhanced, as the 10 randomizations formed 4 groups. The differences between the best and worst model were the highest among the satellites, reaching 14.4%; however, differences between the two sides of the thresholds were 2.9% (P.9 vs. P.3), 1.1% (P.4 vs. P.2) and 2.2% (P.8 vs. P.1). A change in the minimums regarding P.4 and P.2, from 51.1% to 46.4%, can explain the statistical difference; the remaining groups, however, had theoretical relevance.

CLC Classes and the Mixture of Spectral Features
CLC maps were visually interpreted, and the reported thematic overall accuracy (OA) was ≥ 85% [56]. However, OA, as a general index, cannot express the errors on the level of categories. Although validation data of photointerpretation is available [57], it is also important to verify whether CLC categories have a direct relation to LC maps derived from satellite images. Although there is only limited information about the class level accuracy on CLC classification, several studies use it as ancillary reference data, e.g., [58,59]. However, other studies used the CLC directly as reference data. Gudmann et al. [19] performed a LULC classification using Sentinel-2 and Landsat-8 images combined with OBIA (object-based image analysis) and the landscape metric approach [60] and found good correspondence. Their study relied on the accuracy assessment with the CLC2018; thus, this research had the most similarity to our work. The OAs of the two study areas were higher in a homogenous area (91%) and almost the same (76%) in the case of the complex one. Although their approach was exactly the inverse of our research question (i.e., they classified images and used the CLC as a reference, while we intended to justify classes of CLC polygons using the reflectance values of satellites), the final result of the classification was almost the same in case of the complex area having the most similarities with our study area. Reyes et al. [61] also used the CLC maps as reference data, and they gained 85.3% OA with OBIA, combining the image classification with landscape metrics. Ceccarelli et al. [62] reported 87-88% OAs for two areas with segmentation and spectral indices and textural information. OAs of our classifications were almost as good as in these studies, but we used the CLC polygons, which was a relevant difference. Verde et al. [63] elaborated a national-level land cover mapping scheme for ecosystem services. They applied superpixel segmentation and, similarly to our approach, used the RF classifier and their best result was an OA of 79%.
Previous studies did not analyze how well CLC polygons are represented by reflectance values of satellite images as LULC classes; instead, pixel or segmentation-based classification was performed on the images. Although these studies had valuable results, their approach was only similar to ours: segments are based on pixel similarity [64], the variance inside the segments is minimal related to the neighboring segments, while CLC pol-

CLC Classes and the Mixture of Spectral Features
CLC maps were visually interpreted, and the reported thematic overall accuracy (OA) was ≥85% [56]. However, OA, as a general index, cannot express the errors on the level of categories. Although validation data of photointerpretation is available [57], it is also important to verify whether CLC categories have a direct relation to LC maps derived from satellite images. Although there is only limited information about the class level accuracy on CLC classification, several studies use it as ancillary reference data, e.g., [58,59]. However, other studies used the CLC directly as reference data. Gudmann et al. [19] performed a LULC classification using Sentinel-2 and Landsat-8 images combined with OBIA (object-based image analysis) and the landscape metric approach [60] and found good correspondence. Their study relied on the accuracy assessment with the CLC2018; thus, this research had the most similarity to our work. The OAs of the two study areas were higher in a homogenous area (91%) and almost the same (76%) in the case of the complex one. Although their approach was exactly the inverse of our research question (i.e., they classified images and used the CLC as a reference, while we intended to justify classes of CLC polygons using the reflectance values of satellites), the final result of the classification was almost the same in case of the complex area having the most similarities with our study area. Reyes et al. [61] also used the CLC maps as reference data, and they gained 85.3% OA with OBIA, combining the image classification with landscape metrics. Ceccarelli et al. [62] reported 87-88% OAs for two areas with segmentation and spectral indices and textural information. OAs of our classifications were almost as good as in these studies, but we used the CLC polygons, which was a relevant difference. Verde et al. [63] elaborated a national-level land cover mapping scheme for ecosystem services. They applied superpixel segmentation and, similarly to our approach, used the RF classifier and their best result was an OA of 79%.
Previous studies did not analyze how well CLC polygons are represented by reflectance values of satellite images as LULC classes; instead, pixel or segmentation-based classification was performed on the images. Although these studies had valuable results, their approach was only similar to ours: segments are based on pixel similarity [64], the variance inside the segments is minimal related to the neighboring segments, while CLC polygons are not segments but products of a visual interpretation. Thus, while segments represent homogenous pixel groups [64] (can be regarded as raw data), CLC polygons show real information about the delineated areas.
The CLC Technical Report [57] confirmed that 78% of all classification errors occurred on Level 2 and 3 nomenclatures; thus, the most simplified level 1 was the most accurate. The report also stated that most misclassifications occurred between "agricultural" and "forest and semi-natural" categories. However, a comprehensive accuracy assessment for the CLC2018 map is not available yet. OA of CLC2012 was 90.5% in the L2 and 89.5% in the L3 nomenclature [65]. We demonstrated the issues identified in a smaller part of the study area (Figure 9) in detail. Water bodies (rivers and lakes) are delineated without aiming to find the edges, i.e., trees on the shorelines are also included in the patches. In our example, an oxbow lake (Figure 9/a1) was shown where the aquatic vegetation cover was relevant due to the eutrophic state of the lake [66], and the PlanetScope image was the most efficient to show it. Aquatic vegetation with the surrounding trees can relevantly bias the statistical parameters of this polygon of the "water" class, which can be observed both in the case of a fishpond (Figure 9/a2) and the Tisza River (Figure 9/a3). The wetland areas #b1 and #b2 in Figure 9 were also oxbow lakes, but in the next phase of the succession, the proportion of open water was lower, and according to the CLC classification, these lakes were interpreted as wetlands (as these smaller lakes can run dry during dry summers [67], the classification can be accepted). The wetland area #b3 ( Figure 9) was a set of swales and point bars (i.e., negative and positive fluvial forms) with different water cover and height of vegetation, where the reflectance pattern can be similar both to grasslands and eutrophic lakes [68,69]. The wetland area #b4 (Figure 9) was also a set of swales and point bars (consisting of two different parts, of which the northern part was the older with transformed forms), while the fifth wetland area (Figure 9/b5) was a floodplain lake with its surroundings (open water with aquatic vegetation and marshlands). This means that there were five wetlands, with four types of appearance, i.e., with different patterns, and suffering from varying errors regarding the right patch contours. In addition, the agricultural areas were mostly pastures within the floodplain (Figure 9/c1-2), but the #c2 patch, for example (in the triangle formed by #b1-#b3-#b5, Figure 9), was also a set of swale-point bar series with lesser water cover related to wetlands #b1, #b2 or #b3, with water patches and mixed with bushes and trees. In dry summers, all wetlands run dry; thus, when this happens, the difference between these pastures and wetlands is small. While #c1 and #c2 were grasslands (coded as 231 in CLC), #c3 and #c4 were intensive agricultural areas (coded as 211 according to the CLC nomenclature), with different spectral features with bare soil among the plants (or even harvested in August), but also consisting of dense plant cultures. Even the forests were different: #d1 was a semi-natural floodplain forest, the patch of #d2 was a plantation with regular rows and smaller density. Settlement patches also represented varying land cover, while #e1 ( Figure 9) was a village (Rakamaz), #e2 was only a resort area with some summer houses and many trees and grassland areas. These issues arose from the fact that CLC in some cases is a land-use map rather than a land cover one; the MMU of 25 ha is also a limitation, and the aim of the mapping is to provide a consistent map of a whole country (and finally of the EU). Accordingly, we cannot state that the CLC-interpretations were wrong but merely different from a satellite-based land cover classification. LC classes can be mixed without the problem of the CLC polygons due to similar pixel values and patterns, but these identified issues can help to understand the mixture of reflectance values of the CLC classes: differences are not significant, and the classifications never reach 85%. The visually interpreted polygons cover different surfaces, and as the MMU is 25 ha, there are patches smaller than the MMU, and these should be merged into larger patches. That was the case with the smaller oxbow lakes along the Tisza and Bodrog Rivers: instead of water bodies, the lakes were categorized as wetlands with their surroundings. Furthermore, interpretation of CLC classes does not require an accurate delineation, and pixels at the edge of neighboring patches can be mixed. The best examples were the rivers and lakes: aquatic vegetation and floodplain forests were mixed into the water category, and this caused higher maximums in the case of the NIR band and NDVI. According to the CLC2018 validation documentation, the overall thematic accuracy is above 85%, and spatial accuracy is better than 100 m for the whole dataset [56], but regarding the categorylevel thematic accuracies, the only available documents refer to other European countries, not Hungary (e.g., Finland: [70]). A possible 100 m shift could cause serious errors with satellite images even with a 30 m spatial resolution related to reference land cover, but our visual analysis showed a very good fit among the CLC and the images; inaccurate patch contours had a larger bias in the analysis.
NDVI layers experienced much less category mixture in the case of minimum, standard deviation and maximum values. Due to the fact that the NDVI is a widely-used robust index for distinguishing main land cover classes [71][72][73][74], it may have become helpful in separating the main CLC categories. Miomir et al. [75] investigated opportunities for a good practice of updating the national forest inventory in Serbia by comparing CLC and NDVI-based solutions based on Landsat images from different years. They performed comparisons both with CLC forest areas (based on various CLC categories) and forest areas derived from NDVI maps to official statistical forest data, where they identified a much larger difference in the case of CLC products. Diaz-Pacheco and Gutiérrez (2014) [76] examined the accuracy of CLC products through the analysis of local study areas focusing on urban land cover classes and found that CLC is not appropriate for local studies but rather for global and regional scale analyses. We also found that in our dataset, the artificial class showed a mixture in the case of minimum values with all other categories and in the case of median values, characteristically with the agricultural class. In order to make more detailed land cover datasets even using CLC, there are initiatives to produce new variations of land cover databases [77,78]. There are examples of the use of the CLC data solely to perform land change analysis [79], where the real thematic accuracy of the The visually interpreted polygons cover different surfaces, and as the MMU is 25 ha, there are patches smaller than the MMU, and these should be merged into larger patches. That was the case with the smaller oxbow lakes along the Tisza and Bodrog Rivers: instead of water bodies, the lakes were categorized as wetlands with their surroundings. Furthermore, interpretation of CLC classes does not require an accurate delineation, and pixels at the edge of neighboring patches can be mixed. The best examples were the rivers and lakes: aquatic vegetation and floodplain forests were mixed into the water category, and this caused higher maximums in the case of the NIR band and NDVI. According to the CLC2018 validation documentation, the overall thematic accuracy is above 85%, and spatial accuracy is better than 100 m for the whole dataset [56], but regarding the category-level thematic accuracies, the only available documents refer to other European countries, not Hungary (e.g., Finland: [70]). A possible 100 m shift could cause serious errors with satellite images even with a 30 m spatial resolution related to reference land cover, but our visual analysis showed a very good fit among the CLC and the images; inaccurate patch contours had a larger bias in the analysis.
NDVI layers experienced much less category mixture in the case of minimum, standard deviation and maximum values. Due to the fact that the NDVI is a widely-used robust index for distinguishing main land cover classes [71][72][73][74], it may have become helpful in separating the main CLC categories. Miomir et al. [75] investigated opportunities for a good practice of updating the national forest inventory in Serbia by comparing CLC and NDVI-based solutions based on Landsat images from different years. They performed comparisons both with CLC forest areas (based on various CLC categories) and forest areas derived from NDVI maps to official statistical forest data, where they identified a much larger difference in the case of CLC products. Diaz-Pacheco and Gutiérrez (2014) [76] examined the accuracy of CLC products through the analysis of local study areas focusing on urban land cover classes and found that CLC is not appropriate for local studies but rather for global and regional scale analyses. We also found that in our dataset, the artificial class showed a mixture in the case of minimum values with all other categories and in the case of median values, characteristically with the agricultural class. In order to make more detailed land cover datasets even using CLC, there are initiatives to produce new variations of land cover databases [77,78]. There are examples of the use of the CLC data solely to perform land change analysis [79], where the real thematic accuracy of the dataset is less important since the changes are derived from the comparison of consecutive CLC datasets.

CLC Classes in the Light of Statistical Tests
The two-way ANOVA revealed that both the L1 level CLC categories and the satellite types had significant differences. This is important, but the interaction effect is also important because if the two main effects have a common influence, differences in surface reflectance values are determined by both, i.e., in the case of blue and green bands, there is no difference; thus, input data from the satellites do not have an influence on the reflectance values delineated by the CLC categories. However, in the case of red, with the NIR and the NDVI, the interaction effect was significant; thus, the satellite data type can be considered important from the perspective of land cover representation. GLM also pointed to the relevance of bands and interactions with the LCs, but there was a magnitude difference in the contributions. In the case of medians, dependent variable bands had an ω 2 of 0.416, while the interaction of bands and LCs was only 0.023. STDs as dependent variables had a similar result, but the ω 2 was 0.515 for the bands and 0.009 for LCs, and there was a ω 2 of 0.005 for the interaction of satellites and bands. This means that medians are less biased by the satellite types, but STD can point to the differences among the sensors. On one hand, this may be the reason for the details determined by the spatial resolution. Especially on Landsat images, details can disappear, e.g., in the case of the oxbow lake ( Figure 9/a1), aquatic vegetation hardly appears, but PlanetScope revealed that there was a relevant amount of floating aquatic vegetation on the water surface. Thus, PlanetScope can show a more realistic picture of the area because of its 3 m resolution. Due to this high spatial resolution, new biasing factors become visible, including the shadows of trees and buildings. On the other hand, the wavelength ranges of the bands also differ according to the satellite. Sentinel-2 has the narrowest band ranges (usually a half or a third compared to PlanetScope), except for the NIR band, where it was almost three times the Landsat's range. As the NIR range is important to identify vegetation, it can make Sentinel-2 better in finding forest and agricultural areas, which confirms the results of a class level accuracy assessment ( Figure 6) with high PA and low UA values. Sentinel-2 data resulted in a lower error of omission and a higher error of commission regarding the vegetation, as under similar circumstances, it collects more photons than other sensors in the NIR range, influencing the outcomes of classifications. However, the effects of spatial resolution and the wavelength ranges cannot be separated with statistical tests: ranges have only 12 combinations (3 satellites × 4 bands), which is equivalent to the results of the 2-way ANOVA test and GLM. Both tests justified the common effect (interaction), but this proves that the bands involved (RGB + NIR) have different effects according to satellite, and the effect derives from the different spatial resolution and the band reflectance values, which also vary by the wavelength ranges.

CLC Classes and Classification Algorithms
Based on the classifications, the OAs reflected the fact that~70% accuracy can be reached using five level 1 CLC categories. LDA as a classifier with several presumed prerequisites did not perform relevantly worse than the robust RF, at least regarding the same satellite data: differences were even below 3% in the worst case (Landsat) for all descriptive statistics except for the minimum (LDA's minimum was 7% worse than RF's). Furthermore, LDA, although it did not have all prerequisites ensured, outperformed the RF in the case of PlanetScope data's classification. We experienced several cases in which LDA's performance was high in the OA level, but on a class level, it was not so effective [48,51], and the error of commission was especially high. Moreover, LDA had a worse performance with Sentinel and Landsat data, and it reflected the general observations that RF performs well with any kind of training data [80][81][82]. The repeated k-fold cross-validation was a powerful tool to assess classification efficiency, as 30-50 models can provide enough data to have a distribution of the outcomes and to determine the range of outcomes' accuracy.
In our case, the numerous models indicated that in the case of LDA, the worst models could perform with an OA of at least 59.1%, while in the case of RF, the lowest accuracy was 65.1%. This 6% difference can be important; therefore, we suggest using the robust RF for this kind of classification.
Better OAs belonged to higher geometric resolution: 71.1% was the median OA for Landsat and 78.5% for the PlanetScope when we applied the overlapping spectral bands. However, both Landsat-8 and Sentinel-2 have more bands in the infra-red range, and involving these bands into the classification, OAs increased, and Sentinel-2 overperformed the best OA of PlanetScope, i.e., better spectral resolution helped to overcome the disadvantage of coarser geometric resolution. Szabó et al. [82] and Underwood et al. [83] also came to a conclusion to the importance of spectral information, but with the remark of aims of given studies and the involved images (i.e., both hyperspectral and Landsat-8 images can provide eligible outcomes for given purposes, but a very high-resolution hyperspectral image is always a better input for specified tasks).
Randomized sampling from the whole dataset resulted in worse classifications; minimum OAs were between 40 and 44%, medians between 60 and 66% and maximums between 82 and 89%. However, the rank changed compared to the previous results when all data had been used for the classifications: PlanetScope data provided the lowest accuracies, and the Landsat, with coarser spatial resolution, had the best results. The difference between the results was the number of cases in the input data; the omission of the water category could also have an effect on the accuracies and the fact that only 50 data were used per category to train the models. Water body is a usually well distinguishable land cover class [84][85][86], but due to the low number of cases, we excluded it from the analysis, which could reduce the OA by having fewer true positive polygons. The other possible reason is that the different spatial resolutions could also have a bias through the number of pixels involved in a given CLC-category: statistical parameters were calculated by polygons with 9 times more data with the Sentinel and with 100 times more with the PlanetScope satellites compared to Landsat. As previous studies found that finer resolutions should contain spectrally clearer pixels, which have a lesser influence on heterogeneous areas [1,87,88], we can suppose a considerable effect for the missing water category and the lower number of training polygons.
UA and PA values can provide further clues, highlighting that there were large commission errors in the case of agricultural areas and large omission errors in wetlands and artificial areas, which corresponded to the statistical analysis of spectral bands (Section 3.1). Both types of errors can be understood when we are aware of the advantages of visual interpretation: interpreting personnel can make decisions based on several other options than the spectral features, such as the extent/size, relative situation, texture, and pattern of a given object. In this study, we were able to involve only the spectral characteristics, and the classifications were within an acceptable range. There can be relevant intermixing among agricultural areas, grasslands, and forests, and even with wetlands in special situations [36]. Furthermore, artificial surfaces contain built-in areas (buildings, roads, pavements, parking lots), parks (relating to forests or grasslands) and even water features (fountains and lakes, sometimes streams, rivers), so this category can intermix with several natural categories, too. Spectral features are naturally similar in these spectral ranges; thus, we cannot expect clear discrimination. In this study, we supposed that better resolution provides better classifications, but it was true only in the case of OAs; according to the two-way ANOVA conducted on the UA and PA, the satellite type (i.e., the spatial resolution and the different wavelength ranges of the bands) had no influence on the class level accuracy measures of the final products. This finding can only be explained by the visual interpretations and limitation of the MMU size.
Regarding the variables, RFE with the repeated k-fold cross-validation effectively pointed to the relevance of medians (the first four variables in the models) and that green and red bands were more important than NIR and NDVI. Similarly, Chiang and Valdez (2019) [89] also found that green and red bands of Landsat-8 were important in identifying tree species. However, Puletti et al. [90] found that blue and green bands of higher wavelengths, red-edge and NIR, were the most important in the classification of Mediterranean forests. The source of the differences may be that we applied a method, which can be regarded as a special segmentation approach with a different statistical-based technique; furthermore, Puletti et al. [90] applied several dates to gain the best accuracy, while in our study, we used only one. Grabska et al. [91] also found that red-edge, blue and green bands were among the first four most relevant variables. Both studies dealt with forests, and in our case, a further reason for the difference can be found in the categories: artificial areas, agricultural areas, wetlands, water bodies require different sets of involved bands. Sentinel performed 3-6% better when involving all possible bands (red edge and middle-infrared), but in the case of Landsat, the increase of OAs was only 1-2%.
A final major question is how the results can be generalized. Our randomization with 10 randomized sub-samplings showed that not all classifications were similar. The ANOVA model using Tukey's test as a post hoc test revealed that 3 of the 10 models were different in both Landsat and Sentinel, which may be the basis for concluding a possible generalization. Although statistical tests can find significant differences, the magnitude of the differences may be small (i.e., the differences are not relevant; [92]). Furthermore, Tukey's test is more liberal (i.e., it finds more statistical differences than other post hoc tests [93]). In this study, median differences between the groups of accuracies were only 1-2%; thus, statistics did not show a real picture. In the case of PlanetScope, models can be divided into four groups; nevertheless, with no great differences in the OAs, two groups can be distinguished with a relatively larger (4%) difference. Thus, the results suggest that PlanetScope's reflectance values can provide different accuracies when different training datasets are used; CLC polygons and the generalization were not as successful as in the case of Landsat's and Sentinel's data.
Validation of remotely sensed data using CLC datasets as a reference may be a misleading step due to the conclusions drawn in contemporary literature concerning the usage recommendations for the data. Our results revealed that the main categories of CLC could not have been discriminated spectrally from each other in satellite images. However, by paying attention to the descriptive statistics, it is possible to involve this information in a classification that takes pixel statistics into consideration [94,95], and it may provide helpful information to know the mixture of categories in the image we intend to use as an input for the classification. Accordingly, it is critical to know the possible mixing behavior of the categories in order to achieve a successful classification.

Conclusions
We analyzed the polygons of the CLC2018 database with respect to pixel information derived from PlanetScope, Sentinel-2 and Landsat-8 images. We calculated median, minimum, maximum and standard deviation for all polygons of the CLC2018 concerning the visible and near-infrared bands and a calculated NDVI layer for the three satellites.
-Medians of CLC polygons provided the least mixture among the LC classes, while the maximums were the worst input parameters without significant differences. Wetlands and water bodies categories were the most frequently mixing categories of CLC based on reflectance values; -Bivariate statistical tests cannot provide enough information to conclude on the spectral separability of LC classes, but classification algorithms involving several variables can be efficient techniques. Generally, LDA and RF classifiers had similar OAs, but in the case of coarser resolutions (Sentinel and Landsat), RF outperformed the LDA. Data derived from PlanetScope provided 7% better OAs (78%) than those of Landsat (71%) regarding the model medians; thus, better spatial resolution ensured better classification performance. >80% OA was gained with using all available bands of the Sentinel-2; accordingly, more spectral information in the infra-red range can counterbalance the coarser geometric resolution; -We applied a randomization-based technique to gain 10 repetitions of class-level metrics (UA and PA), which showed that satellites had no direct effect on the accuracy. UAs were the lowest in agricultural areas, while PAs were the lowest among wetlands; -Variable importance of statistical parameters showed that usually the medians were the most important statistical layers, and the green, red and near-infrared bands were the first three most important bands; - We provided an approach to prove the possibility of the generalization of the results with multiple randomized subsampling and found that the results of Landsat and Sentinel data can be generalized, but in the case of PlanetScope, a larger area with more CLC polygons would be desirable; -Generally, using the overlapping bands (RGB + NIR) of Landsat-8, Sentinel-2 with the PlanetScope, the best OAs were >70% OAs, but the most accurate was the PlanetScope with the highest spatial resolution (78.5%). Higher OAs (~80%) have also been acquired with the higher spectral accuracy of the Sentinel-2, which means a cost-efficient solution in spite of the coarser spatial resolution; - As we found several studies where CLC maps were used as ground truth data to quantify thematic accuracy, 70-80% OAs do not seem satisfactory. Nevertheless, our experiment was performed with the CLC L1 classes; further investigations can reveal if CLC is more appropriate for ground truth with the more detailed L2 or L3 nomenclature.
CLC maps have many features limiting their use as validation data. The 25 ha MMU size creates several issues, which influence validation by satellite images: land cover polygons do not follow the exact borders of patches, smaller objects are merged into larger ones to reach the 25 ha area, and even the supposedly homogenous patches also have a nonhomogeneous texture (e.g., lakes, rivers).

Conflicts of Interest:
The authors declare no conflict of interest.