1. Introduction
Natural and ecological processes reach out of the human-induced limited space, as delineated by administrative boundaries, demanding standardized products beyond locally generated land cover products, to feed models and scenarios. For example, tele-couplings (e.g., in the form of large area acquisitions or climatic changes) are discussed and measured across the globe for their consequences in land use and local societies along with its value return to the global market [
1]. Activities on land and sea are increasingly depended upon frequently updated qualitative land cover products.
Recent advances in data provision frequency and accessibility by the global scientific community, the progress in Earth Observation techniques, and big data handling and processing [
2] enabled the generation of numerous Continental and Global Land Cover products (C/GLC) with increasing spatial resolution and frequency. Issues and challenges accompany these developments, mainly in matters of data interpretation and categorization, as well as surface objects’ delineation and exact location. Compatibility and interpretation issues are already being treated by working groups, such as the European Environment Information and Observation Network (EIONET) Action Group on Land monitoring in Europe (EAGLE). At the same time, globally performed exercises and fora are initiated from the C/GLC producers themselves and in coordination with international remote sensing associations in order to locally and regionally validate the land cover products. The latter occurs as a necessity to account for the products’ dependence on a huge variety of geographical and climatic conditions and enhance credibility towards policy makers, stakeholders, and entrepreneurs. Confidence needs to be built up.
C/GLC maps represent the most important sources of accumulative and homogenized information about the surface of the earth and are used for several policies and scientific applications such as environmental monitoring, water monitoring, biodiversity, urban planning, and change detection of global land cover [
5]. There are several C/GLC maps—such as IGBP-DISCover, GlobCover maps, MODIS GLC, LC-CCI maps and FROM-GLC maps [
4]. Currently, many organizations produce C/GLC maps with higher resolution, namely the Land Cover-CCI (LC-CCI) maps at 300 m, GIO High Resolution Layers at 20 m, and Globeland at 30 m [
7]. They all have been produced by remote sensing analysis using various optical data and methods. However, they are produced as independent datasets with different class hierarchies, semantic class similarities, and considerable disagreement among them have been reported as a consequence [
Comparative accuracy assessment of C/GLC maps, either one against another or juxtaposed against very high resolution ground data, is crucial but challenging, because of the lack of reference data. Several studies have assessed C/GLC products to analyze their weaknesses and strengths [
11]. A few studies compared the accuracy estimates by harmonizing confusion matrices, but it remains unclear how they compared the C/GLC maps with the same reference dataset [
8]. Reference datasets that are suitable for multiple maps were developed and used for validation of C/GLC maps [
12]. However, these studies provide spatial agreement between C/GLC maps, but they neither compare the very recent high resolution products nor they estimate confidence levels during validation.
Triggered by the aforementioned challenges and recent developments, this study presents a quantitative and qualitative evaluation and inter-comparison for the recently produced C/GLCs, CORINE Land Cover 2012, GIO High Resolution Layers and Globeland30. The focus was on a representative landscape of the Northern Mediterranean basin, the area of Thessaly in Greece. The confidence levels of the experts were incorporated during the validation through a weighted overall accuracy assessment using manually annotated reference data, formed based on existing Google Earth images. In addition, the type of reported errors among the semantic classes is discussed, revealing further qualitative aspects of the considered C/GLCs.
3. Experimental Results and Validation
In this Section results from the validation of the three LC datasets against the manually annotated reference data are presented. In particular, in
Section 3.1 results obtained from the accuracy assessment and weighted accuracy metrics calculation for all three products are presented. Then, in
Section 3.2, the contribution of the confidence level that the experts assigned for each interpretation is analyzed. Lastly, an inter-comparison between the three C/GLC products is assessed in
Section 3.3 by estimating the overlapping fraction of corresponding classes by pairs of products.
3.1. Validation Against the Reference Data
The three LC datasets were validated against the reference data through a quantitative accuracy assessment procedure. The confusion matrices for all three datasets against the manually annotated reference data of all confidence levels are presented on
Table 3,
Table 4 and
Table 5. As it can be observed in all products’ error matrices, OA rates higher than 89% were recorded for the majority of samples, which were annotated with confidence level #1, while OA for samples of lower confidence levels, exceeded 71%.
By integrating all confidence levels together based on the proposed methodology (see
Section 2.3.5), the resulting weighted OA reached the 89% for the CLC2012, the 90% for the HRLs and the 86% for the GLC30, while the weighted kappa coefficient estimation was 0.81, 0.79, and 0.74, respectively. Additional aspects regarding specific LC classes performance are derived from the weighted UA and PA rates of each class, presented in
Figure 2. In particular, one can observe that in all cases the
Agriculture and
Forest classes had PA and UA of above 85%, indicating high accuracy and reliability, respectively.
For CLC2012 the PA for
Artificial Surfaces was 85% and 100% for
Water; however, lower rates were recorded for the UA, i.e., 67% and 79%, respectively. GLC30 presented relatively high accuracy for the class
Artificial Surfaces both for PA (74%) and UA (75%) metrics. However, GLC30 presented mis-classification cases for the class
Water; the majority of reference data samples (six out of eight) were classified as
Agriculture (
Table 5), resulting in a weighted PA of 26%. For HRLs, several reference data samples of
Artificial Surfaces were attributed to class
Unclassified reaching a relatively low PA rate of 55%. The class
Water resulted into a 100% UA rate and a PA rate of 74% for the validation of HRLs.
Table 6, the highest weighed scores recorded for PA and UA rates are presented per LC class. As it can be observed, in five out of eight cases the CLC2012 product validation achieved the highest values.
3.2. The Contribution of the Confidence Indicator
In this Section, the OA (and weighted OA) rates are compared both without and by taking into account the confidence level that the expert assigned at every reference data point during the image interpretation. In
Table 7, the OA rates per confidence level for all LC maps are presented along with the number of samples per confidence level. The last two rows present the OA and the weighted OA rates. The latter are increased by 1–3% compared to the OA ones. In particular, as the confidence level decreases the OA rate decreases, too. This is quite expected, since the annotations with lower confidence levels, indicating difficulty in the labelling decision, usually involve particular regions, terrain types, and complex land cover/use cases, lying most probably on the borders of two LC classes, and therefore they are more associated with classification errors.
Similar conclusions are derived when comparing the resulting PA and UA per class rates for all products validation. In
Table 8 one can also observe the differences between the standard and weighted PA and UA. For most cases the weighted PA and UA are increased around 1–4%. Classes
Water and
Agriculture present the smallest differences. A greater difference occurs for the UA of CLC2012 product validation of
Artificial Surfaces, which presents a UA rate of 58% and a weighted UA rate of 67%. This 9% difference occurs due to the fact that a rather large number of
Agriculture and
Forest samples in the reference data sets (
Table 3), characterized with a confidence level of 2, were annotated as
Artificial Surfaces in the CLC2012 map. The contribution of these errors decreased when the weighted UA metric was calculated, since these confidence level #2 observations are given a decreased weight in the calculation.
3.3. Inter-Comparison between the C/GLC Products
Apart from the comparison with the truth (reference data), useful aspects of the studied products derive from the inter-comparison between one another. To this end, an assessment of the agreement between the three C/GLC products was employed by comparing the overlapping fractions for the studied classes.
Figure 3 the resulting images for the three mathematical differences, i.e, CLC2012-GLC30, CLC2012-HRLs, and GLC30-HRLs are presented, along with the proportional fraction of pixels attributed in the examined class: (i) on both products, (ii) only on the first product and (iii) only on the second product.
The comparison between CLC2012 and GLC30 reveals a high percentage of agreement between the two products for the Agriculture class (91%) and a quite high rate of 67% for the Forest class.
These specific classes scored also high PA and UA rates of above 85%, as analyzed in the previous paragraphs. Still 28% of all
Forest pixels, are attributed to
Forest class only for GLC30. These areas, located on the northern part of the study area are annotated as classes of the Level-2
Scrub and/or Herbaceous Vegetation Associations category in CLC2012
. Similarly, 29% of all
Artificial Surfaces pixels are attributed to this class only for GLC30, while both products share a common
Artificial Surfaces area of 54%. The comparison for
Water class recorded only a 26% shared fraction. This can be also linked with the low PA rates for this class in the GLC30 (see
Section 3.1), since as one can observe in
Figure 3, this product annotates Lake Karla as
Cultivated LandThe comparison between CLC2012 and HRLs presents high discrepancy for class Artificial Surfaces (only 38% shared fraction). Many disagreement cases are mainly observed as a result of the scale difference of the two products, but can be also attributed to semantic differences between the corresponding classes and subclasses of Imperviousness (HRLs) and Artificial Surfaces (CLC2012). Furthermore, regarding the Forest class, 43% of all Forest pixels are characterized as Forest only for the HRLs product. These dissimilarity cases are mainly located in areas where CLC2012 subclasses of the Level-2 category Scrub and/or Herbaceous Vegetation Associations can be found. For Water class a shared fraction of 57% is recorded between the two products, since as it can be observed in the map, the northwestern part of Lake Karla has not been attributed to Water class in the HRLs product. Respectively, only a small percentage (7%) is characterized as Water only on the HRLs product and can be attributed to the detection of narrow linear parts of courses that may not be recorded on the coarser 100 m product of CLC2012.
The difference images for the comparison between GLC30 and HRLs also present high disagreement for class Artificial Surfaces (only 36% shared fraction), which can be also attributed to the semantic and classification methodology diversities between the corresponding classes of the two products. As in the previous comparison regarding the Forest class, 31% of all pixels are characterized as Forest only for the HRLs product. Areas of differences are mainly located in regions characterized in the GLC30 product as Cultivated Land and Shrubland. At last, disagreement recorded for Water class (only 37% shared fraction) are associated with the omission of Lake Karla from the GLC30 product.
4. Discussion
Although various validation frameworks for the qualitative and quantitative assessment of C/GLCs have been presented in several studies [
29] they neither used a common reference layer for the comparison nor did they incorporated a confidence level during the reference data production. Regarding the sampling design strategy, there are several arguments promoting for sample size increment as much as possible in relation with the reference layer. The spatial extent of an area is the crucial determining factor, as its relation with a higher spectral variability, especially in view of the need of image mosaicking required and the bidirectional reflectance distribution effects. Research studies arguing over the optimal sampling design and size determination [
23] have guided the presented work to undertake a cost-effective stratified sampling approach. Thus, the four Level-1 CLC2012 classes were downscaled to their 25 Level-3 constituents, present in the study area, to account for the spectral variability of the four prevailing classes. This way, the percentiles of the extent of each of the 25 subclasses defined the variation and number of samples assigned to the four Level-1 classes, approaching the existing spectral variability through the appearance of the subclasses.
Moreover, the incorporation of the confidence level of the image interpreter within this process took into account the boundaries’ effects in a weighted and non-negligible way; thus, being closer to reality. Using solely the certain location points would have biased the result towards a non-fully objective judgement, producing a higher performative OA result for the three studied layers, e.g., 95, 91, and 90% instead of the 89, 90, and 86% (see
Table 3,
Table 4,
Table 5 and
Table 7). These reported quantitative results are similar with the ones presented in [
26] and disagree with the relative lower OA rates (<50%) presented in [
29]. The integration of the expert’s confidence into the reference data annotation procedure has been already proposed and highlighted as a good practice for accuracy assessment in the literature [
30]. Low, moderate and high confidence rates were used in the analysis in order to subset the results by confidence [
31]. The visual interpretation here has been proven to be highly biased depending on the interpreter, as also stated in similar studies [
30], so for every sample interpreted with lower confidence level or for cases that the two experts disagreed on labelling, a second round of interpretation took place in order to reach consensus. It should be also noted that most of the automatically and randomly selected samples were annotated with high or medium confidence levels and only a 5% of the samples was annotated with the lower confidence level #3. Since similar studies have indicated that OA results from a confusion matrix should be interpreted with caution, as the matrix records the degree of agreement between the reference data and the map data, which are in cases less than perfect [
32], it is suggested here that the integration of confidence levels during annotation can effectively address these concerns.
Producers’ and users’ accuracy rates from the weighted accuracy assessment along with the direct inter-comparison of the products per pairs indicated that classes
Agriculture and
Forest scored high in accuracy (PA) and reliability (UA) while also presenting the highest overlapping fractions between the products. Although class
Water presented high level of reliability for all products, omission errors on the GLC30 product for the water body of Lake Karla were reported and further highlighted from the difference images of the inter-comparisons. It should be noted that Lake Karla was listed among the swallow lakes in Greece up to 1962 when it was completely drained to gain land for agriculture [
33]. Lake Karla’s restoration project was launched in 2000 while its refilling started in 2009 [
34]. Google Earth and Landsat imagery close to the raw image acquisition for GLC30 product (July 2009) picture the lake covered with water in the largest percentage of its current extent. The weighted accuracy assessment and inter-comparisons also indicated many commission errors and low overlapping fractions for class
Artificial Surfaces. This particular class presents diversity on semantic definition of corresponding classes on the different products. In particular, HRLs include only sealed areas of imperviousness, while the other two products also account for open sites, e.g., mines and quarries and partially vegetated areas.
5. Conclusions
The validation procedure proposed in this study provides a qualitative and statistical outcome comparing the accuracy of three C/GLC products in detecting artificial, forest, agriculture, and (inland) water land cover classes for a study area in central Greece, Thessaly. The experts’ confidence level per sample was recorded during reference data annotation and integrated in the evaluation process through a weighted accuracy assessment procedure. Validation results recorded high OA rates of 89, 90, and 86% for CORINE Land Cover 2012, GIO High Resolution Layers and Globeland30 datasets, respectively. Further analysis on the reported PA and UA identified certain classes, i.e., Agriculture and Forest, as the most accurate and reliable, possessing, moreover, the highest overlapping fractions during the inter-comparison. Lower accuracy rates were recorded for Artificial Surfaces most probably due to differences in spatial resolution and semantic definitions. The main conclusion of this work outlines that different quality aspects of the considered LC maps were noted and highlighted more transparently and objectively as a result of the employed confidence levels per sample, the stratified sampling and the weighted OA calculation.