Map of Land Cover Agreement: Ensambling Existing Datasets for Large-Scale Training Data Provision

: Land cover information plays a critical role in supporting sustainable development and informed decision-making. Recent advancements in satellite data accessibility, computing power, and satellite technologies have boosted large-extent high-resolution land cover mapping. However, retrieving a sufﬁcient amount of reliable training data for the production of such land cover maps is typically a demanding task, especially using modern deep learning classiﬁcation techniques that require larger training sample sizes compared to traditional machine learning methods. In view of the above, this study developed a new benchmark dataset called the Map of Land Cover Agreement (MOLCA). MOLCA was created by integrating multiple existing high-resolution land cover datasets through a consensus-based approach. Covering Sub-Saharan Africa, the Amazon, and Siberia, this dataset encompasses approximately 117 billion 10m pixels across three macro-regions. The MOLCA legend aligns with most of the global high-resolution datasets and consists of nine distinct land cover classes. Noteworthy advantages of MOLCA include a higher number of pixels as well as coverage for typically underrepresented regions in terms of training data availability. With an estimated overall accuracy of 96%, MOLCA holds great potential as a valuable resource for the production of future high-resolution land cover maps.


Introduction
In today's world, precise and comprehensive land cover (LC) mapping is becoming increasingly crucial for sustainable development and well-informed decision-making. Beyond its relevance in climate studies [1], LC information finds utility in other fields as well. For instance, in ecology, LC data aids in estimating habitat fragmentation and predicting International Union for Conservation of Nature (IUCN) Red List categories for species [2]. Additionally, LC serves as a crucial variable in hydrological investigations, as exemplified by studies conducted in the upper Crepori river basin in Brazil and the Gumara catchment in Ethiopia [3,4].
The applications of LC data extend to monitoring various phenomena across different regions. Examples include monitoring the desertification process in the Qubqi desert in China [5], tracking urbanization progress in Abuja, Nigeria [6], and observing agricultural expansion in the Mato Grosso state of Brazil [7]. These cases illustrate the diverse range of uses for LC data in monitoring and understanding our changing environment.
The inclusion of open data policies by some providers of satellite imagery has undeniably accelerated the progress of LC mapping [8][9][10]. This favorable development, along with advancements in computing capabilities and satellite technologies, has made significant contributions to the field. Nevertheless, there are still persistent challenges in the domain of LC mapping.
To fully harness the potential of satellite Earth observation resources for land cover mapping, it is crucial to address a significant challenge: the availability of appropriate In this paper, we present the training benchmark dataset that was generated by borrowing two concepts of training sample generation techniques: reuse of existing data [31][32][33][35][36][37] and consensus among multiple annotators in the case of photo interpretation [23,41]. During the human labeling of training samples, human error is often mitigated by having multiple annotators. If there is no consensus among them, or at least among the majority, the sample is rejected.
Adhering to the above principles, we reuse existing HRLC datasets, but only those portions in which there is exclusive consensus among multiple datasets. From a practical standpoint, multiple HRLCs are combined using the intersection method to retain only the areas where all datasets agree on LC classes while disregarding areas of disagreement. Accordingly, the dataset obtained is named Map Of Land Cover Agreement (MOLCA). The main purpose of MOLCA is to serve as a reference training dataset, from which to extract samples that are functional for the creation of new HRLC maps. MOLCA was designed to provide training samples mainly for large-scale HRLC mapping using ML and deep learning techniques, which typically demand extensive training data for satellite imagery classification. This dataset was produced within the Climate Change Initiative HRLC (CCI HRLC) project of the ESA. MOLCA has 117 billion 10 m pixels (11.7 million km 2 ) distributed over an area of 19 million km 2 . Classes included in the MOLCA legend are Bareland, Builtup, Cropland, Forest, Grassland, Shrubland, Water, Wetland, and Permanent ice and snow, which depicts LC in the period between 2016-2020. The accuracy estimate of MOLCA shows an OA of 96%. MOLCA offers distinct advantages over alternative methods of training data collection, including a substantially larger number of available pixels and coverage for regions that are frequently underrepresented in existing benchmark training datasets, such as Africa and Siberia [17,[42][43][44][45][46][47][48][49]. Standing on the analyzed literature, MOLCA outperforms other existing open-access training datasets in terms of spatial coverage and precautions taken to ensure a high level of accuracy, due to the consideration of multiple-instead of individual-HRLC maps for the generation of training samples. These key features of MOLCA are promising to foster its use in future HRLC map production. Furthermore, the availability of MOLCA as open data further enhances its potential for widespread use.
The structure of this paper is as follows: Section 2 outlines the region considered for MOLCA generation, input datasets, data generation concepts and methodology, and the validation approach. Section 3 presents statistical information and the accuracy evaluation of the generated dataset. The analysis and interpretation of the results are discussed in Section 4, while the concluding remarks are provided in Section 5.

Materials and Methods
MOLCA was produced in the context of the CCI HRLC project of the ESA. The region of interest for the project encompasses three macro-regions of the world: Amazon, Siberia, and Sub-Saharan Africa (see Figure 1).
The region of interest extends over 19,163,868 km 2 -4,526,839 km 2 in the Siberia, 6,203,824 km 2 Amazon, and 8,433,205 km 2 Sub-Saharan macro-region. The objective of the CCI HRLC project was to determine the impact of the increased spatial resolution of land cover data on climate models. Besides the selected regions being only partially represented by existing LC training datasets, they are also landmarks for climate change. For these reasons, they were selected for the first implementation of MOLCA.

Input Datasets
In the derivation of MOLCA, multiple global HRLCs were used as common input across all three regions of interest. However, within each region, an additional regional HRLC was incorporated into the MOLCA computation. The global datasets employed included two general-purpose HRLCs, namely FROM-GLC and GL30, along with two thematic HRLCs specific to the built-up class (WSF and GHS BU Sentinel-1-GHS BU S1NODSM), one thematic HRLC for water (GSW), and one thematic HRLC for forests (FNF), as indicated in Table 1. As for the regional HRLCs, MapBiomas was used for the Amazon region, CCI Africa Prototype for Africa, and ESA DUE (Data User Element) GlobPermafrost for Siberia. All regional datasets fall under the general type. The baseline year for these datasets ranged from 2016 to 2020, and the spatial resolution varies from 10 m to 30 m. The most used CRS is WGS84, while a few datasets are supplied in UTM or Web Mercator (see Table 1). In this work, we utilized the 2017 map from FROM-GLC, which is a collection of irregular time series of general-purpose land cover (LC) maps developed by Tsinghua University [23][24][25]. The map has a resolution of 10 m and consists of 10 classes in its legend. It is provided in World Geodetic System 1984 (WGS84) Coordinate Reference System (CRS) in the form of 10°× 10°tiles (http://data.ess.tsinghua.edu.cn, accessed on 28 June 2023). The reported OA of this map is 73%.
As for the GL30 dataset, it is a regular time series of general-purpose LC maps at a resolution of 30 m, developed by the National Geomatics Center of China (NGCC) [30].
The legend of GL30 consists of 10 classes. For this work, we used the 2020 product version. The reported OA for this map is 86%, as mentioned on the GL30 website (http: //globeland30.org, accessed on 28 June 2023). The distribution of the GL30 product is based on the Universal Transverse Mercator (UTM) projection. The tile size of GL30 varies depending on the location, with most tiles (between 60°N and 60°S) being 5°× 6°in size, although some tiles can be 5°× 12°or even larger.
A comprehensive set of thematic maps called the Global Human Settlement Builtup (GHS BU) distinguishes between built-up and non-built-up surfaces [50,51]. These maps were developed by the Joint Research Centre (JRC) of the European Commission. Various GHS BU products exist, each one differing in terms of input imagery, baseline year, and production method. In this study, the GHS BU S1NODSM product, which is based on Sentinel-1 imagery from 2016, was employed. It consists of two classes: Built-up and nonbuilt-up. The product is distributed as a compressed file folder containing 2°× 2°tiles that cover the entire globe (https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL, accessed on 28 June 2023). The original CRS of the tiles is Web Mercator projection (EPSG:3857). The accuracy of GHS BU S1NODSM is described qualitatively in comparison to another LC dataset [50].
Another thematic LC product specifically focused on built-up areas is the WSF from the German Aerospace Center-DLR [27]. It encompasses two classes: Settlements and nonsettlements. The product includes two maps with a spatial resolution of 10 m, representing 2015 and 2019. The 2019 map was utilized in this research. WSF is available as 2°× 2°t iles in the WGS84 CRS (https://download.geoservice.dlr.de/WSF2019, accessed on 28 June 2023). The WSF map for 2019 exhibits an OA of 84% and a Kappa value of 0.65, although information regarding User's Accuracy (UA) and Producer's Accuracy (PA) is currently unavailable [52].
The GSW family comprises a collection of multi-temporal thematic LC maps that focus on inland water bodies [28]. Produced by JRC, these annual maps span 37 years, from 1984 to 2021. The GSW product offerings include various aspects such as monthly water history, seasonality, yearly history, water occurrence, change intensity, recurrence, transitions, maximum water extent, monthly recurrence, and metadata. They are available for download at https://global-surface-water.appspot.com/download, accessed on 28 June 2023. For this study, the yearly history for 2019 was employed. The yearly history combines two water classes, seasonal and permanent. The UA and PA for the entire time series, including the 2019 map, exceed 95%.
The FNF map is a thematic LC map that classifies forested regions worldwide [29]. Developed by the Japan Aerospace Exploration Agency (JAXA), it provides a multi-temporal representation of forest areas with irregular time intervals. The map covers the periods from 2007 to 2010 and from 2015 to 2020, categorizing areas as forest, water, or not water, with an approximate resolution of 25 m. The FNF map for 2019 was used in this research. The accuracy of this specific product is not specified. The product is distributed in two tile sizes: 1°× 1°or 5°× 5°from https://www.eorc.jaxa.jp/ALOS/index_e.htm, accessed on 28 June 2023.
MapBiomas project focuses on generating maps for six Brazilian biomes, namely the Amazon, Atlantic Forest, Cerrado, Caatinga, Pampa, and Pantanal [53]. These maps provide a general overview of LC types at 30 m of spatial resolution annually, dating back to 1985. MapBiomas utilizes a hierarchical legend with three levels of classification. The first level consists of six broad classes, which are further subdivided into more specific classes at the second and third levels. Since its inception in 2016, the project has undergone several collections with different data processing methodologies. The MOLCA creation specifically used the map from Collection 7 for the year 2019, which achieved an OA of 89% [54]. MapBiomas is licensed under the Creative Commons CC-BY-SA license, which means it is freely available and can be accessed through various means, including GoogleEarthEngine (GEE), the GEE app-Toolkit, the MapBiomas dashboard, the QGIS plugin, or direct download in GeoTiff format via a provided link on https://mapbiomas.org/en/colecoes-mapbiomas-1?cama_set_language=en, accessed on 28 June 2023. The default CRS used is WGS84.
The CCI Africa Prototype is a general-purpose LC map with a resolution of 20 m, produced by the ESA CCI LC team, representing the LC state in Africa for the year 2016. The legend of the CCI Africa Prototype consists of Tree-covered areas, Shrub-covered areas, Grassland, Cropland, Vegetation aquatic or regularly flooded, Lichen and mosses/sparse vegetation, Bare areas, Built up areas, and Snow and/or ice and open water. The product can be downloaded as a single GeoTiff file in WGS84 CRS for the entire African continent from https://2016africalandcover20m.esrin.esa.int, accessed on 28 June 2023. Accuracy assessments of the CCI Africa Prototype were conducted for four countries: Kenya, Gabon, Ivory Coast, and South Africa [55]. The OA was found to be 44% for South Africa, 47% for Ivory Coast, 56% for Kenya, and 91% for Gabon.
The ESA DUE GlobPermafrost map describes the LC of permafrost regions, including Western Siberia (Russia), Barrow (Alaska), Teshekpuk (Alaska), Mackenzie Delta (Canada), Umiuaq (Canada), Kytalyk (Russia), Lena Delta (Russia), Seward Peninsula (Alaska), and Yukon Delta (Alaska) [56]. The legend of ESA DUE GlobPermafrost is very detailed on polar LC types (21 in total). Each permafrost region has a corresponding GeoTiff file, which can be downloaded from the PANGAEA data publisher under the Creative Commons Attribution 4.0 International license [57]. The CRS used for ESA DUE GlobPermafrost is the UTM projection, and the OA of the map is estimated to be 83%.

MOLCA Methodology Concepts
The creation of MOLCA involves intersecting multiple HRLC datasets to determine areas of agreement. Only the areas where all the HRLCs agree are retained, while pixels showing LC class discrepancies among the intersected HRLCs are designated as null. From a theoretical standpoint, there is a high probability that the MOLCA has high accuracy, because a manyfold agreement increases the odds of pixels being accurate [58]. Pixels that are accurately classified have a high likelihood of being found in corresponding positions across different datasets, as correct classification is a primary objective during the classification process. Conversely, errors in the LC derivation result from undesired factors associated with the classification process. These errors can be influenced by various factors, such as the quality and quantity of training data, the suitability of the classification algorithm, the accuracy and quality of satellite imagery, the complexity of the LC types being classified, as well as the presence of atmospheric phenomena such as clouds. Since different agencies and procedures are responsible for producing most of the existing HRLCs, it is expected that errors in different datasets are independent and not replicated across them. Thus, the MOLCA methodology's primary benefit arises from its utilization of multiple HRLC maps to create the training dataset. This approach is anticipated to improve the classification accuracy compared to relying solely on training samples extracted from individual HRLC-existing maps.

MOLCA Generation Procedure
A schema of the MOLCA generation procedure is shown in Figure 2. Different parts of the procedure are grouped into preparation, data harmonization, and MOLCA generation.
The procedure of creating MOLCA started by downloading the identified HRLC products (see Table 1) for regions of interest. This was conducted automatically with Python [59] when feasible, otherwise, it was conducted manually. The legends of these datasets were carefully compared to determine common classes across them. The classes that consistently appeared across multiple datasets were chosen as the target classes for MOLCA. Details on MOLCA legend are included in Table A1 of Appendix A. To align the legends of the existing datasets with the target legend of MOLCA, a correspondence table was created for each dataset and stored in a textual file. Based on correspondence tables, txt files with reclassification rules were created to be used in later steps. The reclassification rules contain information about the original raster value, the target raster value, and the target class label. The legend harmonization was performed manually because a single class might have a different name and code in different HRLCs. Since different datasets used different tiling systems, it was necessary to find matching tiles across the datasets. This matching process was automated using Python pandas, shapely, rasterio, and geopandas libraries. The rest of the procedure was automatized by using a combination of GRASS GIS and Python. Principal GRASS GIS modules used for MOLCA generation included r.import, g.region, r.stats, r.reclass, r.patch, r.category, and r.cross [60]. These modules were run through the grass.script library of Python. Some Python libraries independent of GRASS GIS were also used, such as numpy.
A reference dataset was selected to serve as a guidance for the CRS, extent, and spatial resolution of MOLCA tiles. A tile from the ESA CCI HRLC product was chosen, which had a size of 100 km × 100 km, a resolution of 10 m, and used the WGS84 CRS. Information about spatially matching tiles of reference dataset with non-reference datasets was stored in a CSV file.
Prior to importing data into the GRASS GIS database, its default CRS was set to WGS84 CRS. Then, the datasets were imported and automatically reprojected if their source CRS was not WGS84. The extent and resolution of the imported non-reference data tiles were adjusted to the ones of reference tiles. These non-reference tiles were clipped or merged to match the extent of the reference tile. Furthermore, the non-reference tiles were reclassified in accordance with the MOLCA target legend, following reclassification rules. Resampling to target resolution was conducted on the fly when processing operations were executed. The next step was to extract areas of agreement. A cross-product was computed from the non-reference datasets, which generated a raster map with different values representing combinations of class values found within the input layers. The cross-product labels were analyzed to identify agreement labels, which were defined as labels that appeared consistently across all input HRLCs or at least two HRLCs, with other labels considered null. The agreement labels and their associated values were converted into reclassification rules. Finally, the cross-product was reclassified into the MOLCA.

MOLCA Validation
The accuracy of MOLCA was evaluated against photo-interpreted samples collected by the authors in one part of the African region. The accuracy metrics were determined using a conventional error matrix [61] which was filled with classes derived from photointerpretation, along with their corresponding MOLCA classes found at the same sampling locations. The number of samples was estimated based on Cochran's equation [62]. The sample count was determined to be 1068; an additional 130 samples were preven-tively included to take into account the chance of discarding some samples due to photointerpretation uncertainties.
Each class within the MOLCA was considered a distinct stratum. An equal number of samples was selected in each stratum, with the exception of Bareland and Wetland because their count in MOLCA was low. Consequently, the number of samples for these classes was set to match the maximum number of pixels present in MOLCA, specifically 22 for Bareland and 6 for Wetland. The samples within each stratum, except for Bareland and Wetland, were randomly selected, while all pixels belonging to Bareland and Wetland were converted into samples.
The sampling survey was designed in Open Foris Collect platform [63], while photo interpretation was conducted in Open Foris Collect Earth software [63] where the photointerpreter could use either Google imagery or temporal profiles of vegetation indices from Landsat 7/8, Sentinel-2, and MODIS imagery to assign a class label to a sample. A total of 148 samples were discarded due to low confidence in the photo-interpretation deriving from poor image quality, clouds, or a high degree of similarity with other classes. The remaining 1050 samples were used for MOLCA validation.

Results
A total of 2075 MOLCA tiles were created and made publicly available on Zenodo (https://doi.org/10.5281/zenodo.8071675, accessed on 28 June 2023). The datasets include 893 tiles in the African, 658 tiles in the Amazonian, and 524 tiles in the Siberian macroregion of interest. The tiling grid used for MOLCA follows the Sentinel-2 Level-1C product tiling grid. Accordingly, the identifier assigned to each MOLCA tile corresponds to the identifier of the corresponding Sentinel-2 Level-1C tile. Figure 3 shows an example of MOLCA tile in the Amazon region. In addition to the tile data, two supplementary files are also provided. The first file contains the vector representation of MOLCA tile extents along with statistics such as the number of pixels per class, the total number of pixels, and the proportion of valid values for each tile in the attribute table. The second file in the CSV format includes the class codes and labels for MOLCA. Table 2 presents an overview of MOLCA statistics in terms of the class-wise number of pixels and the number of HRLCs participating in the generation of each class in each region, as well as the total number of pixels and the proportion of MOLCA in each region of interest regardless of class. Table 2. Statistics of MOLCA: number of pixels extracted for each class and the total number of pixels per region, the proportion of the MOLCA in a region of interest, and number of existing HRLCs that participated in the creation of a specific class. The number of HRLCs participating in the generation of MOLCA in Siberia is denoted by "*" because ESA DUE GlobPermafrost is not covering the entire region. "#" stands for "number of".

Siberia
Amazon

Accuracy
The result of the validation procedure based on 1050 samples is represented by the error matrix and accuracy indexes of MOLCA, reported in Table 3. The accuracy indexes include UA, PA, F1 score, OA, Kappa, and False Discovery Rate (FDR) The rows of the error matrix represent MOLCA classes, while the columns represent classes of photo-interpreted reference samples and accuracy indexes.

Discussion
MOLCA dataset has billions of LC pixels for the three regions of interest. Its legend and temporal representatives are determined based on the characteristics of input HRLCs.
To be included in MOLCA, a class must appear in at least two input HRLCs. Additionally, the other input HRLCs should either have the same class or no class at all. If these conditions are not met, the class will not be part of MOLCA. Moreover, the intersection procedure eliminates small differences in legends between input HRLC datasets. When there is a variation in the definition of a specific class between different HRLCs, the MOLCA derivation procedure ensures that only the common characteristics are retained. To illustrate, if one HRLC defines Forest as an area of trees with at least 2 m height, while another HRLC sets the threshold at 5 m, MOLCA will exclude any Forest patches with trees shorter than 5 m during the intersection procedure. This exclusion occurs because there is no agreement on the representation of Forest with trees below 5m in the second dataset. If classes with the same name are significantly different in their definition, it might happen that they do not constitute any agreement during MOLCA derivation, and therefore they will be eliminated.
Similarly, if there is a difference in the baseline years of input HRLCs, and a land cover change happened between these years if the HRLC with a more recent baseline year captures the changes, it will cause disagreement among the HRLCs for pixels affected by the change, and consequently, such pixels will not be present in MOLCA dataset. Hence, MOLCA's temporal representativeness falls between the most recent and the least recent baseline year of the input HRLCs.
By  (2) aquatic. The third level considers the artificiality of LC. In MOLCA, the vegetation classes are not fully differentiated as per the third level of FAO LCCS, i.e., most classes are not discriminated by artificiality. This drawback hampers the effectiveness of MOLCA, as FAO LCCS is currently the solely available system that facilitates the interoperability of legends of different LC datasets through a hierarchical approach. However, the inherent nature of MOLCA restricts the control over the legend. Despite this, it is worth noting that the legend remains compatible with the majority of existing HRLCs, which is a de facto standard legend.
Since not all classes are derived from the same HRLCs (e.g., water is present in five input HRLCs, while Grassland is present in three of them), the temporal representatives of each class vary. Nonetheless, MOLCA provides an approximate but reliable representation of land cover during the timeframe of 2016-2020, as explained above. Details about temporal representatives of each class in each region are included in Appendix B (see Tables A2-A4). MOLCA statistics (see Table 2) show that the Forest class emerges as the most abundant class within MOLCA. Among the HRLCs employed, the Built-up and Water classes have the highest representation with five HRLCs, followed by the Forest class with four HRLCs. Other classes primarily rely on three HRLCs, except for Cropland and Permanent ice and snow in Siberia.
Accuracy results (see Table 3) indicate a general high accuracy given that that the OA of MOLCA is 96%, Kappa index is 95%, and FDR is 4%. Regarding the classes, UA and PA scores exceed 85%, except for the Cropland class. Cropland has a UA of 73%; thus, it exhibits moderate overestimation. Unfortunately, no confident samples were available for the Wetland class, making it impossible to estimate its accuracy. It should be noted that the Bareland class had only three samples, which may not accurately reflect its classification accuracy. F1 score is very high for the majority of classes (>90%), which indicates high accuracy of classes, and a good balance between UA and PA. Being derived from UA and PA, the F1 score is slightly lower for the Cropland class (i.e., 80%) than for other classes, and equal to 0% in the case of Bareland class, for the above-mentioned reasons.
One limitation of MOLCA is the lack of pixels of Permanent ice and snow in Africa. It is present in high mountain peaks, but it is not identified in MOLCA. There are two possible reasons for this issue. Firstly, the area of the class is extremely small in the African region of interest; it significantly reduced the possibility of existing HRLCs having a consensus on such a narrow area. Secondly, it could be that the class is not detected by one of the input HRLCs, and consequently, consensus among HRLCs was not possible.

Conclusions and Outlook
The overall objective of MOLCA is to establish a benchmark framework for deriving training datasets from existing HRLC datasets. The HRLCs are combined by the intersection method which ensures that only regions where all datasets align in terms of classes are retained, while conflicting areas are eliminated. Such a manyfold agreement increases the probability that MOLCA retained only correct portions of input HRLCs. Currently, MOLCA covers three macro-regions of the world, two of which are in Siberia and Africa which are rarely included in existing training benchmark datasets [17,[42][43][44][45][46][47][48][49]. Another advantage of MOLCA is that it provides 117 billion of 10-m pixels, or 43% of coverage of the region of interest, which is, to our best knowledge, significantly more compared to any other existing training benchmark datasets. Such a large number of pixels is suitable to support deep learning techniques that are gaining popularity and require extensive training datasets. Nevertheless, it can also support traditional ML approaches.
The results of the accuracy evaluation demonstrated an OA of 96%, Kappa index equals to 95%, and low FDR (i.e., 4%), all of which indicate very high accuracy. Among the seven categories evaluated, four of them exhibited an accuracy rate surpassing 90% for both UA and PA. The Grassland category demonstrated UA and PA values nearing 90%. On the other hand, UA for the Cropland category implies a potential overestimation of the Cropland class. The F1 score for each class was in line with UA and PA results. While MOLCA may not achieve perfect accuracy for certain classes, a study by Rolnick et al. [64] shows that the noisy training samples do not significantly affect the performance of deep neural networks as long as the training dataset is sufficiently large. Moreover, some of the currently available HRLCs were based on other LC products, and in some cases without taking into account that each LC product contains some degree of error. Therefore, we argue that MOLCA might be more suitable for training data than an individual HRLC dataset.
The MOLCA legend is similar to most of the worldwide HRLCs and consists of different types of LC such as Bareland, Built-up, Cropland, Forest, Grassland, Shrubland, Water, Wetland, and Permanent ice and snow (in Siberia only). Although it does not fully align with FAO LCCS, it shows promise in aiding HRLC production if the current legend trend continues.
As a future development of this work, we plan to incorporate other recently published global HRLCs into the MOLCA derivation procedure, since they were not available at the time of generation of this version. On one hand, this would be useful for further refining MOLCA and increasing its accuracy, and on the other hand, it would allow the exploration of a suitable combination of existing HRLCs to ensure the representation of extremely small classes in MOLCA that currently are an issue (e.g., Permanent ice and snow in Africa). Furthermore, we also plan to expand MOLCA availability to other regions of the world.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:  Table A1).   Table A2 is an L-shaped diagram that displays the relationship between the HRLC datasets (rows) and the unique LC classes (columns) represented in those datasets. If a particular HRLC dataset includes a particular LC class, the corresponding cell in the table is highlighted. The table also includes information about the baseline year for each HRLC dataset in Siberia. The MOLCA represents the LC in a specific time period, which is determined by the minimum and maximum baseline years of the datasets used to create it. The period of representativeness for each class may vary because the classes are not derived from the same datasets.

Appendix B
For the MOLCA in Siberia, the Cropland, Permanent ice and snow, and Shrubland classes are representative of the LC in 2017, while the Bareland, Forest, Grassland, and Wetland classes are representative of the period between 2016 and 2017. The Built-up and Water classes are representative of the period between 2016 and 2019. Table A2. L-diagram of existing HRLCs (rows) and their classes (columns) in Siberia.

Existing HRLCs in Siberia
Year The L-shaped diagram for the Amazon region of interest is displayed in Table A3, and the one for Africa in Table A4