Land Cover Classiﬁcation using Google Earth Engine and Random Forest Classiﬁer—The Role of Image Composition

: Land cover information plays a vital role in many aspects of life, from scientiﬁc and economic to political. Accurate information about land cover a ﬀ ects the accuracy of all subsequent applications, therefore accurate and timely land cover information is in high demand. In land cover classiﬁcation studies over the past decade, higher accuracies were produced when using time series satellite images than when using single date images. Recently, the availability of the Google Earth Engine (GEE), a cloud-based computing platform, has gained the attention of remote sensing based applications where temporal aggregation methods derived from time series images are widely applied (i.e., the use the metrics such as mean or median), instead of time series images. In GEE, many studies simply select as many images as possible to ﬁll gaps without concerning how di ﬀ erent year / season images might a ﬀ ect the classiﬁcation accuracy. This study aims to analyze the e ﬀ ect of di ﬀ erent composition methods, as well as di ﬀ erent input images, on the classiﬁcation results. We use Landsat 8 surface reﬂectance (L8sr) data with eight di ﬀ erent combination strategies to produce and evaluate land cover maps for a study area in Mongolia. We implemented the experiment on the GEE platform with a widely applied algorithm, the Random Forest (RF) classiﬁer. Our results show that all the eight datasets produced moderately to highly accurate land cover maps, with overall accuracy over 84.31%. Among the eight datasets, two time series datasets of summer scenes (images from 1 June to 30 September) produced the highest accuracy (89.80% and 89.70%), followed by the median composite of the same input images (88.74%). The di ﬀ erence between these three classiﬁcations was not signiﬁcant based on the McNemar test ( p > 0.05). However, signiﬁcant di ﬀ erence ( p < 0.05) was observed for all other pairs involving one of these three datasets. The results indicate that temporal aggregation (e.g., median) is a promising method, which not only signiﬁcantly reduces data volume (resulting in an easier and faster analysis) but also produces an equally high accuracy as time series data. The spatial consistency among the classiﬁcation results was relatively low compared to the general high accuracy, showing that the selection of the dataset used in any classiﬁcation on GEE is an important and crucial step, because the input images for the composition play an essential role in land cover classiﬁcation, particularly with snowy, cloudy and expansive areas like Mongolia.


Introduction
The most apparent indicator for surficial changes of the Earth, no matter the type, is land cover [1]. Recent studies have reported that the ongoing land use/cover change (LUCC) is having an increasingly negative impact on various aspects of the Earth's surface, such as terrestrial ecosystems, water balance, Google Earth Engine (GEE), a cloud-based computing platform, can solve the most significant problems with respect to land cover mapping of large areas. Users can analyze all available remotely sensed images using a web-based Integrated Development Environment (IDE) code editor without downloading these data to the local machine. In this way, users can easily access, select and process large volumes of data for a large study area [40]. Besides the fast processing, another important aspect that makes GEE more and more popular is the availability of several packages with lots of algorithms simplifying access to remote sensing tools for both expert and non-experts. According to Tamiminia et al. [41], since 2013 the number of publications using GEE steadily increased. Among the available datasets in GEE, data from optical satellites and particularly the approximately 40 years long time series of Landsat, have been the most frequently used. It is reported that GEE has been applied to various areas, ranging from agriculture, forestry, ecology to economics and medicine [41,42]. Among these, forest and vegetation studies were the most frequent application disciplines, followed by land use and land cover studies.
As mentioned above, Mongolia is covered by 125 Landsat tiles, however, to create a free cloud and snow composite imagery more than 800 scenes are required (e.g., Figure 1, Jun-Sep composition imagery). Therefore, GEE is considered the best option and presents great opportunities for mapping land cover in Mongolia.
After identifying all remotely sensed scenes which could provide data for a specific land use/cover study, the first critical step is the combination of these datasets. In the literature, there are two composition methods that have been widely applied for land cover classification using multi- As mentioned above, Mongolia is covered by 125 Landsat tiles, however, to create a free cloud and snow composite imagery more than 800 scenes are required (e.g., Figure 1, Jun-Sep composition imagery). Therefore, GEE is considered the best option and presents great opportunities for mapping land cover in Mongolia.
After identifying all remotely sensed scenes which could provide data for a specific land use/cover study, the first critical step is the combination of these datasets. In the literature, there are two composition methods that have been widely applied for land cover classification using multi-temporal Landsat images. One is using the temporal aggregation method, that is, use the metrics, such as mean, median and min/max, derived from time series images [43][44][45][46][47]. Another is making a composition of time series data from all the (cloud free) available Landsat images [48][49][50]. Obviously, these two methods are physically different. However, the effects of these composition methods on the land cover classification accuracy have not been fully exploited to date and there are many studies that simply select as many images as possible until the gaps are filled, without concern for the effect of different year/season that might affect the classification accuracy. For instance, the most popular strategy for selecting input images for an annual cloud free composite is to use images acquired over three years [44][45][46][47].
In the recent literature, based on our best knowledge, no study has compared the effect on classification accuracy from different selecting strategies for annual image composites, as well as the Consequently, our aim is to provide the reader with a summary of the effects of different composition methods (metrics, time series and the different annual composition strategies) and the input feature selection (spectral and auxiliary variables) on the accuracy of land cover classification. A pilot test site (approximately one Landsat tile) over Ulaanbaatar (Mongolia) with different cloud/snow cover conditions and the seasonally changing land cover was selected. We used Landsat 8 surface reflectance data and the GEE cloud computing platform to investigate the comparisons.

Study Area
To show more detail about the different effects of image selection, we use one Landsat footprint (path 131 row 27) over Ulaanbaatar (Mongolia) as a case study ( Figure 2). It should be mentioned that, to reduce the effect of the edge, we buffer 3 km inside the footprint to use as the study area. that simply select as many images as possible until the gaps are filled, without concern for the effect of different year/season that might affect the classification accuracy. For instance, the most popular strategy for selecting input images for an annual cloud free composite is to use images acquired over three years [44][45][46][47].
In the recent literature, based on our best knowledge, no study has compared the effect on classification accuracy from different selecting strategies for annual image composites, as well as the effect of different composition methods (i.e., median metric versus single date image or time series image data).
Consequently, our aim is to provide the reader with a summary of the effects of different composition methods (metrics, time series and the different annual composition strategies) and the input feature selection (spectral and auxiliary variables) on the accuracy of land cover classification. A pilot test site (approximately one Landsat tile) over Ulaanbaatar (Mongolia) with different cloud/snow cover conditions and the seasonally changing land cover was selected. We used Landsat 8 surface reflectance data and the GEE cloud computing platform to investigate the comparisons.

Study Area
To show more detail about the different effects of image selection, we use one Landsat footprint (path 131 row 27) over Ulaanbaatar (Mongolia) as a case study ( Figure 2). It should be mentioned that, to reduce the effect of the edge, we buffer 3 km inside the footprint to use as the study area.   To focus on the effect of image selection on land cover classification, we minimized all other effects encompassing for example, those arising from the differences between RS sensors, for example, the difference between Landsat OLI and previous Landsat sensors [51], the effect of preprocessing (e.g., atmospheric correction), the effect of different footprints or different acquisition dates [52]. Consequently, we selected all Landsat 8 atmospherically corrected surface reflectance scenes available on the GEE platform for the year 2019 (except Dataset 4 which also encompassed the images from June to September 2018).

of 22
Our study area is covered by either a single scene (path 131, row 27) or can be merged from nine scenes ( Figure 2). It is worth mentioning that, in GEE, if the users do not specifically select the path and row of the scenes, it will automatically select all the images that intersect with the study area's boundary. That is why we have a large different number of images in each datasets (Table 1). Obviously, the selection method (automatic vs. specific selection of path and row of scenes) has an effect on the quality and characteristic of the composite images that are going to be used for classification. Ten popular auxiliary datasets, namely the Normalized Difference Vegetation Index (NDVI), Enhanced vegetation index (EVI), Soil Adjusted Vegetation Index (SAVI), Modified Secondary Soil-Adjusted Vegetation Index (MSAVI2), Normalized Difference Water Index (NDWI), modified NDWI (mNDWI), Normalized Difference Water Body Index (NDWBI), Normalized Difference Built-Up Index (NDBI), simple ration (SR) and Entropy were derived and calculated from the L8sr data in order to increase the accuracy of land cover classification [53,54]. The definitions and formulas of these variables are presented in Table A1. In addition, topographic variables, such as elevation, slope and aspect, have been shown to be related with the distribution of the land cover type in many studies [55,56]. Therefore, in this study, the 30 m elevation, slope and aspect derived from the Shuttle Radar Topography Mission (SRTM) data were used as auxiliary variables for the classification. The number of auxiliary variables used within each dataset is shown in Table 1.

Training and Validation Sample Data
Eight land cover types dominate the area of investigation-(1) Agriculture (AG), (2) Burned Area (BA), (3) Bare land (BL), (4) Grassland (GA), (5) Mixed Grassland (GRm), (6) Residence (RE), (7) Forest (FR) and (8) Water. All training and validate samples were collected based on manual visual interpretation of high-resolution images from Google Earth and Planet Lab [57]. This method is widely applied and reported in the literature [58][59][60]. Furthermore, to minimize the effect of spatial autocorrelation yet still capture the gradient of land cover type within each class (e.g., shallow water versus deep water; low, medium and high grassland cover) we selected training samples as small polygons (i.e., this polygon contains a number of relatively homogenous pixels of a given land cover types) and validation samples as points. If validation samples are selected close to the training polygons, it is more likely to miss the detection of overfitting. Consequently, we randomly selected validation points with the constraint of being at least 100 m away from the closest training polygon to reduce spatial autocorrelation. In the next step, the (initially) selected training polygons and validation Remote Sens. 2020, 12, 2411 6 of 22 points were downloaded from GEE and uploaded to Planet Lab. These samples were further adjusted and corrected by the co-authors. Finally, we had 513 training polygons and 1039 validation points.

Methods
We created eight datasets on the GEE platform with respect to their temporal aggregation and time series stack methods ( Figure 3). Dataset 1 to Dataset 6 are median imageries, Dataset 7 and 8 are time series data that have the same input component images as Dataset 2 and Dataset 6, respectively. The input images (the temporal aggregation and time series) were selected based on different strategies in order to assess the effect of different selections on classification accuracy (Table 1). These datasets, as well as the auxiliary variables, have the same spatial resolution (30 m) as the original resolution of Landsat 8 images.
Remote Sens. 2020, 12, 2411 6 of 23 strategies in order to assess the effect of different selections on classification accuracy (Table 1). These datasets, as well as the auxiliary variables, have the same spatial resolution (30 m) as the original resolution of Landsat 8 images.  Only images between 1st of June and 30th of September 2019 were selected to calculate the median image.

7 13
Dataset 3 Only images with cloud cover less than 30% were used for median calculation. Dataset 5 The best single scene (p131r27) covering the entire study area was selected based on the lowest cloud cover percentage.

Random Forest Classifier
To date, RF is considered one of the most widely used algorithms for land cover classification using remote sensing data [55][56][57][61][62][63][64][65][66]. According to Mahdianpari et al. [67] and Xia et al. [68], the reasons for RF receiving considerable interest over the last two decades are-(1) Good handling of the outliers and noisier datasets; (2) Good performance with high dimensional and multi-source datasets; (3) Higher accuracy than other popular classifiers, such as SVM, kNN or MLC in many applications [69,70]; and (4) Increasing the processing speed by selecting important variables [71].
Another factor making RF more popular than other machine learning algorithms (e.g., SVM) is that only two parameters (ntree and mtry) are required to be optimized, facilitating the application of RF [72]. A meta-analysis of 349 GEE peer-reviewed articles over the last 10 years shows that the RF Remote Sens. 2020, 12, 2411 7 of 22 algorithm is the most frequently used classification algorithm for satellite imagery [41]. Considering all these reasons, we chose RF for the present study.
Based on the recommendations of previous studies [62,73] and pretests from our data, we selected 100 trees (ntree = 100), while mtry was set to the default value (square root of the total number of features).

Accuracy Assessment, Comparison and Statistical Testing
Foody [74] highly recommends that the kappa coefficient should not be routinely used in the assessment and comparison of the accuracy of thematic maps derived from image classification. Therefore, in the present study, for each classification accuracy assessment, we used the popular measures extracted from confusion matrix reports, such as overall accuracy (OA), producer accuracy (PA) and user accuracy (UA). However, according to Foody [75], a confusion matrix (i.e., OA, PA and UA) only provides information for an "estimate" of classification accuracy, thus only a tentative conclusion can be made. This is especially the case when we compare different classification results with small accuracy differences. Janssen & van der Wel [76] state that it is necessary to compare the accuracy of different classifications in a statistically rigorous manner. For example, if two classification maps, M1 and M2, of a region were created by two different classifiers (or different input datasets), it is required to assess whether the difference in the accuracy of these maps is significant or not. In the literature, there are a number of studies using kappa coefficients to assess the significant difference between classifications via z value.
where, κ 1 , κ 2 , δ κ 1 and δ κ 2 are the estimated kappa coefficients and the associated estimates of the standard error of kappa for M1 and M2, respectively [77,78]. According to Foody [66], Equation (1) should be replaced by the equation: where, ρ 0 and δ ρ 0 are the proportion correct and the standard error, respectively. It is worth to mention that Equations (1) and (2) can be applied if the samples used are independent. However, in remote sensing applications, when comparing the difference between maps, the same ground truth (validation/ references) dataset is usually used. In this case, the McNemar test [79] can indicate whether the difference in classification results is significant [75]. Though the non-parametric McNemar test was based on a 2 by 2 confusion matrix, the remote sensing image based classification confusion matrix (which is often more than two classes) can be collapsed to the size of 2 by 2 by focusing on corrected and uncorrected pixels of classified output [80]. In doing so, the McNemar test calculates the z value: where f 12 is the number of corrected samples in classified result one while result two is uncorrected and f 21 is the number of corrected samples in classified result two while result one is uncorrected. z is following the Chi 2 distribution with five degrees of freedom which has been used to estimate the significance of differences in accuracy among classifications in our study (95% confidence level).

Effects of Differences Among Classifications on the Spatial Estimation of Land Use Classes
Assuming that each classification will differ with respect to the spatial occurrences of the different land use classes, we calculated the mode (majority class) of each pixel in all 8 classifications. The mode classification image was then compared to the classification featuring the highest overall accuracy as a reference. In addition, starting from the mode classification, we calculated the number of pixels for each class that have been assigned in all classifications only to this class, as well as the number of pixels that have been assigned to this class and one other class, and so forth. This will provide us with a class-wise estimation of the uncertainty arising from the combination strategy used to create cloudand snow-free images in GEE.

Overall Accuracy of Different Datasets With and Without Auxiliary Variables
We applied the same training sample and validation data points to classify and assess the accuracy of the land cover maps.
As mentioned in the Methods section, aside from the spectral bands of Landsat images, we also used additional variables to test whether they increase the accuracy of the land cover maps.
The results of classification accuracy assessment ( Table 2) show that if only spectral feature bands from L8sr were used, a moderate to high agreement with reference data can be achieved (OA ranges from 77.6% to 85.27%). However, when the additional auxiliary variables (10 spectral indices + 3 topographic indices) were included in the model, the OAs increased by approximately 4.1% to 7.7%. This is consistent for all datasets. The highest increases were observed with Dataset 1, Dataset 2, Dataset 5 and Dataset 6 with more than 7.5%. This increase can be explained by the order of variable importance (Figure 4) of the 20 input features band (Dataset 1 to Dataset 6) and 41 features band of Dataset 7 and Dataset 8. In all datasets, elevation always ranked as the most important variable in the classification, followed by the entropy, B7 and B1 (except Dataset 6, where B7 was ranked at 8th position).
Regarding time series data (Dataset 7 and Dataset 8), the 3 most important variables were elevation, entropy and Band 5 of July composite image. Spectral indices were ranked very low in all datasets. Regarding the three topography indices, elevation always featured high importance values in all datasets, whereas aspect and slope contributed only marginally to the models in all datasets.

The Effect of Different Composition Datasets on Land Cover Classification Accuracy
In the following, we only report results based on spectral bands (1 to 7) and auxiliary variables to focus on the effects of different composition strategies on the classification accuracy. In general, the maps show that all classifications have a consistent pattern of land cover types with some slight differences ( Figure 5). For instance, there is more agricultural land at the southwest corner in Datasets 1, 3, 4 and 6. Some additional agricultural land occur at the lower right corner of classification of Dataset 5. Based on datasets 1, 3, 4 and 6 the models also classified more grassland at the southwest corner.

The Effect of Different Composition Datasets on Land Cover Classification Accuracy
In the following, we only report results based on spectral bands (1 to 7) and auxiliary variables to focus on the effects of different composition strategies on the classification accuracy. In general, the maps show that all classifications have a consistent pattern of land cover types with some slight differences ( Figure 5). For instance, there is more agricultural land at the southwest corner in Datasets 1, 3, 4 and 6. Some additional agricultural land occur at the lower right corner of classification of Dataset 5. Based on datasets 1, 3, 4 and 6 the models also classified more grassland at the southwest corner. In general, all datasets produced high accuracies (OA ranges from 84.31% to 89.80%, Table 3). On average, burned area (BA) and residence area were classified with the highest accuracy, followed by grassland, mixed grassland and water. The lowest accuracy was observed with agriculture and forest. These low accuracies are caused by the very low PA of Dataset 4, Dataset 5 and Dataset 6 for agriculture and forest class.  In general, all datasets produced high accuracies (OA ranges from 84.31% to 89.80%, Table 3). On average, burned area (BA) and residence area were classified with the highest accuracy, followed by grassland, mixed grassland and water. The lowest accuracy was observed with agriculture and forest. These low accuracies are caused by the very low PA of Dataset 4, Dataset 5 and Dataset 6 for agriculture and forest class.
Regarding the difference between Dataset 1 and Dataset 2, which were median composited from all the images in 2019 (196 images) and the images from June to September in 2019 (61 images) over the study area, accuracies of Dataset 2 (88.74%) were higher than those of Dataset 1 (85.95%) and the difference was significant (p < 0.05, Table 4). Comparing the results of Dataset 2 and Dataset 4, which were composited from all images between June and September of one year (2019, 61 images) and two years (2018 + 2019, 126 images), respectively, Dataset 2 always produces higher accuracy of all land cover types (except PA of burned area and residence class) both regarding PA and UA (Table 3) and this difference (Dataset 2 and Dataset 4) is significant (p < 0.05). Therefore, if images within one year (or season) can fill all the gaps (e.g., due to cloud), then the Dataset 2 should be applied instead of Dataset 4. Comparing between the median composition images and time series images that use the same input images, Dataset 2 versus Dataset 7, Dataset 6 versus Dataset 8, as shown in Table 3, the results of Dataset 2 and Dataset 7 were slightly different (OA = 88.74% and 89.80%, respectively). The largest difference was observed in Water, Agriculture and Bare land classes. Water has higher accuracy in Dataset 2, agriculture and bare land have higher accuracies in Dataset 7. However, this difference is not significant (p = 0.31). In contrast, the difference between Dataset 6 and Dataset 8 is significant (p < 0.05). This indicates that the automatic selection (by study area, i.e., Dataset 1) and the manual select scenes (by path and row, that is, Dataset 6) together with composition methods (median vs. time series composition) have an effect on the classification results. Therefore, in applications both the trade-off between median and time series composition, as well as the image included in the collection should be taken into account.
An interesting result is the comparison between Dataset 5 (single date imagery), Dataset 6 (median composite from the 7 images) and Dataset 8 (time series image of the 7 images) of the single scene (path 131, row 27). Although, there is an apparently small difference between OA of Dataset 5 and Dataset 6 (85.27% vs. 85.66%), they are significantly different. Looking at all the eight datasets, as shown in Table 4, there are significant differences (p < 0.05) of classification results between time series data (Dataset 7 and Dataset 8) with all other compositions, except Dataset 2. Furthermore, the classification result of Dataset 2 is also significantly different from other composition's classification results (except the results of Dataset 7 and Dataset 8). It is suggested that the input images for the compositions play a crucial role in producing significantly higher accuracies of land cover maps. In this case, Dataset 2, Dataset 7 and Dataset 8 were composited from images between June and September within one year. It is worth to remind that Dataset 4, also composited from all images between June and September 2019, however, it includes images between June and September in 2018. That is why, it produced significant lower (p < 0.05) accuracy in comparison to those of Dataset 2, Dataset 7 and Dataset 8.

Variation of Land Cover Types Derived from Different Datasets
From all 8 single classifications, we calculated the mode classification symbolizing for each pixel the class with the highest agreement within the single classifications (Figure 6a). Only small differences between the mode classification and the classification on Dataset 7 featuring the highest accuracy occurred (Figure 6b showing differences in the spatial subset around Ulaanbaatar). If differences between mode classification and all other classifications are assessed, large discrepancies are obvious ( Figure 6c for the same spatial subset and Figure 6d for a graphical analysis based on the entire area). Especially at the borders of Ulaanbaatar, the classifications results differed heavily. All classifications agreed widely for the spatial estimation of forest occurrence. Overall, 76% of the pixels assigned to the forest class in the mode image were also assigned to forest in all other classifications. Another 22% of the forest pixels in the mode classification were classified into one another class in at least one classification. For all other classes the coincidence among the classifications was much lower. The lowest coincidence among the classifications was observed for Agriculture, where only 12% of the mode classification pixels were assigned to Agriculture in all other classes.

Discussion
In this study, we test different composition methods (metrics, time series and the different annual/seasonal composition strategies) to generate spectral input data for land cover classifications and investigate how this procedure affects the classification accuracy. Our results show that accuracies were generally high, but, although all classifications were trained with and validated against the same reference dataset, the strategies used to select images led to significantly different results.
Prior to the availability of GEE, many different composition methods were proposed to solve the problem of cloud cover. For example, Senf et al. [81] used fusion techniques to simulate data for the missing areas. Inglada et al. [82] used linear interpolation to fill the invalid pixels by using the previous and/or following cloud-free-date pixels to map land cover over France. However, since GEE has become available, the most popular method for filling gaps in cloudy images is to use median metrics (temporal aggregation method). Carrasco et al. [39] report that the advantage of this method is the significant reduction of data volume, resulting in an easier and faster analysis. For our study, the volume of the median data was approximately 4.3 GB, meanwhile the volume of the time series data (Dataset 7 and Dataset 8) were 20.8 GB and 13.5 GB, respectively. Obviously, such large data volumes are not easy to handle and analyze on a personal computer (PC) with remote sensing software. However, on the GEE platform in particular, this task can be completed easily; hundreds of images can be rapidly processed [83]. Using the median composition method, the input images (typically annual composition) are created in a pixel-wise manner by taking the median value (i.e., DN, TOA or reflectance) from all cloud-free pixels of the image collection. It is worth mentioning that, besides the median method, in GEE there are also other methods available such as mean, minimum, maximum, standard deviation and percentile. However, in the literature, the most popular method is the median reducer. For our case, we also tested the mean composition (the result is not shown), but this method produced lower accuracy compared to the median composition method. Therefore, the median composition was selected in this study. Most importantly, if the input images used for composition are optimally selected, the median composite image (with low data volume) can produce results which are as good as those based on time series composites (Dataset 2 vs. Dataset 7, Table 4), whereas a significantly different classification result could be achieved if the median image is not optimally created (e.g., Dataset 6 and Dataset 8).
The strategies for selecting images in a collection vary between studies as a consequence of cloud cover and types of land cover. Richards & Belcher [47] used images over three years to create composites (i.e., images from 1999-2001 and 2014-2016 to map land cover in 2000 and 2015, respectively). Although, the land cover has only two classes (Vegetated Cover and Unvegetated Cover) and the accuracy was assessed based on 293 validation points, the overall error rate was (with 9.2% and 7.5%) quite high for the classification in 2000 and 2015, respectively. This is consistent with our results, in that all images available in a year produces lower accuracy than seasonal image composition (i.e., Dataset 1 vs. Dataset 3 produced lower accuracy compared to Dataset 2); in addition, composition image from multi-years' data produce land cover maps with lower accuracy than those from a single year (i.e., Dataset 2 vs. Dataset 4). This is consistent with the recommendation of Frantz et al. [84] that the tradeoff between different years (i.e., land cover might change) and large different days in the same years (phenological consistency might be lost) should be taken into consideration. In this case, using data from the same year is preferable. Nevertheless, the most popular method to select images is based on the seasonal image composition. For example, to map land cover for Central Asia, Hu & Hu [46] selected only images between April and July in the span of three years (target year ±1 year) for mapping land cover in 2001 and 2017. Meanwhile, Nyland et al. [44], selected images between July 1st to September 30th (to minimize the effect of snow cover) for land cover mapping in the transitional zone between physiographic provinces in the West Siberian Plains and on the Central Siberian Plateau.
As far as we know, no study has investigated the difference between selecting images based on the seasonal image and selecting as many images as possible to fill all the missing pixels due to cloud and snow cover. Our results show that (Tables 3 and 4), using the same composition method (i.e., median) for different input images (image collection) might produce significantly different land cover classification accuracy. For example, of Dataset 1, Dataset 2 and Dataset 4, which have different input images, Dataset 2 produced the highest accuracy (88.74%) and was significantly more different than Dataset 1 and Dataset 4. The classification accuracy of Dataset 1 is higher than that of Dataset 4. These different results can be explained because most land cover types in our study area have strong seasonality, such as grassland, mixed grassland and agriculture. Even the water might seasonally change, that is, it could be snow/ice in winter and bare land in summer. Furthermore, due to the changing condition of clouds, the median composition of input images from different years (i.e., Dataset 4) could affect the phenology information, for example, the missing pixels in June 2018 could be filled by pixels from July 2019. Such effects would be large for study areas where land cover types have a strong seasonality, like in our study site. That is why the classification result of Dataset 2 was significantly more accurate than that of Dataset 4. Another image input selection strategy is shown in Dataset 1 and Dataset 3, where images were selected based on the percentage of cloud cover threshold. In our case, we selected all scenes over the study area in 2019 with cloud cover less than 30%. This selection highly depends on cloud cover conditions and is therefore not recommended, particularly when the study aims to produce seasonal land cover maps. For example, in Dataset 1 and Dataset 3 of the present study (Table 1), there were 196 and 130 images, in which 61 and 37 images were acquired between June and September 2019, respectively. This means that there were 135 useless images in Dataset 1 and 93 useless images Dataset 3. That is why the results of Dataset 1 and Dataset 3, although they used a very high number of images, produced significantly lower accuracies compared to Dataset 2.
The selection of images is also crucial regarding the usage of images from multiple paths and rows. To our knowledge, this has not been discussed in the literature yet. As in our study area, the final composite can be either covered by data from a single row/path combination or by merging data from multiple rows and paths ( Figure 2). In GEE, if images are selected based on a study area, the engine will automatically consider data from 9 Landsat footprints (Dataset 1, Dataset 2, Dataset 3, Dataset 4 and Dataset 7). For our other datasets, we specifically selected the input images so that the scenes come from the same Landsat footprint (path 131-row 27). The results (Table 4) show that if these image collections (automatic selection vs manual selection) were used for time series composition, the classification results were similar such as Dataset 7 and Dataset 8. However, if these image collections were used for median composition, the classification results were significant different (Dataset 2 vs. Dataset 6). Therefore, in GEE application, where the median composition method is widely applied, this different selection issue should be taken into account when processing input images for land cover classifications.
Another point that should be mentioned is the improvement of classification accuracy after incorporating auxiliary variables as additional predictors ( Table 2). In this study, we used 13 auxiliary variables (10 spectral and 3 topography indices). The 10 spectral indices were calculated based on the correspondent spectral data input ( Figure 3). For example, "auxiliary variable 1" was calculated from "spectral data 1." Particularly, in Dataset 7 and Dataset 8, we tested the "time series indices" (i.e., monthly auxiliary variable indices), however, the performance of these indices were not as good as the indices calculated from the median images. Therefore, the auxiliary variables of Dataset 7 and Dataset 8 were used as those of Dataset 2 and Dataset 6. The result (Tables 3 and 4) indicates that selecting input images as well as the auxiliary variables for the classification can significantly improve the classified results. Popular spectral indices (e.g., NDVI, EVI, SAVI) have been reported to increase the accuracy of land cover classification using remotely sensed images. However, in this study, these indices always had a low rank compared to other feature bands (Figure 4). The reason is that vegetation is the dominant cover type in the study area and it changes (green to brown) quickly. Due to the climatic gradient across our study area, these changes are not consistent in time. Furthermore, the agricultural activities across the area regarding planting and harvesting times change according to the crops and the climate. This is consistent with the studies of Abdi [85] and Zha et al. [86].
NDBI and mNDWI are two other popular indices, which also ranked very low. For NDBI, this can be explained by the residence cover area in Mongolia, which is not easy to be separated from bare land. In the case of mNDWI, the index could not help to separate water and shadowed surfaces ( Figure A1). This is in line with the study of Feyisa et al. [87].
In this study, Elevation and Entropy are always the two most important variables regardless of the dataset ( Figure 4). As shown in Figures 5 and A1, there is a consistent distribution of land cover types ( Figure 5) and elevation ( Figure A1), it is suggested that elevation can help in separating grassland and other vegetation or grassland and bare land, as well as bare land versus resident area. Entropy can help to distinguish between grassland, agricultural land and mixed-grassland.
In addition to the accuracy reported in the confusion matrix, looking at the classification results (land cover maps), there is more grassland and agricultural land ( Figure 5, west-south corner) in classification results of Dataset 1, Dataset 3 and Dataset 4. Obviously, these differences could lead to high uncertainties for any subsequent application (e.g., biomass estimation) using the map as input. This is again confirming that image selection plays a crucial role in land cover classification accuracy, particularly, with a specific target land cover class.
If the consistency of the classifications over space is directly assessed, we find surprisingly high differences among the classifications. Although the OAs of all single classifications were above 84%, for many classes the majority of the pixels were assigned to another class in at least one classification. This causes high uncertainties arising solely from the method how the input data is preprocessed and how many and which predictors are used. This clearly shows that the question of how to generate a cloud-and snow free input dataset for LUCC in GEE is an important and crucial step, whose effects on the classification results is generally underestimated.

Conclusions
In a vast country like Mongolia, both cloud and snow cover are present and the land cover changes over the course of the season. It is not possible to use a single date or even monthly composite images to map land cover. Mosaicking, stacking and filtering (selecting) the images are always required in order to create a satisfactory land cover map from the remotely sensed images. These steps are more easily accessible since the availability of GEE, as it would otherwise be very costly and labor intensive. Our pilot experiment study shows that different strategies to select the input images produce different classification results (with OA ranging from 77.66% to 89.80%). The results show that in order to achieve a highly accurate and realistic land cover map, it is not enough simply to select all the available satellite images (i.e., Dataset 1), filter the images based on the cloud cover threshold (i.e., Dataset 3), or fill in the cloud cover pixels with pixels from the same season in another year (i.e., Dataset 4). The optimal solution to select images for the collection (the input image for the median or time series composition) is based on cloud/snow cover and the land cover types. Regarding the composition methods (i.e., median vs. time series), our results show that if the input images (image collection) are optimally selected (i.e., collection 2 of Dataset 2), the median composite image (low data volume) can produce equally high accuracy as time series composite data does (Dataset 2 vs Dataset 7, Table 4). Therefore, we highly recommend that when implementing land cover classification on the GEE platform, the median composition method should be given priority over the time series composition, because it is not only filling the gaps due to cloud and snow cover but also reducing data volume (i.e., easier and faster analysis) and producing as high accuracy as time series (multi-temporal) image data.
To date, both Landsat (TM, ETM+ and OLI) and Sentinel-2 (A and B) are available in GEE, meaning that all the above mentioned issues (i.e., cloud cover, snow cover, lack of close time acquisition images of Landsat 8 solely) can be solved by integrating Landsat 7, Landsat 8 and Sentinel-2. However, it should be noted that in order to integrate these different sensor images, the differences between the band wavelengths of these sensors need to be taken into account. A handful of studies in the literature combine Landsat7, Landsat 8 and Sentinel-2 for land cover classification while considering the mentioned differences. Therefore, future studies should investigate the optimal method to integrate these different sensor images in order to produce the best land cover map for large areas like Mongolia.