remote Land Use and Land Cover Mapping Using Sentinel-2, Landsat-8 Satellite Images, and Google Earth Engine: A Comparison of Two Composition Methods

: Accurate and real-time land use/land cover (LULC) maps are important to provide precise information for dynamic monitoring, planning, and management of the Earth. With the advent of cloud computing platforms, time series feature extraction techniques, and machine learning classiﬁers, new opportunities are arising in more accurate and large-scale LULC mapping. In this study, we aimed at ﬁnding out how two composition methods and spectral–temporal metrics extracted from satellite time series can affect the ability of a machine learning classiﬁer to produce accurate LULC maps. We used the Google Earth Engine (GEE) cloud computing platform to create cloud-free Sentinel-2 (S-2) and Landsat-8 (L-8) time series over the Tehran Province (Iran) as of 2020. Two composition methods, namely, seasonal composites and percentiles metrics, were used to deﬁne four datasets based on satellite time series, vegetation indices, and topographic layers. The random forest classiﬁer was used in LULC classiﬁcation and for identifying the most important variables. Accuracy assessment results showed that the S-2 outperformed the L-8 spectral–temporal metrics at the overall and class level. Moreover, the comparison of composition methods indicated that seasonal composites outperformed percentile metrics in both S-2 and L-8 time series. At the class level, the improved performance of seasonal composites was related to their ability to provide better information about the phenological variation of different LULC classes. Finally, we conclude that this methodology can produce LULC maps based on cloud computing GEE in an accurate and fast way and can be used in large-scale LULC mapping.


Introduction
Land use maps are fundamental data sources for land planning and management [1,2]. Accurate and up-to-date land use/land cover (LULC) mapping has always been of interest to geoscience and remote sensing societies [3][4][5], mainly because it is a provider of valuable information to understand human-environment relationships [6,7]. The starting point for LULC mapping was that of using mono-temporal and mono-source satellite images [8]. For these approaches, spectral and textural features played a substantial role in improving the classification results [9,10] and the efforts continued with the combined use of multi-source datasets, such as those integrating optical and radar satellite images [11,12]. Previous studies attempted to integrate different types of remotely sensed data and to use their unique capabilities to produce accurate LULC maps. For example, Clerici et al. [13] reported that optical and radar Sentinel data provide supplementary information; therefore, LULC classification can take advantage of the integration of both data sources leading to an increase images in LULC mapping, and (3) identifying the most suitable variables for LULC prediction.
The rest of the paper is organized as follows. In Section 2, the study area, datasets, and methodologies of LULC classification and accuracy assessment are described. The results and analyses are presented in Section 3. A discussion is presented in relation to other studies in Section 4, and finally, the paper concludes in Section 5.

Study Area
This study was carried out in the Tehran Province, including Tehran city and its suburbs, covering an area of 14,000 Km 2 located in the northcentral part of Iran at the southern face of the Alborz Mountains (Figure 1), spanning over 34° to 36°5′N and 50° to 53°E. This region is the most industrialized region in the country and has the highest population density (11,800 individuals/km 2 ) which is mainly concentrated in 10 cities. In general, this province is characterized by an arid climate [7,50] being cold and semi-humid in the northern areas and cold with long winters in the higher regions. Grasslands are located in the northern and western parts, while croplands and bare lands are mainly found in the southern and eastern parts of the region. The boundary of the study area was chosen in a way that could well represent a complex landscape and involved densely built-up and suburban residential areas, industrial cities, croplands, woodlands, grasslands, and bare lands.  In this study, the S-2 and L-8 Operational Land Imager (OLI) time series data were used for mapping the LULC classes. The L-8 OLI consists of nine spectral bands (coastal: 443 nm, blue: 485 nm, green: 563 nm, red: 655 nm, panchromatic: 640, NIR: 865 nm, short-wave infrared 1 (SWIR1): 1610 nm, SWIR2: 2200 nm, and cirrus: 1375 nm) (https:// www.usgs.gov/landsat-missions/landsat-8; accessed on 14 December 2021). S-2 provides high temporal resolution data with a rich spectral configuration, including 13 spectral bands. It has six land monitoring bands that are comparable with Landsat-8 (blue: 490 nm, green: 560 nm, red: 665 nm, NIR: 842 nm, SWIR1: 1910 nm, and SWIR2: 2190 nm) and three additional bands covering the red-edge part of the spectrum which are centered at 705 nm, 740 nm, and 783 nm, and a NIR narrow band at 865 nm (https://sentinel.esa.int/ web/sentinel/missions/sentinel-2; accessed on 14 December 2021).

Digital Elevation Model
Shuttle radar topography mission (SRTM) digital elevation model (DEM) with a resolution of 1 arc second (approximately 30 m) was used to extract the elevation and slope bands. The SRTM resulted from international cooperation between the National Aeronautics and Space Administration (NASA), the National Geospatial-Intelligence Agency (NGA), and German and Italian space agencies. SRTM provides a near-global DEM between 60 • N and 56 • S latitude, built on the data collected by a specially modified radar system onboard the Space Shuttle Endeavour (SSE) during 11 days in February 2000 (https://lpdaac.usgs.gov/products/srtmimgmv003/; accessed on 14 December 2021). Since all L-8 30 m and SRTM 30 m spatial resolution datasets were used in the analysis, all the bands were resampled to 10 m (S-2 resolution) and registered to match the S-2 georeferenced images. Moreover, in GEE, the scaling is executed automatically and all the bands are overlaid perfectly.

Reference Datasets and LULC Classes
In this study, the attempt was to develop an appropriate methodology to achieve the specific research objectives outlined above. Four datasets were prepared so as to be characterized by different time series feature set configurations based on S-2 and L-8 satellite images and two common composition methods. Generally, a high classification accuracy of the remotely sensed datasets requires large sets of training and validation samples. Therefore, a second step was that of generating a high number of training and validation samples to properly manage the issues of insufficient sample sizes and large numbers of dimensions [51,52]. Based on the above, an RF classifier was used to produce LULC maps and to evaluate the classification accuracy by a set of metrics. Since the accurate mapping of LULC classes based on machine learning methods requires a sufficient number of training samples [53], a visual inspection of high-resolution satellite imagery is a typical method used to extract training and validation samples [54,55]. In this study, a number of 3800 ground polygon samples (33,530 pixels) were defined based on a random distribution within LULC classes which included artificial land, cropland, woodland, grassland, bare land, and water bodies; this was done by a visual interpretation of the Google Earth highresolution satellite imagery ( Table 1). The characteristics of LULC classes are described in Table 1. Some studies indicated that the reference datasets should represent approximately 0.25% of the total study area [56,57]. Therefore, we used this proportion to collect our samples in each LULC class and there was an imbalance between them. All the samples were divided into training (60%) and validation (40%) subsets.  [16] was used to create image collections and process the time series. All of the 2020 S-2A/B level 2A and L-8 OLI surface reflectance products over the study area were processed to extract spectral-temporal metrics as predictors for LULC classification. All images with less than 30% cloud cover were filtered as a beginning step. After that, we used the S2cloudless algorithm (for S-2 images) and the function of mask (FMASK; for L-8 images) for masking pixel-wise cloud, cloud shadow, and snow from image collections. During the masking process for both satellites, the results were visually inspected and all parameters were redefined until the best result was obtained [58].
For the extraction of spectral-temporal metrics, two methods were considered to meet the research goals:

1.
Seasonal composites method: The median reducer was used to generate cloud-free seasonal composites [59]. Satellite images were filtered based on the climatological regime from the North of Iran, and took into consideration three seasons: spring (March, April, and May), summer (June, July, and August), and autumn (September, October, and November). Images from winter were discarded because of the high amounts of clouds and snow cover. This method was aimed at including the phenological information in LULC classification [60].

2.
Percentile metrics method: For each image collection, the percentile metric method constructs the histogram of feature collection and then calculates the specified percentiles of the feature distribution [61]. In this study, all the images from 2020 (March to November) were used to produce the 0.1, 0.25, 0.5, 0.75, and 0.95 percentile-based metrics for all spectral bands and indices.
In addition to the spectral bands, the following spectral indices were calculated: normalized difference vegetation index (NDVI; [62,63]), normalized difference built-up index (NDBI; [64,65]), and green normalized difference vegetation index (GNDVI; [66,67]). The spectral-temporal metrics were calculated (Table 2) by using the aforementioned strategy. In the classification process, these topography-based features were used to include the terrain attributes of the LULC classes [68].

Land Cover Classification and Accuracy Assessment
A number of 2280 ground polygon samples (Section 2.2.3) were used to extract perband pixel values of the four composited datasets. These samples were used to train the RF classifier. RF is an ensemble learning method based on a combination of decision trees [69]. This classifier was also found to perform better compared with other machine learning (ML) classifiers such as support vector machine (SVM) [70]. RF requires less processing time, fewer parameters, and minimal manual intervention compared with SVM [71,72]. It can also cope properly with multi-modal data [73] and implicitly performs spectral selection due to its underlying principle [69]. Based on previous findings [74,75], a RF classifier with 500 decision trees (i.e., ntree) was trained and tested on each dataset described in Table 2 to create LULC classifications. The assessment of classification accuracy was carried out by comparing the LULC classes resulted from the training phase with data yielded by the testing phase (numbers of ground polygon samples = 1520) using for this purpose confusion matrices. Based on the confusion matrices, global quality metrics such as the overall accuracy (OA) and kappa coefficient (K) were calculated (Equations (1) and (2)) to evaluate the impact of composition methods on LULC classification.
Overall Accuracy (OA) = Number of Correctly Classified Samples Number of Total Samples (1) Moreover, the class level consumer's accuracy (CA), producer's accuracy (PA), and F1-score were calculated (Equations (3)- (5)). The F1-score is the harmonic mean between producer's and user's accuracies and can be used to evaluate the accuracy at class level [76].
PA is the probability that a pixel was correctly classified in a given class. CA is the probability that a pixel classified in a given class of the map represents that class on the ground [77]. The F1 was found to be the best performance metric and is widely used in previous research, which gives equal importance to both PA (as a precision) and CA (as a recall) by combining them into a single model performance metric [78,79]. The methodology adopted in this study is provided in the flowchart shown in Figure 2.

Variable Importance
Variable importance stands for the variables' contribution to distinguish between LULC classes, which helps by improving the classification accuracy while reducing data redundancy and processing workload. In this study, variable importance was derived from the RF model to estimate the contribution of variables (i.e., spectral bands and indices) to the obtained accuracy of the model [80]. Remote Sens. 2022, 14, x FOR PEER REVIEW 7 of 19

Variable Importance
Variable importance stands for the variables' contribution to distinguish between LULC classes, which helps by improving the classification accuracy while reducing data redundancy and processing workload. In this study, variable importance was derived from the RF model to estimate the contribution of variables (i.e., spectral bands and indices) to the obtained accuracy of the model [80].    Table 3. Based on the results, the overall accuracy of all datasets was relatively close. The highest overall accuracy and K coefficient were reached by the S-2 seasonal composites (OA = 95.48%, K = 0.9387), closely followed by S-2 percentile metrics (OA = 95.34%, K = 0.9365), L-8 seasonal composites (OA = 94.30%, K = 0.9220), and L-8 percentile metrics (OA = 93.87%, K = 0.9116). Therefore, in terms of satellite images, the highest overall accuracy and K coefficient were reached by S-2 using seasonal and percentile composites. Moreover, seasonal composites produced slightly higher accuracies than percentile metrics (Table 3).  Figure 4 provides some false color composites with their associated seasonal composites to better evaluate the effect of phenological information for an accurate LULC classification. As observed, the phenological variation of LULC classes, particularly of croplands and woodlands, was provided effectively via seasonal composites.  The accuracy figures reached by the four datasets characterizing different composition methods are provided in Table 3. Based on the results, the overall accuracy of all datasets was relatively close. The highest overall accuracy and K coefficient were reached  Figure 4 provides some false color composites with their associated seasonal composites to better evaluate the effect of phenological information for an accurate LULC classification. As observed, the phenological variation of LULC classes, particularly of croplands and woodlands, was provided effectively via seasonal composites.

Class Level Accuracy Assessment
For a more in-depth evaluation, per-class producer (PA) and consumer (CA) accuracies and F1-scores are provided for all datasets in Table 4. Regarding the role of time series composition methods, for both S-2 and L-8 datasets, it can be observed that the CA, PA, and F1 scores were higher in all LULC classes when using seasonal composites rather than percentile metrics. The lowest CA, PA, and F1-score were calculated for bare lands and the highest ones were observed for water bodies in all datasets.

Class Level Accuracy Assessment
For a more in-depth evaluation, per-class producer (PA) and consumer (CA) accuracies and F1-scores are provided for all datasets in Table 4. Regarding the role of time series composition methods, for both S-2 and L-8 datasets, it can be observed that the CA, PA, and F1 scores were higher in all LULC classes when using seasonal composites rather than percentile metrics. The lowest CA, PA, and F1-score were calculated for bare lands and the highest ones were observed for water bodies in all datasets. The highest CA, PA, and F1-scores for all LULC classes were produced by dataset-1 (except grasslands). In contrast, dataset-4 had lower CA, PA, and F1 score values of all LULC classes (except grasslands). As observed in Table 4, the CA, PA, and F1 score values of artificial lands (AL) in all datasets were very high (F1-score ranged between 97.62 and 94.48%) and there were no contrasting differences between different datasets. Similar results were also observed in terms of woodland (WO) classification (F1-score varied between 95.54 and 94.52%), with the exception that PA values for all datasets were quite low (ranged between 93.18 and 92.16%). There were some omission errors in the woodland classification. Regarding the cropland (CR) classification, S-2 based datasets (seasonal composites and percentiles metrics) achieved higher CA, PA, and F1-score values than the L-8 datasets. A similar difference in accuracy was observed by comparing the values returned by the composition methods used for each satellite time series. The F1-scores produced for cropland based on S-2 and L-8 datasets (seasonal and percentiles) were of 95.55 and 94.06%, respectively. Based on the results (Table 3), the lowest CA, PA, and F1-score values were calculated for bare land in all datasets. The results also showed that the S-2 datasets had a higher capability of mapping bare lands than L-8 datasets. For example, the CA and PA values produced for bare land using S-2 seasonal composites (dataset-1) increased by nearly 8% and 7%, respectively, as opposed to L-8 seasonal composites (dataset-3). Moreover, an important difference was observed in CA, PA, and F-1 scores of bare land classification between these datasets. These differences were also observed between dataset-2 (S-2 percentiles) and dataset-4 (L-8 percentiles). In terms of grassland classification, the best result was obtained with dataset-2, in which we used S-2 percentiles. But there were no remarkable differences between all datasets in grassland mapping accuracy. Figure 5 provides some finer-scaled partitions of the maps to better evaluate the differences between the datasets in identifying LULC classes. As shown, there were problems in all datasets to distinguish between bare and artificial land, grassland, and harvested croplands, which remain challenging to differentiate due to similar spectral properties.

Variable Importance
In all datasets, elevation, slope, and vegetation indices were found to be the most important variables in RF models ( Figure 6). In regard to S-2 seasonal composites, S-2 new spectral bands such as B8A, B5, and B6 were among the most important variables. In the case of L-8 seasonal datasets (datasets 3 and 4), in addition to vegetation indices, spectral bands including B5, B4, and B2 had a higher importance than other bands in LULC pre-

Variable Importance
In all datasets, elevation, slope, and vegetation indices were found to be the most important variables in RF models ( Figure 6). In regard to S-2 seasonal composites, S-2 new spectral bands such as B8A, B5, and B6 were among the most important variables. In the case of L-8 seasonal datasets (datasets 3 and 4), in addition to vegetation indices, spectral bands including B5, B4, and B2 had a higher importance than other bands in LULC prediction.

Discussion
Spatial distribution of LULC classes is often related to topographic factors [9,81]. Therefore, we considered elevation and slope factors in training and testing the RF models. As a result, in all RF models, elevation factors obtained high importance scores. The importance of topographic variables in large-scale LULC classification was also reported in similar studies [39,82]. For example, Rufin et al. [83] used topographic factors (elevation and slope) in their LULC model and indicate their high importance in increasing the overall accuracy. In another study, Htitiou et al. [30] used elevation and slope layers in the attempt to map croplands at national scale. Based on their results, these variables improved the ability to distinguish croplands, particularly in regions characterized by high elevation and steep slopes. Phan et al. [84] defined several datasets based on L-8 time series and auxiliary bands such as topographic factors and reported that the elevation was the most important variable in all of the models. Therefore, the high importance of topographic factors used in this study was consistent with the results reported by previous studies.
Regarding the composition methods used for S-2 and L-8 time series, the results showed that seasonal composites outperformed percentile metrics in distinguishing different LULC classes. In terms of accurate LULC mapping, the strategy used to generate composites depends on the types of LULC classes and the density of available time series [44]. The higher performance of seasonal composites in this study can be related to the selected LULC classes which covered cropland, woodland, and grassland. The use of seasonal composites enabled the inclusion of phenological information. These kinds of seasonality features were a piece of sufficient and valuable information for distinguishing the LULC classes with similar spectral attributes. The importance of phenological information was also reported in similar studies. Xie et al. [44] defined two datasets based on monthly median composites and percentile metrics to classify different LULC classes including various types of vegetation such as evergreen forests, deciduous forests, shrublands, croplands, etc., using Landsat TM/ETM+ time series and reported that the highest overall accuracy was produced by using monthly median composites.
When comparing the two satellite time series, accuracy assessment results indicated that S-2 composition methods (seasonal composites and percentiles metrics) produced higher accuracies than their L-8 counterparts did. The L-8 time series used in this study had similar spectral features to S-2 time series, excluding the red-edge bands of S-2. Therefore, the better results provided by the S-2 could be related to the red-edge bands and their spectral-temporal metrics. The feasibility and efficiency of red-edge bands were also present in variable importance scores. As Figure 6 illustrates, in both seasonal and percentile composites, red edge bands (B-7, 6, and 5) played a significant role in the performance of the RF classifier. Forkuor et al. [85] defined several datasets with different feature configurations based on mono date S-2 and L-8 bands. Based on their results, the S-2 dataset produced the highest accuracy. To evaluate the effect of red edge bands on the classification results, they also added S-2 red edge bands to the L-8 dataset and reported a 4% improvement compared with L-8 bands. Moreover, Immitzer et al. [86] reported that red-edge bands, particularly B-5 (RE1), are among the most important data features in mapping vegetation types such as croplands and woodlands. Ghayour et al. [87] compared the performance of S-2 and L-8 satellite images in LULC mapping by using classifiers such as SVM, artificial neural network (ANN), maximum likelihood classification (MLC), minimum distance (MD), and Mahalanobis distance (MHD). According to their findings, S-2 produced higher accuracies compared with the L-8 datasets irrespective of the classifier used.
Analyzing the confusion matrices indicated that bare lands were poorly identified and most of them were classified as grasslands. Moreover, some pixels of the bare land were misclassified as artificial lands. This problem occurred more often in the L-8 based datasets (dataset-3 and dataset-4). Therefore, the largest difference between the S-2 and L-8 composites was that observed in the classification of bare lands. The classification of bare lands, on the other hand, is a common challenge in LULC mapping studies. Zhao and Zhu [88] reported that some LULC classes such as artificial lands, bare lands, deserts, and harvested croplands have similar spectral signatures and tried to develop a spectral index to better distinguish these classes. They introduced the artificial surface index (ASI) by considering the artificial surface enhancement, vegetation, and bare soil suppressing and reported that ASI could increase the separability of artificial lands from other types of land use. Ettehadi Osgouei et al. [89] faced the same challenge and tried to use spectral indices, including the normalized difference tillage index (NDTI), red-edge based NDVI, and modified normalized difference water index (MNDWI). They reported that these spectral indices could improve the performance of the SVM classifier to distinguish bare land from urban areas. We used similar spectral indices in this study, including the NDVI, GNDVI, and NDBI, but it seems that this challenge still remains. Nevertheless, these spectral indices showed a good performance in differentiating other classes with similar spectral signatures, such as woodland and dense croplands.
In this study, croplands were composed of irrigated and rain-fed crops, which hold heterogeneous spectral patterns [90]. Considering all these spectral similarities and heterogeneous patterns, our approach and the used spectral-temporal metrics could enhance the classification accuracy. In some studies, the integrated use of textural and spectral features was found to be a solution to distinguish LULC classes with similar spectral properties. For example, Petrushevsky et al. [91] integrated S-2 multispectral bands, Sentinel-1 texture features, and object-based image analysis to generate an urban mask and reached a 90% overall accuracy. The radar signal is sensitive to geometry (e.g., roughness, texture, and internal structure), while physiology influences optical reflectance [92]. As such, these two spectral domains may provide complementary information [93]. Hence, the integration of radar time series and textural feature is recommended for future research [94,95].
Under the GEE platform, the satellite image processing time can be greatly shortened, which helps users to store decades of data and removes the need to download the satellite images one by one. However, one of the limitations of our study was the limited availability of data with suitable resolution in the Google Earth catalog, whose improvement in this direction is desirable for both researchers and practitioners. The second limitation of this study was the lack of precise training and validation samples, making it impossible for us to conduct our study on a much larger scale. These samples were collected automatically from existing reference datasets such as the land use/cover area frame survey (LUCAS) in similar studies. There was no access to accurate reference datasets or reliable LULC maps in our study area. Therefore, we used very high resolution (VHR) satellite images provided by Google Earth and visual interpretation, which are very time-consuming and challenging, particularly when dealing with large areas. Moreover, based on the previous studies and our results, topographic bands (such as elevation and slope) are very important variables in increasing the overall accuracy. In this study, we used freely available SRTM DEM with a resolution of approximately 30 m. Considering the importance of topographic variables, using more accurate DEMs could considerably increase the classification results. Unfortunately, we didn't have access to such high-resolution DEM in this study (third limitation). Therefore, future studies could use more accurate DEM, especially when dealing with mountainous landscapes. Future research may compare the existing land use products, such as those from this paper, with global products (e.g., ESA WorldCover and GlobCover maps). Since the classification accuracy may be affected by the type of machine learning model used, future studies could compare the results of other popular classifiers for LULC mapping such as SVM, ANN, and deep learning models. Finally, we suggest comparing balanced and imbalanced reference datasets in terms of classification accuracy as well as considering larger study areas with different climate conditions.

Conclusions
This study aimed at evaluating the potential of S-2 and L-8 time series and the RF classifier to produce accurate LULC maps. The S-2 and L-8 time series were used to generate spectral-temporal metrics based on two composition methods (seasonal and percentiles) to achieve the research objectives. Datasets, including S-2 and L-8 spectral temporal metrics and vegetation indices were defined, then their performance in identifying six common LULC classes (artificial land, water bodies, woodland, cropland, bare land, and grassland) was compared. The study concludes that medium resolution satellite time series and the extracted spectral-temporal metrics are robust sources for accurate LULC mapping. However, some differences among the datasets were observed. For instance, the LULC maps generated from the S-2 time series were more accurate than those generated from the L-8 time series. The comparison between composition methods indicated that seasonal median composites outperformed percentile metrics in both S-2 and L-8 time series. The results proved the efficiency of vegetation indices, particularly of NDVI and NDBI and S-2 red-edge bands, concerning the variable importance. In short, the approach used in this research and the generated spectral-temporal metrics can be used to produce accurate LULC maps based on the GEE cloud computing platform. These results highlight that cloud computing platforms such as GEE and new Earth-observing satellites such as S-2 have contributed significantly in improving LULC mapping and monitoring.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: