A Hybrid Data Balancing Method for Classification of Imbalanced Training Data within Google Earth Engine: Case Studies from Mountainous Regions

Distribution of Land Cover (LC) classes is mostly imbalanced with some majority LC classes dominating against minority classes in mountainous areas. Although standard Machine Learning (ML) classifiers can achieve high accuracies for majority classes, they largely fail to provide reasonable accuracies for minority classes. This is mainly due to the class imbalance problem. In this study, a hybrid data balancing method, called the Partial Random Over-Sampling and Random Under-Sampling (PROSRUS), was proposed to resolve the class imbalance issue. Unlike most data balancing techniques which seek to fully balance datasets, PROSRUS uses a partial balancing approach with hundreds of fractions for majority and minority classes to balance datasets. For this, time-series of Landsat-8 and SRTM topographic data along with various spectral indices and topographic data were used over three mountainous sites within the Google Earth Engine (GEE) cloud platform. It was observed that PROSRUS had better performance than several other balancing methods and increased the accuracy of minority classes without a reduction in overall classification accuracy. Furthermore, adopting complementary information, particularly topographic data, considerably increased the accuracy of minority classes in mountainous areas. Finally, the obtained results from PROSRUS indicated that every imbalanced dataset requires a specific fraction(s) for addressing the class imbalance problem, because different datasets contain various characteristics.


Introduction
Mountains, covering a quarter of earth's land surface, are globally distributed from the Tropics to the poles and from maritime to continental environments [1]. Obtaining up-to-date and accurate information of Mountain Land Cover (MLC) types is important for various applications, including global warming and environmental changes [2][3][4]. Moreover, MLC data is a vital part of the assessment and management of natural hazards studies (e.g., landslides and wildfires) [5][6][7]. Considering the large extent and limited accessibility of mountainous areas, Remote Sensing (RS) datasets are well-suited for mapping MLC classes. This is mainly related to the global coverage, the availability of various spatial and spectral resolutions, and frequent observations from RS systems [8][9][10].
The RS community has been so far examined various datasets and methodologies to meet users' requirements for generating accurate MLC maps [11][12][13][14]. The advent of state-of-the-art Machine Learning (ML) techniques has particularly helped the RS community to improve the accuracy of MLC

Overall Workflow
As illustrated in Figure 2, the overall workflow of this study, which were implemented in GEE consists of five main steps: (1) acquiring time-series of Landsat-8 OLI and SRTM imagery and generating complementary data (i.e., spectral indices and topographic products) within the GEE platform, (2) Generating reference samples and splitting them into three groups based on the number of samples (i.e., majority classes, middle classes, and minority classes), (3) Selecting the best spectral and topographic features for LC classification and assessing the effects of various features on the accuracy of minority classes using RF classifier, (4) Applying the PROSRUS method using 200 different fractions, and (5) Accuracy assessment of PROSRUS and evaluating its accuracy compared to those of the Random Over-Sampling (ROS) [24], Random Under-Sampling (RUS) [37], Synthetic Minority Over-sampling Technique (SMOTE) [21], and Geometric SMOTE (G-SMOTE) [38] techniques.

Acquiring Landsat and Elevation Data, and Generating Complementary Data Within the GEE Platform
The time-series of Landsat-8 surface reflectance Tier 1 products (ID: LANDSAT/LC08/C01/T1_SR) with less than 10% cloud coverage between May and October 2019 were used in this study. A total of 9, 13, and 15 Landsat-8 scenes were processed for the Site-1, Site-2, and Site-3, respectively (refer to Appendix A for more information). From the available spectral bands of Landsat-8 image, six bands (i.e. Bands 2-7) were used in this study. A median function, which can remove noisy, very dark, and very bright pixels [39], was applied to produce a single Landsat-8 mosaic image for each experiment site. Several spectral indices, including Normalized Difference Vegetation Index (NDVI), Normalized Difference Water Index (NDWI), Soil-Adjusted Vegetation Index (SAVI), and Normalized Difference Built-up Index (NDBI) (see Table 1) were also generated from Landsat-8 imagery to investigate the effect of the spectral indices on the overall classification accuracy and the accuracies of the minority classes. NDVI, which helps us for generating an image showing the relative biomass, has been applying broadly in LC mapping [40]. It has been proven that NDBI along with NDVI are effective indices for identifying urban built-up area and discriminating them from other land cover types (e.g., trees and grassland) [41]. The NDWI allows scholars to recognize water bodies from other objects such as soil and terrestrial vegetation features [42]. SAVI can help us to discriminate soil-vegetation systems [43]. Furthermore, the Shuttle Radar Topography Mission (SRTM) data, which is available in the GEE platform (ID: USGS/SRTMGL1_003), was applied to generate complementary topographic information, including elevation, slope, and aspect. The effects of these elevation products were also investigated on the accuracy of classification.

Acquiring Landsat and Elevation Data, and Generating Complementary Data within the GEE Platform
The time-series of Landsat-8 surface reflectance Tier 1 products (ID: LANDSAT/LC08/C01/T1_SR) with less than 10% cloud coverage between May and October 2019 were used in this study. A total of 9, 13, and 15 Landsat-8 scenes were processed for the Site-1, Site-2, and Site-3, respectively (refer to Appendix A for more information). From the available spectral bands of Landsat-8 image, six bands (i.e., Bands 2-7) were used in this study. A median function, which can remove noisy, very dark, and very bright pixels [39], was applied to produce a single Landsat-8 mosaic image for each experiment site. Several spectral indices, including Normalized Difference Vegetation Index (NDVI), Normalized Difference Water Index (NDWI), Soil-Adjusted Vegetation Index (SAVI), and Normalized Difference Built-up Index (NDBI) (see Table 1) were also generated from Landsat-8 imagery to investigate the effect of the spectral indices on the overall classification accuracy and the accuracies of the minority classes. NDVI, which helps us for generating an image showing the relative biomass, has been applying broadly in LC mapping [40]. It has been proven that NDBI along with NDVI are effective indices for identifying urban built-up area and discriminating them from other land cover types (e.g., trees and grassland) [41]. The NDWI allows scholars to recognize water bodies from other objects such as soil and terrestrial vegetation features [42]. SAVI can help us to discriminate soil-vegetation systems [43]. Furthermore, the Shuttle Radar Topography Mission (SRTM) data, which is available in the GEE platform (ID: USGS/SRTMGL1_003), was applied to generate complementary topographic information, including elevation, slope, and aspect. The effects of these elevation products were also investigated on the accuracy of classification.

Generating Reference Samples and Spitting Them into Majority, Middle, and Minority Groups
Collection of in-situ samples in mountainous areas is often labor-intensive and expensive. However, generating reliable reference datasets is a basic requirement for accurate supervised Land Cover (LC) classification. Therefore, the reference samples over the experiment sites were generated using accurate visual interpretation of very high spatial resolution images of Google Earth (Figure 3). The specifications of the LC Classification System, developed by the Food and Agriculture Organization of the United Nations [46], were considered in the generating reference samples. It includes nine LC types, including Forest, Grassland, Shrub land, Cultivated land, Artificial land, Water bodies, Wetland, Permanent snow/ice, and Bare land. Based on the distributions of MLC types, 1089, 970, and 1044 samples were, respectively, generated for Site-1, Site-2, and Site-3. It should be noted that MLC classes covering larger areas relatively received more samples. Finally, the generated reference samples were randomly divided into two groups training and validation (50% and 50%).
To split the generated reference samples into three different groups (i.e., Majority, Middle, and Minority), first, the Highest Number of Samples (HNS) among different classes in each experiment site (i.e., Forest class with 244 samples in Site-1, Bare land Class with 326 samples in Site-2, and Forest class with 280 samples in Site-3) were selected. Then, the class(es) with samples between 70% and 100% of HNS was grouped as the Majority Class; the class(es) with samples between 35% and 70% of the HNS was grouped as the Middle Class; and the class(es) with samples between 0% and 35% of the HNS was grouped as the Minority Class.
Remote Sens. 2020, 11, x FOR PEER REVIEW 5 of 20 Collection of in-situ samples in mountainous areas is often labor-intensive and expensive. However, generating reliable reference datasets is a basic requirement for accurate supervised Land Cover (LC) classification. Therefore, the reference samples over the experiment sites were generated using accurate visual interpretation of very high spatial resolution images of Google Earth (Figure 3). The specifications of the LC Classification System, developed by the Food and Agriculture Organization of the United Nations [46], were considered in the generating reference samples. It includes nine LC types, including Forest, Grassland, Shrub land, Cultivated land, Artificial land, Water bodies, Wetland, Permanent snow/ice, and Bare land. Based on the distributions of MLC types, 1089, 970, and 1044 samples were, respectively, generated for Site-1, Site-2, and Site-3. It should be noted that MLC classes covering larger areas relatively received more samples. Finally, the generated reference samples were randomly divided into two groups training and validation (50% and 50%).
To split the generated reference samples into three different groups (i.e., Majority, Middle, and Minority), first, the Highest Number of Samples (HNS) among different classes in each experiment site (i.e., Forest class with 244 samples in Site-1, Bare land Class with 326 samples in Site-2, and Forest class with 280 samples in Site-3) were selected. Then, the class(es) with samples between 70% and 100% of HNS was grouped as the Majority Class; the class(es) with samples between 35% and 70% of the HNS was grouped as the Middle Class; and the class(es) with samples between 0% and 35% of the HNS was grouped as the Minority Class.

Selecting Best Classification Scenario Based on the Optimum Features
Among the available ML algorithms, RF has been drawing considerable attention in LC mapping [8,47]. This is mainly due to its high performance, availability in different computing environments, and its low sensitivity to noisy data [48,49]. RF combines multiple decision trees to classify the input data [50,51]. Moreover, it takes and resamples the input dataset several times to avoid the overfitting problem [5,50]. To achieve the most accurate RF model, two main parameters should be accurately optimized: (1) the number of trees in the forest (ntree); (2) the number of variables available for splitting at each tree node (mtry). In this study, after multiple trial and errors to find the optimum values of these parameters, the ntree and mtry were set to 500 and the square root of the total number of input features, respectively.
Four well known spectral indices, including NDVI, NDWI, SAVI, and NDBI, along with topographic products features, including elevation, slope, and aspect were used to identify best classification scenario. The most optimum spectral and topographic features were selected based on the results of RF classifications applied to four following scenarios. Additionally, the effects of the

Selecting Best Classification Scenario Based on the Optimum Features
Among the available ML algorithms, RF has been drawing considerable attention in LC mapping [8,47]. This is mainly due to its high performance, availability in different computing environments, and its low sensitivity to noisy data [48,49]. RF combines multiple decision trees to classify the input data [50,51]. Moreover, it takes and resamples the input dataset several times to avoid the overfitting problem [5,50]. To achieve the most accurate RF model, two main parameters should be accurately optimized: (1) the number of trees in the forest (ntree); (2) the number of variables available for splitting at each tree node (mtry). In this study, after multiple trial and errors to find the optimum values of these parameters, the ntree and mtry were set to 500 and the square root of the total number of input features, respectively.
Four well known spectral indices, including NDVI, NDWI, SAVI, and NDBI, along with topographic products features, including elevation, slope, and aspect were used to identify best classification scenario. The most optimum spectral and topographic features were selected based Remote Sens. 2020, 12, 3301 6 of 21 on the results of RF classifications applied to four following scenarios. Additionally, the effects of the complementary datasets (i.e., spectral indices and topographic features) on the accuracy of MLC mapping, especially those of the minority classes, were investigated. After comparing the results of the four scenarios and selecting the optimal input features (i.e., scenario with the best result), the proposed PROSRUS method was implemented to address the class imbalance problem.

Applying PROSRUS Method
In this study, a hybrid data balancing method, called PROSRUS, was proposed. The PROSRUS method combines two well-known data-level balancing methods, including ROS [24] and RUS [37]. ROS, as a straightforward oversampling technique, randomly duplicates samples from minority class(es) to balance the distribution of classes. Fully balancing of an original imbalanced dataset using this method could cause overfitting of the classifier because of the duplication [52]. On the other hand, RUS randomly deletes samples from the majority class(es) to adjust the data distribution. The main shortcoming of a fully balancing dataset using RUS is that it may miss valuable information [23].
The proposed hybrid method not only takes the advantages of both ROS and RUS, but also limits their disadvantages by examining 200 different fractions in the balancing scheme. More specifically, as shown in Figure 4, original data were initially divided into three following groups based on the number of samples of different LC classes: Group-1 (minority classes), Group-2 (middle classes), and Group-3 (majority classes). Subsequently, after multiple trial and errors, 200 different fractions (it is possible to define any other preferred fractions) are employed for balancing LC classes to extract the optimal fraction(s) among them. In this way, as a partial balancing approach, ROS was used for oversampling samples in Group-1, and RUS was applied for under-sampling in Group-3, while samples of Group-2 were unchanged. For example, in fraction-1, only 10% of samples from Group-3 (90% of samples removed using RUS), 100% of Group-2 (unchanged), and 110% of Group-1 (10% new samples added using ROS) were contributed to the balancing process. The code for applying PROSRUS in the GEE platform is available in the Supplementary Material.

Accuracy Assessment and Comparison
The accuracy of obtained MLC maps using the proposed PROSRUS method were evaluated using the OA, User's Accuracy (UA), and Producer's Accuracy (PA) measures. Since OA is affected by majority classes rather than the minority ones [25], the Geometric Mean (G-Mean) index was also applied for accuracy assessment. G-Mean is particularly suitable for the evaluation of a classification with a class imbalance problem with more focusing on the accuracy of minority classes [53]. Accordingly, the G-Mean of PA (GM-PA) and G-Mean of UA (GM-UA) were also calculated.
The results of PROSRUS were also compared with those of the four well-known balancing techniques, including ROS, RUS, SMOTE, and G-SMOTE. To this end, RF along each of these data balancing techniques were applied to the optimum features (i.e., best scenario discussed in Section 3.4). For comparison purposes, the methods were named as RF-PROSRUS, RF-ROS, RF-RUS, RF-SMOTE, and RF-G-SMOTE.

Accuracy Assessment and Comparison
The accuracy of obtained MLC maps using the proposed PROSRUS method were evaluated using the OA, User's Accuracy (UA), and Producer's Accuracy (PA) measures. Since OA is affected by majority classes rather than the minority ones [25], the Geometric Mean (G-Mean) index was also applied for accuracy assessment. G-Mean is particularly suitable for the evaluation of a classification with a class imbalance problem with more focusing on the accuracy of minority classes [53]. Accordingly, the G-Mean of PA (GM-PA) and G-Mean of UA (GM-UA) were also calculated.
The results of PROSRUS were also compared with those of the four well-known balancing techniques, including ROS, RUS, SMOTE, and G-SMOTE. To this end, RF along each of these data balancing techniques were applied to the optimum features (i.e., best scenario discussed in subsection 3.4). For comparison purposes, the methods were named as RF-PROSRUS, RF-ROS, RF-RUS, RF-SMOTE, and RF-G-SMOTE.

Results
After grouping the LC classes based on the number of samples over each experiment site (Table  2), the impacts of different complementary information and different balancing techniques in MLC classes were investigated as follows: Table 2. Grouping the land cover classes based on the number of samples over each experiment site.

Optimum Classification Scenario
The effects of different complementary information, such as spectral indices (see Table 1) and topographic data (elevation, slope, and aspect) on the accuracy of minority classes in mountainous areas were investigated using four different classification scenarios explained in section 3.2. As it is

Results
After grouping the LC classes based on the number of samples over each experiment site (Table 2), the impacts of different complementary information and different balancing techniques in MLC classes were investigated as follows: Table 2. Grouping the land cover classes based on the number of samples over each experiment site.

Optimum Classification Scenario
The effects of different complementary information, such as spectral indices (see Table 1) and topographic data (elevation, slope, and aspect) on the accuracy of minority classes in mountainous areas were investigated using four different classification scenarios explained in Section 3.2. As it is clear from Figure 5, including complementary information considerably improved the accuracy of MLC classification, particularly minority classes. Scenario-4 (time-series of Landsat images + original imbalanced data + topographic features + spectral indices) resulted in the highest accuracy. The OAs, GM-UAs, and GM-PAs of this classification scenario, respectively, ranged between 87.3%-93.8%, 85.6%-91.6%, and 82.6%-89.4% over the three experiment sites. As shown in Figure 5, all three overall accuracy assessment metrics (i.e., OA, GM-UA, and GM-PA), generally had the highest values using Scenario-4. For example, in Site-1, OA, GM-UA, and GM-PA, respectively, increased from 80% to 87.3%, 76.3% to 85.6%, and 70.7% to 82.6% compared to when only spectral bands of Landsat-8 were used (i.e., Scenario-1).
Although both topographic features (Scenario-3) and spectral indices (Scenario-2) improved all three accuracy assessment metrics for simple RF (Scenario-1), topographic data had higher impacts than spectral indices on improving MLC classification results (see Figure 5). The OAs, GM-UAs, and GM-PAs of Scenario-3, respectively, ranged between 86.2%-92.7%, 85.3%-90.8%, and 81.7%-88.9% over the three experiment sites. Moreover, the OAs, GM-UAs, and GM-PAs of Scenario-2, respectively, ranged between 83.4%-88%, 79.1%-85.3%, and 76.3%-79.7% over the three experiment sites. Regarding different MLC types, minority classes showed stronger responses to including topographic and spectral features (Table 3). For example, regarding UA values, the highest improvement compared to Scenario-1, were observed in two (out of three) experiment sites for the minority classes: Grassland class in Site-1 (18.9%), and Wetland class in Site-3 (17.8%). According to the PA values, the highest improvement also achieved by minority classes as follows: Wetland class in Site-1 (31.2%), Wetland class in Site-2 (18.7%), and Artificial land class in Site-3 (28.5%). The results indicated that including complementary information to the classification procedure was necessary to improve not only the overall classification accuracy but also the individual class accuracies, especially those of the minority MLC types. Table 3. Effects of the Scenario-4 on User's Accuracy (UA) and Producer's Accuracy (PA) values over three experiment sites. The increased and decreased in the accuracies are indicated by + and − signs, respectively (refer to Appendix B for more detailed information).  Regarding different MLC types, minority classes showed stronger responses to including topographic and spectral features (Table 3). For example, regarding UA values, the highest improvement compared to Scenario-1, were observed in two (out of three) experiment sites for the minority classes: Grassland class in Site-1 (18.9%), and Wetland class in Site-3 (17.8%). According to the PA values, the highest improvement also achieved by minority classes as follows: Wetland class in Site-1 (31.2%), Wetland class in Site-2 (18.7%), and Artificial land class in Site-3 (28.5%). The results indicated that including complementary information to the classification procedure was necessary to improve not only the overall classification accuracy but also the individual class accuracies, especially those of the minority MLC types.

Comparison of Balancing Techniques
The proposed method along with four balancing techniques (i.e., ROS, RUS, SMOTE, and G-SMOTE) were applied over three experiment sites to study the impact of different balancing techniques on the accuracy of MLC classification. The results are these investigations are discussed in the following.

Site-1
In Site-1, the proposed PROSRUS with the fraction numbers of 190 showed the best performance ( Figure 6). This fraction used 210%, 100%, and 100% of Group-1 (minority classes), Group-2 (middle classes), and Group-3 (majority classes), respectively. As is clear from Figure 6, in comparison to Scenario-4 with imbalanced samples, it respectively improved GM-PA, GM-UA, and OA values by approximately 3.5%, 1.2%, and 1.2%. This proved the high potential of the proposed balancing method to provide high accuracies for both majority and minority classes. RF-G-SMOTE yielded the second-best result by providing OA = 86.6%, GM-UA = 84.03%, and GM-PA = 83.01%. Unlike the PROSRUS-190 and RF-G-SMOTE that increased all three overall metrics, the price for increasing the accuracy of minority classes was a reduction in the OA values for the other three resampling techniques (i.e., RF-SMOTE, RF-RUS, and RF-ROS). For example, although RF-ROS increased the value of GM-PA by approximately 1.6%, it reduced OA by approximately 1.1%. This amount for RF-SMOTE was even higher (i.e., a decrease of 1.5% in OA).
( Figure 6). This fraction used 210%, 100%, and 100% of Group-1 (minority classes), Group-2 (middle classes), and Group-3 (majority classes), respectively. As is clear from Figure 6, in comparison to Scenario-4 with imbalanced samples, it respectively improved GM-PA, GM-UA, and OA values by approximately 3.5%, 1.2%, and 1.2%. This proved the high potential of the proposed balancing method to provide high accuracies for both majority and minority classes. RF-G-SMOTE yielded the second-best result by providing OA = 86.6%, GM-UA = 84.03%, and GM-PA = 83.01%. Unlike the PROSRUS-190 and RF-G-SMOTE that increased all three overall metrics, the price for increasing the accuracy of minority classes was a reduction in the OA values for the other three resampling techniques (i.e., RF-SMOTE, RF-RUS, and RF-ROS). For example, although RF-ROS increased the value of GM-PA by approximately 1.6%, it reduced OA by approximately 1.1%. This amount for RF-SMOTE was even higher (i.e., a decrease of 1.5% in OA).  Figure 7). However, two classes of Artificial land (0.9%) and Grassland (1.9%) experienced downtrends. RF-G-SMOTE, as the second-best method, improved UA values of the Bare land class by 0.5% and the Shrub land class by 2.8%, while decreased UA values for four classes, including Artificial land (1.3%), Grassland (10%), Cultivated land (0.1%), and Wetland (7.5%). Regarding the PA values, the proposed method improved the values of three classes (Artificial land = 5.9%, Grassland = 18.5%, and Wetland = 6.3%). However, the PA values of the Bare land, Cultivated land, and Shrub land classes, respectively, decreased by 1.9%, 3.3%, and 1.3% using the proposed RF-PROSRUS method.   In Site-2, the fraction numbers of 26 using 130% of Group-1 (minority classes), 100% of Group-2 (middle classes), and 70% of Group-3 (majority classes), showed the best performance ( Figure 8).  In Site-2, the fraction numbers of 26 using 130% of Group-1 (minority classes), 100% of Group-2 (middle classes), and 70% of Group-3 (majority classes), showed the best performance ( Figure 8). The proposed method, respectively, increased GM-PA, GM-UA, and OA values by approximately 4.5%, 4.2%, and 1.2% in comparison to Scenario-4. This confirmed the high potential of PROSRUS in dealing with the class imbalance problem. Similar to Site-1, RF-G-SMOTE showed the second-best results by providing OA = 93.86%, GM-UA = 91.67%, and GM-PA = 92.88%.   In Site-3, the fraction numbers of 74 using 180% of Group-1 (minority classes), 100% of Group-2 (middle classes), and 50% of Group-3 (majority classes) had the best performance in improving MLC classification using the proposed method ( Figure 10). Reaching to OA = 92.15, GM-UA = 91.53, and GM-PA = 88.85 in comparison to the Scenario-4, the proposed method increased these overall accuracy metrics by approximately 1.6%, 1.7%, and 5.3%, respectively. RF-G-SMOTE outperformed remaining three resampling methods and obtained the second-best place. In Site-3, the fraction numbers of 74 using 180% of Group-1 (minority classes), 100% of Group-2 (middle classes), and 50% of Group-3 (majority classes) had the best performance in improving MLC classification using the proposed method ( Figure 10). Reaching to OA = 92.15, GM-UA = 91.53, and GM-PA = 88.85 in comparison to the Scenario-4, the proposed method increased these overall accuracy metrics by approximately 1.6%, 1.7%, and 5.3%, respectively. RF-G-SMOTE outperformed remaining three resampling methods and obtained the second-best place. In Site-3, the fraction numbers of 74 using 180% of Group-1 (minority classes), 100% of Group-2 (middle classes), and 50% of Group-3 (majority classes) had the best performance in improving MLC classification using the proposed method ( Figure 10). Reaching to OA = 92.15, GM-UA = 91.53, and GM-PA = 88.85 in comparison to the Scenario-4, the proposed method increased these overall accuracy metrics by approximately 1.6%, 1.7%, and 5.3%, respectively. RF-G-SMOTE outperformed remaining three resampling methods and obtained the second-best place.

Discussion
GEE has markedly improved the LC mapping studies by providing a huge number of geospatial datasets, in particular, the archive of Landsat data [54,55]. In this study, 37 Landsat-8 OLI scenes and SRTM data were used to study the potential of balancing methods on MLC classification. The GEE platform allowed us to have a faster and easier classification process because of several factors, such as providing atmospherically corrected time-series of Landsat data, high-performance computing capability, image-based functions, and integrated RF algorithm to the GEE API.
The experiments demonstrated the efficiency of adopting complementary information to improve the accuracy of MLC classification. We were able to increase the average OA, GM-UA, and

Discussion
GEE has markedly improved the LC mapping studies by providing a huge number of geospatial datasets, in particular, the archive of Landsat data [54,55]. In this study, 37 Landsat-8 OLI scenes and SRTM data were used to study the potential of balancing methods on MLC classification. The GEE platform allowed us to have a faster and easier classification process because of several factors, such as providing atmospherically corrected time-series of Landsat data, high-performance computing capability, image-based functions, and integrated RF algorithm to the GEE API.
The experiments demonstrated the efficiency of adopting complementary information to improve the accuracy of MLC classification. We were able to increase the average OA, GM-UA, and GM-PA by 7%, 7.2%, and 10.2% using all spectral and topographic features (i.e., slope, elevation, aspect, NDVI, NDWI, NDBI, and SAVI), respectively. This can be explained by the fact that both topographic data and spectral indices provided important information, which in turn improved the MLC mapping accuracy [56,57]. By comparing the results of the four different scenarios over three experiment sites, it was observed that although Scenario-4 (i.e., integrating spectral indices and topographic data) showed the highest accuracies, the impact of topographic data was higher than the spectral indices in MLC classification (see Figure 12). This corresponded well to multiple studies, such as [29,58,59]. Based on Figure 5, among all three overall accuracy metrics, the GM-PA metric showed the highest improvement (Site-1 = 11.9%, Site-2 = 12.4%, and Site-3 = 6.5%) after adopting the complementary information. It can be concluded that including spectral and topographic features had bigger effects on the accuracy of minority classes. It was also observed that PROSRUS outperformed all other data balancing techniques, including ROS, RUS, SMOTE, and G-SMOTE. PROSRUS along with RF algorithm improved the average OA by approximately 1.3% considering all experiment sites ( Figure 13). Higher improvements in the GM-UA and GM-PA values were even observed after adopting the proposed method (i.e., approximately by 1.8% and 4.6%, respectively). The reason might be attributed to two main factors as follows: (1) PROSRUS only duplicated samples from minority classes and did not generate artificial samples. Generating artificial samples by some balancing methods can sometimes lead to misclassification [24]; (2) PROSRUS partially balanced dataset to find the most optimal fraction(s) for addressing the class imbalance problem. This decreased the drawbacks of fully balancing datasets using ROS and RUS (e.g., overfitting for fully ROS [52] and losing critical information for RUS [60]). It was also observed that PROSRUS outperformed all other data balancing techniques, including ROS, RUS, SMOTE, and G-SMOTE. PROSRUS along with RF algorithm improved the average OA by approximately 1.3% considering all experiment sites ( Figure 13). Higher improvements in the GM-UA and GM-PA values were even observed after adopting the proposed method (i.e., approximately by 1.8% and 4.6%, respectively). The reason might be attributed to two main factors as follows: (1) PROSRUS only duplicated samples from minority classes and did not generate artificial samples. Generating artificial samples by some balancing methods can sometimes lead to misclassification [24]; (2) PROSRUS partially balanced dataset to find the most optimal fraction(s) for addressing the class imbalance problem. This decreased the drawbacks of fully balancing datasets using ROS and RUS (e.g., overfitting for fully ROS [52] and losing critical information for RUS [60]). Based on previous studies, improving the accuracy of minority classes usually leads to decrease in OA. For example, Waldner et al. [25] reported that "the price for increasing the accuracy of minority classes was a decrease in OA". However, among all five resampling methods, PROSRUS was the only method that successfully improved the accuracy of minority classes without a reduction in OA in all experiment sites (i.e., Site-1 = 1.57%, Site-2 = 1.23%, and Site-3 = 1.17%). Our experiments also confirmed that G-SMOTE outperforms SMOTE in most cases, which was in agreement with [27], ROS had higher accuracies than RUS, which confirmed the findings of [61], and had lower accuracies than SMOTE, which was in the agreement with [16].
The experiments showed that a specific balancing ratio cannot provide optimal results in all datasets and settings. For example, fraction numbers of 190, 74, and 26 showed the best results among all applied 200 fractions over Site-1, Site-2, and Site-3, respectively. The reason that different datasets react differently to various fractions can be related to the issue that the imbalance ratio is different from a dataset to another one [25]. Therefore, it is necessary to investigate different fractions to achieve the most accurate MLC map.

Conclusion
In this study, a hybrid data balancing technique was proposed to address the class imbalance problem, which is a common problem in LC classification using ML algorithms. Additionally, the role of complementary information on MLC mapping was investigated. All the investigations were conducted over three different experiment sites using the time-series of Landsat-8 OLI within the GEE cloud computing platform. The study revealed the feasibility and reliability of improving the accuracy of LC classes in mountainous areas by adopting the RF classification algorithm, using both spectral and topographic features, and PROSRUS as a data balancing technique. The experiments also showed that topographic data including elevation, slope, and aspect had higher impacts than spectral indices in improving the accuracy of MLC maps. Moreover, it was illustrated that higher accuracies could be obtained for both minority and majority classes using an appropriate balancing ratio. Moreover, it was concluded that every dataset requires a specific balancing ratio to obtain the optimal result because the imbalance ratios and complexity levels are different for different datasets. In summary, since the performance of the proposed balancing method was substantially better than those of the RF with imbalanced data and four rebalancing techniques (i.e., ROS, RUS, SMOTE, and G-SMOTE), it was concluded that the integration of complementary information and PROSRUS method was a valid alternative practice that should be considered for LC classification in mountainous areas.
Supplementary Materials: The following are available online at www.mdpi.com/xxx/s1, S1: Scripts for investigating the role of different complementary information on the accuracies of MLC classes. S2: Scripts for implementing PROSRUS based on time-series of Landsat and the GEE platform. Based on previous studies, improving the accuracy of minority classes usually leads to decrease in OA. For example, Waldner et al. [25] reported that "the price for increasing the accuracy of minority classes was a decrease in OA". However, among all five resampling methods, PROSRUS was the only method that successfully improved the accuracy of minority classes without a reduction in OA in all experiment sites (i.e., Site-1 = 1.57%, Site-2 = 1.23%, and Site-3 = 1.17%). Our experiments also confirmed that G-SMOTE outperforms SMOTE in most cases, which was in agreement with [27], ROS had higher accuracies than RUS, which confirmed the findings of [61], and had lower accuracies than SMOTE, which was in the agreement with [16].
The experiments showed that a specific balancing ratio cannot provide optimal results in all datasets and settings. For example, fraction numbers of 190, 74, and 26 showed the best results among all applied 200 fractions over Site-1, Site-2, and Site-3, respectively. The reason that different datasets react differently to various fractions can be related to the issue that the imbalance ratio is different from a dataset to another one [25]. Therefore, it is necessary to investigate different fractions to achieve the most accurate MLC map.

Conclusions
In this study, a hybrid data balancing technique was proposed to address the class imbalance problem, which is a common problem in LC classification using ML algorithms. Additionally, the role of complementary information on MLC mapping was investigated. All the investigations were conducted over three different experiment sites using the time-series of Landsat-8 OLI within the GEE cloud computing platform. The study revealed the feasibility and reliability of improving the accuracy of LC classes in mountainous areas by adopting the RF classification algorithm, using both spectral and topographic features, and PROSRUS as a data balancing technique. The experiments also showed that topographic data including elevation, slope, and aspect had higher impacts than spectral indices in improving the accuracy of MLC maps. Moreover, it was illustrated that higher accuracies could be obtained for both minority and majority classes using an appropriate balancing ratio. Moreover, it was concluded that every dataset requires a specific balancing ratio to obtain the optimal result because the imbalance ratios and complexity levels are different for different datasets. In summary, since the performance of the proposed balancing method was substantially better than those of the RF with imbalanced data and four rebalancing techniques (i.e., ROS, RUS, SMOTE, and G-SMOTE), it was concluded that the integration of complementary information and PROSRUS method was a valid alternative practice that should be considered for LC classification in mountainous areas.