A Review of Ensemble Learning Algorithms Used in Remote Sensing Applications

: Machine learning algorithms are increasingly used in various remote sensing applications due to their ability to identify nonlinear correlations. Ensemble algorithms have been included in many practical applications to improve prediction accuracy. We provide an overview of three widely used ensemble techniques: bagging, boosting, and stacking. We ﬁrst identify the underlying principles of the algorithms and present an analysis of current literature. We summarize some typical applications of ensemble algorithms, which include predicting crop yield, estimating forest structure parameters, mapping natural hazards, and spatial downscaling of climate parameters and land surface temperature. Finally, we suggest future directions for using ensemble algorithms in practical applications.


Introduction
Remote sensing is widely used in applications such as military reconnaissance, analysis of natural hazards, detection of land use and land cover change, measurement of sea ice distribution, precision agriculture, and estimation and mapping of carbon stocks [1,2]. The primary properties of the target detected by the sensor in these applications are in the form of spatial, spectral, temporal, and polarization signatures [3]. Many models of physical processes have been developed that quantify the data relationships between sensor and target. These models include radiative transfer models, radiosity models, raytracing models, dynamic global vegetation models, and land surface microwave emission models, among others [4][5][6][7][8][9]. However, it is generally difficult to obtain the parametric input required for a physical process model, especially at a large scale. Some studies have coupled a physical process model with a statistical method or used a statistical method with a machine learning algorithm to extract crucial predictions for a specific application. Since the remote sensing signals often provide a nonlinear representation of the target, machine learning and deep learning algorithms have been widely used in remote sensing applications, because they are not based on some underlying assumption regarding the distribution of the data and they have a potent capacity to capture nonlinear correlations [10,11]. Support vector machines (SVM) and random forests (RF) are two wellknown nonparametric machine learning algorithms that are used in many studies [12][13][14].
Some studies have claimed that no single algorithm could outperform every other machine learning algorithm under all situations (sometimes called the no free lunch theorem) [15]. The ensemble learning technique has been developed in response to this claim [16,17]. Ensemble learning techniques create multiple hypotheses and combine them to solve a given problem, in contrast to conventional machine learning techniques that aim to learn a single hypothesis from training data. [18,19]. By combining multiple learners and taking full advantage of these learners, ensemble algorithms have produced improved results and reduced the overfitting problem, and they possess the flexibility to deal with different tasks [17,20,21]. Bagging, boosting, and stacking are three well established ensemble techniques, although there exist some variants and other ensemble algorithms that have been used in practical applications [17]. Current implementations of these algorithms in R and Python are shown in Table 1. We present a general overview of the bagging, boosting, and stacking ensemble techniques used in remote sensing. We first review the principles underlying the three types of ensemble algorithms in Section 2 and describe our literature search for publications about bagging, boosting, and stacking ensembles in Section 3. In Section 4, we focus on several fields of study that widely use ensemble techniques for objectives such as crop yield prediction, estimation of forest structure parameters, mapping natural hazards, and spatial downscaling, as well as other applications. Finally, we examine the issues related to ensemble algorithms and future directions.

Bagging Algorithms
Bagging is an ensemble technique that combines multiple learners trained on distinct subsamples of the original data. To build a bagging model, we generate multiple datasets by bootstrapping the training data and then develop models based on the individual datasets and make predictions using these models. All the predictions are then combined to produce a representative value such as the mean, median, or majority vote for classification and averaging, for regression, depending on the problem to be solved. Since an individual learner is often sensitive to noise in the training data, bagging, by aggregating multiple results in a single prediction, should provide stable and improved results with decreased variance [22].
RF is a modified version of bagging, in which the classification and regression trees (CART) technique is often used as the individual learner. RF uses subsamples of the original data to build trees and randomly selects a subset of variables to determine each split in the tree [23]. RF excludes approximately 30% of training samples from the modeling due to bootstrapping and random subspace techniques and is often used to compute the out-of-bag (OOB) prediction error.
Many studies have shown that RF outperforms other machine learning and statistical regression algorithms [13]. It has many advantages, some of which are the following. Both classification and regression issues can be resolved using RF. In addition, RF is insensitive to noise or outliers in the training data because it bins them. RF trains rapidly since it incorporates only part of the features for training. RF is easy to use since only one or two hyperparameters (i.e., the number of trees in the forest and the number of predictor variables to be randomly selected at each junction) are required to be tuned. In addition, an intrinsic estimate of the generalization error (the OOB error) is given. RF provides a measure of variable importance as identified by the Gini index [24]. Despite these advantages, the spatial correlation of nearby observable data is disregarded by RF. Thus, RF kriging (RFK), which couples RF with residuals interpolated by conventional kriging, was proposed and successfully used for forest biomass mapping [25] and prediction of PM2.5 concentrations [26].
Extremely randomized trees (ERT) is also a tree-based ensemble method. ERT differs from RF in the following ways: ERT uses all samples to train the base learner instead of using bootstrap resampling; and ERT randomly selects cutting points instead of calculating local optimal cutting points when splitting nodes, which injects more randomness than RF [22]. The results from different individual trees are therefore more diverse. Some studies have suggested that ERT produces more accurate predictions than RF [27]. ERT has been used in a wide variety of applications such as aboveground biomass estimation [28], prediction of ground level PM2.5 concentrations [29], modeling olive tree phenology [30], retrieval of downward longwave radiation [31], streamflow modeling [32], and estimation of terrestrial latent heat flux [33].

Boosting Algorithms
Boosting algorithms use a forward stagewise process to transform weak learners into strong learners by increasing the weights of training samples that were mistakenly identified or wrongly calculated in a successive iteration. The final output of boosting is obtained by combining the results from all iterations using a weighted vote for classification or a weighted sum for regression [34]. Widely used boosting ensemble algorithms include adaptive boosting (AdaBoost), gradient boosting machine (GBM), extreme gradient boosting (XGBoost), LightGBM, and categorical boosting (CatBoost).
AdaBoost was initially developed to increase the efficiency of binary classifiers. In GBM, all the weak learners of GBM are decision trees. XGBoost is an improved version of GBM that implements parallel preprocessing at the node level, which makes it faster than GBM. XGBoost also introduces a variety of regularization techniques to reduce overfitting [35]. Studies have shown that of XGBoost is excellent in mapping plant species diversity [36], predicting PM2.5 concentrations [37], forest biomass estimation [38], and risk analysis for flash floods [39]. LightGBM has many of XGBoost's advantages, such as parallel training, regularization, and sparse optimization. The major difference between the two lies in the method of constructing trees. LightGBM uses a leafwise split: after the first split, the node with a higher delta loss is selected for the next split [40]. This technique enables LightGBM to easily handle huge amounts of data. A histogram-based method is often adopted to select the best split in LightGBM to reduce the time used in training. However, LightGBM cannot perform well with a small number of samples.
CatBoost is an improved gradient boosting decision tree (GBRT) algorithm and an alternative to XGBoost. As the name suggests, CatBoost can internally handle categorical variables in the data, and is thus suitable for dealing with machine learning tasks involving categorical and heterogeneous data [41,42]. Studies have found that CatBoost is superior to other machine learning algorithms, and it has been used to estimate forest biomass [28,43] and reference evapotranspiration [44].

Stacked Generalization
Stacked generalization, also known as stacking, was proposed by Wolpert in 1992 [45]. It is a heterogeneous learning technique that combines diverse base learners by training a model, unlike the homogeneous bagging and boosting methods which directly aggregate the outputs of several learners to obtain the final prediction. If properly designed, stacking can take full advantage of different base learners and should perform better than an individual base learner, whether using majority voting or weighted averaging [20,[46][47][48].
Generally, stacking consists of several base learners (level 0) and a meta learner (level 1), in which the outputs of base learners serve as inputs of the meta learner. Both the precision and variety of base learners affect the performance of a stacking algorithm [49,50].
Diversity is a measure of the dependence or complementariness among learners [51]. That base learners have high diversity implies that they are skilled in different ways, and thus stacking them could lead to improved results. There are many measures of diversity, such as Q-statistics in classification, and the variance of ensemble predictions around their weighted mean in regression [52]; no one measure has been shown to be the best [51,53,54]. In addition to the diversity and accuracy of base learners, the number of base learners also affects the performance of stacking. More base learners are not always associated with more accurate predictions, but they do require additional memory usage and computing time [55]. Some studies have suggested that three to four base learners stacked together are a suitable choice [56], and some action should be taken to ensure that an optimal subset of base learners is used in stacking.
A two-layer stacking model is often adopted in practical applications, although stacking models with more than two layers are sometimes used to further improve results [57]. A two-layer stacking model is constructed as follows ( Figure 1). A k-fold cross-validation is initially used to create k verification datasets on all the base learners. The cross-validated predictions are then used as new training data for the meta learner in the second layer; the number of base learners in the first layer is equal to the number of predictors in the second layer, and the test data are created by adding the averages of the k-fold predictions of the base learners [47,58]. In some classification studies, class probabilities were used instead of predicted classes as input attributes for the meta learner, which provided an effective way of combining base model confidences [20]. The inclusion of engineered original features in stacking has been shown to give better performance [58,59]. There have also been studies that used weighted stacking to improve results [60].
Generally, stacking consists of several base learners (level 0) and a meta learner (level 1), in which the outputs of base learners serve as inputs of the meta learner. Both the precision and variety of base learners affect the performance of a stacking algorithm [49,50]. Diversity is a measure of the dependence or complementariness among learners [51]. That base learners have high diversity implies that they are skilled in different ways, and thus stacking them could lead to improved results. There are many measures of diversity, such as Q-statistics in classification, and the variance of ensemble predictions around their weighted mean in regression [52]; no one measure has been shown to be the best [51,53,54]. In addition to the diversity and accuracy of base learners, the number of base learners also affects the performance of stacking. More base learners are not always associated with more accurate predictions, but they do require additional memory usage and computing time [55]. Some studies have suggested that three to four base learners stacked together are a suitable choice [56], and some action should be taken to ensure that an optimal subset of base learners is used in stacking.
A two-layer stacking model is often adopted in practical applications, although stacking models with more than two layers are sometimes used to further improve results [57]. A two-layer stacking model is constructed as follows ( Figure 1). A k-fold cross-validation is initially used to create k verification datasets on all the base learners. The cross-validated predictions are then used as new training data for the meta learner in the second layer; the number of base learners in the first layer is equal to the number of predictors in the second layer, and the test data are created by adding the averages of the k-fold predictions of the base learners [47,58]. In some classification studies, class probabilities were used instead of predicted classes as input attributes for the meta learner, which provided an effective way of combining base model confidences [20]. The inclusion of engineered original features in stacking has been shown to give better performance [58,59]. There have also been studies that used weighted stacking to improve results [60].

Literature Search and Analysis
We conducted a literature survey of the ISI Web of Science database using the search term TOPIC ensemble AND TOPIC remote sensing. The search was then refined to include research areas Remote Sensing, Geography, Agriculture, Forestry and Environmental Sciences Ecology and document types Articles, Meeting, and Review articles. This yielded a total of 2247 results that we refer to as ensemble literature.
Research articles accounted for 84% of the records returned, conference papers for 14% and review articles for 2% ( Figure 1). The publication years of the refined search records returned ranged from 1979 to 2022. In 2015-2022, more than 100 papers meeting the search criteria were published annually ( Figure 1).

Literature Search and Analysis
We conducted a literature survey of the ISI Web of Science database using the search term TOPIC ensemble AND TOPIC remote sensing. The search was then refined to include research areas Remote Sensing, Geography, Agriculture, Forestry and Environmental Sciences Ecology and document types Articles, Meeting, and Review articles. This yielded a total of 2247 results that we refer to as ensemble literature.
Research articles accounted for 84% of the records returned, conference papers for 14% and review articles for 2% ( Figure 1). The publication years of the refined search records returned ranged from 1979 to 2022. In 2015-2022, more than 100 papers meeting the search criteria were published annually ( Figure 1).
We then constructed a co-occurrence network by identifying keywords in the searched ensemble literatures, calculating the frequencies of their co-occurrences, and analyzing the networks to find central words and clusters of themes in the network ( Figure 2). The constructed network showed that studies of ensemble learning algorithms were mainly related to classification, and studies of ensemble method use or ensemble models were primarily concerned with RF. In addition, the co-occurrence network suggested that applications concerning crop yield, biomass, rainfall, landslides, and satellite products were connected to ensemble algorithms. Appl We then constructed a co-occurrence network by identifying keywords in the searched ensemble literatures, calculating the frequencies of their co-occurrences, and analyzing the networks to find central words and clusters of themes in the network ( Figure  2). The constructed network showed that studies of ensemble learning algorithms were mainly related to classification, and studies of ensemble method use or ensemble models were primarily concerned with RF. In addition, the co-occurrence network suggested that applications concerning crop yield, biomass, rainfall, landslides, and satellite products were connected to ensemble algorithms.

Applications of Ensemble Learning Algorithms
The co-occurrence network ( Figure 2) combined with our reading of the ensemble literature shows that typical applications of ensemble learning algorithms were mainly concerned with the following issues: yield prediction, forest structure and biomass estimation, mapping of natural hazards (e.g., land susceptibility to natural disasters), and spatial downscaling of land surface temperature and rainfall (Table 2). RF was the most frequently used ensemble algorithm (30 times) in the 43 applications listed in Table 2, followed by stacking (13 times), while boosting algorithms were less frequently used ( Table 3). Only seven studies used XGBoost; GBRT and AdaBoost were used four times. MLP and kNN often served as reference models to evaluate the performance of an ensemble algorithm.

Applications of Ensemble Learning Algorithms
The co-occurrence network ( Figure 2) combined with our reading of the ensemble literature shows that typical applications of ensemble learning algorithms were mainly concerned with the following issues: yield prediction, forest structure and biomass estimation, mapping of natural hazards (e.g., land susceptibility to natural disasters), and spatial downscaling of land surface temperature and rainfall (Table 2). RF was the most frequently used ensemble algorithm (30 times) in the 43 applications listed in Table 2, followed by stacking (13 times), while boosting algorithms were less frequently used (Table 3). Only seven studies used XGBoost; GBRT and AdaBoost were used four times. MLP and kNN often served as reference models to evaluate the performance of an ensemble algorithm.

Yield Prediction
Machine learning algorithms have been widely used in crop yield prediction [98]. Van Klompenburg et al. [99] analyzed 50 machine learning papers and 30 deep learning papers concerning crop yield prediction. A neural network was the most frequently used machine learning technique in the machine learning papers; RF was used in 12 studies, and GBRT was used four times. Ruan et al. [89] included eleven statistical and machine learning techniques for predicting field-scale wheat yield and found that the two ensemble learning models RF and XGBoost were most accurate in prediction, with R 2 in the range 0.74-0.78 and RMSE in the range 0.78-0.85 t/ha. Cao et al. [66] adopted MLR, XGBoost, RF, and SVR algorithms and three datasets including MODIS EVI, climate data from the climatic research unit, and the subseasonal-to-seasonal (S2S) atmospheric prediction data, to estimate winter wheat yield at the grid level. The results showed that among the four models, XGBoost reached the highest skill with the S2S prediction as inputs, with an R 2 of 0.85 and RMSE of 0.78 t/ha. Ang et al. [61] employed the RF and AdaBoost algorithms to estimate oil palm yield prediction from multi-temporal remote sensing data. Results of the study revealed that the RF model (RMSE = 0.384; MSE = 0.148; MAE = 0.147) outperformed the AdaBoost model (RMSE = 0.410; MSE = 0.168; MAE = 0.176). Kamir et al. [82] used nine base learners and two ensemble (average ensemble and Bayesian fusion) methods to estimate wheat yields across the Australian wheat belt from climate data, and satellite image time series. The results showed that SVR with radial basis function merged as the single best learner with the R2 of 0.77 and RMSE of 0.55 t/ha at the pixel level, and the ensemble techniques did not show a significant advantage over the single best model. Stacking models are also increasingly used in crop yield prediction. Feng et al. [74] predicted alfalfa yield from unmanned aerial vehicle (UAV)-based hyperspectral images using RF, SVR, k-nearest neighbors (kNN) and stacking ensemble algorithms. Comparison of the results indicated that stacking performed the best among all the base learners, with R 2 = 0.874. Fei et al. [75] combined four base learners (RF, SVM, Gaussian process (GP) and ridge regression (RR)) to predict grain yield across growth stages and found that stacking improved prediction accuracy in both full and limited irrigation scenarios with respective R 2 values of 0.625 and 0.628 at the mid grain filling stage. Li et al. [84] developed four base models (RF, SVM, GP, and RR) as well as stacking model, to predict winter wheat yields from UAV-based hyperspectral image data. They found that SVM outperformed the other learners and, compared with each base model, the stacking model was more accurate.

Estimation of Forest Structure Parameters and Biomass
Forest structure parameters (e.g., forest height) and forest biomass are critical indicators of carbon stocks and are increasingly important in fields related to the carbon cycle and climate change [100]. García-Gutiérrez et al. [76] compared a multiple linear regression (MLR) model, a neural network, SVR, kNN, and RF in estimating forest parameters in the province of Lugo (Galizia, Spain) from lidar data and found that MLR was outperformed by ML algorithms and that SVR with Gaussian kernels outperformed the other algorithms. Corte et al. [69] used SVR, ANN, RF, and XGBoost to estimate tree dendrometry parameters such as volume, height and diameter at breast height from UAV-lidar point clouds. Their results showed that all models were robust, with relative RMSE <29% for volume, <9% for height and <15% for diameter at breast height; SVR performed the best in terms of minimal error rates. SVR outperformed the ensemble algorithms in both studies and gave the most accurate predictions.
Various ensemble algorithms have been developed to estimate forest parameters. For example, Cartus et al. [67] used a two-stage model to derive forest canopy height and growing stock volume (GSV) in Chile from a combination of airborne laser scanner (ALS), PALSAR and Landsat data. They developed an RF model of canopy height and GSV using ALS-derived values that were validated with in situ measurements and then used a second RF model that used multitemporal PALSAR and Landsat data as predictor variables and ALS-based estimates as response variables. At three test sites, the retrieval of canopy height and GSV reached good accuracies with the R 2 of 0.70-0.85. Xu et al. [96] developed a two-stage ensemble approach to increase the accuracy of forest GSV estimation. They selected variables using a collinearity test and ran four base learners (CART, kNN, SVR, and ANN) and combined them first using bagging and then using AdaBoost to generate eight ensemble models. The eight ensemble models were then aggregated using a weighted average method in which the weights were determined by the validated relative RMSE values of the eight ensemble models. The experimental results showed that the combined ensemble approach significantly reduced the uncertainty of GSV mapping from the Sentinel-1A and Sentinel-2A data, with relative RMSE values in the range 18.89-21.34%.
Dube et al. [73] used stochastic gradient boosting (SGB) to estimate the stand volume of eucalyptus plantations in South Africa from multisource data and found that SGB accurately predicted stand volume, with R 2 = 0.78 and RMSE = 33.16 m 3 /ha. These results were more accurate than the results given by RF or stepwise regression. Zhang et al. [28] comprehensively assessed the performance of eight algorithms (SVR, MARS, MLP, RF, ERT, SGB, GBRT, and CatBoost) in predicting forest biomass using several remote sensing datasets. Their results indicated that five ensemble algorithms (RF, ERT, SGB, GBRT and CatBoost) produced more accurate predictions than the other three individual algorithms and that CatBoost obtained slightly more accurate results than the other four ensemble algorithms, with R 2 = 0.72 and RMSE = 45.63 t/ha. In a subsequent study, they developed a stacking model to combine several accurate base learners to further increase the accuracy of biomass prediction, and the results indicated that the stacking ensemble increased prediction accuracy; in particular, it decreased the biases [58]. Ghosh et al. [77] used a stacked set of ensemble algorithms (RF, GBM, and XGBoost) to predict the aboveground biomass of Indian mangroves from Sentinel-1 and Sentinel-2 time series. The results indicated that stacking increased AGB prediction accuracy with RMSE = 72.864 t/ha and relative RMSE = 11.38%.
Du et al. [72] developed a CNN model using ALS data and Landsat imagery and found that CNN was more accurate than an extreme learning machine (ELM), a backpropagation neural network, a regression tree, RF, SVR, KNN, and other standard machine learning techniques. The stacking algorithm significantly increased prediction accuracy when compared with base models.

Natural Hazards
Natural hazards such as droughts, hurricanes, tornadoes, floods, and landslides can affect human life and property [101]. Accurate prediction or mapping the probabilities of natural disasters is therefore of great importance for human survival. Machine learning algorithms, particularly ensemble algorithms, have high prediction accuracy [85] and so have become increasingly used in identifying areas of long term drought [63], mapping landslide hazards [81,90], assessing the susceptibility of gullies to erosion [102,103] and mapping land susceptible to subsidence [78].
Arabameri et al. [62] combined three meta classifiers (Real AdaBoost, Random Subspace, and MultiBoosting) into a hybrid ensemble framework to predict the likelihood of flooding in the Gorganroud River Basin, Iran. Band et al. [65] used multiple ensemble algorithms (GBM, RF, parallel random forest (PRF), regularized random forest (RRF), and ERT) to quantify the likelihood of flash flooding in Markazi Province, Iran and found that ERT was the most accurate model with an area under curve (AUC) value of 0.82. Chapi et al. [68] combined bagging with logistic model trees (LMT) and developed a bagging-LMT ensemble model to map flood susceptibility. The results showed that in terms of accurately mapping flood susceptibility, the bagging-LMT model performed better than LMT, logistic regression, Bayesian logistic regression, and RF. Hakim et al. [78] compared logistic regression, MLP and two meta ensemble machine learning algorithms (AdaBoost and LogitBoost) in predicting likely subsidence based on a land subsidence inventory map generated from Sentinel-1 synthetic aperture radar (SAR) data. The results showed that AdaBoost gave the greatest prediction accuracy (81.1%), followed by MLP (80%), logistic regression (79.4%), and LogitBoost (79.1%). Kalantar et al. [81] investigated the suitability of flexible discriminant analysis (FDA), generalized logistic models (GLM), GBM, and RF for mapping landslide susceptibility. The test results showed that FDA was similar in prediction accuracy to GLM but was less accurate than GBM, which was in turn less accurate than RF. Rahman et al. [87] compared a Bayesian regularization back propagation neural network (BRBP), CART, an evidence belief function (EBF) and their various combinations in ensemble models to predict flood likelihood in Bangladesh. They found that the ensemble model that combined BRBP, CART, and EBF using weighted averaging was more accurate (AUC > 90%) than other models. In another study, Rahman et al. [86] found that stacking locally weighted linear regression (LWLR) and RF models increased the prediction accuracy of flood susceptibility maps in Bangladesh, with R 2 = 0.967-0.999, MAE = 0.022-0.117, RMSE = 0.029-0.148. Sachdeva et al. [90] used a majority voting ensemble technique to predict landslide susceptibility and found that the ensemble model that combined logistic regression, GBDT, and voting feature intervals produced predictions that were close in accuracy to the predictions of widely used machine learning algorithms such as decision trees, SVM, and RF.

Spatial Downscaling
Remote sensing data are often spatially downscaled to obtain fine resolution (FR) data from coarse resolution (CR) remote sensing data. The finer resolution data provide more spatial details and thus bridge the gap between what CR data provide and what regional applications require. Statistical downscaling methods have frequently been used in several domains to obtain FR parameters since they require less computation and running time, and are more accurate than other downscaling methods such as dynamic downscaling [104].
The statistical downscaling procedure for retrieving FR data from CR data is as follows: (1) develop models relating CR parameters and predictor variables or ancillary variables at a coarse resolution; (2) apply the CR models to the FR data, assuming that the relationships between target parameters and predictor variables remain unchanged at different spatial scales; and (3) obtain the target FR parameters from the models and FR predictor variables at a fine resolution. A variety of machine learning algorithms, especially the increasingly used ensemble learning algorithms, have been used as the relationships between target parameters and predictor variables are often nonlinear and complex.
RF has been widely used to upscale large-scale precipitation, land surface temperature (LST), and soil moisture remote sensing data. For example, Shi et al. [91] established nonparametric relationships between precipitation and six indicators (EVI, altitude, slope, aspect, latitude, and longitude) using RF and spatially downscaled annual precipitation for 2001-2012 from 0.25 • pixels to 1 km pixels over the Tibetan Plateau. Zhao et al. [97] used RF to downscale the Tropical Rainfall Measuring Mission (TRMM) monthly precipitation product from 25 km resolution to produce monthly precipitation data for China at 1 km resolution. Hutengs and Vohland [80] developed a model relating LST to digital elevation data, land cover type, and surface reflectance in the red and near-infrared bands using RF; they downscaled the spatial resolution of LST from 1 km to 250 m, with the RMSEs from 1.41 to 1.92 K. When compared to the widely-adopted TsHARP sharpening method, downscaling accuracy using RF improved up to 19%. Karami et al. [83] created an RF-based regression tree to downscale the daily SMAP soil moisture product at 9 km resolution and created a 1 km soil moisture product using Sentinel-1 data, MODIS NDVI, land cover, and auxiliary topography and soil properties. Xu et al. [94] combined wavelet analysis with machine learning algorithms to create a wavelet support vector machine (WSVM) and a wavelet random forest (WRF) algorithm to downscale North American multimodel ensemble (NMME) precipitation forecasts. Their results showed improvement over quantile mapping, with an average decrease in RMSE of 18-40 mm (21-33%).
Several ensemble boosting algorithms have also been used for spatial downscaling. Wei et al. [92] created high resolution soil moisture maps covering the entire Tibetan Plateau using GBRT with SMAP soil moisture data and related variables derived from MODIS and DEM. Using GBRT, Asadollah et al. [64] yielded a significant improvement in downscaling global climate model predictions, compared to SVR that was previously found as the most suitable for downscaling climate in Iran. Xu et al. [95] developed a multifactor geographically weighted machine learning algorithm using Sentinel-2A data that combined the results from three base learners (XGBoost, MARS, and Bayesian ridge regression) and downscaled LST at 30 m resolution to 10 m resolution.

Other Applications
Other applications have used ensemble learning algorithms. In this section, we briefly mention some studies that recognized the importance of stacking.
Wu et al. [93] developed a two-layer stacking and blending ensemble method to predict daily reference evapotranspiration in which level 0 models included RF, SVR, multilayer perceptron neural network (MLP), and kNN. Both stacking and blending were significantly more accurate than the base models, and this approach is thus highly recommended for predicting reference evapotranspiration. Cho et al. [46] used a stacking model to predict daily maximum air temperature that consisted of multiple linear regression (MLR) and support vector regression (SVR), and RF was optimized by SVR. The stacking ensemble method produced more accurate predictions than cokriging, three distinct data-driven methods, and a simple average ensemble model. Divina et al. [71] showed that stacking was a suitable approach for short term electricity consumption forecasting. Healey et al. [79] showed that stacking increased the accuracy of detection of forest change. They investigated a stacking model using both parametric and RF-based image fusion rules as the meta learner to combine several forest disturbance detection algorithms and found that stacking using an RF model to build the meta learner reduced the rates of errors of omission and errors of commission by 50% in some instances when compared to individual change detection methods and by 25% when compared with stacking using a logistic regression model as the meta learner.

Combining Feature Selection with Ensemble Learning
Many studies have shown that using selected important features or variables instead of using all extracted variables can result in more accurate and robust predictions [105][106][107].
Ensemble algorithms are essentially black box models that carry the risk of overfitting, and the underlying physical mechanisms can be obscure [108]. It is therefore critical to identify important variables by selecting features before training a model; filter, wrapper, and embedded algorithms are just a few of the feature selection techniques that have been proposed [106,109].
Some studies have highlighted the importance of combining feature selection with ensemble learning algorithms in practical applications. For example, Luo et al. [43] found that recursive feature elimination (RFE) for feature selection in combination with a Cat-Boost model produced the most accurate prediction of forest biomass for all forests in Jilin province, China with RMSE = 25.77 t/ha. There have also been studies that used a variance inflation factor (VIF) to quantify multicollinearity between independent variables, and only variables with VIF values < 10 were finally included for modeling [62,110]. Of the three types of ensemble learning methods, tree-based bagging and boosting algorithms have identified important variables mainly by permutation importance [111,112]. Some other indicators, such as mean decrease in accuracy and mean decrease in impurity, have also been used in association with tree-based algorithms (e.g., RF) [113]. In contrast, stacking ensemble algorithms appear to have difficulty in selecting important variables due to working with a set of models rather than an individual model, and it can be difficult to interpret the ensemble results [88,114]. Feature selection should thus be implemented with care when stacking, but this aspect of the technique has been little explored in published studies.

Other Ensemble Learning Algorithms
In this study, we primarily reviewed bagging, boosting and stacking ensemble algorithms. Other ensemble learning algorithms have been developed in addition to these well-known algorithms, such as dynamic ensemble learning [115] and Bayesian additive regression trees [116]. The dynamic ensemble method, unlike static ensemble algorithms which combine fixed base learners, selects the single best learner or combines a subset of learners from the pool using a just-in-time condition that depends on the particular input pattern from which a prediction is to be made when making a prediction [117][118][119].
Blending is another ensemble technique that is derived from stacking. Blending differs from stacking in that it does not use k-fold cross-validation to generate training data for the meta learner but instead uses a one-holdout set. This technique results in only a small portion of the training dataset being used to generate predictions to be used as inputs to a meta model [93,120].

Deep Learning Algorithms
Deep learning algorithms are used in many fields, including agriculture and remote sensing. The base learners in current ensemble models are mostly statistical and conventional ML methods, and the possibility of combining deep learning models in several ways is worthy of investigation. Deep learning algorithms have been used as base learners in some studies with the intention of increasing prediction accuracy. For example, de Oliveira e Lucas et al. [70] used three CNNs to predict reference evapotranspiration time series and developed ensemble models consisting of the three CNNs. The CNN ensembles produced predictions with high accuracy and low variance. Lv et al. [47] developed a heterogeneous ensemble learning approach that combined three deep learning models (a deep belief network (DBN), a CNN and a deep residual network (ResNet)) to map landslide susceptibility in the Three Gorges reservoir area in China.
Deep learning algorithms well capture nonlinear relationships between target and sensor signals, so an ensemble of various deep learning algorithms should produce more accurate predictions than a single algorithm. The three principal ensemble methods we described in this study, particularly stacking, provide a framework for leveraging different algorithms. Future studies will investigate optimal combinations of deep learning algorithms in various applications.

Conclusions
In this paper, we reviewed bagging, boosting, and stacking ensemble learning algorithms and their typical applications in the use of remote sensing data. RF was the most often adopted algorithm in several fields that used remote sensing data. In contrast, the other ensemble algorithms were often not considered for specific applications. Despite recent progress in increasing the prediction accuracy of ensemble algorithms, there are still some gaps in our knowledge, such as how to effectively combine feature selection with ensemble algorithms and how to incorporate deep learning algorithms in an ensemble to increase prediction accuracy. The understanding of ensemble algorithms deserves to be the main focus of future study and will enable us to incorporate more advanced and diverse algorithms in practical applications.

Conflicts of Interest:
The authors declare no conflict of interest.