A Dual-Variable Selection Framework for Enhancing Forest Aboveground Biomass Estimation via Multi-Source Remote Sensing

Chen, Dapeng; Luo, Hongbin; Liu, Zhi; Pan, Jie; Wu, Yong; Wang, Er; Lu, Chi; Wang, Lei; Wang, Weibin; Ou, Guanglong

doi:10.3390/rs17142493

Open AccessArticle

A Dual-Variable Selection Framework for Enhancing Forest Aboveground Biomass Estimation via Multi-Source Remote Sensing

by

Dapeng Chen

^1,2,

Hongbin Luo

^1,2

,

Zhi Liu

^1,2,

Jie Pan

^1,2,

Yong Wu

^1,2

,

Er Wang

^1,2,

Chi Lu

^1,2

,

Lei Wang

³,

Weibin Wang

^1,2 and

Guanglong Ou

^1,2,*

¹

Key Laboratory for Forest Resources Conservation and Utilization in the Southwest Mountains of China, Ministry of Education, Southwest Forestry University, Kunming 650233, China

²

College of Forestry, Southwest Forestry University, Kunming 650224, China

³

Yunnan Academy of Forestry and Grassland, Kunming 650201, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2493; https://doi.org/10.3390/rs17142493

Submission received: 19 May 2025 / Revised: 7 July 2025 / Accepted: 16 July 2025 / Published: 17 July 2025

(This article belongs to the Section Forest Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Integrating multi-source remote sensing can improve the accuracy of forest aboveground biomass (AGB) estimation. However, the accuracy and stability of the forest AGB estimation results are affected by multiple remote sensing feature variables as well as parameter tuning of machine learning algorithms. To this end, this study employed six types of remote sensing data—Landsat 8 OLI, Sentinel-2A, GEDI, ICESat-2, ALOS-2, and SAOCOM. A dual-variable selection strategy based on SHapley Additive exPlanations (SHAP) was developed, and a genetic algorithm (GA) was used to optimize the parameters of five machine learning models—elastic net (EN), least absolute shrinkage and selection operator (Lasso), support vector regression (SVR), Random Forest (RF), and Categorical Boosting (CatBoost)—to estimate the AGB of Pinus kesiya var. langbianensis forest in Wuyi Village, Zhenyuan County. The dual-variable selection strategy integrates SHAP with the Pearson correlation coefficient (PC), RF, EN, and Lasso to enhance feature screening robustness and interpretability. The results of the study showed that Lasso-SHAP dual-variate screening was more stable than SHAP univariate screening. In particular, the Lasso-SHAP strategy improved the average R² from 0.59 (using SHAP alone) to above 0.70, achieving an enhancement of 11%. Among GA-optimized parametric machine learning models, the linear GA-Lasso achieved the best performance, with an R² of 0.91 and an RMSE of 12.94 Mg/ha, followed by the GA-EN model (R² = 0.89, RMSE = 14.46 Mg/ha). For nonlinear models, GA-SVR performed the best (R² = 0.74, RMSE = 22.07 Mg/ha), surpassing the GA-CatBoost model (R² = 0.64, RMSE = 25.88 Mg/ha). In summary, the Lasso-SHAP dual-variable selection strategy effectively improves the estimation accuracy of AGB for Pinus kesiya var. langbianensis forests, while GA-optimized machine learning models demonstrate excellent performance, providing strong support for regional-scale forest resource monitoring and carbon stock assessment.

Keywords:

Pinus kesiya var. langbianensis forest aboveground biomass; interpretable machine learning; genetic algorithm; multi-source remote sensing data

1. Introduction

Forests are crucial components of the Earth’s terrestrial ecosystems and play an irreplaceable role in the global water, energy, and carbon cycles [1,2]. Forests play crucial roles relating to the mitigation of climate change and the maintenance of ecological balance by regulating atmospheric carbon dioxide (CO₂) concentrations through carbon fixation, storage, and cycling [3,4]. Traditional field forest aboveground biomass (AGB) monitoring surveys require a lot of time, consume manpower and material resources, and cause ecological damage [5,6]. Thus, obtaining forest AGB by means of remote sensing has become a research hotspot at present [7]. However, different remote sensing images have limitations of individual nature [8].

Optical remote sensing (OPR) can obtain horizontal structural information of forests through visible as well as infrared light for forest AGB estimation; however, the influence of factors such as weather, cloud cover, and topography can reduce the importance of the accuracy of forest biomass estimation [9,10,11,12]. Lu et al.’s [13] review of remote sensing-based AGB estimation methods showed that OPR data are affected by weather and have higher cloud cover in the tropics. Sun et al. [14] estimated the uncertainty of forest AGB using OPR and LiDAR (GEDI) analysis; the results showed that OPR is affected by factors such as elevation, slope, and climate. Fremout et al. [15] mapped long-term tropical dry forest degradation and showed that the use of OPR with microwave radar (MR) improves forest AGB estimation accuracy. Although MR synergizes with OPR to improve the accuracy of forest AGB estimation, its limitations are unavoidable. Santoro et al. [16] showed that MR is affected by topographic factors in forest AGB estimation in their study of forest aboveground biomass estimation based on SAR backscattering and interferometric SAR observations. Moreover, the sensitivity loss of the SAR backscatter coefficient and coherence leads to low accuracy of forest AGB estimation. Sinha et al. [17] showed research advances in radar remote sensing for biomass estimation, finding that the synergistic use of multi-sensor optics and MR has better potential than a single sensor, but MR has its own limitations and complexities. Therefore, Liu et al. [18] used airborne (TomoSAR) backscatter coefficients and satellite-borne lidar (GEDI) data for forest height and AGB estimation, which showed that synergizing MR and GEDI data can improve the estimation accuracy of tropical forest heights and aboveground AGB. In addition, forest branches can absorb radar information, resulting in sensors not being able to accurately acquire ground reflection values [19]. However, GEDI data are distributed in the form of patches of light over the study area, which makes it challenging to cover the discontinuous patches over the entire study area [20]. Therefore, combining GEDI data with OPR as well as MR is necessary. May et al. [21] used GEDI together with a hierarchical model to map aboveground biomass in Indonesian lowland forests and showed that GEDI-measurable objects were not directly associated with AGB. Furthermore, GEDI observations are spatially incomplete. Benson et al. [22] estimated forest aboveground biomass (AGB) as well as canopy height in homogeneous areas by combining SAR, GEDI, and OPR with physically based models, which showed that the usioalgorithm’s fn of multimodal remote sensing techniques with minimal ground-based information could improve the accuracy of estimating AGB and canopy height (with RMSE errors of 1.6 kg/m² and 1.68 m). Therefore, the fusion of multi-source remote sensing data has become a mainstream trend to improve biomass estimation accuracy [23]. Nevertheless, synergizing multi-source remote sensing data will generate a large number of characteristic variables affecting the accuracy of forest AGB estimation, so it is necessary to select key remote sensing factors for forest AGB estimation [24].

In a large amount of remote sensing feature information, it becomes especially critical to effectively exclude redundant feature variables to improve the accuracy of AGB estimation. Variable selection methods can screen key remote sensing factors for accurate estimation of forest AGB [25]. The rational choice of variables can not only reduce redundant information and simplify the model structure but also improve the prediction accuracy and robustness [26]. SHapley Additive exPlanations (SHAP) is a game-theoretic-based model interpretation method with theoretical validity and wide model adaptability. Its core principle is based on cooperative game theory, following the game theory axioms of efficiency, fairness, symmetry, and additivity, considering each feature as a “player” and reflecting the real impact of the variable on the model output by calculating its marginal contribution value (Shapley value) among all feature combinations. It provides a unified feature importance metric and consistency guarantee, and it is mainly used for visual analysis of model results, but in recent years, it has been used more for auxiliary variable selection and feature optimization [27,28]. The advantage of this method is that it can realize the precise quantitative interpretation of each feature, which can effectively improve the transparency, interpretability, and credibility of the model. In terms of variable selection and model modeling, SHAP is not only able to identify the key influencing factors and improve the prediction accuracy but also demonstrate good generalization ability and consistency of interpretation in several studies [29,30]. A number of studies have verified its explanatory ability and variable identification effect in complex models. Pezoa et al. [31] interpreted the results and analyzed the importance of individual features by calculating SHAP values, experimentally showing that the SHAP method has high potential for understanding complex machine learning models. Ekanayake et al. [32] captured the complex relationship between components in SHAP concrete compressive strength prediction based on XGBoost prediction. On the other hand, SHAP provides a harmonized measure of the importance of features and the impact of variables on predictions. Li et al. [33] proposed an interpretable AGB prediction framework that supports both SHAP and XGBoost models in the estimation of AGB in subtropical bamboo forests based on an interpretable machine learning framework. The results showed that the method can effectively predict AGB. Molisse et al. [34] used the SHAP software package for model interpretation when implementing an exploratory workflow for estimating aboveground biomass based on Sentinel-2, which provided more insight into the effects of features on model predictions. Huang et al. [35] used SHAP to analyze the inversion contribution of related characteristic factors to grassland AGB and realized the quantitative expression of variable interpretability. These studies further exhibited that SHAP has significant advantages in enhancing model transparency, identifying key features, and reducing redundant variables. Therefore, the interpretable machine learning method SHAP was introduced to improve model transparency to quantify the contribution of each variable to the model’s prediction results [36,37]. However, traditional methods of variable selection should not be ignored. Traditional variable selection methods, such as Pearson’s correlation coefficient (PC), are mainly suitable for data analysis of linear relationships, while the least absolute contraction and selection operator (Lasso) effectively avoids model overfitting by introducing the L1 regularization term [38,39]. Nevertheless, the Lasso variable selection method may select highly covariant and redundant features in a small-sample, high-dimensional dataset, which can reduce the predictive power and stability of the subsequent model, thus affecting the model’s performance and interpretability. Elastic net regression (EN) combines the advantages of L1 and L2 regularization to effectively avoid the variable covariance problem [40]. Algamal et al. [41] used adjusted adaptive elastic nets for regularized logistic regression in a high-dimensional cancer classification study. EN was biased in selecting genes in the study. Secondly, it does not perform well when the variables are not highly correlated with each other in pairs. While RF may overemphasize one variable at the expense of others, when variables are highly correlated with each other, it complicates the identification of the most appropriate features [42]. This indicates that there are limitations to these commonly used variable selection methods [43].

Therefore, this study used SHAP with a commonly used variable selection method to construct a dual-variable selection method to screen key remote sensing factors for forest AGB estimation and compared it with the SHAP variable selection method. Although variable selection methods are more important in forest AGB estimation, machine learning method construction is also indispensable. SVR, RF, CatBoost, EN, and Lasso are machine learning algorithms that are widely used for forest AGB estimation. However, the machine learning model parameters are more complicated to set, and the machine learning model is sensitive to hyperparameters; improper tuning may lead to overfitting or underfitting [44]. The genetic algorithm (GA) efficiently searches for optimal solutions in large-scale parameter spaces by simulating the biological evolution process, avoiding the traditional optimization methods from falling into local optima [45]. In addition, the GA can improve the model accuracy during parameter tuning for forest biomass estimation while enhancing its adaptability. Ji et al. [46] proposed a GA-optimized SVR algorithm that significantly improved the accuracy of forest aboveground biomass (AGB) estimation using SAR data. Mabdeh et al. [47] demonstrated the superiority of a GA-optimized SVR model when using support vector regression and an evolutionary algorithm based on an adaptive neuro-fuzzy inference system for forest fire susceptibility assessment and mapping studies. Therefore, the use of the GA can improve the shortcomings of machine learning algorithms and increase the estimation accuracy of forest AGB.

To address the above challenges, this study took the Pinus kesiya var. langbianensis woodland in the Pu’er region of Yunnan as the research object and proposed an integrated framework that combined multi-source remote sensing, SHAP dual-variable selection, and GA optimization modeling to improve the accuracy and stability of forest AGB estimation. The specific objectives include (1) integrating six types of active and passive remote sensing data from Landsat 8, Sentinel-2A, GEDI, ICESat-2, ALOS-2, and SAOCOM; (2) combining Lasso/EN and other techniques to construct a SHAP dual-variable strategy to improve the interpretability and stability of variable selection; and (3) introducing a genetic algorithm to optimize five machine learning models, SVR, RF, Lasso, EN and CatBoost, to construct the best model for AGB estimation.

2. Methods

The research framework consists of 6 key stages: (1) obtaining field sample survey and multi-source remote sensing data; (2) storing and organizing the data and constructing the dataset; (3) calculating the AGB of the sample data and pre-processing the different remote sensing data; (4) selecting the key variables through the single- and dual-variable selection methods; (5) optimizing the model parameters with GA algorithms and applying a 5-fold cross-validation method to effectively evaluate the model; (6) and realizing the forest (AGB) inversion mapping through the optimal model. The technical alignment is shown in Figure 1.

2.1. Study Area

The study focuses on Zhenyuan County, northern Pu’er, Yunnan Province, spanning 100°21′–101°31′E and 23°34′–24°22′N, with a mean annual temperature of 20.9 °C [48]. The terrain is predominantly mountainous, covering 97.7% of the county’s area, and it hosts one of the densest forest distributions in the region. Pinus kesiya var. langbianensis forests dominate at elevations between 1200 m and 2000 m [49], forming a major component of the local forest ecosystem. Field data were collected in Wuyi Village, a representative site within the state-owned forest farm, where the Pinus kesiya var. Langbianensis forest is the most widespread species [50]. It supports ecological stability, provides key habitats, and contributes to local economic value. The study area is shown in Figure 2.

2.2. Data Acquisition and Processing

2.2.1. Sample Plot Collection and Forest AGB Estimation

Field sampling was conducted in Wuyi Village, Pu’er, Yunnan Province, where 64 Pinus kesiya var. langbianensis forest plots (30 × 30 m) were established. Coordinates of individual trees and plots were recorded using RTK, and diameters at breast height (DBH) and tree height (H) were measured to support biomass estimation. Individual-tree AGB was computed using Equations (1) through (4), with R² values of 0.9788, 0.9861, 0.9664, and 0.880, respectively [51,52]. Plot-level AGB was then calculated using Equation (5) and converted to a per-hectare basis using Equation (6). Fieldwork was completed in March 2023.

Pinus kesiya var. langbianensis

W_{P} = \sum_{i = 1}^{n_{1}} 0.02997 {D_{i}^{2} H}_{i}^{0.97817}

(1)

Keteleeria fortune

W_{k} = \sum_{i = 1}^{n_{2}} 0.0729 {(D_{i}^{2} H_{i})}^{0.9334}

(2)

Quercus acutissima Carruth

W_{Q} = \sum_{i = 1}^{n_{3}} 0.1663 {(D_{i}^{2} H_{i})}^{0.7821}

(3)

Broadleaf species

W_{b} = \sum_{i = 1}^{n_{4}} 0.1793 {(- 0.619 + D_{i})}^{2}

(4)

Total forest AGB of the sample plots

W_{t} = W_{P} + W_{K} + W_{Q} + W_{b}

(5)

Forest AGB per hectare of the sample site

W_{h} = \frac{W_{t}}{0.09} \times 1000

(6)

where

n_{1}, n_{2}, n_{3} {, n}_{4}

represent the number of each tree species of Pinus kesiya var. langbianensis, Keteleeria fortunei, Quercus acutissima Carruth, and broadleaf species and

D_{i}

and

H_{i}

are the diameter at breast height (DBH) and height (H) of the

i

tree, respectively.

W_{k}, W_{Q}, W_{b} {, W}_{t}

in the formula indicate the total forest AGB of different species in the sample plot (unit: kg), and

W_{h}

indicates the forest AGB per hectare (unit: Mg/ha).

The forest AGB of sample plots in the study area was calculated based on the above formula, and the statistical results are shown in Table 1. The maximum AGB was 268.82 Mg/ha, the minimum was 75.36 Mg/ha, and the average was 147.68 Mg/ha.

2.2.2. Multi-Source Geospatial and Remote Sensing Datasets

This study utilized satellite remote sensing data from Landsat 8 OLI, Sentinel-2A, GEDI, ICESat-2, ALOS-2, and SAOCOM, along with climate and land cover classification datasets (Table 2).

To ensure the quality and reliability of remote sensing data, preliminary processing was applied to each dataset type. For optical imagery, Landsat 8 and Sentinel-2A underwent atmospheric correction, radiometric calibration, geometric correction, and terrain correction, following standard pre-processing protocols [53]. All raster data were geometrically co-registered using high-resolution base maps to ensure spatial alignment across datasets. Spaceborne LiDAR data were filtered prior to analysis. For GEDI L2A/L2B, the following criteria were applied [54,55]: (1) quality_flag = 1, (2) degrade_flag = 0, (3) sensitivity > 0.95, and (4) canopy height between 3 m and 60 m. Similarly, ICESat-2 ATL08 data were filtered using the following rules [56]: (1) canopy height (h_canopy) must be between 3 m and 60 m, (2) when h_canopy < 20 m, the associated uncertainty (h_canopy_uncertainty) must be <20, (3) differentiation between strong and weak beams was retained, and (4) only valid photon types were preserved (classed_pc_flag = 0), including ground, canopy, and canopy-top photons. For SAR data, ALOS-2 PALSAR-2 [57,58] and SAOCOM-L1A [59,60] were subjected to radiometric calibration, multi-looking, speckle filtering, geometric correction, polarization decomposition, and terrain correction to extract backscatter coefficients (HH, HV, VH, VV), radar indices, and polarimetric decomposition features. Additional auxiliary data included WorldClim v2.1 climate layers—temperature, precipitation, and humidity—downscaled from ERA5-Land using the Delta method [61] and land cover maps from the AI Earth 10 m classification dataset for China (2022). To ensure spatial consistency and pixel-level comparability, all raster layers were resampled to 30 m × 30 m resolution, co-registered, and projected using the WGS 1984 coordinate system.

2.2.3. Remote Sensing Variable Extraction

The extraction of informative features from remote sensing imagery constitutes the analytical foundation for AGB modeling [62]. In this study, a comprehensive suite of remote sensing variables was systematically derived from multiple sensor platforms, encompassing optical, radar, and LiDAR data. For Landsat 8 OLI, a total of 7 spectral bands were utilized alongside 18 vegetation indices, 7 enhancement factors, and 168 texture features generated using Gray Level Co-occurrence Matrix (GLCM) metrics across 3 × 3, 5 × 5, and 7 × 7 window sizes [20]. Similarly, ALOS-2 PALSAR-2 and SAOCOM-L1A contributed 8 backscatter coefficients (HH, HV, VV, VH), 96 polarization-based texture measures (derived from HV and VH polarizations at multiple spatial scales), 20 radar vegetation indices, 20 polarization ratios, and 24 polarimetric decomposition features. Sentinel-2A imagery yielded 12 spectral bands, 19 vegetation indices, 6 enhancement factors, and 264 texture features using the same window configurations. In addition, 50 structural and waveform-based metrics were extracted from GEDI L2A/L2B, while 45 complementary features were derived from ICESat-2 ATL08. To enhance model accuracy and generalizability, auxiliary variables related to climate and topography were also incorporated. In total, the feature set comprised 733 variables spanning spectral, structural, textural, and environmental dimensions [63,64] (Table 3).

2.3. Variable Selection Methods

This study employed 9 variable selection methods to optimize forest biomass estimation models. These included the PCC, RF importance ranking, EN, Lasso, and SHAP.

Specifically, PCC is used to measure the degree of linear correlation between each predictor variable and AGB and is a widely used correlation analysis method. In this study, a significance level of p ≤ 0.5 was set to exclude variables with low linear correlation with AGB or strong covariance to reduce model redundancy [65].

RF importance ranking is based on the integration results of multiple decision trees, evaluating the importance of each variable based on its average contribution to model error reduction in node splitting and measuring its impact on model accuracy accordingly. The method shows good robustness in dealing with nonlinear relationships and multivariate interference and can effectively reveal the explanatory power of the characterized variables. To ensure the representativeness of the variable selection, only variables with a cumulative significance of 90% were retained in this study [66].

Lasso regression, as a sparse modeling method based on L1 regularization, is able to compress the coefficients of the null variables to zero for automatic feature selection. λ (usually denoted as alpha) is the regularization strength parameter in Lasso regression, which controls the overall degree of regularization, and optimization is performed through cross-validation to achieve the optimal balance of sparsity and model performance [67].

EN regression combines the strengths of Lasso and Ridge regression to enhance the robustness of variable selection. By integrating L1 and L2 regularization, EN addresses both variable selection and multicollinearity, making it suitable for high-dimensional data. l1_ratio, a key hyperparameter in EN, controls the ratio of L1 regularization (Lasso) to L2 regularization (Ridge). In this study, l1_ratio is set to 0.5, and the regularization strength parameter (λ) is adjusted through cross-validation to achieve the optimal feature combination [68]. These complementary methods provided a multi-perspective screening of input variables, laying a foundation for robust AGB modeling.

To improve the interpretability of variable selection and to quantify the contribution of different variables to the AGB estimation model, the SHAP method was introduced in this study. SHAP is an interpretable machine learning framework grounded in cooperative game theory, designed to quantify the contribution of each feature to a model’s prediction. SHAP values are derived from the Shapley value concept, which allocates a marginal contribution to each feature such that the sum of all contributions equals the difference between the model’s prediction and a baseline value, typically the mean of the target variable [69]. Let

X_{i}

denote the i-th sample,

X_{i j}

the j-th feature of that sample, and

{\hat{y}}_{i} = f (X_{i})

the model prediction. Let

{\hat{y}}_{b a s e}

represent the model baseline (usually the mean of all target values). Then, the SHAP value explanation satisfies the following equation (Equation (7)):

{\hat{y}}_{i} = {\hat{y}}_{base} + \sum_{j = 1}^{k} ϕ_{i j}

(7)

where

ϕ_{i j}

denotes the SHAP value of the j-th feature in the i-th sample, representing its individual contribution to the predicted outcome

{\hat{y}}_{i}

. A positive SHAP value (

ϕ_{i j}

> 0) indicates that the feature increases the prediction above the baseline, while a negative value implies a suppressive or negative influence [37].

Model interpretation in SHAP begins by constructing an explainer, with support for various model types, including deep, gradient, kernel, tree, and sampling explainers [70]. In the case of tree-based models—such as XGBoost, LightGBM, and CatBoost—SHAP’s tree explainer provides efficient and accurate attribution. Global interpretability in SHAP assesses overall feature importance across the entire dataset, where SHAP values farther from zero indicate greater contribution to model output [71]. Each feature can exert both positive and negative influences, depending on its directional effect on the prediction [72]. In this study, SHAP was applied in conjunction with multiple machine learning models, including Lasso, EN, RF, SVR, and CatBoost. For each model, SHAP values were computed and ranked by absolute magnitude. The top 20 features were selected as key variables, aiming to improve both predictive performance and computational efficiency.

2.4. AGB Model Parameter Optimization

In this study, the GA was employed to automatically optimize hyperparameters for SVR, RF, CatBoost, Lasso, and EN models. Compared to a traditional grid search, the GA substantially reduces computational cost while improving optimization efficiency. The process involves randomly initializing a parameter population, followed by iterative fitness evaluation, selection, crossover, and mutation, ultimately converging to optimal configurations. For SVR, the GA tuned C (penalty), gamma (RBF kernel parameter), and epsilon (margin tolerance) [73]. In RF, n_estimators, max_depth, and min_samples_split were optimized to balance model complexity and reduce overfitting [74]. CatBoost hyperparameters, including learning_rate, depth, iterations, and subsample, were adjusted to enhance convergence and regularization [75]. For Lasso and EN, the GA optimized λ (regularization strength) and

α

(L1/L2 ratio), enabling effective feature selection and stability in high-dimensional, collinear data [39,40]. The global search capability of the GA allows models to maintain strong predictive performance across both linear and nonlinear relationships [76]. All models were trained using 5-fold cross-validation to ensure robustness and generalizability [77]. The dataset was partitioned into five subsets; each subset was used once for validation, while the others were used for training. The process yielded average performance metrics across folds, reducing dependence on specific data splits and enabling reliable model evaluation.

2.5. Model Evaluation

Model accuracy was evaluated using the coefficient of determination (

R^{2}

) and Root Mean Square Error (

R M S E

) [78,79,80].

R^{2}

ranges from 0 to 1, with higher values indicating better model accuracy, while lower

R M S E

values indicate higher precision. The specific formulas are as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - {\bar{y}}_{i})}^{2}}

(8)

RMSE = \sqrt{\frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}{n}}

(9)

In the equations,

n

represents the sample size,

y_{i}

denotes the true values,

{\hat{y}}_{i}

refers to the model’s predicted values, and

{\bar{y}}_{i}

is the mean of the sample.

3. Analysis of Results

3.1. Model Variable Selection

In this study, nine different variable selection methods were used to optimize the AGB model, as shown in Figure 1. In the variable selection process, Pearson correlation analysis identified 248 variables with a significance level (p) of less than 0.05. RF importance selection retained 196 variables, with a cumulative contribution of 0.9. Elastic net selected 88 variables, while the least absolute shrinkage selected 47 variables. SHAP, in combination with Lasso, EN, RF, SVR, and CatBoost, selected the top 20 variables based on SHAP absolute value rankings for final modeling (Figure 3). The five methods identified variables significantly correlated with AGB, with the following proportions out of 733 total variables: PC: 33.83%, RF: 26.74%, EN: 12.01%, LAS: 6.41%, and SHAP: 2.73%. To further optimize variable selection and reduce dimensionality, a dual-variable selection approach was implemented. The first round of selection used conventional methods, followed by a second round of selection using interpretable machine learning methods. The results are shown in Figure 3. Notably, the selected feature variables varied across methods, and their contributions to AGB differed across models, suggesting that variables selected by different methods may have significant impacts on AGB variation.

3.2. Model Results Analysis

3.2.1. Comparison of AGB Estimation Accuracy Across Different Variable Selection Methods

This study employed the GA to automatically optimize hyperparameters for several machine learning models, including SVR, RF, CatBoost, Lasso, and EN (Figure 4). Among the five GA-optimized models, Lasso and EN generally performed better in the single-variable selection methods. In the GA-EN model, Lasso-selected variables resulted in the highest model fit, with R² = 0.87 and RMSE = 15.74 Mg/ha, while RF-selected variables yielded the lowest model fit, with R² = 0.40 and RMSE = 33.49 Mg/ha. In the GA-SVR model, Lasso-selected variables had the highest R² (0.74), while the variables selected by Pearson correlation (PC) showed the lowest fit accuracy (R² = 0.36). In the GA-RF and GA-CatBoost models, the best fit for Lasso’s single-variable selection was R² = 0.49, RMSE = 30.77 Mg/ha, and R² = 0.51, RMSE = 30.24 Mg/ha, respectively, both outperforming PC and RF selection methods. Overall, Lasso and EN showed the best performance among single-variable selection methods, while PC and RF variable selection models showed lower fitting accuracy.

Compared to single-variable selection, dual-variable selection methods significantly improved model fit. In the GA-EN model, the LAS-EN (SHAP) combination achieved the highest model fit (R² = 0.89, RMSE = 14.46 Mg/ha), outperforming both Lasso and EN alone. In the GA-RF model, the LAS-RF (SHAP) combination achieved the highest R² of 0.52, an improvement over the single Lasso selection (R² = 0.49), indicating SHAP’s significant impact on variable selection in RF models. In the GA-CatBoost model, the LAS-CAT (SHAP) combination yielded the highest fit (R² = 0.64, RMSE = 25.88 Mg/ha), followed by EN-CAT (SHAP), with R² = 0.55 and RMSE = 29.05 Mg/ha. However, some dual-variable selection methods did not effectively improve model performance. In the GA-SVR model, only the PC-SVR (SHAP) combination showed an improvement in fit accuracy (R² = 0.42) compared to the single PC-selected variables (R² = 0.36), while other dual-variable selection methods resulted in decreased fitting accuracy, suggesting that SHAP’s combination with SVR for variable selection is less effective. Among all variable selection methods, the Lasso-SHAP combination consistently yielded higher fitting accuracy across the five GA-optimized models, with R² values greater than 0.51. This suggests that SHAP’s variable selection capability improves the model’s generalization ability. Notably, the GA-Lasso model achieved the highest accuracy (R² = 0.91, RMSE = 12.94 Mg/ha).

By comparing nine variable selection methods, this study found that dual-variable selection methods outperformed single-variable methods. Overall, SHAP-based variable selection enhanced the interpretability and stability of the models used in this study while effectively optimizing predictive accuracy. As shown in Figure 5, the number of selected variables in each method was as follows: PC (248), RF (196), EN (88), LAS (47), PC-SHAP (20), and SHAP (20). Notably, using the SHAP dual-variable selection method effectively reduced the number of redundant variables, resulting in a significant reduction in model runtime from 400 s to approximately 8 s, while maintaining high accuracy. This demonstrates that SHAP’s interpretable machine learning algorithm not only enhances model interpretability but also improves the accuracy, efficiency, and stability of forest biomass estimation by effectively selecting a small number of multi-source remote sensing features.

3.2.2. Comparison of the Accuracy for the Five Models

Figure 6 illustrates the fitting results of the five models under the nine variable selection methods. Among them, the GA-Lasso model demonstrated the highest stability, consistently outperforming other models in combinations such as PC-SHAP, EN-SHAP, LAS-SHAP, and SHAP alone. In terms of average overall performance, the ranking of models by mean R² was as follows: GA-Lasso (0.71) > GA-EN (0.62) > GA-SVR (0.51) > GA-CAT (0.48) > GA-RF (0.46). SHAP-enhanced variable selection methods significantly improved the accuracy of forest AGB estimation. Across variable combinations, the GA-Lasso model consistently exhibited the most stable performance.

3.2.3. Comparison of Variable Selection Differences Among Models

To further examine the sensitivity and consistency of different variable selection methods in remote sensing applications, this study evaluated their effects on feature selection under the optimal variable combinations. As summarized in Table 4, the selected remote sensing features varied across the five GA-optimized models. The corresponding model fitting accuracies were as follows. For GA-Lasso (LAS-SHAP), optical (16), microwave (3), and LiDAR (1) variables were selected, yielding the highest performance (R² = 0.91, RMSE = 12.94 Mg/ha). GA-EN (LAS-EN-SHAP) selected optical (17), microwave (2), and LiDAR (1) variables (R² = 0.89, RMSE = 15.15 Mg/ha). GA-SVR (Lasso) included a broader set of features, with optical (31), microwave (10), LiDAR (4), topographic (1), and climatic (1) variables (R² = 0.74, RMSE = 22.07 Mg/ha). GA-CatBoost (LAS-CAT-SHAP) selected optical (14), microwave (4), LiDAR (1), and topographic (1) variables (R² = 0.64, RMSE = 25.88 Mg/ha). GA-RF (LAS-RF-SHAP) selected optical (12), microwave (5), LiDAR (1), topographic (1), and climatic (1) variables (R² = 0.52, RMSE = 29.91 Mg/ha). These results demonstrate that optical remote sensing features are most strongly correlated with AGB and consistently dominate model performance. However, microwave and LiDAR features also contribute valuable vertical and spatial structural information, enhancing the reliability of AGB estimation.

Overall, the Lasso-SHAP method achieved consistently higher fitting accuracy across all models, with GA-Lasso performing the best and GA-RF showing the lowest accuracy. Despite differences in variable selection strategies, all models converged on the same five key remote sensing variables—GD2Brg_aN, S2_MTCI, S2_X3B2Cor, L8_X3B2Cor, and A2_BZ3—highlighting their critical importance in biomass estimation and indicating strong consistency in feature selection. The integration of SHAP with Lasso substantially reduced uncertainty in variable selection and produced the best performance in the GA-Lasso model (R² = 0.91). Dual-variable selection approaches combining SHAP with Lasso or EN further improved estimation accuracy. Compared with conventional methods, SHAP not only enhanced model interpretability but also optimized the feature selection strategy, resulting in more spatially uniform biomass predictions and improved stability and reliability.

3.3. Comparison of AGB Inversion Across Different Models

By selecting the optimal variable selection and prediction scheme from five GA models, AGB estimation and mapping were conducted for Pinus kesiya var. langbianensis forests. Additionally, the 2022 AIEC, resampled to a 30 m spatial resolution, was utilized to refine the predicted results, effectively distinguishing forest from non-forest areas. As shown in Figure 7, the five model outputs at the 30 m pixel level produced 60,722 prediction values. While the predicted AGB values varied across models, their overall distributions were similar and closely aligned with field-measured AGB. Higher AGB values were concentrated in the western and southwestern regions. In contrast, lower values appeared in the eastern and northeastern areas, while the central region exhibited a fragmented pattern. This spatial fragmentation is mainly attributed to the presence of rivers and roads, as well as greater human disturbance in the central zone. On the other hand, the western region, with higher elevation and denser vegetation cover, showed more substantial AGB values. The predicted forest AGB values for different GA models were as follows: GA-Lasso ranged from 30.15 to 279.97 Mg/ha (Figure 7a), GA-EN ranged from 32.20 to 279.99 Mg/ha (Figure 7b), GA-SVR ranged from 38.69 to 271.26 Mg/ha (Figure 7c), GA-CAT ranged from 81.66 to 238.07 Mg/ha (Figure 7d), and GA-RF ranged from 99.58 to 211.50 Mg/ha (Figure 7e).

4. Discussion

4.1. Contribution of Dual-Variable Selection to Enhancing AGB Estimation Accuracy

Optical remote sensing variables had the highest percentage of final modeling variables in this study for both single- and dual-variable selection methods, especially for the Sentinel-2 data. Optical remote sensing variables still have a large contribution, even after applying the SHAP interpretability screening method, indicating the importance of optical remote sensing data in forest AGB estimation. However, optical remote sensing still has limitations in complex terrain and high-density forested areas. The integration of data from various sources can improve the comprehensiveness of information, thus increasing the accuracy of AGB estimates [81]. Therefore, multi-source fusion combining microwave remote sensing (SAR) and satellite-mounted LiDAR data remains a key direction for future research. Research has shown that the effective integration of data from multiple sources is critical for improving estimation accuracy and efficiency, especially in forest resource monitoring [82]. In this study, an interpretable method of SHAP combined with modeling is proposed that systematically screens multiple remotely sensed features to improve the accuracy of AGB estimation. The SHAP technique is able to maintain computational efficiency while identifying the 20 optimal variables for modeling. Five individual variable selection methods (PC, RF, EN, Lasso, and SHAP) were evaluated during the model variable selection process, and the performance of SHAP combined with different variable selection strategies was compared. The results of the study show that the model constructed using LAS (SHAP) and dual-variable selection outperforms other combinations in terms of AGB estimation accuracy and computational efficiency. This advantage stems from their complementary nature: SHAP screens key variables related to AGB from an interpretive perspective [71], while Lasso further removes redundant information from a statistical perspective [67], forming a two-tiered mechanism of “interpretive screening-statistical regularization”, which provides a relocatable framework for the screening of variables in different forest types. The dual-variable selection method significantly outperforms other selection methods in the estimation of AGB in the GA-Lasso model, demonstrating more stable predictive performance. Furthermore, among the other four GA optimization models, the dual-variable selection method consistently outperforms the single-variable selection method, primarily due to the model interpretability and transparency of variable contributions provided by SHAP [33]. With the increasing abundance of multi-source remote sensing data, integrating multiple variable selection methods can help minimize redundant data processing, eliminate multicollinearity, and select optimal variables for accurate AGB estimation. The best variable screening combination is LAS (SHAP), as shown in Table 4. This illustrates the key role of the SHAP and Lasso model combination in AGB estimation, whereas the variable combination that fitted the other GA models better was the dual-variable screening of the combined LAS and SHAP model. The results show that the first 20 optimal variables retained using the LAS (SHAP) method can significantly reduce redundant information and improve computational efficiency, providing an efficient and reliable solution for large-scale AGB estimation. This dual-variable selection method for determining the optimal modeling variables not only shortens the processing time but also improves the overall computational efficiency [83].

The SHAP dual-variable selection method was used to model and accurately predict the AGB of the Pinus kesiya var. langbianensis forest. The LAS-SHAP combination for variable selection had the best overall results in the dual-variable selection method, which screened variables with the highest fitting accuracy R² of 0.91 and RMSE of 12.94 Mg/ha in the GA-Lasso model. The practicability and reliability of the bivariate selection method for inverting the AGB of the Pinus kesiya var. langbianensis forest were demonstrated. Li et al. [84], Ehlers et al. [85], and Su et al. [86] obtained the same results compared to this study, confirming the important contribution of different sensors from multiple data sources to the AGB inversion. Sa et al. [87] used the Lasso-SVR model to estimate AGB in a planted forest area with flat terrain, with a test set R² = 0.8792. In contrast, the cross-validated GA-Lasso model in this study fitted R² = 0.91 under complex mountainous conditions, with slightly better accuracy. The comparison reveals that this study integrates all current remote sensing types and effectively utilizes the complementary advantages of multi-source remote sensing data. Wang et al. [88] used the Lasso-SVR model to estimate forest aboveground biomass in Deer Head Forest Exploration Variable selection and model performance improved estimation accuracy (R² = 0.62). Fu et al. [89] compared four machine learning algorithms of variable optimization methods for estimating AGB in rubber plantations based on Sentinel-2 remote sensing imagery, and the Boruta-RF model had the best accuracy, with an R² and RMSE of 0.86 and 15.77 Mg/ha, respectively. The results of the study demonstrate the significant advantages of our dual-variable combination approach. Meanwhile, comparing the modeling results for different combinations of variables revealed that the more variables screened, the model fit was not necessarily more effective. The number of variables was not positively correlated with the accuracy of the model. Adame-Campos et al. [90] revealed that bivariate screening methods outperform univariate selection methods in terms of model fitting accuracy, which was verified in this study based on the SHAP and modeling bivariate screening methods. In addition, the GA optimization model based on PC and RF screening variables in this study had a shorter fitting accuracy compared to the LAS (SHAP) and bivariate screening methods. The reason may be that PC lacks the theoretical basis for the interpretation of variable contributions, and the characteristic importance of RF is greatly affected by the number and distribution of samples, while the theoretical framework of the Lasso-SHAP combination ensures the stability and interpretability of the modeling of the screening results. This advantage can comprehensively show the local and global contributions of different remote sensing factors to the model results, and the symmetry and validity of the Shapley values can ensure the objectivity of variable importance ranking and enhance the interpretability of key variables [69]. Meanwhile, Li et al. [91] and Chen et al. [92] showed that the impact of climate change on forest resource estimation cannot be ignored. Factors such as tree growth status, leaf color, and plant indices can fluctuate significantly over the growth cycle in forests in response to changes in climate. Therefore, the meteorological factors were used in this research to improve the accuracy of AGB estimation in Pinus kesiya var. langbianensis forests.

4.2. Impact of Estimation Model Selection on AGB Estimation

The performance of different models on AGB estimation varies significantly, and model optimization and selection are equally crucial [75]. Traditional regression methods may lead to overfitting in small-sample data environments due to insufficient data volume, while appropriate regularization methods with nonlinear modeling techniques can effectively improve the applicability and generalization of the mode [93]. This survey shows that GA-Lasso performs optimally among the methods, whose L1 regularization mechanism effectively reduces the risk of overfitting to achieve high-precision estimation of the AGB of Pinus kesiya var. langbianensis forests under the condition of small-sample data. In the model performance comparison, GA-Lasso and GA-SVR perform stably, while GA-EN combines the advantages of Lasso and Ridge regression to provide more flexibility, yet the complexity of the computation makes it less stable across different datasets. Furthermore, the conventional experience is that RF typically provided high prediction accuracy in regression and classification tasks [94,95]. However, in the small sample data setting of this study, GA-RF performed mediocrely. The stochastic nature of the model and sensitivity to data noise probably resulted in an increased risk of significant variance and overfitting. Compared with GA-RF, GA-SVR is capable of capturing the nonlinear features in small-sample data better by choosing the kernel function (e.g., RBF kernel) appropriately, and it exhibits stronger robustness to outliers and noise. Therefore, the GA-SVR machine learning method is more applicable when there are complex nonlinear relationships in relation to small sample data features. Overall, the performance of GA-RF and GA-CatBoost is average in the context of the small sample data in this study. The reason is that GA-RF is prone to overfitting noise, resulting in insufficient generalization ability, while GA-CatBoost struggles to achieve effective optimization in small-sample environments due to its heavy reliance on hyperparameter settings. Compared to GA-RF, GA-CatBoost showed its advantages in modeling structured data. In summary, reliance on variable selection alone is not sufficient in AGB estimation; for model selection, the optimization strategy determines the accuracy and robustness of the prediction. Moreover, the performance of different models under small sample data is affected by factors such as model complexity, regularization method, ability to adapt to nonlinear relationships, and difficulty of parameter tuning. Therefore, the selection of a model with strong nonlinear fitting capability (e.g., SVR) is crucial for accurately capturing the complex relationship between AGB and variables. In addition, genetic algorithm (GA) optimization contributes to improving the model’s generalization ability and reducing the sensitivity to noise under small sample conditions. Consequently, Lasso and SVR combined with GA parameter optimization show better stability and prediction ability in small-sample environments.

4.3. Variations in Optimal AGB Estimation Among Models

The predictions of the five models optimized by the genetic algorithm showed approximately the same spatial distribution pattern of AGB in this study, characterized by higher values in the west and southwest, a fragmented zone in the center, and lower values in the east and northeast. Nonetheless, the range of predicted values and localized response patterns varied among the models (see Figure 7 and Figure 8). The results indicate that higher predicted AGB values are primarily concentrated in forested areas with higher elevations, while lower AGB values are distributed in non-forested regions (see Figure 8). This suggests that factors such as topography, transportation accessibility, and human disturbances significantly influence the spatial distribution of AGB [96,97,98]. Among them, the GA-Lasso and GA-EN models exhibited the widest prediction range (30.15–279.99 Mg/ha) and showed good agreement with field-measured values, indicating their strong fitting capability and generalization performance under variable constraint and parameter optimization conditions. Similar to Roy et al. [99], the GA-SVR was average among all models, but it performed best within the nonlinear model. Its predicted range (38.693–271.255 Mg/ha) is approaching and between the predicted values of the GA-Lasso and GA-EN models, exhibiting a more sensitive response to changes in disturbed areas within the central river and roadways, which may be attributed to its strength in nonlinear modeling of small samples. Zhang et al. [100] indicated that minor differences in model predictions are primarily attributed to the relatively small sample size. In this study, the GA-CatBoost and GA-RF models yielded similar prediction results, both showing high concentration and limited fluctuation in predicted values, with a tendency to underestimate high-value AGB regions (Figure 4). Under limited sample conditions, CatBoost and RF, even with GA optimization, failed to fully exploit their nonlinear modeling advantages, resulting in relatively low estimation accuracy. In contrast, linear models more effectively captured the strong correlations between multi-source remote sensing variables and AGB, thus achieving higher prediction accuracy. Except for GA-SVR, which remained relatively stable, both CatBoost and RF appeared to rely on larger sample sizes to identify complex nonlinear relationships, leading to smoother AGB spatial estimates with reduced variability. This phenomenon is presumed to be related to the conservative treatment of outliers within their internal mechanisms, which results in insufficient responses to higher AGB values [101].

To explore the potential reasons for the model’s insufficient response to high AGB values, we further compared the variable importance ranking of five GA optimization models under different feature selection strategies. It was found that the dual-variable strategy not only improved the accuracy of the model but also enhanced the consistency and stability of feature selection. Although the number and type of retained predictors varied among models, five key variables—GD2Brg_aN, S2_MTCI, S2_X3B2Cor, L8_X3B2Cor, and A2_BZ3—were consistently preserved across all optimal subsets. This highlights the pivotal role of LiDAR waveform energy, microwave polarization metrics, and optical texture and vegetation indices in AGB estimation [102]. This cross-consistency validates the systematicity and robustness of the SHAP bivariate screening strategy and enhances the interpretability and generalizability of the variable ranking results across multiple models. In addition, the differences in the responses of different models to the key remote sensing variables also indicate that the future model selection should match the optimal modeling strategy by combining the regional topographic features with the practical application requirements.

Therefore, on the basis of stable and interpretable variable sets, the GA-Lasso (R² = 0.91) and GA-SVR (R² = 0.74) models constructed in this study not only realized a high-precision estimation of forest AGB but also enhanced the ability to analyze the mechanism of variables’ role in the spatial pattern of its differentiation from the side (see Figure 8). GA-Lasso obtains the optimal λ through L1 regularization and GA global search, which compresses the feature dimensions while retaining the variables closely related to vegetation growth [87], while GA-SVR utilizes an RBF kernel to effectively capture the nonlinear relationship between optical, microwave, and LiDAR features and AGB [73]. The penalty coefficients and kernel function parameters under GA optimization effectively enhance the adaptability of these two models to complex topography and canopy heterogeneity areas and reduce the uncertainty in the estimation process of AGB in Pinus kesiya var. langbianensis forests. They provide an effective solution for regional-scale carbon stock assessment, forest spatial structure identification, and fine management of heterogeneous woodlands.

4.4. Limitations and Future Perspectives

The accuracy of remote sensing estimation is constrained by spatial resolution, variable selection, data fusion, and modeling, with pronounced uncertainties in areas of complex topography and diverse vegetation [103]. Although this study improved prediction accuracy through the integration of multi-source data, variable screening, and ensemble modeling, its applicability at larger spatial scales and across diverse climatic and forest conditions remains restricted due to the limited geographic scope and sample representativeness [104]. Also, limited sampling of broadleaf forests in addition to coniferous forests hinders the validation of the applicability of the method under more complex vegetation conditions. Data processing capability and model training efficiency will also be restricted factors in large-scale applications.

Uncertainty in forest AGB estimation mainly comes from remote sensing data noise, model assumption bias, and the under-representation of ground samples. Models based on regression analysis are prone to systematic errors if the assumptions do not match the actual forest characteristics. In addition, methodological differences in the pre-processing of remotely sensed images affect the reliability of the estimation results. The top 20 key variables were selected for modeling through SHAP value ranking in terms of variable selection for this study to obtain better results. However, the method has not yet assessed the migration effect of the SHAP variable on the modeling performance of other models, nor has it systematically explored the optimal relationship between the number of variables and model stability. Future research could further analyze the sensitivity of the model to the number of variables in order to optimize the modeling strategy and improve generalization. To further reduce errors, the use of high-resolution imagery (e.g., unmanned aerial vehicles or next-generation satellite data) to enhance the accuracy of estimation of edges and dense areas should be prioritized [105]. Building more complex model structures is also an effective way to improve accuracy in modeling. Current models are often simplified, making it difficult to fully reflect the complexity of forest ecosystems. The potential of integration algorithms and deep learning methods (e.g., CNN, Transformer) in remote sensing data processing deserves further exploration.

In summary, topographic factors, climate change, model selection, and data collection accuracy are the key factors affecting the accuracy of AGB estimation. Although remote sensing technology has the advantage of spatial width and efficiency, it still needs continuous improvement in accuracy and uncertainty control. Multi-source data fusion combined with advanced algorithms, like SHAP and GA-based parameter optimization, offers an effective approach to enhance estimation accuracy, model robustness, and large-scale applicability.

5. Conclusions

This study constructed a forest AGB estimation framework that combines SHAP interpretable machine learning with genetic algorithm (GA) optimization. Through integrating seven multi-source remote sensing data and constructing five GA optimization models, we systematically assessed the effect of nine variable selection methods on the accuracy of AGB estimation of Pinus kesiya var. langbianensis forest in Wuyi Village. The results demonstrated that the SHAP-based dual-variable selection method combined with the multi-modeling framework can effectively screen out the key factors, which significantly reduces the redundant information and improves the computational efficiency. The different models optimized based on GA parameters significantly improved the accuracy, stability, and generalization ability of AGB estimation, with the highest R² value of 0.91 for the linear model of GA-Lasso combined with the SHAP variable selection method and the best R² value of 0.74 for the nonlinear model of GA-SVR combined with the Lasso variable selection method. Overall, the SHAP interpretable method can optimize the variable selection and improve the accuracy and computational efficiency of AGB estimation, while the GA parameter optimization enhances the intelligent adjustment ability of the model, which provides an efficient and reliable modeling scheme for AGB monitoring of a typical Pinus kesiya var. langbianensis forest. The results of this study provide a reference for improving AGB monitoring in different forest areas and are expected to expand the scope of application of these methods.

Author Contributions

D.C.: writing—review and editing, writing—original draft, visualization, validation, software, resources, methodology, investigation, formal analysis, data curation, and conceptualization. H.L.: supervision, software, investigation, data organization, and conceptualization. Z.L.: supervision, formal analysis, data processing, and conceptualization. C.L.: supervision, software, methodology, investigation, and data management. Y.W.: supervision, methodology, formal analysis, and conceptualization. E.W.: software, methodology, conceptualization, and data organization. J.P.: software, methodology, and data organization. L.W.: supervision, formal analysis, and data organization. W.W.: writing—review and editing, validation, supervision, software, project management, methodology, investigation, formal analysis, data organization, and conceptualization. G.O.: writing—review and editing, writing—original draft, visualization, validation, supervision, software, resources, project management, methodology, investigation, funding acquisition, formal analysis, and data organization. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly supported by the Science and Technology Program of Yunnan Provincial Science and Technology Department, China (No. 202403AC100039), and the Education Talent of Xingdian Talent Support Program of Yunnan Province, China (No. XDYC-JYRC-2023-0083).

Data Availability Statement

The datasets generated or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Antonarakis, A. Linking carbon and water cycles with forests. Geography 2018, 103, 4–11. [Google Scholar] [CrossRef]
Fujii, H.; Sato, M.; Managi, S. Decomposition Analysis of Forest Ecosystem Services Values. Sustainability 2017, 9, 687. [Google Scholar] [CrossRef]
Zhang, Y.; Liang, S. Fusion of Multiple Gridded Biomass Datasets for Generating a Global Forest Aboveground Biomass Map. Remote Sens. 2020, 12, 2559. [Google Scholar] [CrossRef]
Yang, C.; Sun, W.; Zhu, J.; Ji, C.; Feng, Y.; Ma, S.; Shi, Y.; Guo, Z.; Fang, J. Updated estimation of forest biomass carbon pools in China, 1977–2018. Biogeosci. Discuss. 2022, 19, 2989–2999. [Google Scholar] [CrossRef]
Yang, H.; Qin, Z.; Shu, Q.; Xu, L.; Yu, J.; Luo, S.; Wu, Z.; Xia, C.; Yang, Z. Estimation of Above-ground Biomass for Dendrocalamus giganteus Utilizing Spaceborne LiDAR GEDI Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 5271–5286. [Google Scholar] [CrossRef]
Ge, J.; Hou, M.; Liang, T.; Feng, Q.; Meng, X.; Liu, J.; Bao, X.; Gao, H. Spatiotemporal dynamics of grassland aboveground biomass and its driving factors in North China over the past 20 years. Sci. Total Environ. 2022, 826, 154226. [Google Scholar] [CrossRef]
Zhang, R.; Zhou, X.; Ouyang, Z.; Avitabile, V.; Qi, J.; Chen, J.; Giannico, V. Estimating aboveground biomass in subtropical forests of China by integrating multisource remote sensing and ground data. Remote Sens. Environ. 2019, 232, 111341. [Google Scholar] [CrossRef]
Lu, D. The potential and challenge of remote sensing-based biomass estimation. Int. J. Remote Sens. 2006, 27, 1297–1328. [Google Scholar] [CrossRef]
Pertille, C.T.; Nicoletti, M.F.; Topanotti, L.R.; Stepka, T.F. Biomass quantification of Pinus taeda L. from remote optical sensor data. Adv. For. Sci. 2019, 6, 603–610. [Google Scholar] [CrossRef]
Madundo, S.D.; Mauya, E.W.; Kilawe, C.J. Comparison of multi-source remote sensing data for estimating and mapping above-ground biomass in the West Usambara tropical montane forests. Sci. Afr. 2023, 21, e01763. [Google Scholar] [CrossRef]
Quang, N.H.; Quinn, C.H.; Carrie, R.; Stringer, L.C.; Van Hue, L.T.; Hackney, C.R.; Van Tan, D. Comparisons of regression and machine learning methods for estimating mangrove above-ground biomass using multiple remote sensing data in the red River Estuaries of Vietnam. Remote Sens. Appl. Soc. Environ. 2022, 26, 100725. [Google Scholar] [CrossRef]
Kašpar, V.; Hederová, L.; Macek, M.; Müllerová, J.; Prošek, J.; Surový, P.; Wild, J.; Kopecký, M. Temperature buffering in temperate forests: Comparing microclimate models based on ground measurements with active and passive remote sensing. Remote Sens. Environ. 2021, 263, 112522. [Google Scholar] [CrossRef]
Lu, D.; Chen, Q.; Wang, G.; Liu, L.; Li, G.; Moran, E. A survey of remote sensing-based aboveground biomass estimation methods in forest ecosystems. Int. J. Digit. Earth 2016, 9, 63–105. [Google Scholar] [CrossRef]
Sun, X.; Li, G.; Wang, M.; Fan, Z. Analyzing the uncertainty of estimating forest aboveground biomass using optical imagery and spaceborne LiDAR. Remote Sens. 2019, 11, 722. [Google Scholar] [CrossRef]
Fremout, T.; Vinatea, J.C.-D.; Thomas, E.; Huaman-Zambrano, W.; Salazar-Villegas, M.; La Fuente, D.L.-D.; Bernardino, P.N.; Atkinson, R.; Csaplovics, E.; Muys, B. Site-specific scaling of remote sensing-based estimates of woody cover and aboveground biomass for mapping long-term tropical dry forest degradation status. Remote Sens. Environ. 2022, 276, 113040. [Google Scholar] [CrossRef]
Santoro, M.; Cartus, O. Research Pathways of Forest Above-Ground Biomass Estimation Based on SAR Backscatter and Interferometric SAR Observations. Remote Sens. 2018, 10, 608. [Google Scholar] [CrossRef]
Sinha, S.; Mohan, S.; Das, A.; Sharma, L.; Jeganathan, C.; Santra, A.; Santra Mitra, S.; Nathawat, M. Multi-sensor approach integrating optical and multi-frequency synthetic aperture radar for carbon stock estimation over a tropical deciduous forest in India. Carbon Manag. 2020, 11, 39–55. [Google Scholar] [CrossRef]
Liu, X.; Neigh, C.S.; Pardini, M.; Forkel, M. Estimating forest height and above-ground biomass in tropical forests using P-band TomoSAR and GEDI observations. Int. J. Remote Sens. 2024, 45, 3129–3148. [Google Scholar] [CrossRef]
Imhoff, M. A theoretical analysis of the effect of forest structure on synthetic aperture radar backscatter and the remote sensing of biomass. IEEE Trans. Geosci. Remote Sens. 1995, 33, 341–351. [Google Scholar] [CrossRef]
Zhang, B.; Zhang, L.; Yan, M.; Zuo, J.; Dong, Y.; Chen, B. High-resolution mapping of forest parameters in tropical rainforests through AutoML integration of GEDI with Sentinel-1/2, Landsat 8 and ALOS-2 data. Sci. Remote Sens. 2025, 18, 9084–9118. [Google Scholar] [CrossRef]
May, P.B.; Schlund, M.; Armston, J.; Kotowska, M.M.; Brambach, F.; Wenzel, A.; Erasmi, S. Mapping aboveground biomass in Indonesian lowland forests using GEDI and hierarchical models. Remote Sens. Environ. 2024, 313, 114384. [Google Scholar] [CrossRef]
Benson, M.L.; Pierce, L.; Bergen, K.; Sarabandi, K. Model-based estimation of forest canopy height and biomass in the Canadian Boreal forest using radar, LiDAR, and optical remote sensing. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4635–4653. [Google Scholar] [CrossRef]
Zhu, X.; Cai, F.; Tian, J.; Williams, T.K.-A. Spatiotemporal Fusion of Multisource Remote Sensing Data: Literature Survey, Taxonomy, Principles, Applications, and Future Directions. Remote Sens. 2018, 10, 527. [Google Scholar] [CrossRef]
Zhou, J.; Zan, M.; Zhai, L.; Yang, S.; Xue, C.; Li, R.; Wang, X. Remote sensing estimation of aboveground biomass of different forest types in Xinjiang based on machine learning. Sci. Rep. 2025, 15, 6187. [Google Scholar] [CrossRef]
Yan, X.; Li, J.; Smith, A.R.; Yang, D.; Ma, T.; Su, Y.; Shao, J. Evaluation of machine learning methods and multi-source remote sensing data combinations to construct forest above-ground biomass models. Int. J. Digit. Earth 2023, 16, 4471–4491. [Google Scholar] [CrossRef]
Wei, H.-L.; Billings, S.A.; Liu, J. Term and variable selection for non-linear system identification. Int. J. Control 2004, 77, 86–110. [Google Scholar] [CrossRef]
Broeck, G.V.D.; Lykov, A.; Schleich, M.; Suciu, D. On the tractability of SHAP explanations. J. Artif. Intell. Res. 2022, 74, 851–886. [Google Scholar] [CrossRef]
Fumagalli, F.; Muschalik, M.; Kolpaczki, P.; Hüllermeier, E.; Hammer, B. SHAP-IQ: Unified approximation of any-order shapley interactions. NeurIPS 2023, 36, 11515–11551. [Google Scholar] [CrossRef]
Santos, M.R.; Guedes, A.; Sanchez-Gendriz, I. SHapley Additive exPlanations (SHAP) for Efficient Feature Selection in Rolling Bearing Fault Diagnosis. Mach. Learn. Knowl. Extr. 2024, 6, 316–341. [Google Scholar] [CrossRef]
Ashraf, I.; Bifarin, O.O. Interpretable machine learning with tree-based shapley additive explanations: Application to metabolomics datasets for binary classification. PLoS ONE 2023, 18, e0284315. [Google Scholar] [CrossRef]
Pezoa, R.; Salinas, L.; Torres, C. Explainability of High Energy Physics events classification using SHAP. J. Phys. Conf. Ser. 2023, 2438, 012082. [Google Scholar] [CrossRef]
Ekanayake, I.; Meddage, D.; Rathnayake, U. A novel approach to explain the black-box nature of machine learning in compressive strength predictions of concrete using Shapley additive explanations (SHAP). Case Stud. Constr. Mater. 2022, 16, e01059. [Google Scholar] [CrossRef]
Li, X.; Du, H.; Mao, F.; Xu, Y.; Huang, Z.; Xuan, J.; Zhou, Y.; Hu, M. Estimation aboveground biomass in subtropical bamboo forests based on an interpretable machine learning framework. Environ. Model. Softw. 2024, 178, 106071. [Google Scholar] [CrossRef]
Molisse, G.; Emin, D.; Costa, H. Implementation of a Sentinel-2 Based Exploratory Workflow for the Estimation of Above Ground Biomass. In Proceedings of the 2022 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS), Istanbul, Turkey, 7–9 March 2022; pp. 74–77. [Google Scholar] [CrossRef]
Huang, W.; Li, W.; Xu, J.; Ma, X.; Li, C.; Liu, C. Hyperspectral monitoring driven by machine learning methods for grassland above-ground biomass. Remote Sens. 2022, 14, 2086. [Google Scholar] [CrossRef]
Ma, S.; Tourani, R. Predictive and causal implications of using shapley value for model interpretation. In Proceedings of the 2020 KDD Workshop Causal Discovery, PMLR, San Diego, CA, USA, 24 August 2020; Volume 127, pp. 23–38. Available online: https://proceedings.mlr.press/v127/ma20a (accessed on 18 May 2025).
Aas, K.; Nagler, T.; Jullum, M.; Løland, A. Explaining predictive models using Shapley values and non-parametric vine copulas. Depend. Model. 2021, 9, 62–81. [Google Scholar] [CrossRef]
Sriram, N. Decomposing the Pearson Correlation. SSRN Electron. J. 2006, 2213946. [Google Scholar] [CrossRef]
Kim, J.; Kim, Y.; Kim, Y. A gradient-based optimization algorithm for lasso. J. Comput. Graph. Stat. 2008, 17, 994–1009. [Google Scholar] [CrossRef]
Al Jawarneh, A.S.; Ismail, M.T.; Awajan, A.M. Elastic net regression and empirical mode decomposition for enhancing the accuracy of the model selection. Int. J. Math. Eng. Manag. Sci. 2021, 6, 564. [Google Scholar] [CrossRef]
Algamal, Z.Y.; Lee, M.H. Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst. Appl. 2015, 42, 9326–9332. [Google Scholar] [CrossRef]
Freeman, E.; Moisen, G.; Coulston, J.; Wilson, B. Random forests and stochastic gradient boosting for predicting tree canopy cover: Comparing tuning processes and model performance. Can. J. For. Res. 2014, 16, 408. [Google Scholar] [CrossRef]
Wang, E.; Huang, T.; Liu, Z.; Bao, L.; Guo, B.; Yu, Z.; Feng, Z.; Luo, H.; Ou, G. Improving Forest Above-Ground Biomass Estimation Accuracy Using Multi-Source Remote Sensing and Optimized Least Absolute Shrinkage and Selection Operator Variable Selection Method. Remote Sens. 2024, 16, 4497. [Google Scholar] [CrossRef]
Schratz, P.; Muenchow, J.; Iturritxa, E.; Richter, J.; Brenning, A. Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol. Model. 2019, 406, 109–120. [Google Scholar] [CrossRef]
Mirjalili, S. Genetic algorithm. Evol. Algorithms Neural Netw. Theory Appl. 2019, 780, 43–55. [Google Scholar] [CrossRef]
Ji, Y.; Xu, K.; Zeng, P.; Zhang, W. GA-SVR algorithm for improving forest above ground biomass estimation using SAR data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6585–6595. [Google Scholar] [CrossRef]
Mabdeh, A.N.; Al-Fugara, A.K.; Khedher, K.M.; Mabdeh, M.; Al-Shabeeb, A.R.; Al-Adamat, R. Forest fire susceptibility assessment and mapping using support vector regression and adaptive neuro-fuzzy inference system-based evolutionary algorithms. Sustainability 2022, 14, 9446. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, X.; Wu, Y.; Xu, Y.; Cao, Z.; Yu, Z.; Feng, Z.; Luo, H.; Lu, C.; Wang, W.; et al. LiDAR-based individual tree AGB modeling of Pinus kesiya var. langbianensis by incorporating spatial structure. Ecol. Indic. 2024, 169, 112973. [Google Scholar] [CrossRef]
Ou, G.L.; Hui, X.; Wang, J.-F.; Xiao, Y.-F.; Ke Yi, C. Building mixed effect models of stand biomass for Simao pine (Pinus kesiya var. langbianensis) natural forest. J. Beijing For. Univ. 2015, 37, 101–110. [Google Scholar] [CrossRef]
Wang, D.; Yang, L.; Shi, C.; Li, S.; Tang, H.; He, C.; Cai, N.; Duan, A.; Gong, H. QTL mapping for growth-related traits by constructing the first genetic linkage map in Simao pine. BMC Plant Biol 2022, 22, 48. [Google Scholar] [CrossRef]
Xu, H.; Zhang, Z.; Ou, G.; Shi, H. A Study on Estimation and Distribution for Forest Biomass and Carbon Storage in Yunnan Province; Yunnan Science and Technology Press: Kunming, China, 2019. [Google Scholar]
Lu, C. Multiscale Forest Biomass Sampling Estimates Integrating Sky-Ground Data. Doctoral Dissertation, Southwest Forestry University, Kunming, China, 2024. [Google Scholar]
Chen, Z.; Sun, Z.; Zhang, H.; Zhang, H.; Qiu, H. Aboveground Forest Biomass Estimation Using Tent Mapping Atom Search Optimized Backpropagation Neural Network with Landsat 8 and Sentinel-1A Data. Remote Sens. 2023, 15, 5653. [Google Scholar] [CrossRef]
Li, H.; Li, X.; Kato, T.; Hayashi, M.; Fu, J.; Hiroshima, T. Accuracy assessment of GEDI terrain elevation, canopy height, and aboveground biomass density estimates in Japanese artificial forests. Sci. Remote Sens. 2024, 10, 100144. [Google Scholar] [CrossRef]
Xu, L.; Shu, Q.; Fu, H.; Zhou, W.; Luo, S.; Gao, Y.; Yu, J.; Guo, C.; Yang, Z.; Xiao, J.; et al. Estimation of Quercus Biomass in Shangri-La Based on GEDI Spaceborne Lidar Data. Forests 2023, 14, 876. [Google Scholar] [CrossRef]
Liu, X.; Su, Y.; Hu, T.; Yang, Q.; Liu, B.; Deng, Y.; Tang, H.; Tang, Z.; Fang, J.; Guo, Q. Neural network guided interpolation for mapping canopy height of China’s forests by integrating GEDI and ICESat-2 data. Remote Sens. Environ. 2022, 269, 112844. [Google Scholar] [CrossRef]
Usami, S.; Ishimaru, S.; Tadono, T. Advantages of High-Temporal L-Band SAR Observations for Estimating Active Landslide Dynamics: A Case Study of the Kounai Landslide in Sobetsu Town, Hokkaido, Japan. Remote Sens. 2024, 16, 2687. [Google Scholar] [CrossRef]
Rula, S.; Yonghui, N.; Wenyi, F. Combining Multi-Dimensional SAR Parameters to Improve RVoG Model for Coniferous Forest Height Inversion Using ALOS-2 Data. Remote Sens. 2023, 15, 1272. [Google Scholar] [CrossRef]
Ariel, S.S.; Carlos, L.; Jacqueline, J.M. Assessment of L-Band SAOCOM InSAR Coherence and Its Comparison with C-Band: A Case Study over Managed Forests in Argentina. Remote Sens. 2022, 14, 5652. [Google Scholar] [CrossRef]
Brunelli, B.; Mancini, F. Comparative analysis of SAOCOM and Sentinel-1 data for surface soil moisture retrieval using a change detection method in a semiarid region (Douro River’s basin, Spain). Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103874. [Google Scholar] [CrossRef]
Peng, S.; Ding, Y.; Liu, W.; Li, Z. 1 km monthly temperature and precipitation dataset for China from 1901 to 2017. Earth Syst. Sci. Data 2019, 11, 1931–1946. [Google Scholar] [CrossRef]
Lillesand, T.; Kiefer, R.W.; Chipman, J. Remote Sensing and Image Interpretation; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Huang, T.; Ou, G.; Wu, Y.; Zhang, X.; Liu, Z.; Xu, H.; Xu, X.; Wang, Z.; Xu, C. Estimating the Aboveground Biomass of Various Forest Types with High Heterogeneity at the Provincial Scale Based on Multi-Source Data. Remote Sens. 2023, 15, 3550. [Google Scholar] [CrossRef]
Huang, T.; Ou, G.; Xu, H.; Zhang, X.; Wu, Y.; Liu, Z.; Zou, F.; Zhang, C.; Xu, C. Comparing Algorithms for Estimation of Aboveground Biomass in Pinus yunnanensis. Forests 2023, 14, 1742. [Google Scholar] [CrossRef]
Rahadian, H.; Bandong, S.; Widyotriatmo, A.; Joelianto, E. Image encoding selection based on Pearson correlation coefficient for time series anomaly detection. Alex. Eng. J. 2023, 82, 304–322. [Google Scholar] [CrossRef]
Torre-Tojal, L.; Bastarrika, A.; Boyano, A.; Lopez Guede, J.M.; Grana, M. Above-ground biomass estimation from LiDAR data using random forest algorithms. J. Comput. Sci. 2022, 58, 101517. [Google Scholar] [CrossRef]
Ranstam, J.; Cook, J. LASSO regression. Br. J. Surg. 2018, 105, 1348. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
Li, Z. Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Comput. Environ. Urban Syst. 2022, 96, 101845. [Google Scholar] [CrossRef]
Chen, H.; Lundberg, S.M.; Lee, S.-I. Explaining a series of models by propagating Shapley values. Nat. Commun. 2022, 13, 4512. [Google Scholar] [CrossRef]
Chen, H.; Covert, I.C.; Lundberg, S.M.; Lee, S.-I. Algorithms to estimate Shapley value feature attributions. Nat. Mach. Intell. 2023, 5, 590–601. [Google Scholar] [CrossRef]
Marcílio, W.E.; Eler, D.M. From explanations to feature selection: Assessing SHAP values as feature selection mechanism. In Proceedings of the 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Porto de Galinhas, Brazil, 7–10 November 2020; pp. 340–347. [Google Scholar] [CrossRef]
Huang, Q.; Mao, J.; Liu, Y. An improved grid search algorithm of SVR parameters optimization. In Proceedings of the 2012 IEEE 14th International Conference on Communication Technology, Chengdu, China, 9–11 November 2012; pp. 1022–1026. [Google Scholar] [CrossRef]
Ming, D.; Zhou, T.; Wang, M.; Tan, T. Land cover classification using random forest with genetic algorithm-based parameter optimization. J. Appl. Remote Sens. 2016, 10, 035021. [Google Scholar] [CrossRef]
Luo, M.; Wang, Y.; Xie, Y.; Zhou, L.; Qiao, J.; Qiu, S.; Sun, Y. Combination of feature selection and catboost for prediction: The first application to the estimation of aboveground biomass. Forests 2021, 12, 216. [Google Scholar] [CrossRef]
Fan, Z.; Bai, K.; Zheng, X. Hybrid GA and Improved CNN algorithm for power plant transformer condition monitoring model. IEEE Access 2023, 12, 60255–60263. [Google Scholar] [CrossRef]
Bergman, E.; Purucker, L.; Hutter, F. Don’t Waste Your Time: Early Stopping Cross-Validation; Cornell University: Ithaca, NY, USA, 2024. [Google Scholar] [CrossRef]
Miles, J. R-squared, adjusted R-squared. Encycl. Stat. Behav. Sci. 2005, 1, 421–423. [Google Scholar] [CrossRef]
Harwell, M. A strategy for using bias and RMSE as outcomes in Monte Carlo studies in statistics. J. Mod. Appl. Stat. Methods 2019, 17, 5. [Google Scholar] [CrossRef]
Shanmugavalli, M.; Ignatia, K.M.J. Comparative Study among MAPE, RMSE and R Square over the Treatment Techniques Undergone for PCOS Influenced Women. Recent Pat. Eng. 2025, 19, E041223224190. [Google Scholar] [CrossRef]
Wang, L.; Ju, Y.; Ji, Y.; Marino, A.; Zhang, W.; Jing, Q. Estimation of Forest Above-Ground Biomass in the Study Area of Greater Khingan Ecological Station with Integration of Airborne LiDAR, Landsat 8 OLI, and Hyperspectral Remote Sensing Data. Forests 2024, 15, 1861. [Google Scholar] [CrossRef]
Lucas, R.; Lee, A.; Armston, J.; Breyer, J. Advances in forest characterisation, mapping and monitoring through integration of LiDAR and other remote sensing datasets. In Proceedings of the SilviLaser 2008: 8th International Conference on LiDAR Applications for Assessing Forest Ecosystems, Edinburgh, UK, 17–19 September 2008; Available online: https://www.researchgate.net/publication/47379840 (accessed on 18 May 2025).
Gualdrón, O.; Llobet, E.; Brezmes, J.; Vilanova, X.; Correig, X. Coupling fast variable selection methods to neural network-based classifiers: Application to multisensor systems. Sens. Actuators B Chem. 2006, 114, 522–529. [Google Scholar] [CrossRef]
Li, Y.; Li, M.; Li, C.; Liu, Z. Forest aboveground biomass estimation using Landsat 8 and Sentinel-1A data with machine learning algorithms. Sci. Rep. 2020, 10, 9952. [Google Scholar] [CrossRef]
Ehlers, D.; Wang, C.; Coulston, J.; Zhang, Y.; Pavelsky, T.; Frankenberg, E.; Woodcock, C.; Song, C. Mapping forest aboveground biomass using multisource remotely sensed data. Remote Sens. 2022, 14, 1115. [Google Scholar] [CrossRef]
Su, Y.; Wu, Z.; Zheng, X.; Qiu, Y.; Ma, Z.; Ren, Y.; Bai, Y. Harmonizing remote sensing and ground data for forest aboveground biomass estimation. Ecol. Inform. 2025, 86, 103002. [Google Scholar] [CrossRef]
Sa, R.; Nie, Y.; Chumachenko, S.; Fan, W. Biomass estimation and saturation value determination based on multi-source remote sensing data. Remote Sens. 2024, 16, 2250. [Google Scholar] [CrossRef]
Wang, P.; Tan, S.; Zhang, G.; Wang, S.; Wu, X. Remote Sensing Estimation of Forest Aboveground Biomass Based on Lasso-SVR. Forests 2022, 13, 1597. [Google Scholar] [CrossRef]
Fu, Y.; Tan, H.; Kou, W.; Xu, W.; Wang, H.; Lu, N. Estimation of rubber plantation biomass based on variable optimization from Sentinel-2 remote sensing imagery. Forests 2024, 15, 900. [Google Scholar] [CrossRef]
Adame-Campos, R.L.; Ghilardi, A.; Gao, Y.; Paneque-Gálvez, J.; Mas, J.F. Variables selection for aboveground biomass estimations using satellite data: A comparison between relative importance approach and stepwise Akaike’s information criterion. ISPRS Int. J. Geo-Inf. 2019, 8, 245. [Google Scholar] [CrossRef]
Li, Y.; Li, M.; Wang, Y. Forest aboveground biomass estimation and response to climate change based on remote sensing data. Sustainability 2022, 14, 14222. [Google Scholar] [CrossRef]
Chen, H.Y.; Luo, Y.; Reich, P.B.; Searle, E.B.; Biswas, S.R. Climate change-associated trends in net biomass change are age dependent in western boreal forests of Canada. Ecol. Lett. 2016, 19, 1150–1158. [Google Scholar] [CrossRef] [PubMed]
Tateishi, S.; Matsui, H.; Konishi, S. Nonlinear regression modeling via the lasso-type regularization. J. Stat. Plan. Inference 2010, 140, 1125–1134. [Google Scholar] [CrossRef]
Maesano, M.; Santopuoli, G.; Moresi, F.V.; Matteucci, G.; Lasserre, B.; Mugnozza, G.S. Above ground biomass estimation from UAV high resolution RGB images and LiDAR data in a pine forest in Southern Italy. Iforest-Biogeosci. For. 2022, 15, 451. [Google Scholar] [CrossRef]
Wu, Z.; Yao, F.; Zhang, J.; Liu, H. Estimating forest aboveground biomass using a combination of geographical random forest and empirical bayesian kriging models. Remote Sens. 2024, 16, 1859. [Google Scholar] [CrossRef]
Anees, S.A.; Mehmood, K.; Khan, W.R.; Sajjad, M.; Alahmadi, T.A.; Alharbi, S.A.; Luo, M. Integration of machine learning and remote sensing for above ground biomass estimation through Landsat-9 and field data in temperate forests of the Himalayan region. Ecol. Inform. 2024, 82, 102732. [Google Scholar] [CrossRef]
Miguel, A.S.M.; Skutsch, M.; Lovett, J.C. Predicting aboveground forest biomass with topographic variables in human-impacted tropical dry forest landscapes. Ecosphere 2018, 9, e02063. [Google Scholar] [CrossRef]
Ding, L.; Li, Z.; Shen, B.; Wang, X.; Xu, D.; Yan, R.; Yan, Y.; Xin, X.; Xiao, J.; Li, M. Spatial patterns and driving factors of aboveground and belowground biomass over the eastern Eurasian steppe. Sci. Total Environ. 2022, 803, 149700. [Google Scholar] [CrossRef]
Dutta Roy, A.; Debbarma, S. Comparing the allometric model to machine learning algorithms for aboveground biomass estimation in tropical forests. Ecol. Front. 2024, 44, 1069–1078. [Google Scholar] [CrossRef]
Zhang, X.; Shen, H.; Huang, T.; Wu, Y.; Guo, B.; Liu, Z.; Luo, H.; Tang, J.; Zhou, H.; Wang, L.; et al. Improved random forest algorithms for increasing the accuracy of forest aboveground biomass estimation using Sentinel-2 imagery. Ecol. Indic. 2024, 159, 111752. [Google Scholar] [CrossRef]
Tang, J.; Liu, Y.; Li, L.; Liu, Y.; Wu, Y.; Xu, H.; Ou, G. Enhancing aboveground biomass estimation for three pinus forests in yunnan, SW China, using landsat 8. Remote Sens 2022, 14, 4589. [Google Scholar] [CrossRef]
Luo, P.; Liao, J.; Shen, G. Combining Spectral and Texture Features for Estimating Leaf Area Index and Biomass of Maize Using Sentinel-1/2, and Landsat-8 Data. IEEE Access 2020, 8, 53614–53626. [Google Scholar] [CrossRef]
Li, X.; Zhang, M.; Long, J.; Lin, H. A novel method for estimating spatial distribution of forest above-ground biomass based on multispectral fusion data and ensemble learning algorithm. Remote Sens. 2021, 13, 3910. [Google Scholar] [CrossRef]
Nepal, S.; Kc, M.; Pudasaini, N.; Adhikari, H. Divergent Effects of Topography on Soil Properties and Above-Ground Biomass in Nepal’s Mid-Hill Forests. Resources 2023, 12, 136. [Google Scholar] [CrossRef]
González-Jaramillo, V.; Fries, A.; Bendix, J. AGB estimation in a tropical mountain forest (TMF) by means of RGB and multispectral images using an unmanned aerial vehicle (UAV). Remote Sens. 2019, 11, 1413. [Google Scholar] [CrossRef]

Figure 1. Technical roadmap (the model (SHAP) refers to the combination of EN, Lasso, SVR, RF, and CatBoost base models with SHAP).

Figure 2. The study area and sample plot distribution: (a) the location of Zhenyuan in Yunnan Province; (b) remote sensing image data of Wuyi Village; (c) eight types of remote sensing imagery.

Figure 3. Results of variable selection for multi-source remote sensing data (PC, RF, EN, LAS, SVR, CAT, and SVR (SHAP) refer to individual variable selection methods using Pearson correlation, Random Forest, EN, Lasso, support vector regression, CatBoost, and support vector regression combined with interpretable machine learning, respectively. PC-SVR (SHAP) denotes a dual-variable selection method combining Pearson correlation with support vector regression and interpretable machine learning).

Figure 4. Cross-validation fitting results of nine variable selection methods in five genetic algorithm optimization models (the scatter plots (a1–a10), (b1–b10), (c1–c10), (d1–d10), and (e1–e5) represent the cross-validation fitting results for the five models corresponding to the variable selection methods: PC and PC-SHAP, RF and RF-SHAP, EN and EN-SHAP, Lasso and Lasso-SHAP, and SHAP variable selection).

Figure 5. The number of variables selected by the nine variable selection methods and the bar chart of the running time and RMSE of the five models corresponding to them ((a,b) are the running time and RMSE values of the five models corresponding to the number of modeling factors of different screening methods).

Figure 6. Histograms of the fitted R² for the nine variable selection methods across the five models (dashed lines indicate the average R² for each model).

Figure 7. Spatial distribution of Pinus kesiya var. langbianensis forest AGB estimation using the five machine learning models ((a–e) represent the AGB prediction results of the GA-Lasso, GA-EN, GA-SVR, GA-CAT, and GA-RF models, respectively).

Figure 8. Schematic diagram of the comparison of the spatial distribution of AGB predictions from the optimal linear and nonlinear models (A–D are the AGB inversion maps of RGB images, DEM images, GA-LAS, and GA-SVR models, respectively; a and b represent localized image magnifications of high-elevation and low-elevation regions, respectively).

Table 1. The statistical parameters of Pinus kesiya var. langbianensis sample plots.

Parameters	Minimum	Mean	Maximum	STD
H (m)	7.60	9.95	13.42	1.12
Dg (cm)	10.19	15.41	20.39	2.33
AGB (Mg/ha)	75.36	147.68	268.82	40.05

Table 2. Remote sensing image information.

Types	Image ID	Sources	Access Time
Landsat 8 OLI	LC08_L1TP_130044_20230407_20230420_02_T1	https://earthexplorer.usgs.gov/	11 May 2024
Sentinel-2A	S2A_MSIL1C_20230310T034551_N0509_R104_T47QPG_20230310T060314.SAFE	https://browser.dataspace.copernicus.eu/	5 June 2024
GEDIL2A	GEDI02_A_2021158031308_O14072_02_T06143_02_003_02_V002 GEDI02_A_2021162014016_O14133_02_T10412_02_003_02_V002 GEDI02_A_2021327165552_O16700_03_T03578_02_003_02_V002 GEDI02_A_2022011125142_O17457_02_T07566_02_003_02_V002 GEDI02_A_2022044083839_O17966_03_T10693_02_003_02_V002 GEDI02_A_2022094124648_O18744_03_T05001_02_003_02_V002 GEDI02_A_2022163093033_O19812_03_T06424_02_003_03_V002	https://search.earthdata.nasa.gov/search	15 April 2024
GEDIL2B	GEDI02_B_2021002014205_O11653_03_T10693_02_003_01_V002 GEDI02_B_2021033131410_O12141_03_T09270_02_003_01_V002 GEDI02_B_2021158031308_O14072_02_T06143_02_003_01_V002 GEDI02_B_2022011125142_O17457_02_T07566_02_003_01_V002 GEDI02_B_2022044083839_O17966_03_T10693_02_003_01_V002		18 April 2024
ICESat-2 ATL08	ATL08_20231112041901_08272107_006_01 ATL08_20231002180214_02102101_006_02 ATL08_20231002180214_02102101_006_01 ATL08_20230813083923_08272007_006_02 ATL08_20230703222251_02102001_006_02 ATL08_20230703222251_02102001_006_01 ATL08_20230404024334_02101901_006_02		18 April 2024
ALOS-2 PLASRA-2	0000519755_001001_ALOS2495483130-230727	https://www.eorc.jaxa.jp	27 April 2023
SAOCOM-L1A	S1A_OPER_SAR_EOSSP__CORE_L1A_OLF_20230828T124022	https://catalog.saocom.conae.gov.ar/catalog/#/	6 September 2023
SRTM DEM	ASTGTMV003_N24E101	https://gscloud.cn	9 July 2023
ERA5-Land	Gridded datasets of annual mean temperature, humidity, and precipitation at a 30 m resolution over China	https://www.ecmwf.int/
AIEC	DAMO_AIE_CHINA_LC_2022_N21E99-Map DAMO_AIE_CHINA_LC_2022_N24E99-Map	https://engine-aiearth.aliyun.com/	4 October 2024

Table 3. Variables for multi-source remote sensing information extraction.

Types	Variables
Landsat 8 OLI	B1, B2, B3, B4, B5, B6, B7, Con, Dis, Mea, Hom, Sm, Ent, Var, Cor, NDVI, ND43, ND67, ND563, DVI, SAVI, RVI, B, G, W, ARVI, MV17, MSAVI, VIS234, ALBEDO, SR, SAV12, MSR, KT1, PC1-A, PC1-B, PC1-P
Sentinel-2A	B2, B3, B4, B5, B6, B7, B8, B8A, B9, B10, B11, B12, Con, Dis, Mea, Hom, Sm, Ent, Var, Cor, RVI, DVI, WDVI, IPVI, PVI, NDVI, NDVI45, GNDVI, IRECI, SAVI, TSAVI, MSAVI, S2REP, REIP, ARVI, PSSRa, MTCI, MCARI
GEDI L2A	Lon, Lat, Elev, TanDEM-X, RH, Sens, Quality_Flag, Degrade_Flag
GEDI L2B	Lon, Lat, Sens, cover, cover_z, Pai, fhd_normal, rv-aN, rg-aN, rx-aN, rh100
ICESat-2 ATL08	Lon, Lat, h_te_best_fit, dem_h, h_canopy, canopy_h_metrics, h_canopy_uncertainty, terrain_slope, night_flag, snr, cloud_flag_atm, classed_pe_flag,
ALOS-2 PLASRA-2, SAOCOM L1A	Con, Dis, Mea, Hom, Sm, Ent, Var, Cor, σHH, σVV, σHV, σVH, Backscattering Coefficient, Yamaguchi Deconposition, Sinclaiir Deconposition, Freeman–Durden Deconposition, Generalized Deconposition, Cloude Deconposition, BMI, CSI, RVI, RFDI, VSI, HHVVR, HHHVR, VVVHR, BZ1-10
SRTM DEM	Elevation, Slope, Aspect
ERA5-Land	Tmean, RH, PREC

Table 4. Optimal fitting accuracy of the nine variable selection methods in the five genetic algorithm optimization models.

Methods	Factors	Model Types	R2	RMSE
LAS (SHAP)	S2_X3B4Cor, GD2Brg_aN, SA_X5VHCon, S2_MTCI, S2_X3B9Cor, S2_X3B8V, S2_X3B2Cor, S2_X3B3E, S2_X7B8E, L8_X3B2Cor, A2_BZ3, SA_X3HVs, S2_X3B9S, S2_X7B3Cor, L8_X7B7Con, L8_X7B6M, S2_X7B6M, S2_X3B11V, S2_X5B5S, S2_X5B6Cor	GA-LAS	0.91	12.94
LAS-EN (SHAP)	S2_X3B4Cor, GD2Brg_aN, SA_X5VHcon, S2_X3B9Cor, S2_X3B3E, S2_X3B8V, S2_X7B8E, S2_X3B2Cor, S2_X5B5S, A2_BZ3, S2_MTCI, L8_X3B2Cor, S2_X3B9S, L8_X7B7Con, SA_X3HVS, S2_REIP, L8_X5B1Cor, S2_X7B3Cor, L8_X7B6M, L8_X7B5Con	GA-EN	0.89	15.15
LAS	Elevation, RH, A2_X3HVCor, A2_X3VHCon, A2_X3VHCor, A2_X7HVCor, A2_CdblR, A2_BZ3, A2_VSI, S2_X3B2Cor, S2_X3B3E, S2_X3B4Cor, S2_X3B8V, S2_X3B9Cor, S2_X3B9S, S2_X3B11E, S2_X3B11V, S2_X5B5S, S2_X5B6Cor, S2_X5B8E, S2_X5B7E, S2_X7B3Cor, S2_X7B6M, S2_X7B7E, S2_X7B8E, S2_X7B9Cor, S2_MTCI, S2_REIP, GD2A_Sensitivit, SA_X3HVS, SA_X5VHcon, SA_YamdblR, GD2Brg_aN, GD2Brv_a4, ICE2_RH98, L8_X3B2Cor, L8_X3B4V, L8_X3B6Cor, L8_X5B2S, L8_X5B1Cor, L8_X5B6M, L8_X7B6H, L8_X7B6M, L8_X7B5Con, L8_X7B5Cor, L8_X7B7Con, L8W	GA-SVR	0.74	22.07
LAS-CAT (SHAP)	Elevation, S2_X5B7E, S2A_MTCI, S2_REIP, S2_X3B8V, A2_BZ3, S2_X3B4Cor, SA_YamdblR, S2_X7B3Cor, GD2Brg_aN, S2_X3B2Cor, S2_X7B6M, L8_X3B6Cor, S2_X7B9Cor, A2_X3SVHcor, S2_X5B8E, SA_X3HVS, L8_X3B2Cor, S2_X7B7E, S2_X7B8E	GA-CAT	0.64	25.88
LAS-RF (SHAP)	S2_MTCI, Elevation, S2_X5B7E, S2_REIP, S2_X3B4Cor, S2_X7B6M, S2_X3B2Cor, S2_X7B3Cor, SA_YamdblR, L8_X3B2Cor, L8_X3B6Cor, RH, SA_X5VHCon, SA_X3HVS, A2_BZ3, A2_X3VHCor, S2_X3B8V, S2_X3B9Cor, GD2Brg_aN, L8_X7B5Con	GA-RF	0.52	29.91

Note: L8, S2, A2, SA, and GD2B refer to Landsat-8, Sentinel-2, ALOS-2, SAOCOM, and GEDI L2B, respectively. X3/X5/X7B2Cor denote correlation texture variables calculated using 3 × 3, 5 × 5, and 7 × 7 moving windows.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, D.; Luo, H.; Liu, Z.; Pan, J.; Wu, Y.; Wang, E.; Lu, C.; Wang, L.; Wang, W.; Ou, G. A Dual-Variable Selection Framework for Enhancing Forest Aboveground Biomass Estimation via Multi-Source Remote Sensing. Remote Sens. 2025, 17, 2493. https://doi.org/10.3390/rs17142493

AMA Style

Chen D, Luo H, Liu Z, Pan J, Wu Y, Wang E, Lu C, Wang L, Wang W, Ou G. A Dual-Variable Selection Framework for Enhancing Forest Aboveground Biomass Estimation via Multi-Source Remote Sensing. Remote Sensing. 2025; 17(14):2493. https://doi.org/10.3390/rs17142493

Chicago/Turabian Style

Chen, Dapeng, Hongbin Luo, Zhi Liu, Jie Pan, Yong Wu, Er Wang, Chi Lu, Lei Wang, Weibin Wang, and Guanglong Ou. 2025. "A Dual-Variable Selection Framework for Enhancing Forest Aboveground Biomass Estimation via Multi-Source Remote Sensing" Remote Sensing 17, no. 14: 2493. https://doi.org/10.3390/rs17142493

APA Style

Chen, D., Luo, H., Liu, Z., Pan, J., Wu, Y., Wang, E., Lu, C., Wang, L., Wang, W., & Ou, G. (2025). A Dual-Variable Selection Framework for Enhancing Forest Aboveground Biomass Estimation via Multi-Source Remote Sensing. Remote Sensing, 17(14), 2493. https://doi.org/10.3390/rs17142493

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Dual-Variable Selection Framework for Enhancing Forest Aboveground Biomass Estimation via Multi-Source Remote Sensing

Abstract

1. Introduction

2. Methods

2.1. Study Area

2.2. Data Acquisition and Processing

2.2.1. Sample Plot Collection and Forest AGB Estimation

2.2.2. Multi-Source Geospatial and Remote Sensing Datasets

2.2.3. Remote Sensing Variable Extraction

2.3. Variable Selection Methods

2.4. AGB Model Parameter Optimization

2.5. Model Evaluation

3. Analysis of Results

3.1. Model Variable Selection

3.2. Model Results Analysis

3.2.1. Comparison of AGB Estimation Accuracy Across Different Variable Selection Methods

3.2.2. Comparison of the Accuracy for the Five Models

3.2.3. Comparison of Variable Selection Differences Among Models

3.3. Comparison of AGB Inversion Across Different Models

4. Discussion

4.1. Contribution of Dual-Variable Selection to Enhancing AGB Estimation Accuracy

4.2. Impact of Estimation Model Selection on AGB Estimation

4.3. Variations in Optimal AGB Estimation Among Models

4.4. Limitations and Future Perspectives

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI