1. Introduction
Forest aboveground biomass (AGB), as the energy foundation and material source for forest ecosystem functioning, is a key indicator for evaluating forest health and carbon storage capacity [
1]. Accurate monitoring of forest AGB is crucial for carbon accounting, forest ecosystem management, and understanding global climate change. While long-term and continuous monitoring of AGB is important for evaluating temporal dynamics and supporting carbon neutrality policies, current large-scale monitoring remains challenging due to the limitations of traditional methods.
Traditional AGB estimation methods often rely on field surveys and allometric equations [
2]. While these approaches offer high accuracy, they are limited by sparse plot distribution, high data acquisition costs, and poor spatial representativeness, making them insufficient for supporting continuous monitoring at the national scale. Remote sensing, with its advantages of broad spatial coverage, high temporal resolution, and automation, has become the mainstream tool for forest AGB estimation.
Remote sensing-based AGB estimation commonly uses three types of data sources: microwave radar, LiDAR, and optical imagery [
3]. Microwave Synthetic Aperture Radar (SAR) can penetrate clouds and portions of vegetation, and its backscattered signals reflect surface geometry, roughness, and dielectric properties, making it suitable for monitoring forest structure [
4]. However, SAR data are complex to interpret, more susceptible to speckle noise, and often require sophisticated pre-processing, which limits their use in nationwide AGB estimation. LiDAR can penetrate forest canopies and provide detailed vertical structural information (e.g., canopy height and volume), allowing highly accurate AGB estimation [
5]. However, airborne LiDAR data are expensive, while spaceborne LiDAR typically lacks full spatial coverage. Optical remote sensing, highly sensitive to vegetation density [
6], offers low acquisition and processing costs, globally available datasets, and long-term consistency. These characteristics make optical imagery particularly suitable for constructing large-scale AGB datasets over extended time periods.
While Vegetation Optical Depth (VOD) derived from passive microwave sensors has been successfully used for national and global AGB estimation, its coarse spatial resolution (typically 0.25°) limits its applicability for medium-to-high-resolution studies [
7,
8,
9]. In contrast, optical imagery provides decades-long continuous observations at medium-to-high spatial resolutions (10–30 m), making it the only feasible data source for constructing high-resolution, long-term AGB datasets across China and similar regions. While SAR data offer continuous monitoring capabilities, their historical record is much shorter, limiting their suitability for building multi-decadal AGB time series.
A major limitation of optical imagery is spectral saturation in high-biomass regions. Spectral saturation occurs when the reflectance signal becomes insensitive to increases in AGB beyond a certain threshold, particularly in dense or structurally complex forests [
10]. Although some studies have attempted to mitigate saturation effects by integrating optical data with radar or LiDAR [
11,
12], optical data remain the foundational input for large-scale, long-term AGB estimation due to their accessibility and historical continuity.
In addition, topographic variables, including elevation, slope, and aspect, have been shown to influence vegetation growth and optical reflectance patterns, affecting AGB estimation accuracy [
13,
14].
Therefore, under the constraint of relying primarily on optical data, it becomes essential to understand the saturation behavior and model errors in high AGB regions, in order to enhance the reliability and applicability of optical AGB estimation. Existing studies in China have mainly focused on regional scales. For instance, Sa et al. quantified saturation levels of combined variables in Saihanba Forest, Hebei Province, analyzing the limitations of AGB estimation saturation [
15]. Wu et al. examined optical saturation in 20 regions across Yunnan, exploring the individual, interaction, and combined effects of climate, soil, and topography on saturation [
16]. Another study by Wu et al. estimated saturation levels in a county of Heilongjiang Province [
17], while Zhao et al. used a spherical model to quantify AGB saturation for different vegetation types in parts of Zhejiang Province [
18]. These studies, however, rely on small-scale field data, limiting their ability to generalize findings nationwide. This limits the assessment of national-level saturation thresholds in high-AGB regions across China.
Machine learning has become a widely adopted approach for forest AGB estimation [
3]. In remote sensing-based modeling, field measurements are critical for linking signals to actual biomass, but such measurements are time-consuming and labor-intensive [
3,
19], which limits their applicability for large-scale AGB estimation and monitoring.
The release of the Global Ecosystem Dynamics Investigation (GEDI) mission has provided a breakthrough. Its Level 4A (L4A) product serves as a high-accuracy reference for global AGB modeling. By using GEDI-derived estimates to complement traditional field measurements, large-scale inversion and monitoring of AGB can be achieved [
20,
21]. Recently, Cai et al. used GEDI L4A as a reference for mapping forest AGB across China from 1985 to 2023, leveraging long-term optical imagery for national-scale time-series estimation [
22]. While this study achieved large-scale inversion, it focused on mapping accuracy and temporal modeling, without detailed investigation of spectral saturation mechanisms and error structures in high-biomass regions.
Addressing this research gap, the present study employs GEDI L4A AGB estimates as modeling labels, combined with Sentinel-2 multispectral imagery and topographic variables, to construct a 200 m resolution forest AGB estimation model at the national scale. Sentinel-2 was selected due to its high spatial resolution (10–20 m), frequent revisit, and rich spectral bands, making it particularly suitable not only for national-scale AGB estimation but also for analyzing error structures and optical saturation mechanisms. This also provides a methodological basis for future studies integrating Landsat historical imagery or other optical datasets for long-term AGB mapping and mechanism analysis.
This study focuses on understanding prediction error mechanisms caused by spectral saturation in high-biomass areas. Unlike previous studies that primarily targeted mapping accuracy, this work presents a national-scale analysis of spectral saturation mechanisms using optical inputs, along with the compensatory role of topography.
The specific innovations of this study include the following:
Quantifying optical saturation thresholds at the national scale, defining the performance boundaries of different spectral bands and indices in medium-to-high AGB regions;
Systematically evaluating the compensatory role of topographic variables in high AGB areas, revealing how they maintain prediction accuracy under spectral information deficiency;
Proposing a transferable mechanism analysis framework that quantitatively reveals changes in feature contributions through grouped error analysis, LOWESS response curves, and SHAP model interpretation, extendable to other countries or global scales;
Producing a nationwide 200 m resolution AGB data product and conducting regional validation using forest volume data, providing support for ecological monitoring and carbon stock estimation.
3. Results
3.1. Overall Model Performance Evaluation
The model achieved R
2 values of 0.78 and 0.76 on the training and testing datasets, respectively, with comparable RMSE values. The RMSE on the testing set was 47.73 Mg·ha
−1, indicating good generalization ability without obvious overfitting. The scatter density plot of predictions versus observations on the test set is shown in
Figure 5a. Although the overall accuracy is satisfactory, relatively larger errors exist in high AGB ranges, warranting further stratified analysis. To further provide a more intuitive comparison of the contribution of topographic factors, we additionally developed a model excluding topographic variables and generated the corresponding scatter density plot (
Figure 5b) for direct comparison with the original model including topography. The trend line in
Figure 5b shows a noticeably lower slope, indicating the contribution of topographic variables to improving model accuracy and mitigating saturation effects.
3.2. Stratified Error Analysis by AGB
Following the stratified residual analysis method described in
Section 2.6.1, the vali-dation dataset was divided into five AGB intervals and the corresponding RMSE and mean bias were computed. The stratified error metrics are summarized in
Table 7, and the RMSE and Bias trends are shown in
Figure 6.
Both RMSE and Bias increase with AGB levels, especially in the >400 Mg·ha−1 range, where RMSE reaches 176.74 Mg·ha−1 and mean underestimation exceeds 152 Mg·ha−1. This indicates systematic underestimation in extremely high biomass areas, likely associated with spectral saturation effects. However, due to the relatively sparse sample size in these high biomass regions, the overall impact on model performance is limited. Consequently, the model still maintains high accuracy at the national scale, though future work requires more independent validation data from high biomass areas to further confirm saturation severity and model bias in this range.
In this study, the model’s estimated AGB saturation threshold is approximately 300 Mg·ha−1; beyond this value, predictions show significant bias and decreased reliability.
3.3. SHAP Interpretation
To better understand the contribution of different features to AGB prediction, SHAP values were calculated for the trained LightGBM model.
Figure 7 shows the top 20 features ranked by their mean absolute SHAP values along with the distribution of their SHAP values. Spatial and topographic variables such as lon_lowestmode, dem_std, and slope_avg exhibited the highest importance in the model, while certain spectral variables (e.g., band_B4_avg, band_B5_avg, NDSI_B2_B4_avg) also demonstrated relatively high overall importance.
3.4. Spectral Saturation Curves
To further investigate the response relationship between spectral features and AGB, the top two original bands, NDSI indices, and topographic variables ranked by SHAP values were selected for LOWESS curve fitting to analyze their saturation trends. These features include “band_B4_avg,” “band_B5_avg,” “NDSI_B2_B4_avg,” “NDSI_B11_B12_avg,” “dem_std,” and “slope_avg.” The resulting curves for these features are shown in
Figure 8.
The results show that the two band reflectance features (
Figure 8a,d) exhibit a monotonically decreasing trend with increasing AGB, reaching a plateau around 80 Mg·ha
−1, reflecting a typical spectral saturation phenomenon. In contrast, the NDSI-type indices have higher saturation thresholds. The curve of NDSI_B2_B4_avg (
Figure 8b) begins to plateau between 100–150 Mg·ha
−1, while NDSI_B11_B12_avg (
Figure 8e) reaches a peak within the 100–150 Mg·ha
−1 range, then reverses slope and shows a stable negative trend from 150–200 Mg·ha
−1, gradually plateauing after 200–250 Mg·ha
−1 with a slight declining trend. This suggests that this index exerts a mild adverse effect on predictions in medium-to-high biomass areas.
Compared with spectral variables, the topographic features slope_avg (
Figure 8f) and dem_std (
Figure 8c) maintained a strong positive relationship with AGB when AGB < 300 Mg·ha
−1, but exhibited a slight declining trend when AGB > 300 Mg·ha
−1, indicating that they could not capture AGB variations beyond this threshold. This pattern is consistent with the observed decline in model prediction accuracy in high-biomass regions. These results suggest that topographic features play a significant supporting role in model prediction for medium- to high-biomass areas, serving as important complementary variables when spectral information becomes saturated. However, in extremely high-biomass regions (>300 Mg·ha
−1), their response trends weaken or even decline, implying that their contribution to prediction accuracy is also limited under conditions of highly homogeneous canopy structure.
3.5. Spatial Distribution of Residuals
To identify the spatial patterns of model prediction errors, this study calculated the average residuals (Residual = Prediction − agbd_avg) for each provincial administrative region based on the test set samples. A province-level residual heatmap at the national scale was generated (
Figure 9). The colors in the map indicate the magnitude and direction of the average residuals, where blue represents regions with overall underestimation and red indicates regions with overall overestimation.
The results indicate that the model demonstrates good spatial stability across most regions but still exhibits notable regional systematic biases. Specifically, systematic underestimation is observed in areas such as Chongqing, Tianjin, and Inner Mongolia, whereas slight overestimation occurs in eastern and southern provinces such as Jiangsu, Jiangxi, and Hunan.
3.6. Forest AGB Distribution Map in China
Using the Sentinel-2 imagery and terrain data, the 200 m resolution mean and standard deviation images were computed. These were input into the trained LightGBM model, along with calculated NDSI indices, to predict the AGB for each province. The resulting forest AGB distribution map is shown in
Figure 10.
Figure 10 presents the predicted spatial distribution of China’s forest aboveground biomass (AGB) for 2022, with units in Mg·ha
−1. The average forest AGB across China is 123.90 Mg·ha
−1. Overall, the forest AGB exhibits clear spatial heterogeneity, with a pattern of higher values in the southeast and lower values in the northwest.
High-biomass areas (AGB > 200 Mg·ha−1) are primarily concentrated in the mountainous and tropical regions of southwestern China, such as southern Yunnan, southeastern Tibet, Hainan Island, and the western edge of the Sichuan Basin. These areas are dominated by evergreen broadleaf forests or tropical rainforests characterized by warm and humid climates and mature forests with substantial biomass accumulation.
Moderate biomass density regions (100–200 Mg·ha−1) are widely distributed across central and southern China as well as southern Northeast China, including provinces like Jiangxi, Hunan, Zhejiang, and southern Jilin. These regions mainly contain mixed coniferous and broadleaf forests and fast-growing plantations, with relatively intact forest structures and solid resource bases.
Low AGB regions (<75 Mg·ha−1) are mainly found in the arid northwest and the Qinghai–Tibet Plateau interior, including Xinjiang, Gansu, Qinghai, and northwestern Tibet. These areas have harsh ecological conditions, sparse forest vegetation, and low productivity levels.
The distribution map also reveals notable transitional zones with moderate to low biomass (75–100 Mg·ha−1) along the southern edge of the Northeast Plain, the Loess Plateau margins, and the Qinling–Huaihe ecological transition belt, reflecting typical ecological gradient changes.
The overall spatial pattern aligns well with China’s forest ecological zoning and climatic gradients and shows strong consistency with large-scale forest AGB estimates from previous studies, indicating that the model possesses good predictive capability and ecological plausibility at the national scale.
The overall spatial pattern aligns well with China’s forest ecological zoning and climatic gradients, indicating that the model captures the broad-scale ecological plausibility of forest biomass at the national level. Quantitative comparisons with previous studies are provided in
Section 3.8.
3.7. Spatial Distribution of Prediction Uncertainty
The 95% prediction interval of forest AGB was calculated using the LightGBM-based quantile regression approach described in
Section 2.5.4, in which two models were trained to predict the 2.5th and 97.5th percentiles of AGB. The uncertainty was quantified by the difference between the predicted upper and lower quantile values, based on which a national distribution map of forest AGB prediction uncertainty was generated (
Figure 11).
Overall, the uncertainty in most forested areas across China ranges between 3 and 6 Mg·ha−1, with only a few regions showing uncertainties below 3 Mg·ha−1, indicating high overall prediction stability. Spatially, areas with lower uncertainty are mainly concentrated in the eastern plains, central parts of the northeastern forest region, and the middle to lower reaches of the Yangtze River. These regions have dense training samples, high-quality remote sensing imagery, and stable feature-response relationships, resulting in smaller prediction errors.
In contrast, regions with significantly higher uncertainty (>9 Mg·ha−1) are mainly found in three types of areas: (1) Southwest mountainous canyon forests (e.g., western Sichuan, southwestern Yunnan), where complex terrain and severe land cover mixing cause large disturbances in spectral input features and reduce model responsiveness; (2) tropical seasonal rainforest regions (e.g., Hainan Island and Xishuangbanna), where local prediction uncertainty is elevated, likely due to extremely high biomass density and spectral saturation effects; (3) edges of the high-latitude northeastern forest zone, which exhibit clustered prediction uncertainties possibly related to strong forest heterogeneity and large variations in stand structure. Additionally, some areas without extreme terrain may have locally elevated uncertainties due to sparse training samples or fluctuations in remote sensing data quality.
In summary, this uncertainty map provides an intuitive spatial characterization of model confidence. It is recommended that structural remote sensing data sources (such as SAR and LiDAR) or regional sub-models be introduced preferentially in high-uncertainty areas to improve the reliability and accuracy of forest biomass inversion in complex terrain or high-biomass regions.
3.8. Accuracy Validation of Model Predictions
The predicted average forest AGB for China is 123.90 Mg·ha
−1, which shows a relative error of 3.25% compared to the national average reported by Su et al. [
42] (120 Mg·ha
−1), and a relative error of 1.87% compared to the 2022 average AGB reported by Cai et al. [
22] (121.62 Mg·ha
−1). The spatial distribution patterns are also highly consistent, with high biomass areas (exceeding 300 Mg·ha
−1) concentrated in southern Tibet, the Qinling Mountains, parts of Northeast China, and Taiwan, confirming the accuracy of the model predictions.
The Spearman rank correlation coefficient between the total provincial AGB predicted by the model and the forest stand volume reported in the China Statistical Yearbook is 0.88, indicating a strong agreement in the ranking of resource stocks at the provincial level (see
Figure 12). The line chart comparing provincial forest AGB and forest stand volume is also shown in
Figure 12. This close relationship further demonstrates the model’s reliable predictive capability at the national scale and supports the use of GEDI L4A products as a valid ground truth source for forest AGB inversion in China.
In summary, the three-tier validation results demonstrate that our predictions are accurate and reliable across multiple scales. The national-scale agreement test revealed less than 3.25% deviation from independent studies, ensuring overall unbiasedness; the regional-scale spatial pattern comparison confirmed the model’s reliability across diverse geographic regions; and the provincial-scale stock ranking correlation (ρ = 0.88) further verified the rationality of macro-scale resource distribution. Collectively, these results validate the accuracy and authenticity of our predictions, consistent with findings from existing continental-scale forest biomass studies.
4. Discussion
4.1. Model Performance and Comparison with Existing Methods
The LightGBM model developed in this study achieved a test set performance of R2 = 0.76 and RMSE = 47.73 Mg·ha−1, significantly outperforming conventional methods such as linear regression and support vector machines, and showing comparable accuracy to the national-scale AGB estimation results based on random forest by Su et al. (2016). Furthermore, by using GEDI L4A data as training labels, this study effectively overcomes the spatial heterogeneity of traditional forest inventory data, enabling better model generalization at the national scale.
Currently, most AGB inversion studies focus on accuracy improvement by integrating Landsat and Sentinel imagery, combining optical and SAR or optical time-series data, and applying deep learning methods to enhance model performance. While SAR and LiDAR data have proven effective in mitigating saturation effects of optical imagery, optical remote sensing remains indispensable in long-term forest AGB monitoring due to its extensive temporal coverage. Therefore, a thorough understanding of optical data saturation mechanisms and the compensatory role of topographic features is particularly crucial, which constitutes the core analytical focus of this study.
Although some progress has been made in improving accuracy, there is a lack of quantitative analysis and systematic exploration of the error structure related to optical saturation in high biomass regions. Previous studies, constrained by limited field measurements, have mostly focused on regional models [
15,
16,
17,
18] for inversion and saturation mechanism analysis, without conducting large-scale systematic assessments. Cai et al. [
22] pioneered the use of GEDI data as reference to estimate nationwide forest AGB over long time series, but their study did not provide a detailed error mechanism analysis at the national scale, particularly a systematic quantification of optical saturation.
In contrast, this study constructs an optical-dominant forest AGB inversion model across China at 200 m resolution using GEDI L4A samples without incorporating SAR or LiDAR data. By integrating NDSI indices and topographic variables, it systematically quantifies the saturation mechanisms of optical imagery and the compensatory effect of terrain on optical saturation across the entire country. Through LOWESS response curves and stratified error analysis, multiple typical spectral features’ saturation thresholds were quantified, and key causes of model performance degradation in high AGB areas were identified. This analysis enriches the error interpretation dimension of optical inversion methods and provides theoretical support for subsequent models integrating optical and structural features.
Additionally, residuals were analyzed across biomass groups and provincial scales to examine the spatial distribution of prediction errors and potential overestimation. Although the overall R2 is slightly lower than some regional studies, the model developed here demonstrates better generalization and interpretability across diverse terrains and forest types, making it more suitable for national-scale carbon stock remote sensing modeling tasks.
4.2. Spectral Index Saturation Response Mechanism
This section provides a detailed quantitative dissection of the optical saturation effect.
SHAP analysis (
Figure 7) revealed that spectral variables, particularly band_B4_avg, band_B5_avg, and NDSI_B2_B4_avg, contributed substantially to the prediction of AGB, underscoring their relevance as key optical predictors. Previous studies have demonstrated that red-edge bands can effectively estimate carbon content in drought-affected forests [
43], where carbon is predominantly stored in biomass, especially AGB. NDSIs derived from visible and red-edge bands have also been shown to reliably estimate crop yield and AGB [
20,
33].
The core of our saturation analysis lies in the stratified error analysis and LOWESS curve fitting. This study identified typical saturation response characteristics of optical indices in high AGB regions. The two primary spectral bands—band_B4_avg and band_B5_avg—entered a plateau around AGB ≈ 80 Mg·ha−1, while NDSI_B2_B4_avg maintained a strong response up to 100–150 Mg·ha−1. In contrast, NDSI_B11_B12_avg peaked at 100–150 Mg·ha−1 and subsequently exhibited a negative slope. These response patterns indicate that spectral signals progressively lose sensitivity in medium-to-high biomass areas, especially beyond 200 Mg·ha−1, where model RMSE increases and bias becomes significantly negative, reflecting that spectral saturation is a major driver of error escalation.
This phenomenon is attributable to saturation and loss of sensitivity of spectral indices under high Leaf Area Index (LAI) conditions. Previous studies have shown that when LAI exceeds 4–6, indices such as NDVI rapidly lose sensitivity, and spectral values no longer reflect true structural differences [
44]. Additionally, in dense canopies, multiple scattering of sunlight within the canopy, particularly in the near-infrared and shortwave infrared bands, stabilizes reflectance, limiting the ability to resolve biomass variations [
10]. Therefore, although NDSI-type composite indices can enhance discrimination in moderate AGB ranges, their effectiveness degrades severely at high AGB levels, exhibiting non-monotonic responses.
4.3. Compensation Mechanism of Topographic and Spatial Structure Variables
Following the quantification of saturation, we further investigated the compensatory role of topographic variables.
SHAP interpretation also highlighted the strong contributions of topographic variables such as slope_avg and dem_std, indicating their essential role in improving prediction performance by capturing spatial heterogeneity and terrain effects. Additionally, DEM and derived terrain parameters help elucidate the influence of topography on local growth conditions, thereby revealing spatial patterns of biomass distribution [
13]. Among topographic variables, slope exerts a notable influence on prediction results, which may be linked to errors in GEDI algorithm estimates in steep terrain areas. In such regions, GEDI might misestimate terrain and tree heights [
31], while slope can mitigate the indirect propagation of lidar errors into prediction accuracy.
Combined SHAP analysis (
Figure 7) and LOWESS curves (
Figure 8c,f) indicate that topographic features (slope_avg and dem_std) maintain a relatively stable positive contribution when AGB is below 300 Mg·ha
−1. They can continuously provide information related to forest structure and site productivity when spectral variables experience saturation, partially compensating for the increased uncertainty caused by the loss of optical information. However, in regions where AGB exceeds 300 Mg·ha
−1, both features show declining response trends, suggesting that the compensatory effects of topographic variables also diminish in extremely high-biomass areas, which aligns with and explains the error patterns observed in
Section 4.2.
Topographic factors indirectly reflect ecological attributes such as site conditions and habitat heterogeneity, thereby mitigating uncertainty caused by loss of optical information and indirectly influencing forest growth structure and carbon storage [
45]. For example, elevation controls several key environmental variables [
13]: (1) atmospheric pressure; (2) adiabatic temperature lapse rates; (3) clear-sky radiation; and (4) the proportion of ultraviolet radiation in solar irradiance. These factors influence forest growing season length, accumulated temperature, photosynthetic potential, and nutrient availability [
46]. Elevation is a principal driver of temperature-related growth conditions [
13], and topographically modulated variables such as potential incoming solar radiation improve biomass modeling by providing fine-scale energy input heterogeneity [
47]. Areas with differing slope and aspect can have markedly different subsurface and surface temperatures and plant growth conditions [
48].
Thus, topographic information can assist biomass modeling, especially in complex terrain where solar incidence angles and surface undulations increase noise in purely optical indices. Here, topographic variables act as stabilizing compensators. Additionally, GEDI data are known to have waveform distortion issues in areas with slope >25° [
31]; incorporating topographic variables helps to mitigate indirect propagation of LiDAR errors in steep terrain during prediction.
Based on the results of this study, although the response of topographic variables weakens in areas with high aboveground biomass (>300 Mg·ha−1), these variables still play a crucial auxiliary role in the biomass inversion model by serving as proxies for environmental heterogeneity and compensating for limitations in optical data.
Nevertheless, in areas with extreme terrain or sparse samples, such as the mountainous region of Chongqing, the model still exhibits systematic underestimation. This indicates that although topographic features have compensatory capacity, they cannot fully substitute for canopy structure information. Future work could enhance structural sensitivity in high-biomass regions by integrating multi-source remote sensing data, such as P-band microwave or full-waveform LiDAR, thereby further improving model robustness under complex terrain conditions caused by saturation.
4.4. Spatial Residual Distribution and Identification of Uncertainty Regions
The prediction residuals in this study exhibit significant spatial heterogeneity. Systematic underestimation is observed in regions such as Chongqing and Inner Mongolia, whereas slight overestimation occurs in eastern provinces including Jiangsu, Jiangxi. The spatial variation in residuals can be partly explained by (1) complex mountainous terrain causing degradation in GEDI LiDAR echo quality, which leads to larger reference data errors, and (2) uneven distribution of training samples in these regions, weakening model generalization and reducing its ability to capture local feature variations effectively. Moreover, forest type heterogeneity induces spectral response shifts, especially in mixed and conifer–broadleaf mixed forests where canopy structure and leaf morphology differ, resulting in partial decoupling from principal spectral indices. In eastern and southern provinces, overestimation may also be related to differences in forest types, higher surface reflectance, or sample structure imbalance, particularly where spectral characteristics of some plantation forests differ from the dominant training samples. Future work should enhance sample coverage in regions with sparse data, adopt regional stratified modeling strategies, and incorporate multi-source remote sensing data to further improve prediction accuracy in complex terrains and ecologically heterogeneous areas.
4.5. Impact of GEDI L4A Data Errors
This study employs the GEDI L4A product as the reference AGB dataset for model training and validation, offering global coverage with relatively uniform spatial distribution of high-resolution AGB estimates [
20,
22]. However, it is critical to acknowledge that GEDI L4A is not an error-free “ground truth” and its accuracy varies regionally. Previous research has highlighted that GEDI prediction models exhibit reduced accuracy in Asia [
49], particularly in complex terrain (e.g., steep slopes) and high-AGB forests. For example, Liu et al. [
31] reported significantly increased errors in terrain and tree height retrievals in areas with slopes >25°, while Duncanson et al. [
49] noted potential regional systematic biases due to limited training data representativeness in Asia.
In this study, significant systematic underestimation (negative bias) occurs in the high AGB range (>300 Mg·ha
−1) (
Table 6,
Figure 6), partially attributable to spectral saturation effects of Sentinel-2 optical imagery (as discussed in
Section 4.2). Simultaneously, we cannot exclude the possibility that GEDI L4A reference values themselves exhibit systematic underestimation in complex terrain and high-biomass forests in southwest China, thereby partially contributing to the observed negative model bias. The observed negative bias relative to GEDI arises from the combined effects of model error (including optical saturation) and GEDI reference error.
To mitigate GEDI errors’ impact, strict quality control was applied in preprocessing, including filtering out samples with excessive slope and relative errors, and adopting grid-based aggregation to smooth random geolocation errors (~±10 m) and individual AGB estimation uncertainties.
Despite these measures, potential systematic biases in GEDI L4A may remain in certain regions. Nonetheless, the core mechanisms revealed herein—the saturation response of optical spectral signals in high AGB areas and the compensatory role of topographic features—are strongly supported by both mechanistic reasoning and data evidence.
First, the nonlinear relationship between input features and model responses exhibits physical plausibility: as shown in
Figure 7, red and red-edge reflectances (B4, B5) plateau near 100 Mg·ha
−1, and vegetation indices such as NDSI weaken or invert responses beyond 100–150 Mg·ha
−1, consistent with extensive spectral saturation literature [
10,
17,
18], unlikely to be solely driven by GEDI errors.
Secondly, topographic features such as slope_avg and dem_std demonstrate high importance in the SHAP analysis and maintain stable, positive predictive contributions in the medium- to high-AGB range (<300 Mg·ha−1) according to the LOWESS curves, without showing obvious plateauing or reversal. This suggests that their compensatory effect operates independently of GEDI accuracy within this range and more likely reflects their indirect explanatory power of forest site conditions. However, in extremely high AGB regions, this trend weakens, indicating that the compensatory effect also has its limits.
Finally, stratified error statistics (
Table 6,
Figure 6) show systematic negative bias in high-AGB regions with spectral saturation, rather than an overall increase in random errors, further supporting the saturation mechanism.
In summary, despite GEDI accuracy limitations in some areas, the findings on optical saturation thresholds, error distribution patterns, and topographic compensation mechanisms are robust, grounded in reproducible and physically meaningful relationships between model inputs and outputs. While GEDI errors may partially contribute to negative bias, the spectral saturation features (
Figure 8a,d) and stable topographic compensation (
Figure 8c,f) are independent of reference data and aligned with physical theory, confirming spectral saturation as the primary source of high-AGB prediction errors.
4.6. Future Work
Integrate Higher-Accuracy Reference Data: Collect extensive airborne LiDAR data or high-precision field plot measurements in representative Chinese forest types—especially high-AGB areas—to directly validate and calibrate GEDI L4A accuracy and bias, providing more reliable reference data for pure analysis of optical saturation effects.
Develop Regional GEDI Correction Models: Using newly acquired high-accuracy reference data, establish correction models tailored for China’s major forest ecological zones to reduce systematic GEDI L4A biases before large-scale modeling and analysis.
Explore Multi-Source Data Fusion: Investigate combining other spaceborne LiDAR (e.g., ICESat-2 ATL08) or SAR datasets (e.g., L-band ALOS-2/PALSAR-2, C-band Sentinel-1) to supply structural information that complements or substitutes GEDI, thereby directly alleviating optical saturation issues, especially where GEDI coverage is limited or uncertain.
Regional and Forest-Type Specific Modeling: Conduct province-level or forest-type (coniferous, broadleaf, mixed) sub-model training and mechanism analyses to better capture regional GEDI error patterns and spectral saturation response heterogeneity.