Assessing the Utility of Satellite Embedding Features for Biomass Prediction in Subtropical Forests with Machine Learning

Jin, Chao; Jiang, Xiaodong; Wen, Lina; Wu, Chuping; Xu, Xia; Jiao, Jiejie

doi:10.3390/rs18030436

Open AccessArticle

Assessing the Utility of Satellite Embedding Features for Biomass Prediction in Subtropical Forests with Machine Learning

by

Chao Jin

^1,2

,

Xiaodong Jiang

³,

Lina Wen

⁴,

Chuping Wu

^1,5,6

,

Xia Xu

^6,7 and

Jiejie Jiao

^1,5,6,7,*

¹

Zhejiang Academy of Forestry, Hangzhou 310023, China

²

Zhejiang Tiantong Forest Ecosystem National Observation and Research Station, School of Ecological and Environmental Sciences, East China Normal University, Shanghai 200241, China

³

Ecological and Environmental Science and Research Institute of Zhejiang Province, Hangzhou 310007, China

⁴

Yunhe County Ecological Forestry Development Center, Lishui 323000, China

⁵

Zhejiang Hangzhou Urban Ecosystem Research Station, Hangzhou 310023, China

⁶

Zhejiang Key Laboratory of Carbon Sequestration and Emission Reduction in Agriculture and Forestry, Zhejiang A&F University, Hangzhou 311300, China

⁷

School of Forestry and Biotechnology, Zhejiang A&F University, Hangzhou 311300, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 436; https://doi.org/10.3390/rs18030436

Submission received: 4 January 2026 / Revised: 23 January 2026 / Accepted: 27 January 2026 / Published: 30 January 2026

(This article belongs to the Special Issue Remote Sensing Applications for Forest Ecosystem Monitoring and Spatial Modeling (2nd Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Machine learning models trained on embedding features derived from the Google AlphaEarth Foundations dataset demonstrated strong predictive capability for forest biomass, particularly in broad-leaved and coniferous forests, as validated through five-fold cross-validation. These models successfully captured large-scale spatial patterns of forest biomass distribution.
Spatially predicational biomass maps revealed clear landscape-scale gradients, with lower biomass values predominantly occurring in fragmented forest patches and forest edges near urbanized areas, while higher biomass levels were concentrated within continuous, intact forest regions that are relatively distant from human disturbance.

What are the implications of the main findings?

Embedding-based remote sensing models offer an effective and efficient framework for monitoring biomass dynamics in Yunhe Forestry Station, especially in regions where field surveys or the acquisition of multi-source remote sensing data are constrained by terrain accessibility or logistical limitations.
By leveraging embeddings that integrate information from diverse Earth observation sensors, this study demonstrates an optional and scalable methodology for forest biomass estimation, highlighting the potential of representation learning to advance large-area forest carbon assessment and management.

Abstract

Spatial predictions of forest biomass at regional scale in forests are critical to evaluate the effects of management practices across environmental gradients. Although multi-source remote sensing combined with machine learning has been widely applied to estimate forest biomass, these approaches often rely on complex data acquisition and processing workflows that limit their scalability for large-area assessments. To improve the efficiency, this study evaluates the potential of annual multi-sensor satellite embeddings derived from the AlphaEarth Foundations model for forest biomass prediction. Using field inventory data from 89 forest plots at the Yunhe Forestry Station in Zhejiang Province, China, we assessed and compared the performance of four machine learning algorithms: Random Forest (RF), Support Vector Regression (SVR), Multi-Layer Perceptron Neural Networks (MLPNN), and Gaussian Process Regression (GPR). Model evaluation was conducted using repeated 5-fold cross-validation. The results show that SVR achieved the highest predictive accuracy in broad-leaved and mixed forests, whereas RF performed best in coniferous forests. When all forest types were modeled together, predictive performance was consistently limited across algorithms, indicating substantial heterogeneity (e.g., structure, environment, and topography) among forest types. Spatial prediction maps across Yunhe Forestry Station revealed ecologically coherent patterns, with higher biomass values concentrated in intact forests with less human disturbance and lower biomass primarily occurring in fragmented forests and near urban regions. Overall, this study highlights the potential of embedding-based remote sensing for regional forest biomass estimation and suggests its utility for large-scale forest monitoring and management.

Keywords:

Google Earth Engine; Google AlphaEarth Foundations; embeddings; machine learning; forest biomass

Graphical Abstract

1. Introduction

Forests represent the largest carbon reservoir in terrestrial ecosystems, accounting for approximately 90% of global terrestrial biomass [1], and play an irreplaceable role in the terrestrial carbon cycle [2]. Consequently, accurate quantification and estimation of forest biomass are essential for understanding imbalances in the global carbon budget [3], as well as for informing forest management and conservation strategies [4,5]. Traditionally, forest biomass has been estimated using in situ allometric equation-based approaches or destructive field harvesting [6]. However, these methods are inherently labor-intensive, destructive, and impractical for biomass estimation across large spatial extents [7]. The framework of remote sensing has become one of the most effective methods for estimating forest biomass [8] by enabling the acquisition of high-resolution data to characterize forest structure and derive key inventory parameters [9]. Furthermore, airborne and spaceborne remote sensing have further made it possible to estimate forest biomass in large-scale and previously inaccessible regions [10]. Canopy structural attributes derived from LiDAR data are widely recognized as essential predictors of forest biomass because of LiDAR’s ability to capture three-dimensional vegetation structure and vertical canopy complexity, which are related to forest physiological and ecological processes [11]. Efficient, long-term monitoring of forest biomass dynamics is critical for guiding forest conservation strategies and enabling the early detection of disturbances, including deforestation and natural disturbance events [12].

The accuracy of forest biomass estimation strongly depends on the type and source of remote sensing data, such as optical imagery, synthetic aperture radar, and LiDAR data [13]. Optical remote sensing data have been widely used to estimate forest biomass by establishing empirical relationships between biomass and vegetation-related indices such as the normalized difference vegetation index (NDVI), enhanced vegetation index (EVI), leaf area index (LAI), photosynthetically active radiation (PAR), and absorbed photosynthetically active radiation (APAR). However, the applicability of optical imagery is constrained by signal saturation, especially in dense vegetation and sensitivity to atmospheric conditions [14,15]. In contrast, radar remote sensing overcomes some of these limitations owing to its unique imaging mechanism and acquisition capability depending on weather, which allows partial penetration of forest canopy and has made it an important data source for biomass estimation and mapping [16]. Nevertheless, forest biomass estimates derived from a single radar dataset remain subject to substantial uncertainty [17]. Although integrating multi-source remote sensing data (e.g., optical and radar observations) can improve predictive performance, it also increases data acquisition costs and is constrained by limited computing power [18].

Recently, embedding dataset proposed by AlphaEarth Foundations has integrated observations from multiple satellite platforms into a unified representation at a 10 m spatial resolution [19], which reduces the difficulty, time, and cost compared with acquiring and processing multi-source remote sensing data. Recent studies have begun to explore the application of AlphaEarth Foundations’ embeddings in remote sensing tasks, such as land cover characterization, environmental monitoring, and large-scale automatic mapping [20,21]. These studies highlight the strength of embedding representations in integrating heterogeneous spectral, textural, and temporal information into compact feature spaces. However, their application to forest biomass estimation remains limited, and the performance of embedding-based models has not been systematically evaluated. If embedding-based predictors can improve the goodness-of-fit of models for estimating forest biomass, such data could be more broadly adopted in future forest biomass modeling efforts.

Although several studies have investigated forest biomass estimation using multi-source remote sensing variables [22,23], no study has yet compared the predictive performance of models based on embeddings and evaluated their applicability in forests, especially in natural secondary forests, which cover more than 50% of the total forest cover in China [24]. In this study, we combine field-based measurements with embeddings and apply four machine learning algorithms to generate spatially explicit maps of forest biomass in Yunhe Forestry Station. This study aims to generate management-oriented recommendations for Yunhe Forestry Station operations. Specifically, our objectives are to (1) compare the predictive performance of Random Forest (RF), Support Vector Regression (SVR), Multi-Layer Perceptron Neural Networks (MLPNN), and Gaussian Process Regression (GPR) algorithms in forest biomass estimation; (2) structure models by a 5-fold cross-validation in training and testing data; and (3) examine forest biomass maps in Yunhe Forestry Station for future carbon storage assessment and management.

2. Materials and Methods

2.1. Study Area

The study was conducted at the Yunhe Forestry Station in Lishui City, located in southwestern Zhejiang Province, China (19°20′–119°44′E, 27°53′–28°19′N; Figure 1). This region has a subtropical monsoon climate, characterized by mean annual precipitation from 1465 to 1969 mm. The mean annual temperature is 17.6 °C, with monthly means reaching about 28.4 °C in the hottest month (July) and dropping to nearly 6.3 °C in the coldest (January). The forests in the study area cover 89.3% of the total area, and are dominated by Schima superba, Pinus massoniana, Pinus hwangshanensis, and Cunninghamia lanceolata.

2.2. Data Preparation

2.2.1. Field Inventory and Calculation of Biomass of Sample Plots

We selected representative forests for field survey based on the species types and stand density in 2024, including broad-leaved forests, mixed forests, and coniferous forests. A total of 89 sample plots were randomly established according to forest type and spatial distribution across the study region, and each sample size is 20 m

\times

20 m. All trees with diameter at breast height (1.3 m aboveground; DBH) ≥ 5 cm were targeted, recorded and measured, including DBH, height, height under branch, and spatial location. The sampled plots comprised 26 broad-leaved forest stands, 38 coniferous and broad-leaved mixed forest stands (hereafter expressed as mixed forests), and 25 coniferous forest stands.

Biomass at tree level was calculated by aggregating the biomass contribution of the trunk, leaf and root components, as defined in Equation (1), using the species-specific allometric growth equation (Table 1) [25]. These equations estimate biomass for different species groups, such as pine, fir, hard-wood and broad leaves 1, hard-wood and broad leaves 2, soft-wood and broad leaves and moso bamboos. The biomass in plot-level was subsequently obtained by summing the biomass values of all measured trees within each sample plot.

B i o m a s s = {B_{s t e m} + B}_{f o i l a g e} + B_{r o o t}

(1)

where

B i o m a s s

presents the total biomass including aboveground and belowground biomass;

B_{s t e m}

accounts for the biomass of main stem (excluding bark);

B_{f o l i a g e}

accounts for the biomass of foliage;

B_{r o o t}

accounts for the biomass of roots.

2.2.2. Collection of Embedding Dataset

We utilized an embedding dataset produced by the AlphaEarth Foundations, which is a global multisensory representation model developed to generate compact yet information-dense geospatial feature vectors that capture spatial patterns of terrestrial surface conditions and climate-related processes [19]. These embeddings were derived from a transformer-based architecture trained on a wide spectrum of Earth observation data sources, including Sentinel-1 synthetic aperture radar backscatter, multispectral reflectance from Sentinel-2 and Landsat-8, MODIS-derived vegetation indices, GRACE-based gravity measurements, GEDI canopy height observations, as well as topographic variables, soil moisture products, and atmospheric indicators from the ERA5-Land reanalysis. By integrating these heterogeneous inputs, the embedding features provide rich contextual information related to vegetation structure, terrain, climate, and hydrological conditions, all of which are directly or indirectly associated with forest biomass [10]. In addition, the embedding bands are resilient to cloud interference and exhibit spatial consistency across regions, thereby minimizing data variability caused by cloud contamination.

Embedding datasets were accessed through the Google Earth Engine (GEE) cloud-computing environment (https://developers.google.com/earth-engine/datasets/ (accessed on 20 November 2025)) using the dataset collection GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL. For this analysis, embeddings were extracted for 89 forest inventory plots, comprising 26 broad-leaved, 38 mixed, and 25 coniferous stands. The embedding data have a spatial resolution of 10 m, whereas the plots’ size is 20 m × 20 m. To reconcile this scale mismatch, the values of each embedding band were averaged across the four 10 m pixels within each plot. Based on the geographic coordinates of each plot center, embedding features were retrieved from January 2023 to January 2024, overlapping with the period of field surveys. The embedding features were then aggregated as a median composite to reduce the potential influence of short-term phenological variability. Each pixel contained a 64-element vector representing integrated information from the original multi-sensor inputs. As a result, all averaged 64 embedding bands (A00–A63) were used as predictor variables in the machine learning models. Model implementation, data processing, and visualization were performed in a Jupyter Lab environment, with large-scale raster operations and model deployment conducted through Google Earth Engine and the geemap package (version 0.35.3) in Python version 3.13.3. All datasets used in this study are summarized in Table S2.

2.3. Machine Learning Algorithms

The overall workflow of this study consists of two primary components. First, given the relatively small sample size, all biomass prediction models were constructed with 5-fold cross-validation using multiple machine learning algorithms, with model training and validation conducted for all forest plots as well as for different forest types (Table 2). All embedding variables were used in the model because these embedding variables integrate both biotic and abiotic properties from multi-source satellite observations and each embedding contains information derived from source data [19]. Model performance was systematically assessed to evaluate predictive accuracy. Second, the optimized models were applied to generate spatially explicit maps of forest biomass using the full datasets across the Yunhe Forestry Station. Four machine learning approaches were implemented, including Random Forest (RF), Support Vector Regression (SVR), Multi-Layer Perceptron Neural Networks (MLPNN), and Gaussian Process Regression (GPR). These algorithms were chosen because of their broad applicability and demonstrated robustness in modeling forest biomass from complex, high-dimensional predictor variables.

2.3.1. Random Forest

RF is a flexible, non-parametric machine learning algorithm designed to capture nonlinear relationships through an ensemble of decision trees [26,27]. As for this method, multiple trees are generated using bootstrap resampling of the training data, and each tree is constructed by randomly selecting subsets of predictor variables and observations from the original dataset. Model predictions are obtained by aggregating the outputs of all individual trees, with the final estimate determined by majority voting across the ensemble. Model tuning and implementation were conducted using the Random Forest function from the R package Random Forest (version 4.7.1.2) [28]. Following the recommendation of Breiman (2001) [26], the number of trees was set to 500 to promote model stability and enhance predictive robustness.

2.3.2. Support Vector Regression

SVR was applied as a kernel-based, non-parametric learning approach to capture nonlinear associations between remote-sensing predictors and forest biomass, and it is particularly effective when field observations are limited. The predictive capability of SVR is significantly dependent on the choice of kernel function. In this study, the Radial Basis Function kernel was adopted because of its widespread application and proven effectiveness in forest biomass estimation reported in earlier research [16,29]. Model calibration involved optimizing two meta-parameters: the regularization coefficient (C), which controls model complexity, and the kernel scale parameter (γ), which governs the flexibility of the RBF kernel. Parameter tuning and model implementation were conducted using the R package kernlab (version 0.9.33) [30].

2.3.3. Multi-Layer Perceptron Neural Network

MLPNNs were implemented using a sequential architecture and are widely applied in forest biomass estimation and spatial mapping, particularly in situations involving strong nonlinear relationships [31,32]. An MLPNN typically consists of three main components: an input layer containing remote-sensing predictors, one or more intermediate hidden layers, and a single output node used to estimate biomass or its components. During model training, inter-layer connection weights are optimized through the backpropagation algorithm, which iteratively reduces discrepancies between predicted values and field-based biomass measurements. Training proceeds until a predefined convergence criterion is met or a maximum number of iterations is reached. Model performance was evaluated using the coefficient of determination (R²), root mean square error (RMSE), and mean absolute error (MAE), with optimal models exhibiting higher R² values alongside lower RMSE and MAE. All MLPNN models were developed using the R package RSNNS (version 0.4.17) [33].

2.3.4. Gaussian Process Regression

GPR is a flexible, probabilistic modeling framework that has been widely applied in the estimation of forest biomass and related variables, particularly in studies where explicit uncertainty characterization is required [34]. As a Bayesian non-parametric approach, GPR generates predictions by defining a covariance structure through kernel functions, which quantify similarity among input samples from both training and prediction datasets.

The choice of kernel function is a critical determinant of GPR performance because it governs assumptions about the smoothness, variability, and spatial correlation of the underlying response surface. In this study, we adopted a radial basis function kernel, also referred to as a scaled squared exponential covariance function [30], which has demonstrated strong performance in previous biomass estimation applications using combined optical–SAR indices [35]. All GPR models were implemented using the R package kernlab (version 0.9.33) [30].

2.4. K-Fold Cross-Validation and Model Evaluation

To evaluate the predictive performance of each machine learning algorithm, a grouped 5-fold cross-validation strategy was adopted. This approach was chosen to mitigate potential inflation of accuracy estimates caused by spatial autocorrelation between training and validation samples [36,37]. By assigning all observations from the same spatial group exclusively to either the training subset or the validation subset within each fold, statistical independence between model calibration and evaluation data was maintained. Under this scheme, approximately 80% of the data were used for model training and the remaining 20% for performance assessment in each fold. The cross-validation procedure was repeated ten times to account for algorithmic randomness, and all implementations were conducted using the train function from the R package caret (version 7.0.1) [38].

Model accuracy in predicting forest biomass was quantified using three commonly applied regression metrics: the coefficient of determination (R²), root mean square error (RMSE), and mean absolute error (MAE). Higher R² values indicate stronger correlation between observed and predicted biomass, reflecting improved explanatory power. Both RMSE and MAE measure prediction error magnitude, with smaller values representing greater model accuracy. Differences between RMSE and MAE further provide insight into the dispersion of prediction errors, where closer values suggest reduced error variance. Among these indicators, RMSE was emphasized as the primary evaluation criterion because it retains the original biomass units and is particularly relevant for spatially explicit mapping. The formulas are as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - {\bar{y}}_{i})}^{2}}

(2)

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}{n}}

(3)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |{\hat{y}}_{i} - y_{i}|

(4)

where n is the number of samples;

y_{i}

and

{\hat{y}}_{i}

are the measured and predicted biomass, respectively; and

{\bar{y}}_{i}

is the average biomass. The final model selection prioritized lower RMSE in conjunction with higher R² across validation folds, ensuring an optimal balance between predictive precision and generalization capability.

2.5. Predictor Importance and Spatial Applicability Analysis

Because leave-one-out validation strategies may yield misleading interpretations when strong multicollinearity exists among predictors, Shapley Additive Explanations (SHAP) were employed to assess variable importance [39]. SHAP values were calculated to quantify the relative contribution of each embedding feature to biomass estimates and to determine which embedding dimensions were most influential. This framework enabled a clearer interpretation of the associations between forest biomass and predictors (such as canopy structure, climate, and land-surface characteristics), while also providing guidance for potential feature selection or dimensionality reduction in subsequent analyses.

For the machine learning approach that demonstrated the highest predictive performance, the final model was retrained using the complete dataset and subsequently applied to spatial prediction across the study area. Biomass maps were generated using the full embedding raster layers for the years 2017 and 2024 at a spatial resolution of 20 m, allowing for detailed representation of biomass distribution patterns. All statistical computations and analyses described in this section were conducted using R version 4.4.3 [40].

3. Results

3.1. Model Training and Validation

The cross-validation accuracy results of the RF, SVR, MLPNN, and GPR algorithms across different forest types are summarized in Table 3, which is used as the main benchmark for comparing different machine learning models. Model performance was evaluated using the coefficient of determination (R²), root mean square error (RMSE), and mean absolute error (MAE) under a five-fold cross-validation framework. When all forest plots were analyzed collectively, predictive performance was uniformly weak across models. Although GPR marginally outperformed the other methods, overall performance remained low (R² = 0.06; Table 3). Predictions showed pronounced regression toward the mean, with limited responsiveness to variation in observed biomass values, as illustrated in Figure 2A. The high error metrics (RMSE = 41.9 t/ha; MAE = 34.6 t/ha) further indicate inadequate model transferability when all forest plots were analyzed together.

Partitioning the dataset by forest type led to improvements in model accuracy, yet challenges still persisted in representing within-type forest characteristics. For broad-leaved forests, the SVR algorithm yielded moderate predictive accuracy (R² = 0.33), accompanied by lower RMSE (20.9 t/ha) and MAE (16.6 t/ha). Predicted biomass values generally followed the increasing trend of observed biomass values, particularly across intermediate biomass ranges (Figure 2B). In contrast, predictive performance for mixed forests remained weak, with the SVR algorithm explaining only a small proportion of variance (R² = 0.13) and exhibiting considerable uncertainty (RMSE = 47.4 t/ha; MAE = 40.6 t/ha). Predicted biomass values displayed limited correspondence with observed biomass and were concentrated within a narrow prediction interval (Figure 2C). The best-performance model results were obtained for coniferous forests, where the RF algorithm achieved the highest accuracy (R² = 0.48; RMSE = 34.3 t/ha; MAE = 29.8 t/ha). In this forest type, RF-derived estimates aligned more closely with the 1:1 reference line than in other categories, particularly at low to moderate biomass levels (Figure 2D).

SHAP analysis was conducted for the best-performing models, and only the top ten most influential predictors ranked by mean absolute SHAP values with relatively stable contributions across cross-validation folds are shown for visualization clarity. The resulting importance patterns revealed that the identity of these dominant predictors differed among forest types. Error bars represent inter-fold variability, highlighting sensitivity to training data composition and predictor correlations. Notably, the comparatively wide uncertainty ranges observed in panels A, C, and D indicated substantial variation in SHAP contributions across folds, suggesting less stable feature influence in these cases (Figure 3A,C,D).

3.2. Total Biomass Prediction in Yunhe Forestry Station

Using the cross-validated optimal models, final model fitting was conducted with the complete dataset and subsequently used to produce spatially explicit estimates of biomass across the study region (Figure 4). When considering all forest plots together, predicted biomass values showed strong correspondence with observed biomass (R² = 0.78), though systematic underestimation was evident at higher biomass levels (Figure 4A). Among forest types, predictions for coniferous forests demonstrated the highest agreement with measured values, yielding an R² of 0.95 alongside relatively low error magnitudes (RMSE = 16 t/ha; MAE = 13.6 t/ha). Conversely, biomass estimation in mixed forests exhibited weaker performance (R² = 0.34), accompanied by elevated uncertainty as reflected by larger RMSE (42.2 t/ha) and MAE (34.4 t/ha).

The optimized model was used to produce spatially explicit maps of biomass for broad-leaved, mixed, and coniferous forest stands within the Yunhe Forestry Station for the years 2017 and 2024 (Figure 5). At the landscape scale, biomass estimates displayed broadly similar spatial configurations across both time periods. Areas with relatively high biomass levels (>87 t/ha) were predominantly located within intact, contiguous forest patches (sites 1 and 3 in Figure 4), particularly in regions distant from urban areas. In contrast, lower biomass values (≤54 t/ha) were mainly associated with fragmented forest patches and region adjacent to urban areas (site 2 in Figure 5).

Although the overall spatial distribution of biomass remains largely stable over time, there still were local-scale variations between 2017 and 2024. Forests located in mountainous terrain exhibited comparatively steady biomass conditions (sites 1 and 3 in Figure 5). In several locations, predicted biomass even shifted from intermediate classes (70–87 and 87–103 t/ha) toward higher categories, indicating biomass accumulation. Conversely, declines in predicted biomass were primarily observed along forest edges and within ecotonal or transitional areas (sites 2 in Figure 5).

4. Discussion

The combination of field-measured data from the Yunhe Forestry Station and the embedding-based modeling framework demonstrates the potential of a fully remote-sensing-driven approach for large-scale forest biomass estimation and management applications.

4.1. Capability of Embedding Dataset for Predicting Forest Biomass

Models developed using the AlphaEarth Foundations embedding dataset demonstrated predictive performance comparable to that reported in previous studies based on conventional LiDAR or multi-source optical–radar data. For instance, biomass predicting models using LiDAR data achieved coefficients of determination around 0.73 across broad-leaved, mixed, and pine forests in Wuyishan National Park [41], which is similar to the overall model fit obtained in this study (R² = 0.78 in Figure 4A). Likewise, previous research combining Sentinel-2A optical imagery with ALOS-2 PALSAR-2 SAR observations reported biomass prediction accuracies ranging from moderate to high (R² between 0.44 and 0.73) across various forest vegetation types [16].

However, compared with some previous studies, the relatively low coefficient of determination obtained in the validation dataset (Table 3) is due to the use of a more stringent cross-validation strategy. Repeated 5-fold cross-validation was employed to reduce optimistic bias and provide a realistic estimate of model generalization. Similar results with low coefficient of determination have also been reported elsewhere. For example, in temperate grasslands in Germany, the best predicting biomass model achieved R² = 0.45 when trained with repeated cross-validation [42]. Although repeated cross-validation with machine learning algorithms was employed to mitigate overfitting, the relatively high feature-to-sample ratio may still affect model generalization.

When all forest plots were used in a modeling framework, none of the four machine learning algorithms showed a strong performance under cross-validation (Table 3). There may be two reasons to explain this result. First, forest types within the study area differ in environmental conditions and land management, and such heterogeneity may obscure the consistent relationships between embedding features and biomass. Forest plots located closer to urban areas are more strongly influenced by anthropogenic disturbance, which affects tree growth, forest structure, and ecosystem functioning. For example, forests around urban or agricultural landscapes often show lower tree diversity and basal area than continuous and intact forests [43]. Second, the complex structural composition of mixed forests, such as high variability in canopy height, stand density, and vertical stratification [44], further obscures the relationships between embedding features and biomass. For example, identical biomass values may correspond to different spectral responses, and similar spectral signatures may arise from contrasting biomass conditions [45]. These challenges likely explain the relatively low predictive accuracy achieved for mixed forests in the final model (R² = 0.34; RMSE = 42.2 t/ha; MAE = 34.4 t/ha). Conversely, broad-leaved and coniferous forests exhibit relatively more consistent relationships between embedding features and biomass, because they have more uniform structure and optics [8,46].

SHAP-based interpretation further enhanced understanding of model performance. Feature importance patterns varied substantially among three forest types, indicating that the embedding features flexibly encode biomass-relevant information across different ecological contexts, including canopy structure, vegetation spectral characteristics, and environmental conditions. However, these embedding variables represent latent features learned from multi-source satellite observations and do not correspond to explicit ecological or structural attributes. Therefore, the observed importance should be interpreted as model-driven statistical relevance rather than direct physical meaning. Notably, several variables exhibited relatively large SHAP uncertainty across cross-validation folds, particularly in mixed and coniferous forests, likely due to internal structural heterogeneity and the abstract nature of embedding features.

4.2. Managing Forests Using Spatial Biomass Predictions

This study emphasizes the novel use of an embedding-based modeling framework for estimating forest biomass, representing a methodological difference from traditional approaches that rely on optical imagery, synthetic aperture radar, LiDAR data, or their combinations [17,22]. In the resulting biomass map, areas characterized by persistently high biomass are primarily associated with intact, continuous forest cover and minimal anthropogenic disturbance, whereas reduced biomass values are largely observed in fragmented forests and forests at the urban edge. The development of forests is indeed shaped by human activities. For example, global analyses have shown that edge forests contain, on average, approximately 16% less biomass than forest interiors [47].

The resulting biomass maps across Yunhe Forestry Station further reveal that localized increases in biomass in intact and continuous forest regions between 2017 and 2024, alongside declines concentrated in forests adjacent to urban margins and in areas with greater accessibility (Figure 5). Such spatially explicit information provides a valuable basis for tracking forest conditions over time and identifying areas vulnerable to biomass loss or suitable for recovery. From a management perspective, conservation strategies in intact forests with high biomass should prioritize maintaining existing protection and minimizing disturbance. In contrast, forests located near urban areas require enhanced management interventions, including prohibiting illegal logging and unauthorized change in land types to mitigate further biomass degradation.

4.3. Limitations

Although this study demonstrates the potential of embedding-based models for estimating forest biomass and provides management-relevant suggestions from spatial prediction maps, several limitations should be acknowledged. First, while embedding-based approaches allow for rapid and scalable biomass estimation and support map production, the ecological interpretation of individual embedding dimensions remains challenging. Because embedding features represent abstract combinations of multi-sensor information, it is difficult to directly link specific predictors to underlying ecological processes or structural attributes that control biomass variation. This limitation underscores the potential benefits of integrating embedding-based predictors with more interpretable remote sensing variables to improve both predictive accuracy and ecological understanding, as well as of explicitly evaluating model performance when embedding variables are excluded. Second, the temporal scope of the available embedding dataset constrained the detection of biomass dynamics. The relatively short interval between 2017 and 2024 resulted in limited changes in biomass across most forested areas, limiting the ability to assess long-term trends or disturbance–recovery processes. In addition, some temporal mismatch between field data and embeddings may still introduce uncertainty, especially in planted forests. Third, the spatial distribution of predicted biomass reveals differences between intact and fragmented forests, as well as between urban edge areas and forest interiors (Figure 5). These patterns suggest that a single global model may not fully capture fine-scale spatial heterogeneity in biomass distribution. Future studies could benefit from incorporating spatially adaptive modeling approaches (e.g., Geographical Random Forest or Geographically Weighted Regression), as well as explicitly accounting for edge effects and forest integrity, to better represent localized biomass variability. Finally, the analysis focused on three dominant forest categories, whereas a broader range of forest types and plantation forests were not included in our models. To expand model applications and improve the generalizability of the findings, future research should therefore incorporate larger and more diverse field datasets and integrate multi-source remote sensing observations to enhance model stability, predictive performance, and applicability across heterogeneous forest landscapes.

5. Conclusions

This study represents one of the applications of satellite-derived embedding datasets for modeling forest biomass. In addition, four machine learning algorithms, including RF, SVR, MLPNN, and GPR, were systematically evaluated and compared. Based on the findings in this research, the following conclusions are drawn:

Embedding datasets derived from Google AlphaEarth Foundations’ satellites can effectively predict forest biomass, demonstrating performance comparable to that reported in studies using conventional optical, SAR, or LiDAR data.
The spatially explicit biomass maps generated in this study provide valuable information for forest monitoring and management, supporting large-scale decision-making, particularly for identifying biomass patterns in near-urban forests and conserving high-biomass, continuous forests.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/rs18030436/s1.

Author Contributions

Conceptualization, C.J., C.W. and J.J.; Methodology, C.J.; Validation, C.J. and J.J.; Formal Analysis, C.J. and X.J.; Investigation, X.J. and L.W.; Resource, L.W.; Data Curation, X.X.; Writing—Original Draft Preparation, C.J. and J.J.; Writing—Review and Editing, C.J.; Visualization, J.J.; Supervision, X.X. and C.W.; Funding Acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Research Institute Support Project of Zhejiang Province (2026F1065-2-2).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

We thank the research team for their assistance with data acquisition.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pan, Y.; Birdsey, R.A.; Fang, J.; Houghton, R.; Kauppi, P.E.; Kurz, W.A.; Phillips, O.L.; Shvidenko, A.; Lewis, S.L.; Canadell, J.G.; et al. A Large and Persistent Carbon Sink in the World’s Forests. Science 2011, 333, 988–993. [Google Scholar] [CrossRef]
Houghton, R.A. Aboveground Forest Biomass and the Global Carbon Balance. Glob. Change Biol. 2005, 11, 945–958. [Google Scholar] [CrossRef]
Vashum, K.T.; Jayakumar, S. Methods to estimate above-ground biomass and carbon stock in natural forests—A review. J. Ecosyst. Ecography 2012, 2, 1–7. [Google Scholar] [CrossRef]
Pan, Y.; Birdsey, R.A.; Phillips, O.L.; Jackson, R.B. The Structure, Distribution, and Biomass of the World’s Forests. Annu. Rev. Ecol. Evol. Syst. 2013, 44, 593–622. [Google Scholar] [CrossRef]
Zhao, F.; Guo, Q.; Kelly, M. Allometric equation choice impacts lidar-based forest biomass estimates: A case study from the Sierra National Forest, CA. Agric. For. Meteorol. 2012, 165, 64–72. [Google Scholar] [CrossRef]
Abdul-Hamid, H.; Mohamad-Ismail, F.-N.; Mohamed, J.; Samdin, Z.; Abiri, R.; Tuan-Ibrahim, T.-M.; Mohammad, L.-S.; Jalil, A.-M.; Naji, H.-R. Allometric Equation for Aboveground Biomass Estimation of Mixed Mature Mangrove Forest. Forests 2022, 13, 325. [Google Scholar] [CrossRef]
Viana, H.; Aranha, J.; Lopes, D.; Cohen, W.B. Estimation of crown biomass of Pinus pinaster stands and shrubland above-ground biomass using forest inventory data, remotely sensed imagery and spatial prediction models. Ecol. Model. 2012, 226, 22–35. [Google Scholar] [CrossRef]
Yan, X.; Li, J.; Smith, A.R.; Yang, D.; Ma, T.; Su, Y.; Shao, J. Evaluation of machine learning methods and multi-source remote sensing data combinations to construct forest above-ground biomass models. Int. J. Digit. Earth 2023, 16, 4471–4491. [Google Scholar] [CrossRef]
Su, Y.; Guo, Q.; Jin, S.; Guan, H.; Sun, X.; Ma, Q.; Hu, T.; Wang, R.; Li, Y. The Development and Evaluation of a Backpack LiDAR System for Accurate and Efficient Forest Inventory. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1660–1664. [Google Scholar] [CrossRef]
Brovkina, O.; Novotny, J.; Cienciala, E.; Zemek, F.; Russ, R. Mapping forest aboveground biomass using airborne hyperspectral and LiDAR data in the mountainous conditions of Central Europe. Ecol. Eng. 2017, 100, 219–230. [Google Scholar] [CrossRef]
Fahey, C.; Choi, D.; Wang, J.; Domke, G.M.; Edwards, J.D.; Fei, S.; Kivlin, S.N.; LaRue, E.A.; McCormick, M.K.; McShea, W.J.; et al. Canopy complexity drives positive effects of tree diversity on productivity in two tree diversity experiments. Ecology 2025, 106, e4500. [Google Scholar] [CrossRef] [PubMed]
Pelletier, F.; Cardille, J.A.; Wulder, M.A.; White, J.C.; Hermosilla, T. Inter- and intra-year forest change detection and monitoring of aboveground biomass dynamics using Sentinel-2 and Landsat. Remote Sens. Environ. 2024, 301, 113931. [Google Scholar] [CrossRef]
Hyde, P.; Nelson, R.; Kimes, D.; Levine, E. Exploring LiDAR–RaDAR synergy—Predicting aboveground biomass in a southwestern ponderosa pine forest using LiDAR, SAR and InSAR. Remote Sens. Environ. 2007, 106, 28–38. [Google Scholar] [CrossRef]
Foody, G.M.; Boyd, D.S.; Cutler, M.E.J. Predictive relations of tropical forest biomass from Landsat TM data and their transferability between regions. Remote Sens. Environ. 2003, 85, 463–474. [Google Scholar] [CrossRef]
Ghasemi, N.; Sahebi, M.R.; Mohammadzadeh, A. Biomass Estimation of a Temperate Deciduous Forest Using Wavelet Analysis. IEEE Trans. Geosci. Remote Sens. 2013, 51, 765–776. [Google Scholar] [CrossRef]
Vafaei, S.; Soosani, J.; Adeli, K.; Fadaei, H.; Naghavi, H.; Pham, T.D.; Tien Bui, D. Improving Accuracy Estimation of Forest Aboveground Biomass Based on Incorporation of ALOS-2 PALSAR-2 and Sentinel-2A Imagery and Machine Learning: A Case Study of the Hyrcanian Forest Area (Iran). Remote Sens. 2018, 10, 172. [Google Scholar] [CrossRef]
Duncanson, L.; Neuenschwander, A.; Hancock, S.; Thomas, N.; Fatoyinbo, T.; Simard, M.; Silva, C.A.; Armston, J.; Luthcke, S.B.; Hofton, M.; et al. Biomass estimation from simulated GEDI, ICESat-2 and NISAR across environmental gradients in Sonoma County, California. Remote Sens. Environ. 2020, 242, 111779. [Google Scholar] [CrossRef]
Qin, S.; Wang, H.; Rogers, C.; Bermúdez, J.; Lourenço, R.B.; Zhang, J.; Li, X.; Chau, J.; Tompalski, P.; Gonsamo, A. Aboveground biomass mapping of Canada with SAR and optical satellite observations aided by active learning. ISPRS J. Photogramm. Remote Sens. 2025, 226, 204–220. [Google Scholar] [CrossRef]
Brown, C.; Kazmierski, M.; Pasquarella, V.; Rucklidge, W.; Samsikova, M.; Zhang, C.; Shelhamer, E.; Lahera, E.; Wiles, O.; Ilyushchenko, S. AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data. arXiv 2025, arXiv:2507.22291. [Google Scholar] [CrossRef]
Seydi, S.T. Deep Learning-Based Burned Area Mapping Using Bi-Temporal Siamese Networks and AlphaEarth Foundation Datasets. arXiv 2025, arXiv:2509.07852. [Google Scholar]
Alvarez, C.I.; Ulloa Vaca, C.A.; Echeverria Llumipanta, N.A. Machine Learning for Urban Air Quality Prediction Using Google AlphaEarth Foundations Satellite Embeddings: A Case Study of Quito, Ecuador. Remote Sens. 2025, 17, 3472. [Google Scholar] [CrossRef]
Ehlers, D.; Wang, C.; Coulston, J.; Zhang, Y.; Pavelsky, T.; Frankenberg, E.; Woodcock, C.; Song, C. Mapping Forest Aboveground Biomass Using Multisource Remotely Sensed Data. Remote Sens. 2022, 14, 1115. [Google Scholar] [CrossRef]
Huang, H.; Liu, C.; Wang, X.; Zhou, X.; Gong, P. Integration of multi-resource remotely sensed data and allometric models for forest aboveground biomass estimation in China. Remote Sens. Environ. 2019, 221, 225–234. [Google Scholar] [CrossRef]
Liu, J.; Coomes, D.A.; Gibson, L.; Hu, G.; Liu, J.; Luo, Y.; Wu, C.; Yu, M. Forest fragmentation in China and its effect on biodiversity. Biol. Rev. 2019, 94, 1636–1657. [Google Scholar] [CrossRef]
Yuan, W.G.; Jiang, B.; Ge, Y.J.; Zhu, J.R.; Shen, A.H. Study on Biomass Model of Key Ecological Forest in Zhejiang Province. J. Zhejiang For. Sci. Technol. 2009, 29, 1–5. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cutler, A.; Cutler, D.R.; Stevens, J.R. Random Forests. In Ensemble Machine Learning: Methods and Applications; Zhang, C., Ma, Y., Eds.; Springer: New York, NY, USA, 2012; pp. 157–175. [Google Scholar]
Liaw, A.; Wiener, M. Classification and Regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
Chaofan, W.; Huanhuan, S.; Aihua, S.; Jinsong, D.; Muye, G.; Jinxia, Z.; Hongwei, X.; Ke, W. Comparison of machine-learning methods for above-ground biomass estimation based on Landsat imagery. J. Appl. Remote Sens. 2016, 10, 035010. [Google Scholar] [CrossRef]
Karatzoglou, A.; Smola, A.; Hornik, K.; Zeileis, A. kernlab—An S4 Package for Kernel Methods in R. J. Stat. Softw. 2004, 11, 1–20. [Google Scholar] [CrossRef]
Sharifi, A.; Amini, J.; Tateishi, R. Estimation of Forest Biomass Using Multivariate Relevance Vector Regression. Photogramm. Eng. Remote Sens. 2016, 82, 41–49. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, X.; Shao, Z.; Jiang, W.; Gao, H. Integrating Sentinel-1 and 2 with LiDAR data to estimate aboveground biomass of subtropical forests in northeast Guangdong, China. Int. J. Digit. Earth 2023, 16, 158–182. [Google Scholar] [CrossRef]
Bergmeir, C.; Benítez, J.M. Neural Networks in R Using the Stuttgart Neural Network Simulator: RSNNS. J. Stat. Softw. 2012, 46, 1–26. [Google Scholar] [CrossRef]
Xie, R.; Darvishzadeh, R.; Skidmore, A.K.; Heurich, M.; Holzwarth, S.; Gara, T.W.; Reusen, I. Mapping leaf area index in a mixed temperate forest using Fenix airborne hyperspectral data and Gaussian processes regression. Int. J. Appl. Earth Obs. Geoinf. 2021, 95, 102242. [Google Scholar] [CrossRef]
Abebe, G.; Tadesse, T.; Gessesse, B. Estimating Leaf Area Index and biomass of sugarcane based on Gaussian process regression using Landsat 8 and Sentinel 1A observations. Int. J. Image Data Fusion 2023, 14, 58–88. [Google Scholar] [CrossRef]
Ploton, P.; Mortier, F.; Réjou-Méchain, M.; Barbier, N.; Picard, N.; Rossi, V.; Dormann, C.; Cornu, G.; Viennois, G.; Bayol, N.; et al. Spatial validation reveals poor predictive performance of large-scale ecological mapping models. Nat. Commun. 2020, 11, 4540. [Google Scholar] [CrossRef]
Meyer, H.; Reudenbach, C.; Wöllauer, S.; Nauss, T. Importance of spatial predictor variable selection in machine learning applications—Moving from data reproduction to spatial prediction. Ecol. Model. 2019, 411, 108815. [Google Scholar] [CrossRef]
Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2025. [Google Scholar]
Jian, K.; Lu, D.; Li, G. Modeling Forest Carbon Stock Based on Sample Plots and UAV Lidar Data from Multiple Sites and Examining Its Vertical Characteristics in Wuyishan National Park. Remote Sens. 2025, 17, 377. [Google Scholar] [CrossRef]
Muro, J.; Linstädter, A.; Magdon, P.; Wöllauer, S.; Männer, F.A.; Schwarz, L.-M.; Ghazaryan, G.; Schultz, J.; Malenovský, Z.; Dubovyk, O. Predicting plant biomass and species richness in temperate grasslands across regions, time, and land management with remote sensing and deep learning. Remote Sens. Environ. 2022, 282, 113262. [Google Scholar] [CrossRef]
Schmit, J.P.; Johnson, L.R.; Baker, M.; Darling, L.; Fahey, R.; Locke, D.H.; Morzillo, A.T.; Sonti, N.F.; Trammell, T.L.E.; Aronson, M.F.J.; et al. The influence of urban and agricultural landscape contexts on forest diversity and structure across ecoregions. Ecosphere 2025, 16, e70188. [Google Scholar] [CrossRef]
Pretzsch, H. Canopy space filling and tree crown morphology in mixed-species stands compared with monocultures. For. Ecol. Manag. 2014, 327, 251–264. [Google Scholar] [CrossRef]
Li, C.; Li, M.; Iizuka, K.; Liu, J.; Chen, K.; Li, Y. Effects of Forest Canopy Structure on Forest Aboveground Biomass Estimation Using Landsat Imagery. IEEE Access 2021, 9, 5285–5295. [Google Scholar] [CrossRef]
Yang, Q.; Su, Y.; Hu, T.; Jin, S.; Liu, X.; Niu, C.; Liu, Z.; Kelly, M.; Wei, J.; Guo, Q. Allometry-based estimation of forest aboveground biomass combining LiDAR canopy height attributes and optical spectral indexes. For. Ecosyst. 2022, 9, 100059. [Google Scholar] [CrossRef]
Yang, G.; Crowther, T.W.; Lauber, T.; Zohner, C.M.; Smith, G.R. A globally consistent negative effect of edge on aboveground forest biomass. Nat. Ecol. Evol. 2025, 9, 2036–2045. [Google Scholar] [CrossRef]

Figure 1. Location of Yunhe Forestry Station in Lishui City, Zhejiang Province, China. Yunhe Forestry Station comprises three regions (site 1, site 2, and site 3), which are shown in the right panel.

Figure 2. One-to-one density plots of predicted vs. in situ measured values for the best performing biomass models with (A) GPR in all forests, (B) SVR in broad-leaved forests, (C) SVR in mixed forests, and (D) RF in coniferous forests. The dashed line indicates the 1:1 reference line. Scatter-plots show the accumulated results of 5 folds, with warmer colors indicating higher point density. GPR indicates Gaussian Process Regression; SVR indicates Support Vector Regression; RF indicates Random Forest.

Figure 3. SHAP values for the best-performing models based on embedding dataset in (A) all forests, (B) broad-leaved forests, (C) mixed forests, and (D) coniferous forests. Error bars show standard deviation across repeated cross-validation folds, and the horizontal axis shows the mean SHAP value for each embedding band (A00–A63). Error bars represent the standard deviation of SHAP values across 5-folds, reflecting the variability of feature contributions. A detailed description of all embedding variables is provided in Table S1. For clarity, only the top ten most influential variables with low variation across cross-validation folds are represented. The complete ranking of all embedding variables is provided in the Figures S1–S4.

Figure 4. Training set fit between predicted and in situ measured biomass for the best-performing models: (A) all forests, (B) broad-leaved forests, (C) mixed forests, and (D) coniferous forests. The 1:1 dashed line is shown for reference, indicating perfect agreement between predicted and in situ measured biomass. The R², RMSE, and the MAE are shown in the up-left corner. The R² indicates goodness-of-fit metrics calculated on the full training data, which does not represent independent predictive performance.

Figure 5. Spatial distribution maps of predicted biomass at 20 m resolution for Yunhe Forestry Station in 2017 (left column: (A,C,E)) and 2024 (right column: (B,D,F)), using the best-performing model.

Table 1. Allometric growth equation for different tree species groups.

Species Group	Stem Biomass Equation	Foliage Biomass Equation	Root Biomass Equation	Reference
Pine	$B_{s t e m}$ = 0.0600H^0.7934D^1.8005	$B_{f o l i a g e}$ = 0.1377D^1.4872L^0.4052	$B_{r o o t}$ = 0.0417H^−0.0780D^2.2618	[25]
Fir	$B_{s t e m}$ = 0.0647H^0.8959D^1.4880	$B_{f o l i a g e}$ = 0.0971D^1.7814L^0.0346	$B_{r o o t}$ = 0.0617H^−0.10374D^2.115
Hard-wood and broad-leaves 1	$B_{s t e m}$ = 0.0560H^0.8099D^1.8140	$B_{f o l i a g e}$ = 0.0980D^1.6481L^0.4610	$B_{r o o t}$ = 0.0549H^0.1068D^2.0953
Hard-wood and broad-leaves 2	$B_{s t e m}$ = 0.0803H^0.7815D^1.8056	$B_{f o l i a g e}$ = 0.2860D^1.0968L^0.945	$B_{r o o t}$ = 0.2470H^0.1745D^1.7954
Soft-wood and broad-leaves	$B_{s t e m}$ = 0.0444H^0.7197D^1.7095	$B_{f o l i a g e}$ = 0.0856D^1.22657L^0.397	$B_{r o o t}$ = 0.0459H^0.1067D^2.0247
Moso bamboos	$B_{s t e m}$ = 0.0398H^0.5778D^1.8540	$B_{f o l i a g e}$ = 0.280D^0.8357L^0.2740	$B_{r o o t}$ = 0.371H^0.1357D^0.9817

D indicates tree diameter at breast height (cm); H indicates tree height; L indicates height under branch;

B_{s t e m}

accounts for the biomass of main stem (excluding bark);

B_{f o l i a g e}

accounts for the biomass of foliage;

B_{r o o t}

accounts for the biomass of roots. Unit is kg.

Table 2. Statistics of biomass of different forest types.

Forest Type	Number of Samples	Range (t/ha)	Mean (t/ha)	Standard Deviation (t/ha)
Broad-leaved forest	26	40~125.8	84.4	23.03
Mixed forest	38	44.8~189.8	106.7	47.52
Coniferous forest	25	17.2~163.9	76.7	41.38

Table 3. The accuracy metrics of machine learning models based on embedding datasets in all forests, broad-leaved forests, mixed forests, and coniferous forests: cross-validated coefficient of root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R²). Each model was iterated 10 times, and the resulting is expressed as mean and standard deviation of each accuracy metric. The best performance for each machine learning model is highlighted with bold. GPR indicates Gaussian Process Regression; SVR indicates Support Vector Regression; RF indicates Random Forest; MLPNN indicates Multi-Layer Perceptron Neural Networks.

Forest Type	Machine Learning Methods	RMSE (t/ha) (Means ± sd)	MAE (t/ha) (Means ± sd)	R² (Means ± sd)
All forests	GPR	41.90 ± 4.01	34.59 ± 2.94	0.06 ± 0.07
	SVR	42.48 ± 4.19	34.34 ± 2.95	0.06 ± 0.06
	RF	42.53 ± 4.15	34.87 ± 3.20	0.06 ± 0.09
	MLPNN	45.02 ± 9.42	36.49 ± 8.50	0.06 ± 0.08
Broad-leaved forest	SVR	20.87 ± 5.45	16.61 ± 4.27	0.33 ± 0.27
	GPR	21.46 ± 5.63	16.93 ± 4.60	0.32 ± 0.27
	RF	22.41 ± 5.24	17.85 ± 4.36	0.24 ± 0.26
	MLPNN	24.76 ± 6.53	20.38 ± 5.96	0.26 ± 0.24
Mixed forest	SVR	47.40 ± 5.37	40.58 ± 5.41	0.13 ± 0.16
	RF	48.75 ± 5.58	42.01 ± 5.88	0.13 ± 0.16
	GPR	49.05 ± 5.84	42.50 ± 6.28	0.13 ± 0.18
	MLPNN	51.00 ± 11.86	43.85 ± 11.23	0.14 ± 0.15
Coniferous forest	RF	34.27 ± 7.34	29.79 ± 6.73	0.48 ± 0.25
	SVR	35.53 ± 9.67	29.93 ± 8.06	0.37 ± 0.26
	GPR	35.73 ± 8.38	30.82 ± 7.38	0.42 ± 0.29
	MLPNN	45.61 ± 9.02	37.97 ± 7.55	0.46 ± 0.22

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jin, C.; Jiang, X.; Wen, L.; Wu, C.; Xu, X.; Jiao, J. Assessing the Utility of Satellite Embedding Features for Biomass Prediction in Subtropical Forests with Machine Learning. Remote Sens. 2026, 18, 436. https://doi.org/10.3390/rs18030436

AMA Style

Jin C, Jiang X, Wen L, Wu C, Xu X, Jiao J. Assessing the Utility of Satellite Embedding Features for Biomass Prediction in Subtropical Forests with Machine Learning. Remote Sensing. 2026; 18(3):436. https://doi.org/10.3390/rs18030436

Chicago/Turabian Style

Jin, Chao, Xiaodong Jiang, Lina Wen, Chuping Wu, Xia Xu, and Jiejie Jiao. 2026. "Assessing the Utility of Satellite Embedding Features for Biomass Prediction in Subtropical Forests with Machine Learning" Remote Sensing 18, no. 3: 436. https://doi.org/10.3390/rs18030436

APA Style

Jin, C., Jiang, X., Wen, L., Wu, C., Xu, X., & Jiao, J. (2026). Assessing the Utility of Satellite Embedding Features for Biomass Prediction in Subtropical Forests with Machine Learning. Remote Sensing, 18(3), 436. https://doi.org/10.3390/rs18030436

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Assessing the Utility of Satellite Embedding Features for Biomass Prediction in Subtropical Forests with Machine Learning

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Preparation

2.2.1. Field Inventory and Calculation of Biomass of Sample Plots

2.2.2. Collection of Embedding Dataset

2.3. Machine Learning Algorithms

2.3.1. Random Forest

2.3.2. Support Vector Regression

2.3.3. Multi-Layer Perceptron Neural Network

2.3.4. Gaussian Process Regression

2.4. K-Fold Cross-Validation and Model Evaluation

2.5. Predictor Importance and Spatial Applicability Analysis

3. Results

3.1. Model Training and Validation

3.2. Total Biomass Prediction in Yunhe Forestry Station

4. Discussion

4.1. Capability of Embedding Dataset for Predicting Forest Biomass

4.2. Managing Forests Using Spatial Biomass Predictions

4.3. Limitations

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI