Estimating the Vertical Distribution of Biomass in Subtropical Tree Species Using an Integrated Random Forest and Least Squares Machine Learning Mode

: Accurate quantification of forest biomass (FB) is the key to assessing the carbon budget of terrestrial ecosystems. Using remote sensing to apply inversion techniques to the estimation of FBs has recently become a research trend. However, the limitations of vertical scale analysis methods and the nonlinear distribution of forest biomass stratification have led to significant uncertainties in FB estimation. In this study, the biomass characteristics of forest vertical stratification were considered, and based on the integration of random forest and least squares (RF-LS) models, the FB prediction potential improved. The results indicated that compared with traditional biomass estimation methods, the overall R 2 of FB retrieval increased by 12.01%, and the root mean square error (RMSE) decreased by 7.50 Mg · hm − 2 . The RF-LS model we established exhibited better performance in FB inversion and simulation assessments. The indicators of forest canopy height, soil organic matter content, and red-edge chlorophyll vegetation index had greater impacts on FB estimation. These indexes could be the focus of consideration in FB estimation using the integrated RF-LS model. Overall, this study provided an optimization method to map and evaluate FB by fine stratification of above-ground forest and reveals important indicators for FB inversion and the applicability of the RF-LS model. The results could be used as a reference for the accurate inversion of subtropical forest biomass parameters and estimation of carbon storage.


Introduction
Estimations of forest biomass (FB) have become crucial indicators for revealing the productivity of forest ecosystems, vegetation nutrient allocation, and terrestrial carbon storage potential [1].Improving the estimation accuracy of FBs is a critical problem in quantifying the carbon cycle of terrestrial biomes [2,3].In recent years, many regression models have been developed to estimate regional FB [4,5].Although these models accurately apply biomass prediction from the horizontal perspective, they are limited when considering the spatial pattern of FBs at the vertical scale.Because FBs include multiple parts, such as trunks (WT), leaves (WL), branches (WB), and roots (WR), models in previous studies tended to ignore this vertical-scale division of trees.Thus, dividing biomass into different parts according to the vertical scale of trees can improve the scale and accuracy of FB estimation.Although the use of remote sensing technology has significantly improved the accuracy of FB estimation, there are still some uncertainties at the local scale [6].This deficiency is mainly due to the diversity of forest types at large spatial scales (e.g., tree Forests 2024, 15, 992 2 of 20 species and site conditions).Models trained with limited ground observation data can easily overfit and fail to capture local features [7,8].Thus, the scientific integration of machine learning algorithms and high-density forest survey databases is the key to biomass inversion.
Subtropical forest areas are among the regions with high uncertainty in global FB estimates [9].Recently, regional FB distribution characteristics have been assessed at different spatial scales and resolutions using remote sensing techniques, field observation data, and various empirical modeling methods [10].However, the methods used in these studies still have some deficiencies in FB assessment.First, the principle of FB assessment by ecological models such as the InVEST Carbon module is to assign the empirical values of carbon pools based on forest types.The model provides a comprehensive evaluation of terrestrial ecosystems, but this type of model cannot capture the features of local tree species [11].Second, the average FB of subtropical regions in China varied among the studies, hindering our understanding of the role of subtropical forests in the global carbon cycle [12].Finally, the number of tree species included in the regression was insufficient, and the modeling process lacked objective screening of variables for model construction.
Previous studies have shown that the RF model has the same objective index screening and classification performance as the SVM model [13], and the RF machine learning algorithm has excellent regression accuracy and prediction stability [14].However, due to the lack of an integrated RF model and other algorithms for biomass data training and regression research, the feasibility of an integrated RF model for biomass prediction still needs to be further explored.In addition, recent studies have focused on national or global FB evaluation from a horizontal perspective, neglecting its vertical-scale accurate estimation [15] to a certain extent.Moreover, fitting the coefficients of the allometric growth equation based on field surveys still has challenges.Strengthening the correlations with remote sensing inversion and machine learning methods to reduce the workload has become essential in recent studies [16].Therefore, we propose to answer the following questions: (1) Can integrated remote sensing technology and machine learning algorithms accurately retrieve vertical-scale FB data?(2) Is it feasible to integrate the RF-LS model to predict vertical-scale FB and optimize the coefficients for the allometric growth equation?
In this study, based on multisource remote sensing data and forest field survey data, we developed an RF-LS machine learning model to improve FB estimation scales and optimize the a and b regression coefficients of allometric equations for subtropical forests in China.The objectives of this study were as follows: (1) to collect and calculate multivariate remote sensing data and ground observation data as indicators of the FB inversion process; (2) to use the RF model to screen the importance of FB inversion indexes at four vertical scales (trunk, branch, leaf, and root), conduct multiple regression models, and verify the model's accuracy to construct four vertical-scale estimation models to comprehensively evaluate the FB of the whole region; and (3) to introduce least squares (LS) to integrate the RF-LS algorithm.Based on the field survey data for diameter at breast height (D) and tree height (H), the a and b coefficients of the allometric growth equation were fitted to the trunk, branches, leaves, and roots for the dominant tree species (DTS) through 1000 optimization iterations.This study can provide theoretical guidance and a scientific reference for identifying important indicators and accurately evaluating forest biomass.

Study Area
Our study area (Taojiang County) is located in the north-central region of Hunan Province in China (Figure 1).The region is rich in subtropical forest resources, has a leading position in the county in terms of ecological economics, and ranks in the top ten in ecological forestry in Hunan.Its forest ecological benefits have received widespread attention and have been researched in the past ten years (China's "12th Five-Year Plan" and "13th Five-Year Plan", 2011-2021).
pography is complex and diverse, with mostly hills in the northwest and east and plains and valleys in the middle (altitude from −108 m to 1594 m), and is mainly composed of red soil (76.34%) and yellow soil (1.18%).The forest cover rate in the study area was 62.98%, with 14 main tree species.During the "12th Five-Year Plan" period, the total amount of bamboo in the region ranked third in China and first in Hunan Province (with an area of 77,095.5 hm 2 and a total of 219 million trees).  1 for the names of the DTS corresponding to the codes.

MP Masson pine
The trunk of MP is straight; the branches spread flat or obliquely; the crown is a broad tower or umbrella; the bark is dark brown and flaky, containing resin and water, and is humidity resistant.It is the leading timber tree species in South China, with high economic value.D = 2~28 cm n = 1096 a = 1834.6hm 2

CF
China fir CF is a kind of evergreen tree with a straight trunk.The tree crown is conical, and the bark is greyish brown.The branches are flat and spreading.It mainly grows in South and East China.It is a unique tree species in China and a national first-class protected plant.

EP
Euramerican poplar EP is an evergreen, deciduous, fast-growing tree with high-quality wood.It has a tall tree body and the trunk is straight.The crown is narrow, and the branch angle is slight with delicate collateral branching.The leaves are small, dense, and full-crested.1 for the names of the DTS corresponding to the codes.

MP Masson pine
The trunk of MP is straight; the branches spread flat or obliquely; the crown is a broad tower or umbrella; the bark is dark brown and flaky, containing resin and water, and is humidity resistant.It is the leading timber tree species in South China, with high economic value.D = 2~28 cm n = 1096 a = 1834.6hm 2

CF
China fir CF is a kind of evergreen tree with a straight trunk.The tree crown is conical, and the bark is greyish brown.

EP
Euramerican poplar EP is an evergreen, deciduous, fast-growing tree with high-quality wood.It has a tall tree body and the trunk is straight.The crown is narrow, and the branch angle is slight with delicate collateral branching.The leaves are small, dense, and full-crested.

MQ Metasequoia
MQ is a deciduous tree with a straight and tall trunk.The branches are drooping, brown, or brownish-grey.The surface of the branches is smooth, and the crown is steeple-shaped.It is mainly distributed in parts of South China, East China, and North China.

SF
Shrubs ferns SF refers to a short, densely clustered tree, not more than 6 m tall, without an obvious trunk, generally broad-leaved, but some conifers are shrubs.D = 1~10 cm n = 13 a = 25.5 hm 2 Note: DTS, dominant tree species; ID, abbreviations of tree species; D, tree diameter (cm); n, number of sample points; a, area of sample points.
The total area of the study area is 2068 km 2 (28 and the region is characterized by a typical humid subtropical monsoon climate [17].The topography is complex and diverse, with mostly hills in the northwest and east and plains and valleys in the middle (altitude from −108 m to 1594 m), and is mainly composed of red soil (76.34%) and yellow soil (1.18%).The forest cover rate in the study area was 62.98%, with 14 main tree species.During the "12th Five-Year Plan" period, the total amount of bamboo in the region ranked third in China and first in Hunan Province (with an area of 77,095.5 hm 2 and a total of 219 million trees).

Field Data Collection
Forest field survey data (FFSD) were obtained from the Second Class of Forest Resources Survey Project in the "12th Five-Year Plan" of China from 2011 to 2015.This project is a field survey carried out by the Hunan Provincial government according to the technical regulations of forest resource surveys (GB/T26424-2010, Survey and Planning Institute of State Forestry Administration, Beijing, China, 2011).The work was completed in December 2013, and a 1:100,000-1:10,000 forest cover map of the province was established.These forest surveys and the database obtained from the survey were based on the county (city, district) as the object, and the systematic sampling method was used to establish fixed plots.In strict accordance with the technical regulations, the forest resource indexes were inventoried.All survey data were released at the end of 2014 in the forest resources moni-toring system via the county-level website (http://www.taojiang.gov.cn/,accessed on 6 March 2023).
The process for establishing the plots conformed to the following criteria: (1) The forest map spots were drawn based on satellite images.Circular sample spots within the patches were established, and the size was set to 30 × 30 m (0.09 hm 2 ).A subplot of 5 × 5 m (0.0025 hm 2 ) was arranged at the four corners of the plot for densification.Forest age (FA), diameter at breast height (D), tree height (H), and other indexes were measured in the field.
(2) Identification of representative tree species.In the sample plots, the species (group) with the most significant proportion of stock volume was the DTS (group).For young forests that had not reached the starting diameter of D and plots with unformed forest, the tree species (group) with the largest proportion of trees was the DTS (group) in the plot.
(3) Arboreal forests with D ≥ 5 cm and H ≥ 1.3 m were included in the measurement range.The specific dominant tree species and features in the study area are shown in Table 1.A total of 50,482 plots were collected in the study area, and more than 1000 square kilometers of forest are involved.Masson pine (MP), Chinese fir (CF), Euramerican poplar (EP), and Metasequoia (MQ) cover more than two-thirds of the forest area.

Biomass Estimation
Based on the allometric growth equation developed for the national-scale forest ecosystem published by the "13th Five-Year Plan" National Fund project [18][19][20], we selected biomass equations with high fitting accuracy to calculate the FBs of parts of the DTS in the study area.The biomass equation coefficients of the DTS refer to the relevant references [12,18,19,[21][22][23][24][25][26][27], and we use this field estimate value as measured biomass.The field observation plots were divided into two parts, one for training the model (70% of the samples) and the other for validation and evaluation of prediction accuracy (30% of the samples).The measured FB was used as the dependent variable, and the independent variables were obtained from Landsat-8 bands, forest field survey data, DEMs, and meteorological station data.The specific processing procedure is shown in Figure 2.
A total of 50,482 plots were collected in the study area, and 20,538 plots of DTS participated in RF model training and validation, including 1096 Masson pine (MP), 19,176 Chinese fir (CF), 241 European poplar (EP), and 25 metasequoias (MQ) plots.The FB by tree species can be calculated by estimating the sum of the trunk, branch, leaf, and root biomass of all trees in the plot, where aboveground biomass (AGB) equals the sum of the trunk, branch, and leaf parts and belowground biomass (BGB) refers to root biomass.Therefore, stratified random sampling was used to select 12,017/20,538 plots that met the modeling requirements (plots with D < 5 cm and H < 1 m were removed, and outliers of each regression index were eliminated).
A total of 8411 samples were used for RF model training and equation construction (trunk, branch, leaf, and root), and the remaining 3606 samples were used for the independent validation of model accuracy.The formula for the allometric growth equation is as follows: where W s refers to the biomass per tree (kg•a −1 ) in the sample plot; D is the average diameter at breast height (cm) of the tree species in the plot; H is the plot's average tree height (m); and a and b are the coefficients of the allometric growth equation.
where W t refers to the total forest biomass (kg) in the sample plot; D is the average diameter at breast height (cm) of the tree species in the plot; H is the plot's average tree height (m); a and b are the coefficients of the allometric growth equation; Q refers to the number of plants per hectare (a•hm −2 ); and A is the plot area (hm 2 ).
where  refers to the total forest biomass (kg) in the sample plot;  is the average diameter at breast height (cm) of the tree species in the plot;  is the plot's average tree height (m);  and  are the coefficients of the allometric growth equation;  refers to the number of plants per hectare (a•hm −2 ); and  is the plot area (hm 2 ).

Multispectral Indexes Based on Remote Sensing Data
This study used Landsat-8 OLI_TIRS image (30 m resolution) data from the United States Geological Survey (USGS).The images were collected on 11 October 2013, covering the entire boundary of the study area (https://www.usgs.gov/,accessed on 6 March 2023).We used the ENVI 5.3 platform to calculate the spectral vegetation index after radiometric
In addition, we retrieved the remote sensing ecological index (RSEI) based on the Landsat-8 TIRS band 10 (TIRS-1) and atmospheric correction parameter calculator provided on the NASA website (https://atmcorr.gsfc.nasa.gov/,accessed on 6 March 2023).The RSEI is an index based entirely on remote sensing technology that rapidly monitors and evaluates the ecological status of terrestrial ecosystems based on natural factors [28].The index integrates four evaluation indicators using principal component analysis (PCA), including Vegetation Coverage Index (VCI), Land Surface Temperature (LST), Humidity Index (HI), and Normalized Dryness Building Soil Index (NDBSI).
where PCone refers to the ecological condition index obtained by the weighted preposition of PCA eigenvalues after standardizing four indexes (VCI, LST, H I, and NDBSI).The PCA eigenvalues of each index are 0.1642, 0.0820, 0.0342, and 0.0001, respectively.Another critical dataset is the forest canopy height (FCH) map, which is widely used for FB mapping.We used China's 2019 canopy height data developed by Liu et al. (2022) [29].The FCH error in the study area was corrected by referring to the global height map estimated by Simard et al. (2011) [30].Moreover, we collected forest age (FA), living wood growing stock (LWGS), and forest canopy density (FCD) data from field survey data.In addition, digital elevation model (DEM) data with a spatial resolution of 30 m provided by the USGS were used to create three variables (elevation, slope ratio (SlopeR), and slope aspect (SlopeD)).These were combined with field survey data (geomorphic type (GT) and slope position (SlopeP)) for a total of five topographic and geomorphic indicators.All collected data were rechecked for topological errors and converted to the same coordinate system (WGS 84/UTM 49 N).

Random Forest Regression and Least Squares Fitting (RF-LS Model)
Random forest (RF) is a more advanced classification regression tree (CART) method [4], and regression models have been shown to have good predictive performance in importance identification and index clustering [31,32].Compared with machine learning methods such as stepwise regression and support vector machines (SVMs), the RF model has outstanding advantages in terms of prediction accuracy and stability [33].Therefore, RF models are suitable for classification and regression problems.During regression, RF generates any number of simple trees used to vote and average their responses to obtain estimates of the importance of the dependent variables.Variable data are randomly sampled via iterative bagging bootstrap sampling to generate a forest of regression trees.The basic principles of the RF model are as follows [14]: where n refers to the node of each classification tree.We used the "increase mean squared error" (IncMSE) to assign a value to each predictor randomly.If the predictor is essential, the model prediction error will increase when its value is randomly replaced.Therefore, the importance of this indicator increases as its value increases.We used "increase node purity" (IncNP) to measure the residual sum of squares to represent the effect of each indicator on the heterogeneity of the observed data at different nodes in the classification tree.The more significant the value is, the more important the corresponding variable is.For "IncMSE" or "IncNP", one was selected as the ranking index for assessing the importance of different indicators, and the other was used as the accuracy verification index.In addition, we performed five tenfold cross-validations (CVs) and selected metrics based on the CV curve.
The CV method uses different index combinations to verify the accuracy of regression models in multiple groups, which solves the problem that the test results are exceedingly one-sided and the data are insufficient.In addition, we used the Pearson correlation and significance test to further reveal the relationship between the optimal regression index and the biomass of each tree part.The RF model was run in R-Studio (version 4.1.4)[34], and the Pearson correlation and significance tests were performed using SPSS 25.Table 2 shows the details of the RF modeling groups.

Tiff
Ecological indexes (10) Ecological indexes reflect the advantages and disadvantages of the regional ecological environment and are calculated based on the image bands and vegetation indexes preprocessed by Landsat-8 images using the principal component analysis and weighted overlay tool of ENVI 5.3 software.We selected a total of 10 indicators, including RSEI, SRRI, BRII, GSTI, HI, SI, IBI, NDBSI, PCone, and VCI.See Formulas (3)-( 6 Soil properties data were obtained from the China Soil Dataset (V1.2) in the World Soil Database, and we selected two soil indexes: soil depth (SoilDEP, cm) and soil organic matter content (SoilOMC, mg/100 g) (https://iiasa.ac.at/models-and-data/harmonized-world-soil-database, accessed on 6 March 2023).

Tiff Climate indicators (30)
Climate indicators were derived from the daily dataset (TXT format) of the China Meteorological Administration.The data from 24 meteorological stations in the study area and its surrounding areas were selected, processed, and interpolated using the inverse distance weighting (IDW) method in R-Studio (4.3) and ArcGIS (10.1) and integrated into three kinds of data: annual average (AA), monthly average (MA), and daily average (DA), including the precipitation (PCP, mm), evaporation (EVP, mm), average temperature (TEM, • C), temperature change (TC, • C), solar radiation (RAD, MJ), solar duration hours (SDH, h), surface temperature (ST, • C), atmospheric pressure (AP, pa), relative humidity (RH, %), and wind speed (WS, m/s), for a total of 10 indexes(https://www.data.cma.cn/,accessed on 6 March 2023).Least squares (LS) is a mathematical optimization technique for finding the optimal solution by minimizing the sum of squared errors.The least squares method can also be used for curve fitting to obtain the optimization problem of the coefficients with the existing formula.After completing the RF modeling, we performed fitting optimization of the a and b coefficients of the allometric growth equation W = a D 2 H b according to the prediction results.That is, given multiple sets of data {(x 1 , y 1 ), (x 2 , y 2 ), . . . ,(x n , y n )} and formula y = a(x) b , the following optimal problem is solved:

Shape/Tiff
The LS model was implemented in R-Studio (version 4.1.4)and GraphPad Prism 8.The dependent variable is the prediction result of RF modeling, and the independent variable is the product of tree diameter at breast height squared D 2 and tree height H.
We used the coefficient of determination (R 2 ) and the root mean square error (RMSE) to quantify the performance of the model.Moreover, we used eCognition software (version 9.0) to perform multiscale supervised classification on the Landsat-8 images to identify forest land in the study area, reducing the prediction error caused by non-forest land to FB in early regression modeling and subsequent statistical analysis.

Vertical-Scale Forest Biomass Modelling Using the Random Forest (RF) Model
We used feature importance from the Random Forest model to identify the primary indicators capable of predicting biomass.Then, we utilized the training set to assess the accuracy of these models, with the objective of minimizing errors in the prediction models (Figure 3).
As shown in Figure 3A, FCH made the greatest contribution to trunk biomass (WT), which indicated that FCH was the most important index of the WT.SoilOMC also strongly explained the WT among the soil properties.The RSEI was the remote sensing indicator with the most vital interpretation of the WT.The AEVP and MEVP were more explanatory of the WT than were the other climate factors.The importance ranking of these indicators based on "IncNP" was consistent with the ranking based on "IncMSE", in which RECI and WDRVI showed a more substantial interpretation of WT compared to other spectral vegetation indexes.As shown in the mosaic in Figure 3A, the optimal model could be obtained by selecting the top 28−30 indicators.The overall accuracy (R2) reached 87.45%, and the mean square error (MSE) was 15.82 kg•a −1 .
Thus, we performed a correlation analysis with the top 28 indicators selected by the RF model (Figure 3B) and found that FCH, RECI, SoilOMC, VARI, GLI, WDRVI, LWGS, RSEI, and NDVI exhibited a strong positive correlation with WT, while RG showed a strong negative correlation.Then, we used the training set (70% of the samples) to assess the prediction accuracy of the model (Figure 3C).The accuracy validation results showed that R2 = 0.88 and RMSE = 3.53, which were lower than the average values.The predicted values exhibited a significant correlation with the field estimate values.Therefore, the WT model exhibited good prediction accuracy according to the preliminary assessments.
Similarly, we analyzed the model performance for WB, WL, and WR, and the first 19, 20, and 25 indicators were selected by the RF model, respectively.All the models showed good prediction accuracy.
As shown in Figure 3A, FCH made the greatest contribution to trunk biomass (WT), which indicated that FCH was the most important index of the WT.SoilOMC also strongly explained the WT among the soil properties.The RSEI was the remote sensing indicator with the most vital interpretation of the WT.The AEVP and MEVP were more explanatory of the WT than were the other climate factors.The importance ranking of these indicators based on "IncNP" was consistent with the ranking based on "IncMSE", in which RECI and WDRVI showed a more substantial interpretation of WT compared to other spectral vegetation indexes.As shown in the mosaic in Figure 3A, the optimal model could be obtained by selecting the top 28−30 indicators.The overall accuracy (R2) reached 87.45%, and the mean square error (MSE) was 15.82 kg•a −1 .

FB Prediction Model Validation and Its Equation Construction
After optimal regression modeling using the RF model, we obtained four vertical-scale quantity estimation equations for WT, WB, WL, and WR through two linear and nonlinear fitting methods.Then, we used the test set to validate the accuracy of the prediction models.The test set fitting and ROC curve aim to assess the predictive stability of the biomass estimation models (Figure 4).
Figure 4 had showed the accuracy verification and prediction stability evaluation of forest biomass models using random forest method.We aggregated the WT, WB, WL, and WR to evaluate the model prediction accuracy in terms of total biomass.The AUCs of the four models were all above 0.75, and the overall prediction accuracy was 0.81, indicating that the variable prediction accuracies of the WT, WB, WL, and WR models were credible.
25 indicators were selected by the RF model, respectively.All the models showed good prediction accuracy

FB Prediction Model Validation and Its Equation Construction
After optimal regression modeling using the RF model, we obtained four verticalscale quantity estimation equations for WT, WB, WL, and WR through two linear and nonlinear fitting methods.Then, we used the test set to validate the accuracy of the prediction models.The test set fitting and ROC curve aim to assess the predictive stability of the biomass estimation models (Figure 4).Accuracy verification and prediction stability evaluation of forest biomass (FB) models using random forest method and equations for the trunk (WT), branches (WB), leaf (WL), and root (WR) biomasses.We used 30% of the original dataset as the test set (3606 samples) to reevaluate the RF modeling accuracy.We used the ROC curve to evaluate the stability of the model prediction results and selected the sample plots according to different age classes (yr) of different DTS (<20, Figure 4. Accuracy verification and prediction stability evaluation of forest biomass (FB) models using random forest method and equations for the trunk (WT), branches (WB), leaf (WL), and root (WR) biomasses.We used 30% of the original dataset as the test set (3606 samples) to reevaluate the RF modeling accuracy.We used the ROC curve to evaluate the stability of the model prediction results and selected the sample plots according to different age classes (yr) of different DTS (<20, 20-40, 40-60, >60).The AUC represents the area under the ROC curve and the coordinate axis.Its value is 0.5~1.The closer the AUC is to 1.0, the higher the authenticity of the detection method is.SD refers to the root mean square error.Meas, measured biomass; Pred, predicted biomass.

Optimizing Coefficients Using the Least Squares (LS) Algorithm
After building the estimation models for the vertical scales of FB using the random forest (RF) model, we introduced the least squares (LS) algorithm combined with the diameter at breast height (D) and tree height (H) indexes from the field survey and fitted coefficients a and b of the allometric growth equation (W = a D 2 H b ) of DTS by 1000 optimization iterations.Since the construction of the prediction models was based on DTS with known parameters a and b, and our survey data includes 10 DTS with unknown a and b (Figure 5), we applied LS again to optimize the fitting of biomass estimation parameters a and b for these DTS.This aims to achieve the rapid and accurate estimation of trunk, branch, leaf, and root biomass.As shown in Figure 5, the WT of the broad-leaved trees (SBLT, MBLT, and FBLT) exhibited a significant positive correlation with D 2 H and its biomass increase was weakly affected by stand age (the greater the age was, the greater the D 2 H), while the WB, WL, and WR of the broad-leaved tree species remained stable at D 2 H = 4000.As the tree grew, the biomass increased in the following order: WT > WR ≈ WB > WL.The WT growth of the bamboo group (BG) was similar to that of the broad-leaved tree species, with the order of FB increase as follows: WT > WR > WL ≈ WB.Camellia oleifera Abel (COA) differed from BG and broad-leaved tree species.The growth rate of the WT of COA exhibited a downward trend at D 2 H = 50, while that of its WB, WL, and WR remained unchanged after D 2 H = 50, which may be due to fruit growth, which affects biomass distribution at the later stage.Notably, the increase in WB in the fruit tree group (FTG) was inversely proportional to D 2 H at the earlier stage, possibly because of the influence of fruit growth.In addition, the increase in the WT, WB, WL, and WR of the medicinal tree group (MTG) was inversely proportional to D 2 H, and the biomass decline rate of each part of the tree in the early stage of growth was relatively significant and became stable after D 2 H = 20.The FB of the MTG declined significantly in the pre-growth stage, and the reasons for this decline need to be further studied.
value is 0.5~1.The closer the AUC is to 1.0, the higher the authenticity of the detection method is.SD refers to the root mean square error.Meas, measured biomass; Pred, predicted biomass.
Figure 4 had showed the accuracy verification and prediction stability evaluation of forest biomass models using random forest method.We aggregated the WT, WB, WL, and WR to evaluate the model prediction accuracy in terms of total biomass.The AUCs of the four models were all above 0.75, and the overall prediction accuracy was 0.81, indicating that the variable prediction accuracies of the WT, WB, WL, and WR models were credible.

Optimizing Coefficients Using the Least Squares (LS) Algorithm
After building the estimation models for the vertical scales of FB using the random forest (RF) model, we introduced the least squares (LS) algorithm combined with the diameter at breast height (D) and tree height (H) indexes from the field survey and fitted coefficients  and  of the allometric growth equation ( = ( ) ) of DTS by 1000 optimization iterations.Since the construction of the prediction models was based on DTS with known parameters  and , and our survey data includes 10 DTS with unknown  and  (Figure 5), we applied LS again to optimize the fitting of biomass estimation parameters  and  for these DTS.This aims to achieve the rapid and accurate estimation of trunk, branch, leaf, and root biomass.the fitting and optimization results of the allometric growth equations' coefficients a, b for 10 dominant tree species (except for 4 types of dominant tree species involved in RF modeling), divided into the vertical scales of trunk, branch, leaf, and root.The independent variable was the product of the diameter at breast height (D, cm) squared and the tree height (H, m), and the independent variable was the biomass of the dominant tree species per plant (kg•a −1 ).See Table 1 for the names of the dominant trees and the detailed regression parameters.
In conclusion, the integrated RF-LS model can reasonably predict the FB at different vertical scales and fit the coefficients a, b for the allometric growth equations (Table 3).In addition, we found that the greater the DTS is in the plot, the better the FB inversion and fitting in the study area.Therefore, expanding the study area and collecting more groundmeasured data on tree species can help to improve the fitting accuracy of the coefficients for tree species with too few sample points, which can also enhance the applicability and objectivity of the coefficients.

Accuracy of the RF-LS Machine Learning Model for Forest Biomass Evaluation
The results indicated that compared with traditional biomass estimation methods, the RF-LS model we established exhibited better performance in FB inversion and simulation assessment.For example, compared with the stepwise (R 2 = 0.67~0.82)[35], Leaps-BMA (R 2 = 0.60~0.62)[5], and Cubist models (R 2 = 0.75) [36] used by previous scholars, the RF-LS model (R 2 = 0.76~0.93)we established exhibited better forest biomass (FB) prediction ability.Because the random forest (RF) model is a more advanced classification regression tree (CART) method [14], this regression model has been shown to have good predictive performance in identifying important metrics and clusters [13].Compared with machine learning methods such as stepwise regression (stepwise) and support vector machines (SVMs), When analyzing multiple variables, these models perform better in fitting with small sample sizes, but as the sample size increases, the fitting effect will decrease.In this study, the RF model exhibits outstanding advantages in terms of prediction accuracy and stability [33,37].Thus, the RF model is more suitable for inversion and regression problems [38].In addition, revealing the importance of indicators based on iterative bagging bootstrap random sampling votes improves the objectivity and scientific rationale when selecting predictor indicators.
Compared with previous studies on FB estimation in subtropical forests [36,39,40], we enhanced the degree of correlation between FB and remote sensing factors.We also improved the diversity and scientific selection of indexes affecting FB inversion by integrating machine learning and remote sensing technology.Moreover, our proposed RF-LS algorithm using linear and nonlinear methods exhibited higher prediction accuracy and stability, improving the accuracy of FB inversion and regression (Table 4).We compared the biomass evaluation effect of the RF-LS model constructed in this study with that of previous studies' models on the same tree species (above-ground biomass, AGB).The results show that the models we constructed show significant improvements in accuracy.Additionally, we noted that the FB and carbon conversion coefficient (BCTC) of the InVEST model are 0.43~0.51[41], while the average BCTC of DTS in our study area is 0.375 (lower than the experience value of 0.43~0.5)using the model evaluation parameters of this study, which corresponds to the conclusion that general ecological models overestimate carbon storage [42].Thus, our study was able to improve the estimation accuracy of forest carbon storage in subtropical regions.[40]; Avitabile (2016) refers to [39]; and Zhang (2019) refers to [36].Comparison according to aboveground biomass (AGB = WT + WB + WL).
In addition, we subdivided the aboveground biomass (AGB) into three layers (trunks, branches, and leaves) based on the vertical scale.Compared with the four-layer primary carbon pool (aboveground, belowground, soil, and humus) in the general ecological model, the comprehensive and detailed scales of AGB improve the accuracy of FB estimation and reduce the uncertainty of the empirical value assignment of the carbon pool based on different land use types.Moreover, as root biomass is a part of belowground biomass, accurate modeling dramatically reduces the error of traditional root-to-shoot ratios in estimating belowground biomass [43][44][45].However, this study focused only on the verticalscale FBs of subtropical forest ecosystems, and the total carbon pool needs to be further investigated [46].

Applicability of the RF-LS Machine Learning Model
This study used simple linear and nonlinear methods for constructing biomass estimation equations.From the perspective of mathematical algorithms, it is worth exploring whether multilevel mathematical formulas can improve the fitting accuracy of biomass indicators [26,47].In addition, considering net primary productivity (NPP) and landscape pattern indexes [48,49] might also improve the interpretability of FB indicators.There were four types of dominant tree species (DTS) in the study area (CF, BG, SBLT, and MBLT), and the number of sample plots for other DTS was relatively small.The a, b coefficients of the allometric growth equations could be further optimized.However, comparative studies on stratified FBs utilizing different types of machine learning methods are lacking.Currently, both the LS-SVM [18,19,21,22,50] and the integrated RF-LS derived in this study can accurately estimate stratified FBs.In addition, the RF model used in this study, compared with other studies, shows that the number of samples trained in the model was higher compared to previous research.This may also be a reason for the improved accuracy, so we recommend that enough sample points should be collected.In any case, whether integrating multiple machine learning methods can also improve FB estimation accuracy warrants further study.
FBs include not only trunk, branch, leaf, and root biomass but also the biomass of the humus and litter layers [3,9,47,51].Studies have noted that current models vastly overestimate regional carbon stocks [42].The reason for this is that ecological models usually only introduce the four primary carbon pools, consider only the AGB of the forest as a whole, ignore the vertical structure of the forest, and use the empirical value of the root-to-shoot ratio to convert the belowground biomass [52][53][54].Studies have shown that underground biomass modeling can improve the accuracy of FB evaluation [55][56][57].Therefore, our study considered the vertical structure of the forest (trunk, branches, leaves, and roots), especially the root model, which helps to improve the accuracy of belowground biomass estimation.Although the overall accuracy reaches R 2 = 0.87, the modeling accuracy of the branch biomass (R 2 = 0.77) needs to be improved.

Limitations and Suggestions for Optimizing Subtropical Forest Biomass Estimation
Before selecting variables by machine learning, increasing variable diversification can enable the avoidance of the collinearity of indicators and the overfitting of models to a certain extent, which helps improve modeling accuracy and credibility [31,58].While a fixed combination of regression models may be suitable for different forest types [59], specific parameters might vary according to the local features of tree species [60].Previous studies have shown that hierarchical models, such as the Bayesian model (BMA), can also comprehensively consider the unique situation of each population, thus significantly improving the predictive ability of multiple regression [61].Thus, combining different machine learning methods and selecting the optimal regression model according to a hierarchical model can help to improve the applicability and accuracy of estimating the vertical-scale FBs (trunks, branches, leaves, and roots) of forests.
The biomass of bark, litter, and humus was not considered in this study, which suggests room for optimization in FB estimation.Moreover, current research has focused more on the aboveground biomass of forests [36,[62][63][64], and certain uncertainties in the estimation of belowground biomass and soil biomass still exist [42,53].Although the RF-LS model can obtain a root biomass inversion model and fitting coefficients with high accuracy, it is insufficient for comprehensive carbon pool estimation [65].In addition, deforestation and species expansion are key factors affecting regional biomass and carbon sequestration potential [66,67], and regional FB evaluation can be refined by considering spatiotemporal changes and scenario analysis [68][69][70][71].
More accurate estimation data can be obtained by selecting high-resolution raw data [10].Our study used 30 m resolution primary data, and 1 m resolution remote sensing data from GF satellites might improve the inversion accuracy.However, there was a lack of remote sensing data for identifying and mapping high-precision forest-type maps during the study period (2011)(2012)(2013).Nevertheless, we note that integrating advanced remote sensing techniques and machine learning algorithms into accurate inversion methods may significantly improve FB estimates in the future [72][73][74].In addition, the diversity of regressors and the combination of machine learning methods can also improve the accuracy of FB inversion [8,[75][76][77][78].This study lacks direct measurements of biomass, as it relies on results estimated through equations, which may lead to a certain degree of verification bias, this is also a drawback of estimating biomass in large-scale studies.Although there are still challenges in obtaining and calculating high-precision data, the feasibility of this idea for large-scale FB evaluations needs to be further explored.

Conclusions
The main contribution of this study was to develop an RF-LS machine learning algorithm based on remote sensing and field survey data to enable the prediction of verticalscale FBs and the optimal fitting of allometric growth equation coefficients.This method improved the accuracy of vertical-scale hierarchical FB estimation for subtropical forests in China.The results showed that the RF-LS model explained 87.48%, 76.54%, 91.94%, and 92.84% of the variance in the trunk, branch, leaf, and root biomass portions of FB, respectively, with an overall R 2 = 0.89 and RMSE of 5.43 kg•a −1 and an average R 2 = 0.81 and average RMSE = 1.05 kg•a −1 for the a, b coefficients of the allometric growth equation.Moreover, to better understand the reliability of our predicted FBs, we validated our models using independent test data.Compared with previous studies, the integrated RF-LS model exhibited practical application potential in FB inversion and coefficient optimization.The overall R 2 of the AGB of various DTSs increased by 12.01%, and the RMSE decreased by 7.50 Mg•hm −2 ; the fitting RMSE and R 2 of a, b tended to fluctuate with the DTS, and the R 2 of the BG significantly increased by 74%.Forest canopy height (FCH), soil organic matter content (SoilOMC), and the red edge chlorophyll vegetation index (RECI) were the most critical indicators for stratified FBs.It also indicates that these indicators play a reference role in guiding the management and operation of forest carbon reservoirs.Overall, the combined remote sensing technology and machine learning algorithms can accurately retrieve vertical-scale FBs.The integrated RF-LS model could predict the vertical-scale FB well and optimize the coefficients of the allometric growth equation.In the future, we will analyze the generalizability of our results to small-to large-scale estimates and test their applicability in different geographic regions.

Figure 1 .
Figure 1.Overview of the study area.The figure shows the study area located in Yiyang City, Hunan Province, China.Landsat−8 (30 m resolution) remote sensing image of the study area, with a combination of bands 4,3,2.Spatial distribution of the dominant tree species (DTS) and land use types in the study area.See Table1for the names of the DTS corresponding to the codes.

Figure 1 .
Figure 1.Overview of the study area.The figure shows the study area located in Yiyang City, Hunan Province, China.Landsat−8 (30 m resolution) remote sensing image of the study area, with a combination of bands 4,3,2.Spatial distribution of the dominant tree species (DTS) and land use types in the study area.See Table1for the names of the DTS corresponding to the codes.

Figure 2 .
Figure 2. Research technology framework.(A) Calculation of biomass based on the existing allometric growth equation; (B) Inversion of biomass based on remote sensing and geoscience data using the RF model, of which 70% of the forest field survey sample plots were used for modeling and 30% for model validation; (C) Use of the inversion biomass model to estimate the overall biomass and classification statistics according to different tree species; (D) Fitting and optimization of the coefficients  and  of the allometric growth equation ( = ( ) ) for different dominant tree species (DTS) based on diameter at breast height (D) and tree height (H) measured in the field.

Figure 2 .
Figure 2. Research technology framework.(A) Calculation of biomass based on the existing allometric growth equation; (B) Inversion of biomass based on remote sensing and geoscience data using the RF model, of which 70% of the forest field survey sample plots were used for modeling and 30% for model validation; (C) Use of the inversion biomass model to estimate the overall biomass and classification statistics according to different tree species; (D) Fitting and optimization of the coefficients a and b of the allometric growth equation (W = a D 2 H b ) for different dominant tree species (DTS) based on diameter at breast height (D) and tree height (H) measured in the field.

Figure 3 .
Figure 3. Inversion and modeling of forest vertical biomass (WT, WB, WL, and WR) based on the RF model.(A) Importance of WT, WB, WL, and WR indexes and optimal regression variables selected by the RF model; the mosaic graph refers to the results of five times tenfold cross-validation."IncMSE" refers to the increased mean square error, and "IncNP" refers to the increased node purity

Figure 3 .
Figure 3. Inversion and modeling of forest vertical biomass (WT, WB, WL, and WR) based on the RF model.(A) Importance of WT, WB, WL, and WR indexes and optimal regression variables selected by the RF model; the mosaic graph refers to the results of five times tenfold cross-validation."IncMSE" refers to the increased mean square error, and "IncNP" refers to the increased node purity of the decision tree.(B) Pearson correlation between the selected optimal regressors of WT, WB, WL, and WR.(C) Prediction accuracy verification of the forest vertical biomass model (training set, a total of 8411 samples).WT, WB, WL, and WR refer to the trunk, branch, leaf, and root biomass.

Figure 4 .
Figure 4. Accuracy verification and prediction stability evaluation of forest biomass (FB) models using random forest method and equations for the trunk (WT), branches (WB), leaf (WL), and root (WR) biomasses.We used 30% of the original dataset as the test set (3606 samples) to reevaluate the RF modeling accuracy.We used the ROC curve to evaluate the stability of the model prediction results and selected the sample plots according to different age classes (yr) of different DTS (<20,

Figure 5 .
Figure5.Fitting and optimization of coefficients for allometric growth equations.The figure shows the fitting and optimization results of the allometric growth equations' coefficients a, b for 10 dominant tree species (except for 4 types of dominant tree species involved in RF modeling), divided into the vertical scales of trunk, branch, leaf, and root.The independent variable was the product of the diameter at breast height (D, cm) squared and the tree height (H, m), and the independent variable was the biomass of the dominant tree species per plant (kg•a −1 ).See Table1for the names of the dominant trees and the detailed regression parameters.

Author Contributions:
Conceptualization, Methodology, Software, Investigation, Methodology, Writing-Original draft preparation, Writing-Reviewing and Editing, G.L.; Conceptualization, Methodology, Software, Investigation, Methodology, C.L.; Conceptualization, Methodology, Software, Investigation, Methodology, G.J.; Conceptualization, Methodology, Software, Investigation, Methodology, Z.H.; Writing-Original draft preparation, Writing-Reviewing and Editing, Funding, Y.H.; Investigation, Data curation, Formal analysis, Supervision, Project administration, Funding acquisition, Writing-Reviewing and Editing, W.H.; The authors G.L. and C.L. have made equal contributions.All authors have read and agreed to the published version of the manuscript.Funding: This research was funded by the Natural Science Foundation of Hunan Province [Grant numbers:2022JJ40862], the Key Project of Hunan Education Department [Grant number: 21A0513], the Scientific Research Project of Hunan Education Department [Grant numbers 21B0235] and the Natural Science Foundation of Hunan Province [Grant number: 2020JJ4942].This work was supported by the Key Discipline of the State Forestry Administration [Grant number: 2016-21] and the "Double First-Class" Cultivating Subject of Hunan Province [Grant number: 2018-469].

Table 1 .
Information and characteristics of DTS in the study area.

Table 1 .
Information and characteristics of DTS in the study area.

Table 2 .
Biomass regression indexes based on random forest model.

Table 3 .
Optimization of coefficients of allometric growth equations for dominant tree species.

Table 4 .
Comparison of aboveground biomass prediction models of subtropical forests in different studies.