UAV Remote Sensing-Based Random Forest Modeling of Expressway Vegetation Biomass and Sample Library Construction

Yang, Ying; Gao, Yulu; Zhang, Jiapen; Liang, Shiqi; Zhao, Ben; Guo, Hantian; Cai, Yinfei; Hu, Haifeng; Lian, Xugang

doi:10.3390/land15030401

Open AccessArticle

UAV Remote Sensing-Based Random Forest Modeling of Expressway Vegetation Biomass and Sample Library Construction

by

Ying Yang

¹,

Yulu Gao

²,

Jiapen Zhang

¹,

Shiqi Liang

²,

Ben Zhao

¹,

Hantian Guo

¹,

Yinfei Cai

²

,

Haifeng Hu

² and

Xugang Lian

^2,*

¹

Shanxi Intelligent Transportation Laboratory Co., Ltd., Taiyuan 030032, China

²

College of Geological and Surveying Engineering, Taiyuan University of Technology, Taiyuan 030024, China

^*

Author to whom correspondence should be addressed.

Land 2026, 15(3), 401; https://doi.org/10.3390/land15030401

Submission received: 12 February 2026 / Accepted: 26 February 2026 / Published: 28 February 2026

Download

Browse Figures

Versions Notes

Abstract

To support carbon stock assessment and ecological restoration under the “Carbon Neutrality” objective, this paper developed a high-precision vegetation biomass model for expressway corridors in Shanxi Province, China, by integrating Unmanned Aerial Vehicle (UAV) technology and the random forest algorithm. Based on climatic zoning and DEM data, 70 sample plots representing diverse vegetation and topography were selected. LiDAR point clouds and multispectral data were spatially connected using the BallTree algorithm, achieving an average matching rate of 73.98–82.01%. A joint biomass model incorporating tree height and crown width was constructed with spatial cross-validation. The results indicate that the model substantially outperformed single-factor models, with R² values ranging from 0.839 to 0.934 (highest in the Hengshan–Wutaishan forest area). Accuracy was higher in forest-dominated zones but lower in areas with significant human disturbance. A representative sample library was established for model optimization. This paper provides a robust technical framework for biomass monitoring across comparable Northern Hemisphere latitudes, thereby supporting sustainable green transport development.

Keywords:

expressway; biomass; spatial connection; random forest algorithm; spatial cross-validation

1. Introduction

As the arterial network of modern transportation, expressways play a vital role in economic and social development. By 2025, China’s expressway mileage is projected to reach 190,000 km, solidifying its position as the world leader. As of the end of 2023, Shanxi Province’s total expressway mileage stood at 6187.6 km, with an expected increase to 6670 km by 2025. Roadside vegetation ecosystems play a crucial role in soil stabilization, water conservation, and reducing rainwater erosion of roadbeds. They maintain road stability and ecological balance, fostering a positive interaction between expressway operations and ecological preservation. Under the “carbon neutrality” goal, these ecosystems serve as vital carbon sinks, making biomass monitoring highly significant. Precise monitoring supports carbon stock assessments and ecological restoration, aligning with the synergistic demands of smart transportation and ecological conservation and contributing to green and sustainable development in the transportation sector.

Traditional biomass monitoring relies on manual field surveys, such as random plot sampling, in situ clipping of samples, and laboratory drying and weighing procedures to quantify leaf, stem, and aboveground biomass. Concurrently, morphological parameters like shrub height, crown width, and basal diameter are recorded [1,2,3]. Although these methods offer high precision, they are time-consuming, labor-intensive, have limited sampling coverage, and cause ecological disturbance. Extracting ground data through remote sensing technology and constructing traditional biomass regression models effectively addresses the shortcomings of conventional manual monitoring [4]. However, their accuracy is constrained by spatial resolution and revisit cycles, and data quality is susceptible to cloud cover. Furthermore, when using optical remote sensing information for biomass inversion, only the canopy leaf area is obtained, not the actual biomass. This approach overlooks the biomass accumulated through tree growth, making it difficult to meet the dynamic monitoring requirements of expressways for high precision, high frequency, and extensive coverage.

The rapid advancement of unmanned aerial vehicle (UAV) technology has opened new avenues for vegetation monitoring. UAVs equipped with LiDAR and multispectral sensors can simultaneously capture three-dimensional structural parameters and spectral information, overcoming the limitations of traditional methods in terms of temporal and spatial resolution. LiDAR emits laser pulses to precisely capture vertical vegetation structures and penetrate dense tree canopies to obtain terrain data. Multispectral sensors collect reflectance data across visible and near-infrared bands, enabling vegetation characteristics to be derived through vegetation indices. The integration of these two data types provides multidimensional geospatial support for biomass estimation. Lian et al. employed UAVs equipped with multispectral and LiDAR sensors, supplemented by handheld LiDAR devices, to collect tree parameters in mining areas. Pearson correlation analysis was used to screen model variables, and multiple linear stepwise regression combined with random forest algorithms was applied at the individual tree scale to construct biomass models [5]. Zhang et al. enhanced the individual tree detection rate of UAV LiDAR (UAV-LS) data using a fusion dataset of terrestrial laser scanning (TLS) and UAV-LS, applying optimized biomass models that incorporated height parameters to calculate individual tree biomass for each tree species [6]. However, single-source data modeling remains limited: LiDAR struggles to capture precise vegetation height data in densely vegetated areas, while multispectral data struggles to distinguish vertical structures within overlapping canopies. These challenges have driven research toward integrating multi-source geospatial data with machine learning. The core of this approach lies in integrating multi-source remote sensing data with ground-truth biomass measurements. Machine learning models establish nonlinear relationships from multidimensional features to biomass. Various machine learning algorithms have been employed for biomass modeling, including support vector machines (SVMs), artificial neural networks (ANNs), random forests (RFs), and gradient boosting regression trees (GBRTs).

In the field of machine learning algorithms, the random forest (RF) model proposed by BREIMAN, based on decision tree ensembles, offers advantages such as random feature selection, strong nonlinear fitting capabilities, high robustness, and predictive accuracy [7,8]. This model has been widely applied in biomass estimation. Li et al. developed a grassland aboveground biomass inversion model using multiple machine learning techniques [9]. Mu et al. compared machine learning methods with traditional binary biomass models [10]; Sun et al. constructed shrub biomass models using least squares regression, support vector machines (SVMs), and RF regression based on field samples [11]; Ding et al. established multiple linear regression, random forest, and support vector regression models using multi-source data [12]; Zhang et al. combined measured data from the Penn State Experimental Forest to develop forest aboveground biomass estimation models using random forest and support vector machines [13]. These studies employed root mean square error (RMSE) and coefficient of determination (R²) as evaluation metrics, confirming that random forest models exhibit higher fitting accuracy and generalization capabilities. However, when handling multidimensional environmental factors in complex ecosystems, random forest still faces challenges such as parameter sensitivity and the risk of local overfitting.

To enhance model adaptability, researchers incorporated multi-source environmental variables. Moni et al. combined Sentinel-2 imagery with field survey data using a random forest model to map the spatial distribution of aboveground biomass [14]; Jingyuan et al. integrated topography and biodiversity variables into a multi-scale aboveground biomass productivity model based on RF [15]; Gao et al. extracted four distinct feature types to construct an RF-based grassland biomass model [16]; Salma et al. classified SPOT 7 satellite imagery using RF and integrated field data [17]; Saurabh et al. employed recursive feature elimination to select optimal variables from Landsat 8 and Sentinel-1A data and then applied RF for aboveground biomass mapping [18]; Zarei et al. employed RF to assess the impact of climatic parameters on plant biomass [19]; Dung et al. integrated spectral, topographic, and textural variables to identify optimal predictors for aboveground biomass using RF [20]. These studies highlight the technical advantages of feature extraction and multi-parameter integration in geospatial analysis.

Further improvements have focused on algorithmic enhancements. Guo et al. optimized RF hyperparameters and developed the πFlow platform for full-process modeling [21]; Hou et al. combined RF with an optical algal cloud index to construct a hybrid remote sensing retrieval model [22]; Zhang et al. compared regularized random forest and quantile random forest with the standard RF model [23]; Xiong et al. applied hyperparameter-optimized machine learning to improve the accuracy of forest AGB estimation [24]; and Leyre et al. tuned RF hyperparameters via double cross-validation, achieving a high-accuracy model with R² exceeding 0.7 [25]. Current research predominantly relies on remote sensing data to optimize random forest algorithm hyperparameters, enhancing model performance by incorporating multiple environmental factors, spectral variables, and texture parameters. From a practical application perspective, these methods prove highly effective.

Shanxi Province, located in the mid-latitude region of the Northern Hemisphere, exhibits typical regional characteristics in terms of climate, solar radiation, and precipitation. Its roadside vegetation ecosystems along expressways are highly representative of similar geographic zones. This study focused on the vegetation along expressways in Shanxi Province. Using UAV-based surveying, we collected LiDAR data and high-resolution vegetation imagery. An allometric equation-based biomass model combined with the Random Forest algorithm was developed. A standardized biomass sample database construction workflow was designed. This sample database will not only support subsequent model optimization with precise geospatial data but also provide a replicable framework for biomass monitoring of roadside vegetation in other mid-latitude regions of the Northern Hemisphere. It is expected to promote the deeper application and innovative development of geomatics technology in the field of transportation ecology.

2. Materials and Methods

2.1. Study Area and Data Acquisition

2.1.1. Study Area Overview

Located in Northern China (110°14′–114°33′ E, 34°34′–40°34′ N), Shanxi Province forms a parallelogram extending from northeast to southwest. The region exhibits diverse topography, with elevations ranging from 180 to 3058 m and most areas exceeding 1000 m. Characterized by a temperate continental monsoon climate, the province experiences distinct seasonal variations with synchronous rainfall and heat periods. A climatic transition occurs from semi-arid conditions in the north to semi-humid conditions in the south. Influenced by these hydrothermal conditions, the main vegetation types comprise eight categories: coniferous forests, broadleaf forests, mixed coniferous–broadleaf forests, shrublands, scrub–grasslands, steppes, meadows, and cultivated vegetation. An overview of the study area and the main vegetation types are shown in Figure 1 and Figure 2.

2.1.2. Research Sample Selection

To ensure the representativeness and scientific validity of the samples, this study considered climatic zones, vegetation types, and tree species categories in sample selection. Climatic zones determine the fundamental distribution patterns of vegetation and the environmental conditions for tree survival, while vegetation types reflect the actual community composition under different climates and topographies. The vegetation along Shanxi Province’s expressways is artificially planned and planted. Tree species selection comprehensively considers the climate conditions and topographical characteristics along the routes to ensure vegetation survival and ecological functionality [26].

Referencing existing vegetation zoning results and based on Shanxi Province’s natural conditions, this study combined climate zoning with digital elevation models. By integrating vegetation distribution maps and ecological type classifications, spatial overlay analysis eliminated redundant data to extract key spatial correlation information [27]. This process enabled the completion of vegetation zoning for Shanxi Province’s expressways, as illustrated in Figure 3. The naming convention followed specific patterns: zones adopted the format of “macro-geographical location + climate zone type + vegetation landscape” to reflect macroscopic vegetation–environment relationships, while subzones were named using “geographical location + major landform type + dominant plant formation” to represent micro-level community differences.

Sample plot selection considered different zoning units and various landforms, including plains and mountainous areas, prioritizing locations with stable vegetation growth and minimal human disturbance. This approach ensured even distribution of samples across various vegetation types and ecological environments. Based on the aforementioned vegetation zoning results of Shanxi Province’s expressways, 70 sample plots were selected for this paper. These comprise 15 plots from the north Shanxi mid-temperate steppe vegetation region (zone I), 39 plots from the central and southeastern Shanxi warm-temperate forest vegetation region (zone II), and 16 plots from the south Shanxi warm-temperate forest vegetation region (zone III).

2.2. Data Acquisition

2.2.1. UAV Data Acquisition

This study used the Feima D2000S UAV (Shenzhen Feima Robotics Co., Ltd., Nanshan District, Shenzhen City, China) for data collection, equipped with two types of payloads for multi-source data acquisition. The D-LiDAR2000 payload (Shenzhen Feima Robotics Co., Ltd., Nanshan District, Shenzhen City, China) was used to obtain point cloud data containing vegetation 3D structural information, enabling precise extraction of key morphological parameters such as plant height, crown width, and DBH. The D-MSPC2000 (Shenzhen Feima Robotics Co., Ltd., Nanshan District, Shenzhen City, China) multispectral payload was used to collect vegetation spectral data, providing a basis for subsequent analysis of vegetation growth status and extraction of spectral features of vegetation types.

Data collection was conducted from 23 October to 9 November 2024 and from 28 July to 2 September 2025. During this period, vegetation along the expressways in Shanxi Province was in its peak growing season. Operations were carried out during sunny, low-wind periods to avoid interference from rainy or windy weather on UAV flight stability and data acquisition accuracy. During LiDAR data acquisition, to ensure point cloud quality in complex terrain, we employed a terrain-following flight at a relative altitude of 100 m. Multispectral data collection was fixed at a flight altitude of 111 m. The drone flight sensor parameter settings are shown in Table 1, and examples of the data acquisition results are shown in Figure 4.

After data acquisition, standardized preprocessing was performed on the LiDAR and multispectral data to provide high-quality foundational data for subsequent single-tree parameter extraction and model construction. LiDAR data underwent coordinate transformation, flight path acquisition, swath stitching, and point cloud processing within UAV Manager, generating a high-precision 3D point cloud. Raw multispectral files underwent radiometric calibration, geometric correction, and image mosaicking within Pix4D(V4.0), yielding complete multispectral reflectance imagery.

2.2.2. Ground Measurement Data Acquisition

Manual measurements were conducted to obtain ground truth DBH data using steel tape and soft tape. The measurement point was determined at 1.3 m above ground level on the trunk. Adjustments were made if there were swellings, forks, or tilting near the tree base. The soft tape was wrapped around the trunk, ensuring it was level and snug against the trunk without slack or overtightening. If the trunk surface had knots, depressions, or other irregularities, a relatively smooth circumference was selected. For the same tree, measurements were repeated 2–3 times, with each measurement position deviating by no more than 2 cm. The average value was calculated and converted to DBH as the final DBH data. Relevant information was recorded simultaneously to ensure data traceability, guaranteeing the accuracy and reliability of the ground measurement data and providing foundational data support for forestry resource surveys and tree growth monitoring. Field operations are shown in Figure 5.

2.3. Methodology

2.3.1. Extraction of Individual Tree LiDAR 3D Structural Features

Raw LiDAR data was stored in .las format using the CGCS2000 coordinate system and underwent preprocessing via LiDAR360 software (Beijing Digital Green Soil Technology Co., Ltd., Beijing, China). Point cloud denoising eliminated invalid noise caused by equipment errors and atmospheric interference, while algorithmic classification separated ground points from non-ground points. Spatial interpolation of ground points generated a Digital Elevation Model (DEM) reflecting actual terrain features. A Digital Surface Model (DSM) was constructed by integrating elevation information from non-ground features based on the DEM. The elevation difference between the DSM and DEM was calculated to derive a canopy height model (CHM) representing the height distribution of the vegetation canopy. An example of CHM model generation is shown in Figure 6.

After the above basic processing, the original 3D point cloud underwent secondary filtering to optimize data quality. Individual tree crowns were then accurately delineated using a tree crown segmentation algorithm, and 3D crown models were extracted. Parameters were calculated at the stand level, ultimately yielding key forest parameters corresponding to the area associated with each point cloud data segment, providing foundational data support for subsequent vegetation biomass inversion and ecological assessment.

The primary objective of this study was to assess the total biomass of large-scale mixed forests. Given the extensive study area and diverse tree species involved, constructing independent, highly parameterized models for each species presented significant challenges in data availability and computational feasibility. Therefore, a generalized mixed model based on allometric growth theory was adopted. To incorporate variability within the generalized framework as comprehensively as possible, the model did not employ a single parameter set. Heterotrophic growth parameters were established for different groups based on primary plant functional types. These parameters were derived from the integration and classification of numerous published species-specific equations. While mathematically unified, the model reflected key inter-group differences in biomass accumulation through its parameterization. Based on the mixed species (group) biomass equations provided in the “Carbon Storage in Forest Ecosystems of China—Biomass Equations” [28], the aboveground biomass of trees in each sample plot was calculated as the measured value. The allometric growth equations are shown in Table 2. The stand parameters extracted for each plot are shown in Table 3.

Multispectral imagery and LiDAR data were collected from the same UAV platform, sharing a consistent geographic coordinate framework that enables direct correspondence to individual trees. By comparing UAV imagery with field operation photographs and utilizing permanent landmarks in the photos, the geographic boundaries of the measured sample plots were precisely defined. The highway corridor vegetation consisted of uniform, artificially planted forests with trees arranged in regular patterns. By combining the spacing between trees and rows with the relative positions of individual trees recorded in the field reference, corresponding individual tree profiles were identified within the canopy height model generated from the LiDAR data. This ensured that the ground-measured DBH data could be accurately matched and analyzed against the DBH parameters extracted from the same tree within the LiDAR point cloud.

The primary statistical comparison parameters were absolute error and mean relative error. Absolute error directly reflects the actual physical gap between predicted and true values, while relative error indicates the proportion of this error relative to the true value. The results show that the absolute error ranged from 0.7 to 1.5 cm, with a mean relative error of 3%. Ground measurement data and UAV-collected data exhibited significant correlation. Poplar trees exhibited the smallest error due to their straight trunks, while weeping willows showed slightly higher error but remained within acceptable limits. All linear correlation coefficients (R²) exceeded 0.95, confirming that the UAV data accurately reflected the DBH values and could replace ground measurements.

2.3.2. Individual Tree Extraction Using Multi-Scale Segmentation Algorithm

Remote sensing imagery could precisely identify individual tree crown boundaries. Multi-scale segmentation algorithms effectively extracted tree coordinates and canopy parameters, overcoming challenges posed by tree size variations. These algorithms maintained segmentation integrity and accuracy by integrating features across different resolutions, primarily employing two methods: the region-growing method merged spectrally similar bright canopy center pixels within a set threshold; watershed transformation treated an image as terrain, distinguishing canopies from boundaries while preventing oversegmentation.

(1) Environment Configuration and Data Reading

Using PyCharm(2024.1) combined with remote sensing image processing enabled single-tree information extraction and spectral feature analysis from remote sensing data. Starting with environment configuration, it progressively handled data reading, multi-scale segmentation, parameter extraction, and feature analysis. We configured the environment and read data within PyCharm. We installed libraries using the pip tool: rasterio for reading remote sensing imagery (supporting TIFF, ENVI, and other formats), numpy and pandas for data matrix operations and feature storage, scikit-image for image segmentation and morphological operations, and matplotlib for visualizing results. After installation, the rasterio library was invoked to read multispectral imagery. The src.read() function retrieved band data while extracting transformation parameters and CRS coordinate system information. These parameters facilitated the conversion of pixel coordinates into geographic coordinates.

(2) Multi-Scale Segmentation

Multi-scale segmentation was performed on the remote sensing image. Multi-scale imagery was generated using Gaussian blur, region merging was completed based on spectral similarity and the results from each scale were weighted and fused, followed by accuracy optimization. The Gaussian blur kernel function is a smoothing filter operator based on Gaussian distribution, which performs convolution operations on an original remote sensing image as the basis for subsequent hierarchical segmentation. The expression for multi-scale segmentation is as follows:

K_{σ} (x, y) = \frac{1}{2 π σ^{2}} e^{- \frac{x^{2} + y^{2}}{2 σ^{2}}}

(1)

where (x, y) represents the coordinates of pixels relative to the kernel center within the kernel function;

σ

denotes the standard deviation, which adjusts the blur intensity; and

2 π σ^{2}

is the normalization coefficient, ensuring that the sum of the kernel function equals 1 to prevent overall brightness changes in the image after convolution.

Regional merging was performed by expanding outward from the center point, using spectral similarity as the criterion for determining whether adjacent pixels or regions should be merged. After completing independent segmentation at each scale, multi-scale fusion was applied to overcome the limitations of a single scale, assigning fusion weights to the segmentation results at each scale. The weighting formula was as follows:

ω_{i} = \frac{σ_{\max} - σ_{i}}{\sum_{j = 1}^{m} (σ_{\max} - σ_{j})}

(2)

where

σ_{\max}

is the maximum standard deviation among all scales,

σ_{i}

is the standard deviation of the i-th scale, and

m

is the total number of scales. Since small-scale segmentation results are finer, their weight is proportional to

σ_{\max} - σ_{i}

, resulting in relatively larger weights. Large-scale segmentation focuses more on overall integrity, with relatively smaller weights. Weighted superposition of results from all scales yielded the final optimal segmentation contours.

In areas with significant terrain variations, elevation constraints were incorporated to enhance segmentation accuracy. When the elevation difference between adjacent segments was below a specified threshold, they were classified as belonging to the same tree crown. This approach preserved structural continuity during merging, effectively preventing crown distortion and segmentation fractures caused by sloping terrain, thereby improving accuracy in complex landscapes.

(3) Individual Tree Parameter Extraction

After completing the segmentation, we extracted individual tree parameters. We calculated the centroid of the segmented contour polygon to obtain the tree location coordinates. For crown area calculation, the pixel counting method was employed, with the formula as follows:

\begin{array}{l} (x_{c}, y_{c}) = (\frac{1}{N} \sum_{i = 1}^{N} x_{i}, \frac{1}{N} \sum_{i = 1}^{N} y_{i}) \\ A r e a = N \times r e s^{2} \end{array}

(3)

where

(x_{i}, y_{i})

are the pixel coordinates of the contour, N is the total number of pixels, and res is the spatial resolution (in meters/pixel).

Within the PyCharm environment, the entire workflow from remote sensing data reading to individual tree information extraction was automated through programming. From image reading and preprocessing to multi-scale segmentation and individual tree parameter and spectral feature extraction to data integration and visualization, all steps were automatically linked and executed by the program. Without manual step-by-step operation, structured individual tree information and map outputs were generated from the original remote sensing data, compressing processing time and effectively reducing the cost of manual intervention and operational errors. An example of multi-scale segmentation is shown in Figure 7. The segmentation results for each vegetation zone are presented in Table 4.

2.3.3. Spatial Connection

Spatial connection was used to precisely match the individual tree locations extracted from LiDAR and multispectral data, associating and integrating their corresponding tree height and crown width data to achieve the fusion of multi-source individual tree information. Spatial connection is a core operation in Geographic Information Systems and spatial data analysis, essentially associating attribute information from two datasets based on the spatial relationships between features. Unlike attribute joins in traditional databases, the basis for association in spatial connection is the geometric positional relationship of geographic features, applicable to the integrated analysis of spatial objects like points, lines, and polygons. The key to spatial connection algorithms lies in efficiently determining positional relationships between a large number of spatial objects. When the data volume is large, comparing all possible feature pairs one by one is computationally inefficient. Therefore, practical applications require combining spatial indexing and optimization algorithms to connect LiDAR and multispectral spatial data based on location, completing information, and reducing computational complexity. The principle of spatial connection is shown in Figure 8.

This study did not simply overlay LiDAR points with multispectral segmentation objects by coordinates. Instead, it created a radius buffer centered on each individual tree location detected by LiDAR. Only multispectral segmentation objects falling entirely within this buffer were listed as candidate matching objects. The geometric characteristics of each candidate object were evaluated against the expected canopy size estimated from the LiDAR point cloud. Combined with the reasonableness of their spectral features, the correct matching objects were selected. For LiDAR points lacking candidate objects, manual verification using multispectral imagery was conducted to distinguish and address anomalies caused by shadows, occlusions, or spectral confusion, thereby ensuring the overall quality and reliability of the fused data.

The specific matching algorithm employed was based on the BallTree nearest neighbor search algorithm, combined with dynamic distance threshold filtering, achieving spatial correlation between the point cloud and the multispectral centroids. BallTree is an efficient high-dimensional spatial index structure that can quickly find the nearest neighbors in a point set. It recursively partitions points in space into multiple “hyperspheres,” with each containing a set of points and recording the sphere’s center and radius, and then builds a hierarchical tree structure. The root node contains all points, child nodes are sub-spheres of the parent node’s sphere, and leaf nodes correspond to a single or a small number of points. During querying, by calculating distances between spheres, subtrees that cannot contain the nearest neighbor are quickly excluded, and precise calculations are performed only on possible spheres, significantly reducing the number of comparisons.

To avoid matching points that were too far apart, a dynamic threshold screening mechanism was introduced in the code. If a fixed threshold was not specified, a certain proportion of the crown width corresponding to the multispectral centroid was used as the threshold, retaining only matching pairs with a distance less than or equal to this threshold.

2.3.4. Biomass Model Construction

Random forest (RF), as a typical ensemble learning algorithm, effectively avoids the overfitting problems of single decision trees through its mechanism of collaborative decision-making by multiple trees. It also possesses unique advantages such as resistance to noise interference, adaptability to nonlinear relationships, and output of feature importance, making it a mainstream technique in current biomass modeling. The principle of the random forest algorithm is shown in Figure 9.

The essence of random forest is an ensemble learning model based on decision trees, introducing dual randomness in both data and features to enhance model generalization ability. It integrates the prediction results of multiple trees through regression averaging to ultimately achieve accurate biomass prediction. The construction process relies on the training of regression decision trees. Since biomass is a continuous numerical value, the feature space is recursively partitioned based on the squared error minimization criterion to minimize prediction bias. For a node containing sample set S, with input features being tree height H and crown width C and the target variable being biomass y, the formula for calculating the squared error of a node is as follows:

M S E (S) = \frac{1}{n} \sum_{i \in S} {(y_{i} - {\bar{y}}_{S})}^{2}

(4)

where n is the number of samples, and

{\bar{y}}_{S} = \frac{1}{n} \sum_{i \in S} y_{i}^{2}

is the mean biomass of the node, which is also the predicted value for the samples in that node.

This algorithm incorporates improvements based on the spatial characteristics and ecological patterns of biomass data, with key enhancements in three areas: data preprocessing, spatial partitioning, and cross-validation. Biomass data is susceptible to measurement errors or interference from anomalous samples. The interquartile range (IQR) method is employed to filter out outliers, with the following formula:

\begin{array}{l} Q_{1} = q u a n t i l e (y, 0.02), Q_{3} = q u a n t i l e (y, 0.08) \\ I Q R = Q_{3} - Q_{1} \end{array}

(5)

where

Q_{1}, Q_{3}

are the 2nd and 98th percentiles of biomass, respectively. The effective biomass range was set to

|Q_{1} - 2 \times I Q R, Q_{3} + 2 \times I Q R|

. Expanding the IQR multiplier to 2 times was to retain more large DBH trees and avoid excessive filtering leading to data distortion. Based on forest ecology theory, tree height and crown width were selected as core input features. Tree height determined vertical growth, crown width reflected horizontal photosynthetic area, and both were significantly positively correlated with biomass, avoiding redundant features that increase model complexity.

Tree growth exhibited significant spatial autocorrelation. If training and test sets are randomly partitioned, this can easily lead to spatial overlap between test and training samples, inflating model evaluation results. Therefore, a spatial 7:3 split was adopted to ensure the spatial independence of the test set, making the evaluation results more reliable. To further verify the model’s stability within the training set, 5-fold spatial cross-validation was employed. The dataset

D_{t r a i n}

was sorted in ascending order by the x-coordinate and evenly divided into 5 consecutive subsets

D_{c v, 1}

to

D_{c v, 5}

. In the k-th validation,

D_{c v}

was used as the validation set and the remaining subsets were used as the training set. The model was trained, and the R² for the validation set was calculated. The final cross-validation accuracy was the mean R² of the 5 folds. The expression for spatial cross-validation is as follows:

\begin{array}{l} R_{c v}^{2} = \frac{1}{5} \sum_{k = 1}^{5} R_{c v, k}^{2} \\ R_{c v, k}^{2} = 1 - \frac{\sum_{(x, y) \in D_{c v, k}} {(y - {\hat{y}}_{k} (x))}^{2}}{\sum_{(x, y) \in D_{c v, k}} {(y - {\bar{y}}_{c v, k})}^{2}} \end{array}

(6)

where

{\bar{y}}_{c v, k}

is the average biomass of the kth fold validation set and

{\hat{y}}_{k} (x)

is the predicted value of the model after the kth training iteration. If the difference between

R_{c v}^{2}

and the test set R² was less than 0.1, this indicated no significant overfitting, allowing for judgment on parameter quality.

The entire process of building the biomass model with random forest addressed single-tree overfitting through dual randomness, handled data spatial autocorrelation through spatial partitioning and cross-validation, and balanced accuracy and visualization effects through residual correction and noise addition. Finally, biomass prediction tables were generated and scatter plots of measured vs. predicted values and spatial distribution maps of biomass were drawn, forming a complete closed loop from data input to result output. The parameter settings are shown in Table 5.

3. Results

3.1. Spatial Connection Results Analysis

3.1.1. Spatial Matching Rate Analysis

After spatial registration of the acquired point cloud data and multispectral data, their spatial connection results were visualized in a planar Cartesian coordinate system. The distribution characteristics showed significant spatial overlap between the two data point types, indicating successful spatial connection for most points after registration. Isolated points failed to connect due to data acquisition errors and the precision limits of the registration algorithm. The matching rate is a core metric for evaluating the effectiveness of multi-source spatial data registration. It provides a feasible spatial reference for subsequent multi-source data fusion analysis. The spatial connectivity results for the Youyu Service Area sample plots are shown in Figure 10 and the average matching rates for each vegetation zone are presented in Table 6.

3.1.2. Matching Error Analysis

An analysis of the absolute values of matching distance deviations across all regions is presented in Figure 11. This chart demonstrates the semi-normal distribution characteristics of spatial matching errors, with over half of the matches achieving high precision. However, the average error was 3.25 m, and the error range spanned from 0 to 60 m. This indicates a significant right-skewed long tail in the error distribution, where a small number of unmatched points elevated the average error level. The scale parameter of the semi-normal curve fit was 10.54 m, describing the dispersion of errors. This indicates that while random errors dominated the matching process, extreme error cases still warrant close attention. In summary, this matching method demonstrates strong reliability. Nevertheless, root cause analysis of large error cases is necessary to enhance overall robustness and expand its applicability.

3.2. Biomass Model Construction Using Random Forest

3.2.1. Biomass Modeling

The random forest-based joint model utilizing both tree height and crown width demonstrated superior performance, achieving R² values of 0.839–0.934. This significantly outperformed single-factor models, with tree height and crown width models achieving R² values of 0.156–0.314 and 0.632–0.823, respectively.

The combined model’s predictions closely aligned with measured values along the reference line, while single-factor models showed greater scatter and deviation. Crown width exhibited stronger predictive capability than tree height alone, reflecting its direct relationship with photosynthetic capacity and growing space. Tree height measurements proved less reliable as a standalone parameter due to influences from species variation and competition.

Although minor variations occurred across vegetation zones, the joint model maintained consistent high accuracy (R² > 0.8), demonstrating both stability and adaptability. The random forest algorithm effectively handled nonlinear biomass–environment relationships through its ensemble structure and randomization features. These results establish the tree height–crown width joint modeling approach as the optimal strategy for forest biomass prediction, with crown width serving as the more influential single parameter, while the combination delivers maximum accuracy. The measured and predicted biomass values for each vegetation subzone were compared, as shown in Figure 12.

3.2.2. Statistics and Analysis of Model Residuals

Residuals represented the difference between observed and predicted values, aiding in visualizing locations with the greatest model prediction deviations. In PyCharm, the biomass model residuals for each vegetation subzone were calculated and analyzed using the algorithm, as shown in Figure 13. Overall, the model’s application for Shanxi Province exhibited significant spatial variation, with differences in prediction accuracy and systematic bias across ecological subzones. This indicates the complex influence of vegetation structure, topography, climatic conditions, and potential human disturbance on the stability of biomass estimation models.

In terms of model unbiasedness, the mean residuals for all vegetation zones were close to zero. Zone 2F exhibited a relatively larger residual of 1.397, while the remaining zones all had residuals below 0.27. This indicates that the biomass model demonstrates overall good unbiasedness, meaning that the model’s systematic error is minimal.

However, the model’s predictive accuracy varied across different regions, primarily reflected in the standard deviation (Std) and mean absolute error (MAE) of residuals. Among these, Zone 2A demonstrated the most outstanding performance, exhibiting the lowest Std and MAE values across all subdivisions. This indicates that for this region, the dispersion between model predictions and actual values was minimal, making predictions the most accurate. This was because the area has fewer vegetation types, being a forest–grassland transition zone with relatively flat terrain. This resulted in a more stable relationship between tree height, crown width, and biomass, enhancing the model’s descriptive capability. In contrast, Zones 2F and 2E exhibited the highest model uncertainty. Zone 2F exhibited the highest Std and MAE values among all zones, along with the largest absolute mean, indicating a systematic overestimation trend. Zone 2E showed similar characteristics. Both zones are mountainous areas with significant topographic variation and complex vegetation types, which reduced the model’s predictive capability. Although Zone 2G is a basin terrain, its mixed coniferous and broadleaf forest resulted in relatively high Std and MAE values, further illustrating the impact of complex vegetation types on model accuracy. Zones 1B, 2D, and 3B exhibited moderate prediction accuracy. These areas represent typical loess hilly landscapes or transitional zones between basin agricultural regions and forests, characterized by frequent human activity. This activity influences the spatial distribution of biomass, resulting in residual dispersion higher than in Zone 2A but lower than the most complex mountainous mixed forest zones. Despite belonging to different primary zones, 1A and 2B exhibited very similar Std and MAE values, indicating comparable error levels under similar hilly topography and vegetation conditions.

Additionally, the mean and standard deviation of residuals for all vegetation zone biomass models were calculated, as shown in Figure 14. This chart indicates that the model exhibited minimal overall bias and performed well in prediction, with a standard deviation of 0.448. The residual distribution showed slight negative skewness, where the negative extreme value of −1.397 significantly exceeded the positive extreme value of 0.224, suggesting that the model tended to overestimate values in the 2F zone. The combination of residual points and confidence regions in the figure indicates that the model reliably predicted results for the vast majority of plots. It precisely located plots with residual anomalies, providing clear diagnostic evidence for in-depth analysis of model limitations caused by specific vegetation or site conditions, subsequently enabling targeted optimization.

In summary, the model exhibited overall unbiased performance in Shanxi Province, with significant regional variations in accuracy. The model’s robustness is optimal in areas with simple vegetation structures and uniform topography; however, prediction uncertainty increases substantially in mountainous mixed forests with complex vegetation types. When using this model for regional biomass estimation, spatial variations must be fully accounted for. For high-error regions, further model refinement or the incorporation of variables such as topography and tree species composition is necessary to enhance estimation accuracy.

3.3. Biomass Statistical Results for Sample Areas in Shanxi Province

Combining spatial cross-validation, the biomass model constructed based on the random forest algorithm yielded predicted biomass for each vegetation subzone.

Subzones IIA, IIC, and IIF achieved R² values above 0.93, indicating strong model explanatory power. Subzones IB, IID, and IIIB had R² values between 0.90 and 0.92, indicating a good fit. In contrast, IA and IIG had relatively lower fit, while IIE and IIIA had significantly lower R² values, indicating limited model explanatory power.

Subzone IIA had the lowest MAE and RMSE among all regions, indicating the highest prediction accuracy. Most regions, like IB, IIB, IID, etc., had MAE in the range of 200–300 g/m² and RMSE in the range of 500–700 g/m², indicating controllable prediction bias and good accuracy. Subzones IIE and IIIA had significantly higher MAE and RMSE, indicating lower prediction accuracy in these two regions.

Looking at the maximum, minimum, and mean values of measured and predicted biomass, regions dominated by forest ecosystems, such as IIA, IIB, IIC, etc., generally had maximum measured biomass exceeding 20,000 g/m², with means mostly between 900–1800 g/m², reflecting the strong biomass accumulation capacity of forest ecosystems. Regions dominated by grassland and scrub ecosystems, like IA and IB, had lower maximum measured biomass compared to forest regions and lower means, consistent with their simpler ecosystem structure. Regarding predicted biomass, the maximum predicted values in forest regions were mostly between 8000–10,000 g/m², while predicted values in grassland regions were higher than their measured values, reflecting the model’s ability to differentiate biomass potential across different ecosystems, though the bias still requires optimization. Specific statistical results are shown in Table 7. Biomass distribution maps for selected study areas are shown in Figure 15.

4. Discussion

4.1. Factors Influencing Differences in Biomass Model Performance

Numerical analysis of biomass model estimation results, including statistical evaluation of MAE, MBE, residual distribution histograms, and their standard deviations, reveals that ecosystem type significantly influenced model performance. Forest ecosystems demonstrated superior fitting capabilities compared to grassland and shrub ecosystems. This discrepancy stemmed from the complex structural characteristics of forest vegetation. Its pronounced vertical stratification and strong tree height–canopy spread correlation could be effectively captured by UAV LiDAR and multispectral data. Conversely, grassland and shrub vegetation posed challenges for biomass estimation based on structural features due to low canopy overlap, leading to systematic overestimation that requires further optimization.

Topography and human activities represented another influencing factor. Mountainous and hilly areas exhibited high accuracy due to the clear identification of vegetation-terrain relationships in point cloud data. Basin regions showed mixed results: the Central Basin maintained good model fitting due to minimal disturbance, while the Linfen–Yuncheng Basin experienced reduced accuracy owing to vegetation fragmentation caused by agricultural and urban expansion. Notably, MAE and RMSE significantly increased in the Northern Taihangshan, where expressway corridors disrupted natural vegetation structures, weakening the correlation between features and biomass.

Additionally, this study found that in vegetation zones dominated by grasslands and shrubs, biomass models tended to overestimate low biomass values. During data collection, in areas with low canopy height and sparse vegetation, such as grasslands and shrublands, LiDAR point clouds were susceptible to strong interference from ground reflections. This led to a slight but systematic overestimation of extracted vegetation height parameters, which, in turn, affected the estimation results of biomass models. Beyond data source influences, model-level factors contributed to this issue. The generalized mixed-species equation employed failed to adequately distinguish between the fundamental differences in biomass accumulation mechanisms and allocation ratios between herbaceous plants and woody shrubs. This resulted in a systematic bias toward herbaceous-dominated ecosystems within the low biomass range.

4.2. Advantages of Constructing Biomass Models Based on Vegetation Zoning

Based on Shanxi’s climatic zoning, this study integrated digital elevation models and vegetation distribution maps to conduct spatial overlay analysis, delineating 11 vegetation subzones. This systematic approach ensured scientific rigor and representativeness in sample selection. The 75 selected plots comprehensively covered three major climate types and typical roadside vegetation across the province. The resulting sample library effectively captured vegetation growth characteristics and biomass variations under different climatic conditions, thus preventing sampling bias. Climatic zoning provided an ecological foundation for model optimization by minimizing interference from biomass fluctuations caused by climatic variations, thereby enhancing both model adaptability and generalization capability.

Compared to traditional models, the biomass model in this study, through spatial 7:3 partitioning and 5-fold spatial cross-validation, avoided the spatial autocorrelation error inherent in random sample splitting, making the model evaluation results more reliable. Furthermore, the joint tree height–crown width modeling leveraged the synergistic relationship between vertical growth and photosynthetic area, achieving 30–50% accuracy improvement over single-feature models and overcoming the limitations of individual parameter explanatory power.

4.3. Research Significance, Limitations, and Prospects

Roadside vegetation serves as a vital carbon sink within expressway ecosystems. The biomass model developed in this study provides data support for quantifying carbon storage, laying a scientific foundation for ecological restoration and low-carbon transportation planning in Shanxi Province. The sample repository established based on climatic zoning encompasses representative mid-latitude roadside vegetation types in the Northern Hemisphere, with its methodological framework offering a reference for biomass monitoring in similar geographical regions.

This model has several limitations: during multi-source data integration, the long-tail distribution of spatial matching errors is concentrated in specific terrain areas, such as high-density building zones. This is the primary reason for the model’s reduced performance in these regions. Future research will explore more precise matching methods to optimize matching results in these specific areas. Biomass growth patterns vary across tree species, and existing models may exhibit prediction biases for single-species scenarios, failing to adequately account for species specificity. Using generalized conifer–broadleaf mixed forest models to estimate biomass in areas with complex species composition yields insufficient assessment accuracy for specific ecosystems. Future research should focus on establishing models that encompass diverse ecosystems, including herbaceous, shrub, and tree communities, while integrating hyperspectral remote sensing data with complementary field survey data to optimize the tree species classification system. This approach will enable calibration or tiered application of generalized models. In areas of intense human disturbance, vegetation fragmentation alters natural growth patterns, degrading model performance. To enhance model adaptability, non-vegetation parameters like land-use classification and human disturbance intensity should be incorporated. Furthermore, due to computational constraints, variables such as spectral characteristics, topographic factors, annual precipitation, and mean annual temperature were not directly included in model inputs. Subsequent research must expand the spatial and temporal dimensions of the sample database to enable dynamic monitoring and develop spatiotemporal integrated models for dynamic prediction and trend analysis.

5. Conclusions

This study focused on the roadside vegetation of expressways in Shanxi Province, integrating UAV multi-source data, the random forest algorithm, and climatic zoning technology to construct a biomass model and establish a sample library, achieving high-precision estimation of roadside vegetation biomass.

(1) Based on the climatic gradient of temperate semi-arid, warm temperate semi-arid, and warm temperate semi-humid zones in Shanxi Province, combined with DEM and vegetation distribution maps, 11 vegetation subzones were delineated and 70 sample plots were selected. The sample data encompass UAV LiDAR 3D parameters and multispectral spectral features, providing reliable data support for model training and serving as a reference for regions at similar latitudes in the Northern Hemisphere.

(2) In the random forest algorithm optimized with spatial cross-validation, the tree height–crown width joint model performed best, demonstrating strong generalization capability across different climatic vegetation zones. Accuracy and fit were optimal in forest ecological zones, with R² reaching above 0.93. Grassland and scrub areas also maintained high accuracy, with R² basically reaching 0.9, confirming the technical feasibility.

(3) Providing data and technical support for expressway ecological protection and low-carbon development, the model and sample library constructed in this research can be directly applied to the dynamic monitoring of roadside vegetation biomass on expressways under different climatic conditions in Shanxi Province. They offer precise data for carbon stock assessment and ecological restoration, promoting the synergistic development of smart transportation and ecological protection, and hold significant practical value for the green and sustainable development of the transportation sector.

Author Contributions

Conceptualization, Y.Y. and Y.G.; methodology, Y.Y.; software, Y.G. and S.L.; validation, Y.G. and S.L.; formal analysis, B.Z.; investigation, Y.G., J.Z., S.L., B.Z., H.G., Y.C., and H.H.; resources, Y.Y.; data curation, Y.G.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.G.; visualization, Y.G.; supervision, X.L.; project administration, X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Shanxi Natural Science Foundation of China, grant number 202303011221001.

Data Availability Statement

The data presented in this article are publicly available in Mendeley Data, accessed on 7 January 2026. The address is [https://data.mendeley.com/datasets/n7b2hrf4ct/1].

Conflicts of Interest

Authors Ying Yang, Jiapen Zhang, Hantian Guo and Ben Zhao were employed by the company Shanxi Intelligent Transportation Laboratory Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Gao, X.; Li, S.; Song, G.; Qi, Y.; Yu, Y. Analysis on variation characteristics of vegetation carbon sequestration in highway rock cutting slope with different restoration years. Grassl. Turf. 2023, 43, 105–112. [Google Scholar] [CrossRef]
Xu, E.; Hu, Y.; Zhai, X.; Dong, N.; Gu, H.; Zhao, M.; Wang, H.; Li, C.; Tian, G.; Zhang, G.; et al. Construction and application of biomass estimation model of common shrubs on expressway slopes in Henan Province. Chin. J. Ecol. 2025, 44, 74–84. [Google Scholar] [CrossRef]
Cao, W.; Guo, J.; Wu, X.; Xiao, Y.; Zhu, N.; Guo, J.; Ma, J.; Song, L. Stability analysis of the vegetation community in the Zheng-Xin section of the Beijing-Hong Kong-Macao Highway slope. Pratacultural. Sci. 2022, 39, 2074–2082. [Google Scholar] [CrossRef]
Zhong, S.; Lai, R.; Luo, X.; Gao, J. Calculation of Roadside Vegetation Biomass Based on Gaofen Remote Sensing Imagery. China ITS J. 2022, 7, 141–142. [Google Scholar] [CrossRef]
Lian, X.; Wang, L.; Gao, P.; Yuan, J.; Chen, Y.; Cai, Y.; Hu, H. Research on a Multi-Source Remote Sensing Method for Monitoring Above-Ground Biomass of Forest Trees in Mining Areas Using Ground-Based Data Integration. J. Green Mine 2025, 4, 90–102. [Google Scholar] [CrossRef]
Zhang, H.; Lian, X.; Wang, L.; Gao, Y.; Shi, L.; Li, Y. Research on Time-series Dynamic Biomass Calculation Based on UAV Low-altitude Remote Sensing Data. For. Eng. 2024, 40, 17–25. [Google Scholar]
Breiman, L.; Last, M.; Rice, J. Random Forests: Finding Quasars. In Statistical Challenges in Astronomy; Springer: New York, NY, USA, 2003; pp. 243–254. [Google Scholar]
Song, J.; Zhao, J.; He, P.; Cheng, Y.; Huang, R.; Zhu, W. Single Tree Biomass Prediction Model of Eucalyptus Plantations Based on Random Forest Algorithm. Eucalypt Sci. Technol. 2024, 41, 11–16. [Google Scholar] [CrossRef]
Li, X.; Liu, X.; Sun, B.; Jiang, J.; Yu, H.; Wu, D.; Du, X.; Wang, H.; Jia, J.; Yang, H. Above-Ground Biomass of Grasslands in Gansu Province Based on Machine Learning Algorithms. Pratacultural. Sci. 2024, 41, 297–307. [Google Scholar]
Mu, Z.; Zhang, Z.; Zhang, H.; Jiang, L. Applying Machine Learning Algorithm Models to Predict Aboveground Biomass of Larix gmelinii in Xing’an. J. Northeast. Foresrty Univ. 2024, 52, 41–47. [Google Scholar] [CrossRef]
Sun, X.; Wang, H.; Han, Q.; Sun, K.; Li, Y.; Wang, J.; Pei, Z. Methods for Estimating the Aboveground Biomass of Salix psammophila Shrublands Using UAV Remote Sensing Data. J. Northeast Foresrty Univ. 2025, 53, 74–81. [Google Scholar] [CrossRef]
Ding, J.; Huang, W.; Liu, Y.; Hu, Y. Estimation of Forest Aboveground Biomass in Northwest Hunan Province Based on Machine Learning and Multi-Source Data. Sci. Silvae Sin. 2021, 57, 36–48. [Google Scholar] [CrossRef]
Zhang, P.; Ma, Q.; Lv, J.; Ji, J.; Li, Z. Application of machine learning algorithms in estimation of above-ground biomass of forest. Bull. Surv. Mapp. 2021, 12, 28–32. [Google Scholar] [CrossRef]
Kalita, R.M.; Nandy, S.; Srinet, R.; Nath, A.J.; Das, A.K. Mapping the Spatial Distribution of Aboveground Biomass of Tea Agroforestry Systems Using Random Forest Algorithm in Barak Valley, Northeast India. Agrofor. Syst. 2022, 96, 1175–1188. [Google Scholar] [CrossRef]
He, J.; Fan, C.; Geng, Y.; Zhang, C.; Zhao, X.; Gadow, K.V. Assessing Scale-Dependent Effects on Forest Biomass Productivity Based on Machine Learning. Ecol. Evol. 2022, 12, e9110. [Google Scholar] [CrossRef]
Gao, H.; Hou, M.; Ge, J.; Bao, X.; Li, Y.; Liu, J.; Feng, Q.; Liang, T.; He, J.; Qian, D. Hyperspectral Estimation of Aboveground Biomass of Alpine Grassland based on Random Forest Algorithm. Acta Agrestia Sin. 2021, 29, 1757–1768. [Google Scholar]
Benmokhtar, S.; Robin, M.; Maanan, M.; Bazairi, H. Mapping and Quantification of the Dwarf Eelgrass Zostera noltei Using a Random Forest Algorithm on a SPOT 7 Satellite Image. ISPRS Int. J. Geo-Inf. 2021, 10, 313. [Google Scholar] [CrossRef]
Purohit, S.; Aggarwal, S.P.; Patel, N.R. Estimation of Forest Aboveground Biomass Using Combination of Landsat 8 and Sentinel-1A Data with Random Forest Regression Algorithm in Himalayan Foothills. Trop. Ecol. 2021, 62, 288–300. [Google Scholar] [CrossRef]
Zarei, A.R.; Mahmoudi, M.R.; Shabani, A.; Achite, M. Determination of the Most Important Meteorological Parameters Affecting the Yield and Biomass of Barley and Winter Wheat Using the Random Forest Algorithm. Paddy Water Environ. 2021, 19, 199–216. [Google Scholar] [CrossRef]
Nguyen, T.D.; Kappas, M. Estimating the Aboveground Biomass of an Evergreen Broadleaf Forest in Xuan Lien Nature Reserve, Thanh Hoa, Vietnam, Using SPOT-6 Data and the Random Forest Algorithm. Int. J. For. Res. 2020, 2020, 4216160. [Google Scholar] [CrossRef]
Guo, X.; Zhu, X.; Tang, X.; Yang, G.; Hou, Y.; He, H. Study on Integration Method of Algorithm Model Based on Big Data Pipeline— Taking Tree Biomass Inversion Based on Machine Learning Method and LiDAR Data as an Example. Front. Data Comput. 2024, 6, 96–105. [Google Scholar]
Hou, W.; Chen, J.; He, M.; Ren, S.; Fang, L.; Wang, C.; Jiang, P.; Wang, W. Evolutionary Trends and Analysis of the Driving Factors of Ulva Prolifera Green Tides: A Study Based on the Random Forest Algorithm and Multisource Remote Sensing Images. Mar. Environ. Res. 2024, 198, 106495. [Google Scholar] [CrossRef]
Zhang, X.; Shen, H.; Huang, T.; Wu, Y.; Guo, B.; Liu, Z.; Luo, H.; Tang, J.; Zhou, H.; Wang, L.; et al. Improved Random Forest Algorithms for Increasing the Accuracy of Forest Aboveground Biomass Estimation Using Sentinel-2 Imagery. Ecol. Indic. 2024, 159, 111752. [Google Scholar] [CrossRef]
Xiong, X.; Yang, X.; Zhao, Y.; Li, W. Remote Sensing Inversion of Forest Biomass Based on Hyperparameter-Optimized Random Forest Algorithm. J. Cent. South Univ. For. Technol. 2024, 44, 102–111. [Google Scholar] [CrossRef]
Torre-Tojal, L.; Bastarrika, A.; Boyano, A.; Lopez-Guede, J.M.; Graña, M. Above-Ground Biomass Estimation from LiDAR Data Using Random Forest Algorithms. J. Comput. Sci. 2022, 58, 101517. [Google Scholar] [CrossRef]
Ren, X. Research on Vegetation Establishment Techniques for Highway Corridors in Shanxi Province. Low Carbon World 2021, 11, 152–153. [Google Scholar] [CrossRef]
Qiao, Q.; Ren, X.; Wu, X. Research on Vegetation Zoning in the Highway Corridor of Shanxi Province. Constr. Des. Proj. 2021, 3, 64–66. [Google Scholar] [CrossRef]
Zhou, G. Carbon Stock in China’s Forest Ecosystems—Biomass Equation. In Research Series on Carbon Budget of China’s Terrestrial Ecosystems, 1st ed.; Science Press: Beijing, China, 2018. [Google Scholar]

Figure 1. (a) Shanxi province in China, (b) expressway, and (c) vegetation coverage.

Figure 2. Main vegetation types: (a) coniferous forest, (b) broadleaf forest, (c) mixed conifer–deciduous forest, (d) shrubland, (e) shrubbery, (f) grassland, (g) meadow, and (h) cultivated vegetation.

Figure 3. (a) Research sample plot distribution, (b) Pianguan service area vegetation, (c) Lingqiu service area vegetation, (d) Wanrong service area vegetation, and (e) Jincheng service area vegetation.

Figure 4. Data examples: (a) LiDAR data and (b) multispectral data.

Figure 5. Manual field measurement operations.

Figure 6. CHM model generation example.

Figure 7. Multi-scale segmentation results.

Figure 8. Spatial connection principle.

Figure 9. Random forest algorithm principle.

Figure 10. (a) Spatial connection results for Youyu service area, (b) unmatch, and (c) match.

Figure 11. Distribution of spatial matching errors.

Figure 12. Comparison of measured and predicted values across vegetation zones: (a) IA, (b) IB, (c) IIA, (d) IIB, (e) IIC, (f) IID, (g) IIE, (h) IIF, (i) IIG, (j) IIIA, and (k) IIIB.

Figure 13. Model residual histograms for different vegetation zones: (a) IA, (b) IB, (c) IIA, (d) IIB, (e) IIC, (f) IID, (g) IIE, (h) IIF, (i) IIG, (j) IIIA, and (k) IIIB.

Figure 14. Vegetation zone residual mean statistics chart.

Figure 15. Example biomass map for part of the study area: (a) plot distribution, (b) Youyu Service Area, (c) Shenchi Service Area, (d) Lishixi Service Area, (e) Yongji Service Area, (f) Yanggao Service Area, (g) Yuxian Service Area, (h) Pingyao Service Area, (i) Huguan Service Area.

Table 1. Sensor parameter settings.

Module	Parameters
D-LiDAR2000	Accuracy	5 cm@50 m	Point Frequency	240 kpts/s
	Range	190 m@10% Reflectivity @100 klx	Number of Returns	3 Returns
	Laser Class	Class 1	Return Intensity	8 bits
	Wavelength	905 nm	Ranging Accuracy	±2 cm
	Horizontal FOV	70.4°	Vertical FOV	4.5°/77.2°
	Roll/Pitch Accuracy	0.006°	Heading Accuracy	0.03°
D-MSPC2000	Resolution	1280 × 960	Effective Pixels	1.2 Megapixels
	Focal Length	5.2 mm	Sensor Size	4.8 mm × 3.6 mm
	Aperture	F/2.2	Field of View	HFOV: 49.6°, VFOV: 38°
	Capture Speed	1 frame/s	Quantization Bits	12 bit
	Ground resolution	GSD: 8 cm/pix	Typical Width	110 m × 83 m@AGL = 120 m

Table 2. Allometric growth equations for main stand types in the study area.

Mixed Species (Group)	Organ	$W = a {(D^{2} H)}^{b}$
Mixed Species (Group)	Organ	a	b	$r^{2}$
Coniferous Forest	Stem	0.0354	0.9163	0.99
	Branch	0.0141	0.8421	0.93
	Leaf	0.0178	0.7669	0.96
	Root	0.0581	0.7169	0.99
Broadleaf Forest	Stem	0.0570	0.8642	0.95
	Branch	0.0134	0.9332	0.85
	Leaf	0.0023	0.2399	0.88
	Root	0.0128	1.0502	0.61
Mixed Coniferous–Broadleaf Forest	Stem	0.0768	0.8563	0.88
	Branch	0.0085	0.8701	0.81
	Leaf	0.0219	0.6526	0.84
	Root	0.0276	0.8047	0.80

Table 3. Stand parameters in the study area.

Vegetation Subzone	Plots	Statistic	Elevation (m)	Mean Stand Height (m)	Mean Crown Width (m)	Coverage	Biomass (g/m²)
IA	10	max	1610.982	11.448	3.050	0.413	21,160.411
		min	963.190	3.525	2.364	0.136	10.962
		mean	1200.364	6.716	2.823	0.251	1110.548
IB	5	max	1557.540	5.639	3.114	0.342	22,320.375
		min	1152.380	4.301	2.683	0.126	10.786
		mean	1368.386	5.053	2.983	0.213	1060.713
IIA	7	max	1327.790	10.315	3.795	0.337	22,690.273
		min	920.794	3.357	2.599	0.125	20.016
		mean	1065.283	5.107	2.860	0.214	870.635
IIB	3	max	1390.040	6.179	3.107	0.371	22,670.801
		min	958.746	4.239	2.630	0.244	10.997
		mean	1148.304	5.377	2.947	0.312	1150.744
IIC	2	max	1669.640	10.339	3.663	0.390	22,990.878
		min	1363.130	4.214	2.699	0.213	20.266
		mean	1499.903	7.277	3.181	0.302	1870.519
IID	10	max	1019.020	8.401	3.737	0.329	24,650.505
		min	469.043	4.111	2.421	0.164	20.336
		mean	816.278	6.535	2.775	0.257	1030.616
IIE	8	max	1252.390	9.355	3.979	0.335	24,020.554
		min	638.006	3.668	2.634	0.205	20.019
		mean	1039.313	5.033	2.701	0.241	1250.020
IIF	2	max	1231.450	10.237	3.414	0.375	24,210.542
		min	674.275	2.032	2.385	0.254	20.075
		mean	952.863	6.144	3.043	0.295	1530.112
IIG	7	max	1299.450	14.277	3.405	0.483	24,660.583
		min	874.275	6.097	2.789	0.248	20.035
		mean	1001.507	9.159	3.043	0.329	1780.609
IIIA	9	max	953.683	12.105	2.994	0.332	24,560.321
		min	319.622	4.248	2.422	0.164	20.365
		mean	529.451	7.398	2.684	0.241	1210.958
IIIB	7	max	1129.900	9.863	3.041	0.416	24,870.739
		min	306.506	5.401	2.807	0.253	20.024
		mean	658.669	7.232	2.911	0.326	1380.556

Note: IA is the northern Shanxi mountain–hill–basin shrub–grassland subzone; IB is the northwestern Shanxi loess hill scrub–grassland subzone; IIA is the Hengshan–Wutaishan forest steppe deciduous broadleaf forest subzone; IIB is the western Shanxi loess hill deciduous broadleaf forest scrub subzone; IIC is the Lvliangshan deciduous coniferous and deciduous broadleaf forest subzone; IID is the central basin deciduous broadleaf forest subzone; IIE is the northern–central Taihangshan deciduous broadleaf forest scrub–grass vegetation subzone; IIF is the southern Lvliangshan coniferous–broadleaf forest scrub subzone; IIG is the Qinhe basin southeastern Shanxi coniferous–broadleaf forest scrub subzone; IIIA is the Linfen–Yuncheng basin deciduous broadleaf forest scrub subzone; IIIB is the southern Shanxi Zhongtiaoshan deciduous broadleaf forest scrub subzone.

Table 4. Individual tree parameters extracted by multi-scale segmentation.

Vegetation Subzone	Tree Count	Crown Width (m)			Crown Area (m²)
Vegetation Subzone	Tree Count	Max	Min	Mean	Max	Min	Mean
IA	10,756	3.645	2.567	3.106	10.435	5.175	7.805
IB	62,317	3.327	2.996	3.162	8.694	7.050	7.872
IIA	150,345	3.334	3.002	3.168	8.730	7.078	7.904
IIB	71,053	3.681	2.546	3.114	10.642	5.091	7.867
IIC	31,250	4.003	2.331	3.167	12.585	4.268	8.426
IID	176,301	4.017	2.044	3.031	12.673	3.281	7.977
IIE	209,531	4.255	2.641	3.448	14.220	5.478	9.849
IIF	94,201	3.997	2.214	3.106	12.548	3.850	8.199
IIG	132,613	4.128	2.864	3.496	13.383	6.442	9.913
IIIA	152,201	4.202	2.516	3.359	13.868	4.972	9.420
IIIB	119,584	4.316	3.001	3.659	14.630	7.073	10.852

Table 5. Random forest model parameter settings.

n_estimators (Number of Decision Trees)	More trees increase stability but also increase computational cost
max_depth (Maximum Depth of a Single Tree)	Controls overfitting (often needs restriction with many features)
min_samples_split (Minimum Samples to Split a Node)	More samples reduce overfitting risk
min_samples_leaf (Minimum Samples at a Leaf Node)	Avoids overfitting

Table 6. Average matching rate by vegetation subzone.

Vegetation Subzone	Plots	Avg. Matching Rate (%)	Vegetation Subzone	Plots	Avg. Matching Rate (%)
IA	10	75.61	IIE	6	82.01
IB	6	74.10	IIF	2	79.97
IIA	7	76.13	IIG	9	78.56
IIB	3	73.98	IIIA	10	77.97
IIC	2	79.52	IIIB	6	76.43
IID	12	80.59

Table 7. Biomass statistics for vegetation subzones in Shanxi Province.

Vegetation Subzone	Statistic	Measured Biomass (g/m²)	Predicted Biomass (g/m²)	$R^{2}$	MAE	RMSE
IA	max	21,160.411	8757.023	0.897	293.421	667.521
	min	10.962	285.129
	mean	1156.413	2632.376
IB	max	22,320.375	5592.362	0.909	224.384	502.134
	min	17.860	76.460
	mean	1060.713	3737.991
IIA	max	22,690.273	9102.223	0.934	151.621	381.462
	min	20.016	54.416
	mean	926.665	2004.021
IIB	max	22,670.801	9373.281	0.927	277.420	626.151
	min	10.997	176.818
	mean	1184.539	1764.838
IIC	max	22,990.878	13,286.475	0.933	264.047	609.103
	min	20.266	81.309
	mean	2144.668	2454.374
IID	max	24,650.505	9233.645	0.914	237.940	512.064
	min	20.336	68.355
	mean	1030.616	930.645
IIE	max	24,020.554	9841.525	0.878	470.805	940.259
	min	20.019	221.519
	mean	1250.020	1570.762
IIF	max	24,210.542	8522.526	0.934	236.966	623.176
	min	20.075	86.641
	mean	1530.112	1835.935
IIG	max	24,660.583	8840.968	0.908	422.936	831.107
	min	20.035	90.752
	mean	1814.072	1378.156
IIIA	max	24,560.321	9373.827	0.845	494.448	918.550
	min	20.365	66.072
	mean	1210.958	2105.931
IIIB	max	24,870.739	9373.029	0.915	341.199	694.824
	min	20.024	130.687
	mean	1380.556	3275.485

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Y.; Gao, Y.; Zhang, J.; Liang, S.; Zhao, B.; Guo, H.; Cai, Y.; Hu, H.; Lian, X. UAV Remote Sensing-Based Random Forest Modeling of Expressway Vegetation Biomass and Sample Library Construction. Land 2026, 15, 401. https://doi.org/10.3390/land15030401

AMA Style

Yang Y, Gao Y, Zhang J, Liang S, Zhao B, Guo H, Cai Y, Hu H, Lian X. UAV Remote Sensing-Based Random Forest Modeling of Expressway Vegetation Biomass and Sample Library Construction. Land. 2026; 15(3):401. https://doi.org/10.3390/land15030401

Chicago/Turabian Style

Yang, Ying, Yulu Gao, Jiapen Zhang, Shiqi Liang, Ben Zhao, Hantian Guo, Yinfei Cai, Haifeng Hu, and Xugang Lian. 2026. "UAV Remote Sensing-Based Random Forest Modeling of Expressway Vegetation Biomass and Sample Library Construction" Land 15, no. 3: 401. https://doi.org/10.3390/land15030401

APA Style

Yang, Y., Gao, Y., Zhang, J., Liang, S., Zhao, B., Guo, H., Cai, Y., Hu, H., & Lian, X. (2026). UAV Remote Sensing-Based Random Forest Modeling of Expressway Vegetation Biomass and Sample Library Construction. Land, 15(3), 401. https://doi.org/10.3390/land15030401

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UAV Remote Sensing-Based Random Forest Modeling of Expressway Vegetation Biomass and Sample Library Construction

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Data Acquisition

2.1.1. Study Area Overview

2.1.2. Research Sample Selection

2.2. Data Acquisition

2.2.1. UAV Data Acquisition

2.2.2. Ground Measurement Data Acquisition

2.3. Methodology

2.3.1. Extraction of Individual Tree LiDAR 3D Structural Features

2.3.2. Individual Tree Extraction Using Multi-Scale Segmentation Algorithm

2.3.3. Spatial Connection

2.3.4. Biomass Model Construction

3. Results

3.1. Spatial Connection Results Analysis

3.1.1. Spatial Matching Rate Analysis

3.1.2. Matching Error Analysis

3.2. Biomass Model Construction Using Random Forest

3.2.1. Biomass Modeling

3.2.2. Statistics and Analysis of Model Residuals

3.3. Biomass Statistical Results for Sample Areas in Shanxi Province

4. Discussion

4.1. Factors Influencing Differences in Biomass Model Performance

4.2. Advantages of Constructing Biomass Models Based on Vegetation Zoning

4.3. Research Significance, Limitations, and Prospects

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI