1. Introduction
Global ecological changes and carbon cycle issues have become major concerns for researchers and policymakers. As vital components of terrestrial ecosystems, forests play key roles in carbon regulation, climate moderation, biodiversity conservation, and ecosystem services [
1,
2]. Forest structural parameters are crucial for assessing ecosystem functions, with forest height being a key indicator widely used in biomass estimation, carbon storage modeling, and global carbon cycle simulations [
3,
4]. Defined as the “mean height of trees within a stand/plot” [
5], forest height reflects vertical heterogeneity and supports forest resource assessments and climate change studies. However, due to forest heterogeneity, topographic variability, and external disturbances, accurate forest height estimation remains a global challenge.
Current methods for forest height estimation rely on both ground-based and remote sensing approaches. Terrestrial surveys provide accurate measurements but are labor-intensive, spatially limited, and in the case of destructive ground sampling, disruptive to forest ecosystems. In contrast, remote sensing enables efficient mean monitoring over large areas. Airborne and UAV-mounted LiDAR systems can capture high-resolution canopy height data [
6,
7,
8], but their use is often constrained by cost and limited coverage. For instance, airborne LiDAR systems and Unmanned Aerial Vehicle (UAV)-mounted laser scanners [
9] can capture high-density discrete point clouds, but their operational scalability faces constraints in large-area forest height estimation. Meanwhile, spaceborne LiDAR missions such as GEDI provide valuable global forest structure data, though spatial gaps remain [
10]. Optical imagery, rich in spectral and texture information, is also widely used for forest height retrieval [
11,
12]. Techniques such as tree shadow analysis leverage geometric models using sun angle and image geometry [
13,
14], but are limited by illumination sensitivity and canopy occlusion in dense forests [
15]. Because optical data lack direct height information, forest height must be inferred via statistical models [
16]. Synthetic Aperture Radar (SAR) offers the advantage of penetrating vegetation and operating under all-weather conditions. Polarimetric SAR (PolSAR) captures scattering and texture information for structure retrieval [
17,
18], while advanced X-band PolInSAR has demonstrated improved height estimates [
19]. However, SAR-based approaches demand complex calibration and face scalability issues. Interferometric SAR (InSAR) and PolInSAR, although capable of high accuracy, are often constrained by temporal decorrelation and high data requirements, restricting their use to localized studies [
20].
Additionally, with advances in machine learning, algorithms like Random Forest and Convolutional Neural Networks (CNNs) are increasingly used in remote sensing image analysis [
21,
22], enabling efficient multi-source data integration and model optimization. For instance, Zhang et al. combined satellite data with feature selection and machine learning methods, finding that Boruta and XGBoost achieved the best forest height estimation in Baoding, China (R
2 = 0.67, RMSE = 2.2 m) [
23]. Similarly, Schwartz et al. integrated GEDI LiDAR, Sentinel-1 SAR, and Sentinel-2 optical data with deep learning to produce high-resolution canopy height maps in France’s Landes Forest [
24].
Although forest height estimation has made considerable progress, major methodological hurdles remain in remote sensing-based approaches. Various remote sensing studies analyze each data source in isolation and often overlook the complexities of fusing heterogeneous datasets with differing spatial resolutions, revisit frequencies, and observation geometries [
25,
26]. Moreover, existing fusion frameworks lack generalizability and robustness, particularly in heterogeneous environments or for short-stature forests where signal sensitivity differs [
27,
28]. Among a few recent studies that have explored the integration of multi-source data through machine learning, efforts have primarily concentrated on optical and LiDAR integration. In contrast, the application of InSAR data for forest height estimation, particularly coherence and interferometric phase, remains underexplored, especially at regional levels. Additionally, the interaction mechanisms between InSAR-derived features and forest structural parameters are not yet well understood, resulting in uncertainty in model performance. Addressing this critical gap requires the development of a robust and scalable framework that systematically integrates multi-source remote sensing datasets—optical, SAR, and LiDAR—to improve the accuracy of forest height estimation, especially in regions with fragmented terrain and limited ground measurements, such as Jiangsu Province.
Building upon these identified gaps, this study proposes a scalable multi-modal data integration framework for forest height estimation in Jiangsu Province. The province’s subtropical monsoon climate and complex landscape present both challenges and opportunities for remote sensing applications. Our framework combines Sentinel-1 SAR data, GEDI L2A footprint data, and Sentinel-2 multispectral imagery, with particular emphasis on the contribution of InSAR features, which capture valuable information on forest vertical structure. Unlike traditional forest height inversion methods that primarily rely on PolSAR, PolInSAR, or TomoSAR, our approach incorporates InSAR coherence and phase features directly into the modeling process. All multi-modal features are subsequently used to train machine learning models for forest height estimation.
Our specific objectives are to: (1) assess the contribution of InSAR coherence and phase features in forest height estimation; (2) evaluate the effectiveness of multi-modal integration for forest height inversion; (3) provide a practical solution for large-scale forest height estimation under complex environmental conditions, especially in regions lacking ground-based measurements.
3. Methods
This study proposed a sparse forest height estimation model leveraging multi-modal remote sensing datasets, structured into three core modules as illustrated in
Figure 2—data preprocessing and analysis, multi-source feature extraction, and model development with performance evaluation.
The initial module undertook comprehensive preprocessing of GEDI (a scientific project led by NASA) LiDAR data to derive a Digital Elevation Model (DEM) from ground returns and a Digital Surface Model (DSM) from canopy returns, enabling the calculation of the Canopy Height Model (CHM) via their difference. Simultaneously, Sentinel-2 multispectral imagery undergone atmospheric and radiometric corrections to ensure consistency for vegetation parameter derivation. Sentinel-1 SAR data were subjected to polarization calibration and precise co-registration, with InSAR techniques applied to extract interferometric phase and coherence metrics that provide complementary vertical and structural forest information.
Building on these processed datasets, the second module focused on extracting a comprehensive suite of forest structural features. Vegetation indices and biophysical parameters derived from Sentinel-2 characterized canopy vigor and density, while dual SAR data contributed polarization-sensitive backscatter coefficients and incidence angles. InSAR-derived interferometric phase and coherence enhanced the feature set by capturing vertical canopy structure, reflecting the elevation of effective scattering centers within the canopy and the decorrelation effects associated with vegetation height. These multi-source variables collectively formed a multidimensional representation of forest architecture, sensitive to canopy height and biomass.
The final module involved constructing a regression model using Random Forest to predict canopy height from the integrated feature set. Various feature combinations were systematically evaluated, with hyperparameter tuning conducted via cross-validation to optimize model robustness and generalizability. Model accuracy was assessed through metrics such as Root Mean Square Error (RMSE) and coefficient of determination (R2). Furthermore, the model’s adaptability was scrutinized against variations in forest types and terrain complexity to assess its resilience in heterogeneous ecological settings. This integrative framework advanced the precision of forest height estimation in sparse forest environments and underscores the efficiency of multi-modal remote sensing data fusion for ecological monitoring.
3.1. Data Processing
To develop forest height estimation model, this study combined three types of remote sensing data: GEDI LiDAR data, Sentinel-2A multispectral imagery, and Sentinel-1 SAR imagery.
3.1.1. GEDI Data Processing
The GEDI sensor operated in the near-infrared band (1064 nm) and was highly sensitive to the reflectance differences between green vegetation and the underlying surface. Its ability to penetrate sparse canopies allowed for precise ground elevation measurements beneath vegetation, making it highly valuable in forest ecosystem studies and carbon stock estimation. The spatial footprint of GEDI pulses was approximately 25 m in diameter, providing detailed vertical structure information.
This study primarily employed the GEDI Level 2A ground and vegetation height products, with the RH98 metric adopted as the principal indicator of canopy height. RH98 represented the relative height corresponding to 98% of the waveform energy, and was widely recognized for its balance between sensitivity to canopy top detection and robustness against waveform noise, making it particularly suitable for height estimation in sparse forest environments. To enhance data reliability, low-confidence observations flagged in the GEDI product were excluded, thereby minimizing the influence of degraded waveforms and outliers. Given the inherent discreteness of GEDI waveform sampling, spatial gaps might persist in certain regions. To mitigate these effects and improve representativeness, regional averaging was applied, although this procedure may attenuate fine-scale structural heterogeneity. The resulting averaged RH98 values provided a high-precision reference for forest height across the study area, supporting subsequent regression modeling and enabling the generation of representative canopy height estimates [
30].
Beyond canopy top measurements, GEDI data included surface elevation information, which facilitates the derivation of forest height through elevation differencing. Specifically, the DSM and DEM extracted from GEDI were differenced to compute the CHM, providing per-pixel estimates of canopy height. This approach effectively captured the spatial variability of vertical forest structure. However, its accuracy might be influenced by terrain complexity and errors in ground return measurements, which should be considered when interpreting the derived CHM.
3.1.2. Multispectral Data Processing
Sentinel-2A offered multispectral imagery with spatial resolutions ranging from 10 m (visible and near-infrared bands) to 20 m (red-edge and shortwave infrared bands) and up to 60 m (coastal aerosol and cirrus bands). The inclusion of red-edge and near-infrared bands made it suitable for calculating vegetation indices and biophysical parameters.
Sentinel-2A multispectral imagery was employed primarily to complement GEDI observations by providing spectral indicators of canopy vigor and density, thereby enriching the characterization of forest structural attributes. Unlike LiDAR, Sentinel-2 did not directly capture vertical canopy profiles. However, its high spectral resolution and red-edge bands enabled robust derivation of vegetation indices and biophysical parameters that are indirectly related to canopy height. In our study, the Sentinel-2 image collection was filtered to include data from 1 May to 31 December 2023, corresponding to the leaf-on season in the study area. This period was intentionally selected to ensure a strong and stable vegetation signal, which is essential for accurate vegetation analysis, particularly in regions dominated by deciduous species such as poplar. To ensure data reliability, only scenes with a cloud fraction below 10% were retained, reducing contamination effects and improving consistency in feature extraction.
Vegetation indices such as the Ratio Vegetation Index (RVI), Normalized Difference Vegetation Index (NDVI), and Enhanced Vegetation Index (EVI) were calculated to represent canopy greenness and density [
31,
32,
33,
34]. To ensure the indices were applied specifically to forested regions, a vegetation mask was generated using the ESA WorldCover v200 dataset [
35]. NDVI time-series validation using the threshold method [
36] confirmed that this mask was consistent with forest cover, supporting its use for isolating target regions. While this masking strategy effectively excluded non-forest areas, potential misclassifications in transitional landscapes remained a source of uncertainty.
In addition to vegetation indices, biophysical variables including Leaf Area Index (LAI), chlorophyll content (Cab), and Fractional Vegetation Cover (FVC) were estimated [
37]. LAI and FVC provided key descriptors of canopy density and coverage, whereas Cab reflected vegetation physiological status. The inclusion of EVI, which integrated the blue band to mitigate soil background and saturation effects, is particularly valuable in dense vegetation settings, where NDVI often exhibits reduced sensitivity [
38]. These parameters were integrated into the feature set to strengthen the link between spectral signals and canopy structural properties. Nevertheless, LAI retrieval from optical indices could be affected by saturation at high canopy densities, even when using EVI, which should be acknowledged as a limitation of multispectral-based predictors.
3.1.3. SAR Polarization and Interferometric Processing
Sentinel-1 SAR data were acquired using C-band (5.405 GHz) dual-polarized SAR sensors, which were capable of collecting high-resolution imagery under all-weather and all-day conditions. The spatial resolution of Sentinel-1 Interferometric Wide (IW) mode data typically ranged from 5 to 20 m, depending on processing level and look configuration.
This study employed Sentinel-1 SAR data, including Ground Range Detected (GRD) and Single Look Complex (SLC) products, to extract polarization- and interferometry-based structural features relevant to forest height estimation. All data were acquired in IW swath mode, which offered a balance between spatial resolution and coverage, making it well-suited for forest monitoring applications [
39].
The GRD dataset provided amplitude information in ground range projection and was used to derive polarization-sensitive backscatter coefficients. VV and VH channels were selected due to their complementary sensitivities: VV primarily reflected canopy vertical structure, while VH enhanced sensitivity to multiple scattering and surface roughness. To improve data fidelity, preprocessing steps such as thermal noise removal, radiometric calibration, and terrain correction were applied. Adaptive filtering [
40] was also used to suppress residual noise while preserving object boundaries and texture details. The calibrated backscatter coefficients (σ
0) were expressed in decibel (dB) scale to facilitate interpretation and comparison. While such preprocessing is standard, its relevance lay in ensuring radiometric accuracy and geolocation consistency. The acquisition window (1–10 January 2023) was chosen to minimize temporal variability, but residual influences from soil moisture and surface roughness might still confound canopy-related backscatter signals, representing a limitation of PolSAR-based predictors.
SLC data preserved both amplitude and phase components, thereby enabling interferometric analyses such as high-resolution DEM and ground deformation monitoring. Unlike GRD products, the phase coherence in SLC imagery allowed retrieval of elevation and displacement information with sub-wavelength sensitivity. In this study, Sentinel-1B SLC datasets were employed, with acquisitions between 1–10 January 2023 serving as master images and those from 13–20 January 2023 as slave images. The relatively short temporal and perpendicular baselines of these acquisitions were chosen deliberately to maximize interferometric coherence and minimize decorrelation. Both VV and VH polarization channels were utilized to assess their relative contributions, as VV tends to emphasize surface scattering while VH is more responsive to vegetation and volumetric effects.
Data preprocessing was aimed at ensuring radiometric fidelity and geometric accuracy. Radiometric calibration converted raw pixel intensities into normalized backscatter (σ0), and geometric corrections aligned imagery with geospatial coordinates to enable inter-scene comparability. Noise suppression, including temporal and spatial filtering, further stabilized interferometric measurements. These steps were crucial for maintaining robust phase information, particularly in heterogeneous land-cover areas where coherence is easily degraded.
The interferometric workflow comprised image co-registration, interferogram formation, flat-earth phase removal, phase filtering, coherence estimation, phase unwrapping, and terrain correction. Image co-registration was particularly critical, as sub-pixel alignment ensures that the differential phase truly represents surface displacements rather than geometric misalignments. A third-order polynomial orbit interpolation model was employed, coupled with bilinear interpolation during image resampling, reflecting a balance between computational efficiency and geometric precision. Debursting was applied to eliminate discontinuities inherent in TOPS-mode acquisitions, yielding seamless phase information suitable for interferometric analysis.
To enhance phase quality, multilooking was applied with three looks in the range and one in the azimuth direction, resulting in an average ground range square pixel size of 13.26 m2. This configuration was selected to reduce speckle while maintaining sufficient resolution for deformation analysis. Interferogram filtering employed the Goldstein phase filter, which was widely recognized for improving fringe visibility in low-coherence areas by attenuating high-frequency noise while retaining geophysical signal. Flat-earth phase contributions, which arose from orbital geometry and Earth curvature, were removed using a fifth-degree polynomial model based on 501 estimation points, thereby isolating deformation-related phase components.
Interferometric coherence was estimated with a range window of 10 pixels and an azimuth window of 3 pixels. The coherence coefficient γ was defined as follows:
where
n1 and
n2 represented the complex pixel values of the master and slave images, respectively. E denoted the statistical expectation, and the superscript * denoted the complex conjugate operation. The coherence coefficient ranged from 0 to 1, with higher values indicating better interferometric phase stability.
To remove the topographic phase component, a DEM from the Shuttle Radar Topography Mission (SRTM) with 3-arcsecond resolution was used. DEM resampling was performed using bicubic interpolation to ensure geometric fidelity and accurate phase modeling during terrain correction.
This comprehensive InSAR processing strategy, with detailed parameter settings, ensured the generation of high-quality interferometric products and provided a robust foundation for subsequent deformation analysis.
3.2. Forest Height Regression Models
In this study, multiple data combination-based modeling strategies were developed to address the task of forest height inversion, leveraging GEDI LiDAR observations as reference and auxiliary information. Modeling analyses were conducted by combining optical features, SAR features, and their combined characteristics with GEDI data. Specifically, typical vegetation indices and biophysical parameters were extracted from Sentinel-2 multispectral imagery to represent the horizontal structural attributes of vegetation. These optical features were then paired with GEDI-observed canopy height data to construct regression models, thereby achieving the integration of optical features and GEDI data. This approach emphasized the use of GEDI as a ground-truth reference for height calibration, enhancing the predictive capability of optical features in forest height estimation.
Furthermore, for the SAR feature integration modeling, PolSAR and InSAR data were utilized to extract backscattering coefficients, coherence, and interferometric phase information, characterizing the vertical structural variations of vegetation. By combining these SAR features with GEDI canopy height observations, SAR and GEDI data integration models were established. Referencing GEDI measurements enabled the effective extraction and quantification of vegetation-related information from SAR data, thus enhancing the precision of forest height estimation derived from SAR data.
Moreover, to fully exploit the complementary strengths of different data sources in representing horizontal and vertical vegetation structures, a joint modeling strategy combining optical features, SAR features, and GEDI data were also developed. In this approach, GEDI observations served as the reference, while multispectral vegetation indices, biophysical parameters, and SAR scattering and coherence features were jointly incorporated to form a high-dimensional composite feature space. Machine learning methods were employed to construct the inversion models. This multi-modal data combination framework effectively captured both the spectral responses and structural characteristics of vegetation, demonstrating superior model fitting capability and predictive performance.
Before being input into the model, all features were resampled to a spatial resolution of 110 m using the bilinear method in order to avoid computational load exceeding the limitations of Google Earth Engine (GEE). Sample selection and partitioning were aimed at constructing effective training and testing datasets to support subsequent model training and evaluation. All pixels were randomly permuted and partitioned based on a predefined ratio, with 70% assigned to the training set and the remaining 30% allocated for model evaluation. A random sequence was generated to ensure the randomness of the partitioning process and to avoid data bias. The training set was used to build the models, while the testing set was employed for model validation and accuracy assessment. Notably, the use of a single 70/30 train-validation split might not provide a stable estimate of model performance across different data partitions. Therefore, we additionally employed five-fold cross-validation to improve the robustness and generalizability of the model and to obtain a more reliable evaluation of its predictive performance.
3.2.1. Random Forest Regression Model
The Random Forest (RF) regression model was a nonlinear method based on ensemble learning. It constructed multiple regression trees and aggregates their outputs to enhance prediction robustness and accuracy. Compared to a single decision tree, which tends to overfit, RF improved generalization by combining independently trained trees. This advantage made it suitable for complex nonlinear problems like surface biomass inversion [
41].
In this study, the RF model employed Bootstrap Aggregating (Bagging) and random feature selection to increase diversity and robustness. Given a training dataset with input feature vectors and corresponding target variables, bootstrap sampling was applied to generate subsets by sampling with replacement. This randomness ensured each tree is trained on different data, reducing overfitting and model variance.
During tree construction, feature selection at each split was limited to a random subset of features, from which the best split is chosen. This mechanism further reduced overfitting and prevented highly correlated features from dominating decisions, thereby enhancing generalization.
The final prediction was the average of all regression tree outputs:
where
denoted the total number of regression trees, and
was the prediction from the
t-th tree.
This ensemble approach reduced individual model bias while improving overall stability and accuracy. For regression, RF used Mean Squared Error (
MSE) as the splitting criterion, choosing the split that minimizes target value variance within child nodes. For a given node, the optimal split minimized:
where
was the mean of the target values within the node.
3.2.2. Gradient Tree Boosting Regression Model
Gradient Tree Boosting (GTB) was a nonlinear regression technique under the Boosting framework. It sequentially built multiple weak learners, such as regression trees, with each learner correcting the errors of its predecessor, thereby enhancing both fitting accuracy and generalization capability. Unlike single decision trees that were sensitive to noise, GTB achieved robust performance by minimizing a loss function iteratively.
In this study, GTB used an additive model combined with a forward stagewise algorithm. The initial model prediction was set to the mean of the target values. In each iteration, a new base learner was trained on the residuals of the current prediction and added to the model using a learning rate to minimize the overall loss.
The construction of the
m-th base learner in each iteration was represented as follows:
where
was the prediction model at iteration m,
was the model from the previous iteration,
was the learning rate,
was the base learner fitted to the current residuals. Careful tuning of the learning rate enabled control over the convergence speed and generalization ability of the model, helping to prevent overfitting.
Each base learner fitted the negative gradient of the loss function to approximate the residuals:
where
L denoted the loss function, commonly defined as Mean Squared Error (
MSE) in regression tasks to maximize the model’s fitting accuracy.
The final prediction was the weighted sum of all base learners:
This iterative strategy captured complex feature relationships and improved prediction accuracy by gradually reducing residuals. To prevent overfitting, a small learning rate and early stopping (based on validation performance) were applied to determine the optimal number of iterations.
3.2.3. Evaluation Metrics
To assess the predictive capability of the model, several metrics were calculated, including Root Mean Square Error (
RMSE) and the coefficient of determination (
R2). A lower
RMSE indicated smaller discrepancies between predicted and actual values; a lower relative
RMSE implied better relative prediction accuracy. A higher
R2 value signified better model fit and stronger agreement between predicted and observed trends. The formulas are as follows:
where
n was the number of data points,
was the actual value,
was the predicted value, and
was the mean of the observed values.
4. Results and Analysis
4.1. Feature Extraction Results
Based on multispectral remote sensing data from Sentinel-2, representative spectral vegetation indices were derived, including RVI, NDVI, and EVI. In addition, a series of surface biophysical parameters, LAI, Cab, and FVC, were retrieved through spectral inversion techniques. The results are presented in
Figure 3, which illustrates the spatial distribution patterns of these vegetation indices and biophysical parameters.
With Sentinel-1A GRD data, backscattering coefficients under two polarization modes (VV and VH), as well as coherence and interferometric phase maps for VV-VV and VH-VH polarization modes, were extracted. The combined results are shown in
Figure 4, which provides a comparative view of backscattering intensity, coherence stability, and phase information across different polarization modes. During InSAR processing, although striping and data gaps reduced coherence in some regions, the coherence maps still demonstrate good spatial consistency in topographically stable areas, providing a solid foundation for forest structure estimation.
4.2. Optical Features and GEDI Data Integration
Multispectral remote sensing data are primarily used to capture the horizontal structural information of vegetation. Consequently, regression models that rely solely on such data for tree height inversion often exhibit poor performance. In this study, three typical vegetation indices (NDVI, EVI, RVI) and three biophysical parameters (FVC, LAI, Cab) were selected to construct a six-dimensional feature vector. The model’s performance was subsequently evaluated using the validation set. The inversion results are presented in
Table 1.
An accuracy assessment based on the validation set revealed that the model exhibited suboptimal performance when relying solely on optical features, yielding an RMSE of 5.253 m and an R2 of 0.162, indicative of a weak relationship between the predicted and observed forest heights. Similarly, the model’s predictive capability remained limited when using only CHM, with an RMSE of 3.746 m and R2 of 0.334.
In contrast, the integration of CHM derived from GEDI LiDAR data markedly improved model performance. When CHM was combined with six optical variables, the RMSE decreased to 3.237 m, and R
2 increased substantially to 0.572, demonstrating dramatical enhancement in both predictive accuracy and model robustness. The performance metrics of the random forest regression model for both the training and validation sets are illustrated in
Figure 5.
A similar trend was observed with the GTB model. When only optical features were used, the model yielded an RMSE of 5.506 m and an R
2 of 0.100, confirming weak predictive ability. However, upon the addition of CHM, the model performance improved markedly, achieving an RMSE of 3.541 m and an R
2 of 0.484. Although slightly lower than the RF model in R
2, the GTB model showed a notable decrease in RMSE, indicating that it is more responsive to the structural information provided by CHM. The accuracy analysis results for the training and validation sets of the GTB model are presented in
Figure 6.
Overall, both models exhibited enhanced performance when CHM was introduced, affirming the key role of GEDI-derived structural metrics in forest height estimation. Nonetheless, the limited contribution of optical features alone suggests that they may not adequately capture vertical forest heterogeneity. These findings emphasize the critical importance of integrating 3D structural data to improve forest height inversion accuracy and model robustness.
4.3. SAR and GEDI Data Combination
This study leverages the all-weather, all-time observation advantages of SAR data, combined with its strong penetration capability in vegetation-dense areas, to explore the applicability of RF regression and GTB models in forest height inversion. The study selected backscatter coefficients from VV and VH polarization, interferometric phase, and coherence as model input features from the main and secondary images, and performed model accuracy evaluation on 3000 sample data sets from the validation set. The results are shown in
Table 2.
The results demonstrate that different radar feature combinations significantly affect the model’s prediction performance. When VV polarization was used alongside CHM, the model achieved an RMSE of 3.162 m and an R2 of 0.488. Notably, replacing VV with VH polarization led to an improved RMSE of 2.999 m and a higher R2 of 0.507, suggesting that VH polarization is more sensitive in characterizing vegetation vertical structure.
Compared to the use of CHM alone, the integration of dual-polarized radar features with CHM further improved model performance, resulting in an RMSE of 3.059 m and an R2 of 0.551. Notably, the combination of InSAR-derived metrics with CHM yielded the highest predictive accuracy, with the random forest model achieving an RMSE of 2.265 m and an R2 of 0.732. This marked improvement highlights the critical contribution of InSAR parameters, which is likely attributable to their enhanced sensitivity to vertical forest structure and canopy closure, enabling the model to capture more detailed biophysical characteristics of forested areas.
However, when combining dual PolSAR, InSAR, and CHM features, model performance slightly declined (RMSE = 2.855 m, R
2 = 0.601). This phenomenon may be attributed to the distinct types of structural information provided by the two data sources: InSAR primarily captures vertical forest structure, while PolSAR mainly reflects horizontal structural characteristics. In some cases, the addition of horizontal structural information contributes little to a model that already relies on vertical structure metrics, which may result in a slight reduction in overall predictive performance. Furthermore, the increased feature set also raises model complexity, which can potentially affect generalization on the validation dataset, particularly when training samples are limited. The accuracy analyses for training and validation sets are shown in
Figure 7.
Similarly, the GTB model exhibited strong predictive capability with varying radar feature combinations. The integration of InSAR features with CHM resulted in the best model performance, with an RMSE of 2.619 m and an R2 of 0.643. This corroborates the pattern observed in the RF model, affirming the substantial contribution of InSAR in forest height estimation. However, the results obtained using SAR data alone were not satisfactory. This may due to the fact that SAR data provide only indirect information on forest vertical structure through interferometric phase and coherence features, whereas the CHM data derived from GEDI observations directly represent vertical height information.
When combining dual PolSAR, InSAR, and CHM features, the GTB model achieved an RMSE of 3.051 m and an R
2 of 0.535, indicating its robust capacity to integrate heterogeneous remote sensing inputs. The model accuracy evaluation is illustrated in
Figure 8.
Overall, using CHM as the sole input resulted in limited model performance in both RF and GTB frameworks, suggesting that CHM alone lacks comprehensive structural information. In contrast, the integration of InSAR features markedly improved model accuracy, particularly within the RF model, where R2 reached 0.732, significantly outperforming the GTB model, reflecting RF’s superior capability in structural modeling.
Moreover, for both regression models, the inclusion of InSAR-derived parameters consistently produced the lowest RMSE and highest R2 values, confirming the unique contribution of InSAR in capturing vertical vegetation structures. These results highlight the importance of advanced radar features, especially VH polarization and InSAR, for enhancing forest height modeling within multi-source data fusion frameworks.
In summary, the findings confirm that the SAR-based RF regression model exhibits strong potential in retrieving forest height, especially with the use of InSAR data, which provides more vertical information and improved accuracy.
4.4. Optical, SAR Features, and GEDI Integration
This study employs a multi-modal data combination framework to leverage the significant advantages of multi-modal data for surface parameter estimation. Optical features, with their high spatial resolution and rich spectral information, effectively capture the structural and biochemical characteristics of surface vegetation, providing critical surface information support for forest height inversion. SAR data, with its all-weather and all-time observation capability, particularly excels in providing stable backscatter information under complex weather conditions, significantly compensating for the observational deficiencies of optical data under cloudy or low-light environments. GEDI data offer high-precision vertical structural detection capabilities, ensuring crucial support for extracting three-dimensional information such as surface height and canopy density.
Three typical vegetation indices (NDVI, EVI, RVI) and three biophysical parameters (FVC, LAI, Cab) were selected as multispectral parameters in our study. SAR data, including the backscattering coefficients of VV polarization and VH polarization, along with the interferometric phase and coherence coefficient of the main and secondary images were used. GEDI data, processed to extract the CHM and RH98 were incorporated to construct multidimensional feature vectors. A total of 10,000 random samples were collected, and model accuracy was evaluated using 3000 samples from the validation set, as shown in
Table 3.
The RF model achieves a lower training RMSE of 1.946 and a higher R2 of 0.864, indicating a strong fit to the training data. Nevertheless, its performance on the validation set declines, with the RMSE increasing to 2.564 and the R2 dropping to 0.708. This discrepancy suggests a degree of overfitting, whereby the model captures noise or non-generalizable patterns in the training data, thus reducing its predictive accuracy on unseen samples.
In contrast, the GTB model yields a higher training RMSE of 2.574 m and a lower R2 of 0.716, indicating a comparatively less complex fit to the training data. However, its performance on the validation set remains relatively stable, with an RMSE of 2.697 m and an R2 of 0.672. Although slightly lower than RF in absolute validation accuracy, the GTB model exhibits a smaller performance gap between training and validation stages, implying better generalization and reduced overfitting.
Additionally, the Maximum Direct Information (MDI) was used to quantify the importance of each feature vector, as shown in
Figure 9. The results visually present the relative contribution of each input feature to forest height inversion in the model, providing an essential reference for further optimizing feature selection and improving inversion accuracy. From the figure, it is clear that the backscattering coefficient of VH polarization, along with several InSAR-derived features, stands out in the feature importance analysis, with significantly higher MDI values compared to other features. This indicates that SAR-derived features, particularly from InSAR, demonstrate strong explanatory and predictive capabilities for forest height estimation. The MDI values of CHM features also take the lead, demonstrating its important role in forest height estimation. In contrast, the MDI values for most of the multispectral features are relatively low, suggesting that these features contribute less to the model. Overall, in terms of MDI values, InSAR-derived features, and GEDI-derived features exhibit the most pronounced contribution to forest height estimation, whereas the influence of optical remote sensing imagery appears comparatively limited.
Figure 10 presents a comparative analysis of the prediction accuracy of the RF and GTB models on both the training and validation datasets. This figure offers an intuitive representation of the models’ fitting capabilities and generalization performance, serving as a visual foundation for evaluating their robustness and predictive stability.
These results suggest that while the RF model can effectively capture complex nonlinear relationships within the training set, it may be more susceptible to overfitting under the current feature and data conditions. On the other hand, the GTB model, though less accurate on training data, maintains more consistent performance on the validation set, reflecting its potential advantage in practical applications requiring robust prediction across variable input distributions.
To visualize the prediction results of the regression models for forest height in Jiangsu Province, the predicted data were exported and mapped. However, since GEDI data is point-based, an interpolation method is required to convert point data into surface data for inversion over the entire study area. Therefore, the final predicted results were generated using the Inverse Distance Weighting (IDW) interpolation method, incorporating SAR data, multispectral data, and the CHM. The predicted results are shown in
Figure 11.
Forest height estimates across Jiangsu Province exhibit a clear spatial gradient, with elevations tending to be greater in the southern regions compared to the north. This overall distribution is significantly influenced by topography and the degree of urbanization. In southern Jiangsu, particularly in hilly and mountainous regions such as Nanjing, southern Wuxi and Zhenjiang, forest coverage is relatively dense, with forest heights commonly exceeding 20 m in certain localized areas. In contrast, the central and northern regions, dominated by flat plains, feature more fragmented forest distribution, primarily consisting of low-stature vegetation, with most forest heights falling below 10 m. Notable exceptions with higher vegetation occur only in limited mountainous areas, such as the Yuntai Mountain region in Lianyungang. Overall, forest resources in Jiangsu Province seemed to be predominantly concentrated in the southern mountainous zones, while the central and northern areas exhibit sparse and scattered forest cover.
5. Discussion
Among the various remote sensing inputs, the interferometric coherence and phase derived from Sentinel-1 SAR data proved especially influential. Compared with methods relying only on dual-polarized SAR backscatter or multispectral data, these InSAR features better capture canopy structural complexity, supporting the integration of heterogeneous remote sensing modalities. Their advantages lie in providing vertical structural information which is not available from optical or passive data. Interferometric phase and coherence are highly sensitive to forest vertical structure. They complement spectral indices, which describe canopy greenness and horizontal heterogeneity, and PolSAR backscatter coefficients, which capture scattering from canopy components. Together, these features support more accurate indirect estimation of canopy height. In addition, InSAR’s sensitivity to structural variability helps mitigate errors from illumination and shadow effects in complex terrain. Operating in the microwave domain, InSAR also acquires data regardless of weather or lighting, offering consistent observations particularly valuable in Jiangsu’s cloudy and rainy climate. By combining spectral, structural, and scattering information, the fusion models reduce reliance on a single data source, achieving higher predictive accuracy and stronger ecological interpretability.
In terms of model performance, the RF regression model demonstrated high accuracy on the training dataset (R2 = 0.864) but a significant drop on the validation dataset (R2 = 0.708), indicating a notable overfitting issue. This behavior may be attributed to several factors. Firstly, the model complexity, specifically the configuration of a large number of trees (500) and a small minimum leaf size (5), can cause the RF model to capture noise or specific patterns in the training data that do not generalize well to unseen data. Besides, the input features include highly correlated variables, such as multiple vegetation indices and radar metrics. This can cause the model to overemphasize minor variations and increase overfitting. Further feature selection, dimensionality reduction, or regularization could improve generalization. In contrast, the GTB model showed more balanced performance, with a smaller gap between training (R2 = 0.716) and validation (R2 = 0.672) datasets. Its sequential learning and built-in regularization likely helped handle feature redundancy and reduce overfitting.
The results show that SAR data can effectively offset the limitations of optical imagery, especially in Jiangsu’s cloudy and fragmented forest areas, where optical features alone are insufficient. By combining spectral, structural, and scattering data, multi-source fusion models achieve more stable and accurate canopy height predictions.
Despite these promising outcomes, several limitations persist. The availability and spatial continuity of GEDI LiDAR and InSAR data are restricted by acquisition conditions, leading to striping effects and incomplete coverage in some areas. Moreover, the SAR data, which are derived from C-band dual-polarization and interferometric techniques, have limited canopy penetration, which may cause underestimation in densely forested areas. Future work should incorporate SAR data from longer wavelengths and higher spatial resolutions, such as P-band from the Biomass satellite launched on 29 April 2025, or L-band from missions like Lutan-1, to address this limitation.
This paper focuses on enhancing the accuracy and efficiency of regional forest average height estimation in Jiangsu Province through the combination of multi-modal remote sensing data. The core innovation lies in the first systematic evaluation of the joint use of InSAR interferometric phase and coherence within two mainstream machine learning models: the Random Forest regression model and the Gradient Tree Boosting regression model.
First, single-source remote sensing data inevitably carry both advantages and limitations. Optical imagery provides spectral and textural information related to canopy greenness and horizontal heterogeneity, PolSAR backscatter coefficients capture scattering from canopy components, while GEDI LiDAR provides direct vertical structural references but suffers from sparse and discontinuous sampling. In contrast, interferometric coherence and phase are highly sensitive to vertical canopy structure, providing information not captured by optical or PolSAR data. Therefore, the integration of these complementary modalities is essential to obtain a more comprehensive representation of forest structure. In our study, the combination of spectral, scattering, and vertical structural information was particularly effective for Jiangsu Province, where the terrain is dominated by plains and low hills with relatively gentle undulations. In such landscapes, the fusion of multi-modal data can overcome the limitations of single-source datasets and yield more stable and accurate canopy height predictions.
Second, from the perspective of model adaptability, our framework relies on both direct and indirect vertical information (GEDI canopy height metrics and InSAR coherence/phase), spectral features (optical indices), and physical scattering metrics (dual PolSAR). Ablation experiments revealed that although GEDI and InSAR features made the strongest contributions to model performance, neither dataset alone achieved satisfactory accuracy. This highlights the mutual complementarity among data sources, which underpins the generalizability of the approach. While Random Forest tended to overfit, Gradient Tree Boosting showed stronger generalization ability across different datasets. Furthermore, given that Jiangsu Province is characterized by fragmented forests and modest terrain variability, the proposed framework demonstrates potential to be extended to other regions with similar environmental conditions. Nevertheless, the deeper mechanisms of cross-modal complementarity still need to be further explored by employing more advanced models.
Finally, the interferometric coherence and phase derived from Sentinel-1 SAR data proved especially influential among the various remote sensing inputs. Compared with methods relying only on dual-polarized SAR backscatter or multispectral data, these InSAR features better capture canopy structural complexity, supporting the integration of heterogeneous remote sensing modalities. Their advantages lie in providing vertical structural information which is not available from optical or passive data. In addition, InSAR’s sensitivity to structural variability helps mitigate errors from illumination and shadow effects in complex terrain, and its weather- and illumination-independent acquisition provides consistent observations particularly valuable in Jiangsu’s cloudy and rainy climate. However, despite these strengths, establishing robust quantitative relationships between InSAR observables and canopy height remains challenging. While multi-frequency InSAR approaches (such as using P- and X-band differences) have been successfully applied in airborne or repeat-track scenarios, the potential of spaceborne InSAR for forest height estimation is still limited by temporal baselines and requires further investigation.
In terms of model performance, the RF regression model demonstrated high accuracy on the training dataset (R2 = 0.864) but a significant drop on the validation dataset (R2 = 0.708), indicating a notable overfitting issue. This behavior may be attributed to model complexity, highly correlated input variables, and limited feature diversity. In contrast, the GTB model showed more balanced performance, with a smaller gap between training (R2 = 0.716) and validation (R2 = 0.672) datasets. Its sequential learning and built-in regularization likely helped handle feature redundancy and reduce overfitting.
Nevertheless, limitations persist due to the restricted availability and spatial continuity of GEDI and InSAR data. For instance, this study employed GEDI RH98 as the reference for canopy height, but its spatial distribution is inherently sparse and uneven. Even after interpolation, uncertainties remain, which may affect the accuracy of large-scale modeling. Future research could benefit from incorporating airborne or UAV-based LiDAR point clouds to derive CHM as denser ground truth, complemented by field plot measurements. Such improvements in reference data quality would further optimize the performance of large-scale forest height estimation models.
Future work should incorporate SAR data from longer wavelengths and higher spatial resolutions, such as P-band from the Biomass satellite (launched 29 April 2025) or L-band from missions like Lutan-1. Additionally, alternative ensemble models like XGBoost and LightGBM may offer further improvements in scalability [
42,
43], tuning efficiency, and robustness. To enhance spatial detail and overall estimation precision, high-resolution optical datasets (such as GaoFen-2, WorldView) may also be incorporated [
44,
45]. To address the issue of sparse GEDI point coverage, spatial interpolation methods or deep learning-based data augmentation approaches could be employed [
46,
47]. Lastly, incorporating time-series remote sensing data [
48] could enable dynamic tracking of forest canopy changes and enhance the temporal generalization of the models [
49,
50].