ES-Net Empowers Forest Disturbance Monitoring: Edge–Semantic Collaborative Network for Canopy Gap Mapping

Wang, Yutong; Zhang, Zhang; Xia, Jisheng; Zhao, Fei; Dong, Pinliang

doi:10.3390/rs17142427

Open AccessArticle

ES-Net Empowers Forest Disturbance Monitoring: Edge–Semantic Collaborative Network for Canopy Gap Mapping

by

Yutong Wang

¹,

Zhang Zhang

²

,

Jisheng Xia

^2,*

,

Fei Zhao

²

and

Pinliang Dong

³

¹

Institute of International Rivers and Eco-Security, Yunnan University, Kunming 650500, China

²

School of Earth Science, Yunnan University, Kunming 650500, China

³

Department of Geography and the Environment, University of North Texas, Denton, TX 76201, USA

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2427; https://doi.org/10.3390/rs17142427

Submission received: 29 May 2025 / Revised: 2 July 2025 / Accepted: 11 July 2025 / Published: 12 July 2025

Download

Browse Figures

Versions Notes

Abstract

Canopy gaps are vital microhabitats for forest carbon cycling and species regeneration, whose accurate extraction is crucial for ecological modeling and smart forestry. However, traditional monitoring methods have notable limitations: ground-based measurements are inefficient; remote-sensing interpretation is susceptible to terrain and spectral interference; and traditional algorithms exhibit an insufficient feature representation capability. Aiming at overcoming the bottleneck issues of canopy gap identification in mountainous forest regions, we constructed a multi-task deep learning model (ES-Net) integrating an edge–semantic collaborative perception mechanism. First, a refined sample library containing multi-scale interference features was constructed, which included 2808 annotated UAV images. Based on this, a dual-branch feature interaction architecture was designed. A cross-layer attention mechanism was embedded in the semantic segmentation module (SSM) to enhance the discriminative ability for heterogeneous features. Meanwhile, an edge detection module (EDM) was built to strengthen geometric constraints. Results from selected areas in Yunnan Province (China) demonstrate that ES-Net outperforms U-Net, boosting the Intersection over Union (IoU) by 0.86% (95.41% vs. 94.55%), improving the edge coverage rate by 3.14% (85.32% vs. 82.18%), and reducing the Hausdorff Distance by 38.6% (28.26 pixels vs. 46.02 pixels). Ablation studies further verify that the synergy between SSM and EDM yields a 13.0% IoU gain over the baseline, highlighting the effectiveness of joint semantic–edge optimization. This study provides a terrain-adaptive intelligent interpretation method for forest disturbance monitoring and holds significant practical value for advancing smart forestry construction and ecosystem sustainable management.

Keywords:

canopy gap; edge–semantic collaboration; ecological monitoring; attention mechanism; multi-task network; UAV remote sensing

Graphical Abstract

1. Introduction

A canopy gap [1] refers to a medium- to small-scale opening in the forest canopy resulting from human-related actions or natural disturbances (e.g., the decay, death, and lodging of trees). According to Brokaw [2], the “canopy gap” is the region enveloped by the vertical projection of the surrounding tree crowns, and many scholars have used this definition in their research. As a common disturbance type in forest ecosystems, canopy gaps profoundly influence species composition, community succession, and carbon storage dynamics by altering the understory microenvironment (e.g., light, temperature, and moisture distribution) [3,4,5]. A reasonable and accurate delineation of canopy gaps and their spatial distribution is not only an important foundation for analyzing forest structural heterogeneity and assessing ecosystem service functions but also a core prerequisite for enabling detailed forest land research and sustainable management [6,7].

Traditionally, ground investigation was a primary means for monitoring canopy gaps. The emergence of small aircraft, aerial walkways, and tower cranes advanced the study of canopy gaps to a certain degree [8]. Nevertheless, these observation methods are limited by issues such as high labor costs, small coverage areas, and poor terrain adaptability, making it difficult to meet the needs of long-term dynamic observation in complex mountainous forest areas [9,10]. As a new platform, unmanned aerial vehicles (UAVs) have the advantages of a long battery life, strong timeliness, high data accuracy, and low cost. At the same time, UAVs can avoid the influence of the atmosphere on imaging due to the low flying altitude [11,12], and they have been prevalently applied in monitoring small-scale forest ecosystems [13].

Methods for canopy gap extraction from UAV imagery have undergone a technological evolution from traditional classification approaches to deep learning-based techniques [14,15,16,17]. Pixel-wise classification methods rely solely on spectral features, tending to generate “salt-and-pepper” noise and failing to effectively capture the spatial structural features of canopy gaps. Object-based classification, by integrating multi-dimensional features such as spectral, textural, and shape characteristics, has improved the segmentation accuracy to some extent. However, it relies on manually designed feature parameters and struggles to characterize the irregular boundaries of canopy gaps in complex terrains, often resulting in excessive edge smoothing or fragmentation. These methods are fundamentally constrained by the expressiveness of handcrafted features, performing poorly in scenarios with understory shadow interference and uneven vegetation cover. Nowadays, deep learning, leveraging its powerful capability for automatic feature learning, has become a research hotspot in the field of remote-sensing information recognition and extraction [18,19,20], and numerous breakthroughs have been made in forest monitoring. For example, EdgeFormer optimizes the feature capture of complex scenes through Transformer global attention [21]; foundation models combined with Sentinel-2 time-series data enable forest change monitoring [22]; knowledge-guided deep learning enhances the adaptability of multi-temporal monitoring in high-altitude mountainous areas [23]; and Siamese networks optimize the extraction accuracy of forest cover changes based on Landsat 8 data [24]. These advancements have provided a new paradigm for efficiently extracting complex semantic information of canopy gaps but still require further optimization for the unique attributes of canopy gaps.

Attention mechanisms and edge detection techniques have brought new ideas to remote-sensing information extraction [20,25,26], while they face the core challenge of “insufficient semantic–geometric collaborative modeling” in canopy gap extraction. Channel attention modules (e.g., SENet [27]) dynamically enhance spectral responses related to canopy gap-sensitive features by modeling non-linear dependencies between feature channels, effectively improving the discriminative ability for similar land objects in hyperspectral imagery classification. Edge detection networks (e.g., RCF [28], BDCN [29]) capture pixel value mutation information through multi-scale feature fusion, achieving high-precision boundary localization in the extraction of regular ground objects such as farmland [30], roads [31], and buildings [32]. Nevertheless, canopy gap boundaries exhibit non-rigid, multi-scale variability due to interference from canopy shadows and understory vegetation cover. In existing methods, Transformer-based models (such as EdgeFormer) model long-range dependencies through global self-attention mechanisms, but their limited ability to characterize local geometric features leads to blurred segmentation contours when extracting targets with irregular boundaries like canopy gaps. Multi-task segmentation models (such as DeepLabv3+), although attempting to integrate semantic segmentation and edge detection tasks, lack special optimization for the “spectral heterogeneity + topographic interference” characteristics of canopy gaps. When facing spectral confusion between understory shadows and bare soil, they often suffer from missed segmentation due to insufficient semantic discrimination. This synergistic lack of “semantic feature enhancement” and “edge detail capture” means existing methods commonly encounter issues of omission, misclassification, and boundary discontinuity in canopy gap extraction within complex mountain forest areas. Therefore, a new deep learning framework needs to be constructed to combine spectral semantics and geometric priors.

To address the heterogeneity characteristics of canopy gaps and the technical bottlenecks of existing methods, we propose a multi-task deep learning network integrating edge features and semantic information (ES-Net). The architecture achieves three critical contributions: (1) The EDM’s multi-scale gradient fusion enables pixel-level boundary localization, reducing the Hausdorff Distance by 38.6% compared to U-Net; (2) the cross-layer attention in SSM enhances spectral discrimination for “same-spectrum” regions; and (3) the lightweight framework (23.99 M params) balances accuracy and efficiency, meeting UAV real-time processing requirements. When tested in Yunnan’s typical forests, ES-Net provides a robust solution for canopy gap extraction, advancing intelligent forest disturbance monitoring.

2. Material and Methods

2.1. Study Area

This study was conducted in a representative forest region within Yunnan Province, China, situated adjacent to Songmao Reservoir in Chenggong District, Kunming (Figure 1). Nestled within the northeastern Dianchi Lake Basin at an average elevation of 2016.75 m, the area exhibits typical subtropical plateau monsoon climate characteristics. It has an average annual temperature of 14.9 degrees Celsius and receives an annual precipitation ranging from 900 to 1000 mm. The vegetation communities are dominated by coniferous forests composed primarily of Pinus yunnanensis and Podocarpus macrophyllus, with distinctive canopy gap structures. However, the complex terrain and significant topographic variations within the study area present substantial challenges for systematic monitoring of canopy gap dynamics and related ecological research.

2.2. Data Acquisition

A DJI Phantom 4 RTK UAV (SZ DJI Technology Co., Ltd., Shenzhen, China), featuring a 20-megapixel sensor, was utilized to acquire the image data of the study area on 13 July 2020. To ensure data integrity, flight operations were conducted under cloudless weather conditions with wind speeds below 3 m/s. Given the alpine terrain, a low-altitude flight plan at 180 m above ground level (AGL) was implemented to minimize atmospheric distortion. A dual-crossing flight pattern with 80% both course and side overlaps achieved full coverage, yielding 1268 photos. Post-processing using Agisoft PhotoScan generated an orthophoto (DOM) with a 0.039 m/pixel resolution, measuring 7132 × 6838 pixels. Georeferencing was performed using the WGS84 coordinate system with Universal Transverse Mercator (UTM) projection.

2.3. Canopy Gaps Dataset

As illustrated in Figure 2, this study generated a canopy gap dataset for the purpose of training the deep learning network. Firstly, the data were cropped to a size of 512 × 512 pixels by the sliding window method. Then, the canopy gaps were annotated in LabelMe [33], and the labels were generated in combination with ArcGIS10.4. Subsequently, the data were augmented through horizontal flipping, 90-degree counterclockwise rotation, and a 20% brightness increase. Finally, the dataset was randomly split into a training set and a testing set, where the ratio between them was 8:2.

The following steps were specifically carried out:

(1) Image segmentation. The large original image may cause long training, poor results, or even memory overflow when directly input into the network model, so pixel-level segmentation is needed. In Figure 3, the local segmentation results for diverse resolutions are presented. Considering the actual situation, the DOM and labels were cropped in a sequential manner, with a window size of 512 × 512 and a sliding step of 256. After data cleaning, 702 sub-images were obtained.

(2) Canopy gap annotation. Use the online software LabelMe 3.16.2 based on Anaconda to delineate the boundaries of the canopy gaps, thus generating “.json” files with information like the category and coordinate position of labeled objects. Finally, the files were converted into the “.png” format required for the experiment.

(3) Data augmentation and division. Insufficient training samples may render the neural network vulnerable to overfitting. Thus, it is necessary to expand the sample data by increasing the quantity and diversity to enhance the model’s performances and robustness. In this study, a total of 2808 images were eventually generated via techniques including horizontal flipping, 90-degree clockwise rotation, and brightness adjustment. For the purpose of ensuring the trustworthiness of the model, cross-validation was utilized to randomly divide the dataset, and the ratio between the training set and the test set was 8:2.

2.4. ES-Net Model

Canopy gaps exhibit significant heterogeneous characteristics in remote-sensing images, specifically manifested as follows: (1) Spectral heterogeneity: Different land covers (e.g., bare soil or grassland) show reflectance differences across multispectral bands; (2) Geometric heterogeneity: Irregular morphologies with varying sizes and shapes. The dual heterogeneity leads to challenges in traditional methods for addressing “same-object, different-spectrum” and “different-object, same-spectrum” phenomena. In view of this, the study proposes an improved U-Net [34] architecture that retains the symmetric structure and skip-connection mechanism while systematically enhancing feature representation capabilities through a two-level feature fusion strategy and an attention-guided semantic segmentation module (SSM). Additionally, an edge detection module (EDM) is integrated to reinforce boundary constraints. With this design, the network is capable of capturing the global semantic details and the local edge features of canopy gaps at the same time, significantly improving segmentation robustness in complex scenarios.

2.4.1. Semantic Segmentation Module (SSM)

The encoder of the SSM (Figure 4) captures multi-scale contextual information through a hierarchical structure: Each downsampling unit consists of two 3 × 3 convolutional blocks (incorporating batch normalization and ReLU activation) and a max-pooling layer. At each downsampling stage, features outputted by the current layer are combined with the previous layer along the channel dimension, and then a 3 × 3 convolution operation is applied to enable cross-level feature interaction, which effectively mitigates the problem of detailed information attenuation due to the deep network.

To strengthen the model’s ability in representing key features, a channel attention module (Figure 5) is embedded before skip connections. This module compresses spatial dimensional information via global average pooling to generate channel-wise statistical descriptors, models non-linear dependencies among channels using two-layer fully connected layers, and finally produces channel weight vectors through the sigmoid function. Mathematically, the channel attention module computes weights

w_{c}

for each feature channel c as

w_{c} = σ (f_{M L P} (f_{A v g P o o l} (F_{c})))

(1)

where

F_{c}

is the channel feature,

f_{A v g P o o l}

is global average pooling, and

f_{M L P}

is a two-layer MLP. This mechanism dynamically amplifies channel responses related to canopy gap-sensitive features while suppressing irrelevant background interference.

The decoder employs a symmetric topological structure with the encoder, which restores spatial resolution via transposed convolutions. At each decoding stage, the upsampled features are concatenated with attention-enhanced features from the corresponding encoder layer along the channel dimension, followed by spatial–semantic refinement through two 3 × 3 convolutional layers.

2.4.2. Edge Detection Module (EDM)

Downsampling in SSM inevitably causes a loss of detail information, thereby reducing the edge extraction accuracy—a phenomenon particularly evident in canopy gap extraction. Thus, an EDM (Figure 6) is incorporated into the model, which captures abrupt pixel value changes to precisely localize canopy gap boundaries. Inspired by mainstream edge detection frameworks (e.g., RCF, BDCN), this module extracts multi-level edge features through convolutional layers and fuses them across scales. Meanwhile, to enhance computational efficiency and model performance, the EDM shares convolutional layers with the SSM, enabling joint training during the learning process.

The EDM operates through three stages: First, three layers of feature maps are extracted from the semantic segmentation encoder, and edge response features are derived through 3 × 3 convolutional layers with ReLU activation to enhance non-linear representation. Second, the feature maps from the first two convolutional outputs are bilinearly upsampled to maintain spatial consistency with the third layer’s features, followed by channel concatenation for feature fusion. Ultimately, a convolutional layer with a kernel size of 1 × 1 compresses the fused multi-channel feature maps into a single-channel representation, which is converted into an edge probability map (0–1 range) using the sigmoid function to precisely characterize the irregular boundaries of canopy gaps.

2.5. Loss Function

Aiming at addressing the problems of class imbalance and the challenge of semantic–geometry collaboration in canopy gap extraction, this study designed a joint loss function that fuses Focal Loss and Binary Cross-Entropy (BCE), and adopted a fixed weight strategy to balance multi-task optimization. The total loss function is defined as

L o s s = L_{S S M} + λ \cdot L_{E D M}

(2)

Among them,

L_{S S M}

adopted Focal Loss [35] to address the problem of an imbalanced sample distribution, which is a dynamically scaled version of the standard Cross-Entropy (CE) loss. It is formulated in the following way:

L_{S S M} = - α_{t} {(1 - p_{t})}^{γ} \log (p_{t})

(3)

Here,

(1 - p_{t})

acts as the modulation factor, which is used to reduce the impact that easily classifiable samples have on the loss function, thus increasing the proportion of the loss attributed to difficult-to-classify samples;

α_{t}

serves the purpose of regulating the ratio between negative and positive samples in the context of the loss function, and to control the quantitative imbalance; and

γ

(ranging from [0, 5]) can change the influence of the modulation factor—when

γ

is 0, that is the CE loss function.

For

L_{E D M}

, Binary Cross-Entropy (BCE) loss was employed to compel the model to learn spectral gradient mutation features, thereby enhancing its ability to depict irregular boundaries.

L_{E D M} = - [y \log (\hat{y}) + (1 - y) \log (1 - \hat{y})]

(4)

where

y

is the ground-truth edge mask and

\hat{y}

is the predicted edge probability.

Semantic segmentation and edge detection complement each other in canopy gap extraction: semantic segmentation provides region-level classification, but boundary localization is susceptible to spectral noise interference; and edge detection strengthens geometric constraints, but the lack of semantic information easily generates pseudo-edges. The equal-weight (λ = 1.0) strategy ensures a balanced contribution of the two tasks in gradient updates, promoting collaborative learning of semantic–geometric features.

2.6. Evaluation Index

In this study, seven indicators spanning two categories—semantic segmentation accuracy and edge accuracy—were used to analyze and evaluate how well the canopy gap extraction model performed, as detailed in Table 1. Specifically, the semantic segmentation accuracy emphasized the model’s capability at identifying canopy gaps and preserving spatial consistency, while edge accuracy quantified the geometric congruence between predicted results and ground-truth edges.

Acc (Equation (5)) reflects how well the model performs in making overall predictions, yet it is prone to overestimation in scenarios where canopy gaps account for a minor class. R (Equation (6)) measures the proficiency of the model in identifying all the actual canopy gaps, while P (Equation (7)) evaluates the accuracy of the prediction outcomes. To address the contradiction between R and P under class imbalance, the F1 score (Equation (8)) provides a harmonized evaluation through their harmonic mean. IoU (Equation (9)) quantifies segmentation accuracy by measuring spatial overlap between predictions and ground truth, with values ranging from 0 to 1, where a higher score indicates a better performance.

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(5)

R = \frac{T P}{T P + F N}

(6)

P = \frac{T P}{T P + F P}

(7)

F 1 = \frac{2 \times P \times R}{P + R}

(8)

I o U = \frac{T P}{T P + F P + F N}

(9)

In Equation (5),

T P

(True Positives) corresponds to correctly identified canopy gap pixels,

F N

(False Negatives) represents canopy gap pixels misclassified as background,

F P

(False Positives) indicates background pixels erroneously assigned to canopy gaps, and

T N

(True Negatives) denotes accurately classified background pixels.

HD (Equation (10)) quantifies the spatial coincidence of canopy gap boundaries by calculating the maximum distance between predicted and ground truth edge point sets. ECR (Equation (11)) evaluates the completeness of edge extraction through the proportion of matched edge pixels.

H (A, B) = \underset{a \in A}{m a x} \{\underset{b \in B}{m i n} \{d (a, b)\}\}

(10)

where

A

and

B

are the predicted edge point sets and the true edge point sets, respectively, and

d (a, b)

is the Euclidean distance between point

a

and point

b

.

E C R = \frac{P E \cap T E}{T E}

(11)

Here, PE is the count of predicted edge pixels, and TE represents the count of real edge pixels.

3. Results

3.1. Experimental Environment and Training Parameter Configuration

The ES-Net model was constructed by leveraging the PyTorch 1.13.1 deep-learning framework, and the experiment was executed on a high-performance workstation outfitted with a CPU of the 13th Gen Intel^® Core™ i7-13700KF and a GPU of the NVIDIA GeForce RTX 4080. Taking into account the hardware configuration and sample characteristics, the training parameters were configured as described below: an initial learning rate of 0.0001, a total of 50 training epochs, and a batch size of 4, and the Stochastic Gradient Descent (SGD) algorithm was selected as the optimizer to achieve an efficient update of the model parameters.

Taking the test loss as the monitoring index, the training loss curve of ES-Net (Figure 7) demonstrates a continuous decline in both training loss and validation loss, asymptotically approaching near-zero values. In terms of key performance (Figure 8), both ES-Net and U-Net exceed 90%, exhibiting good training gains. Notably, ES-Net achieves more efficient feature utilization through the EDM and SSM.

3.2. Extraction Effects and Comprehensive Performance Analysis in Multiple Scenarios

ES-Net and U-Net were deployed across three representative test scenarios, and the robustness of both models was systematically validated through the visualization of edge precision and multi-dimensional metrics.

3.2.1. Comparison of Extraction Effect in Complex Scenarios

Figure 9 shows the extraction results under three different scenarios:

General Scenario: Both models nearly achieved a complete extraction of canopy gaps. However, ES-Net demonstrated a superior edge processing performance, with an edge integrity closer to the ground-truth label and a better edge localization capability compared to U-Net.

Spectral Confusion Scenario: When faced with similar spectra between understory shrubs and canopy vegetation in the canopy gaps, U-Net exhibited significant omission errors. In contrast, ES-Net enhanced the discriminative sensitivity to textural patterns through its semantic–edge joint constraint mechanism, which substantially mitigated such omissions.

Shadow Interference Scenario: While both architectures exhibited misclassification phenomena, U-Net’s error proved significantly greater. By leveraging EDM to capture the boundary between canopy gaps and shadows, ES-Net significantly improved the edge integrity and showed a superior robustness in the shadow interference environment.

3.2.2. Visual Analysis of Edge Precision

Figure 10 systematically compares the edge detection accuracy of ES-Net and U-Net in canopy gap extraction and reveals the differences in model performance in three parts. Sub-figures (a) and (b), respectively, show the overlay effects of the edges extracted by U-Net (yellow dashed line) and ES-Net (orange dashed line) with the ground-truth edges (blue solid line). The predicted edges of U-Net exhibit significant deviations from the ground-truth edges in complex regions such as acute turns and curved boundaries, reflecting an inadequate fitting capability for irregular boundaries. In contrast, the predicted edges of ES-Net demonstrate greater spatial overlap with ground-truth edges. Even in ambiguous regions where canopy shadows intersect with canopy gaps, ES-Net maintains continuous and relatively precise edge delineation, highlighting its effective capture of multi-scale edge features.

Sub-figure (c) uses a histogram to count the quantity distribution of the predicted edges of ES-Net (orange) and U-Net (yellow) within different pixel distance intervals from the ground-truth edges (0–5 pixels). Quantitative analyses of the error histograms reveal that ES-Net’s edge localization errors are significantly lower than those of U-Net across all scenarios. Specifically, ES-Net exhibits the highest density of edge points at the 0-pixel error bin, with errors concentrated within the 0–1-pixel interval, indicating an overall closer proximity to the ground-truth edge. Statistical results show that the average error distance of ES-Net is 1.8 ± 0.6 pixels (px), which is 75% lower than that of U-Net (7.2 ± 2.3 px). This high-precision edge localization capability is attributed to directional enhancement of gradient information by the EDM, and the global optimization of the contextual structure by the SSM.

3.3. Comparison of Multi-Dimensional Indicators

Table 2 presents a comparison between ES-Net and U-Net on multi-dimensional evaluation indexes, including accuracy and efficiency indexes. According to the statistical results, ES-Net outperforms U-Net in the core metrics. In terms of segmentation accuracy, its F1 score amounts to 97.64%, representing an increase of 0.49% over U-Net (97.15%); and its IoU is 95.41%, which is 0.86% higher than that of U-Net (94.55%), indicating that the model has a better overall segmentation integrity for the canopy gap region. Regarding edge accuracy, the ECR of ES-Net is 85.32%, outperforming U-Net (82.18%) by 3.14%, which reflects its stronger ability in capturing the boundaries of the canopy gaps; the HD is 28.26 px, marking a 38.6% decrease compared to U-Net (46.02 px), which verifies the accurate localization effect of EDM on complex boundaries.

To assess model deployability and application scenarios, the computational complexity and memory efficiency of U-Net and ES-Net were compared. Although ES-Net, due to the integration of the EDM and the channel attention mechanism, has a larger number of parameters (23.99 M) and FLOPs (65.94 g) than U-Net, through the sharing of convolutional layers and lightweight design, it has achieved controllable growth in memory occupation. Moreover, its memory occupation is still significantly lower than that of similar complex models, meeting the real-time processing requirements of the UAV remote sensing platform.

3.4. Ablation Study

To quantify the contributions of the EDM and Semantic Enhancement Component (SEC, cross-layer attention mechanism) to model performance, four ablation studies were designed. The experiments strictly adhered to the single-variable principle, verifying the independent roles and synergistic effects of core components through stepwise introduction under identical training parameters (an initial learning rate of 0.0001, a total of 50 training epochs, a batch size of 4) and the same dataset partitioning strategy. The ablation configurations are as follows:

Baseline: The original U-Net model, containing only the basic encoder–decoder structure and skip connections.

U-Net + SEC: Based on the baseline, a cross-layer attention mechanism (i.e., the channel attention mechanism in the SSM) was embedded to enhance spectral–semantic feature discrimination.

U-Net + EDM: Based on the baseline, an edge detection branch was added to enforce geometric edge constraints.

ES-Net: The full model integrating both SEC and EDM, fusing semantic and edge information synergistically.

The test set loss was used as the monitoring condition, and if there was no reduction in the loss for five successive epochs, it was terminated early to avoid overfitting.

Table 3 presents the core segmentation metrics (IoU) and model complexity (parameter count) under different configurations, with the results indicating the following: (1) Effectiveness of Semantic Enhancement Component: U-Net + SEC achieved an IoU of 84.0% (vs. Baseline’s 78.0%, + 6.0% improvement), demonstrating that the cross-layer attention mechanism mitigated misclassification caused by “same-spectrum heterogeneity” through reinforced feature responses in spectral-sensitive channels. This validates the role of semantic information in improving global segmentation accuracy. (2) Necessity of Edge Detection Module: U-Net + EDM’s IoU (82.0%, + 4.0% over baseline) confirmed that the edge detection branch provided geometric structural constraints for semantic segmentation by capturing pixel gradient mutations, reducing contour blurring and boundary fragmentation. (3) Synergistic Effect Verification: ES-Net’s IoU (91.0%), a striking 13.0% improvement over the baseline and surpassing single-component models, proved that semantic-guided edge feature filtering and edge-constrained semantic segmentation correction acted complementarily. Their joint optimization enhanced feature representation in complex scenarios.

3.5. Cross-Region Validation

To comprehensively validate the practicality of ES-Net, a forest area in Qinglong Town, Anning City, Yunnan Province, which is located outside the original research region, was selected as the cross-regional application test area. The geographical environment and vegetation types of this area are distinct from those of the study area, enabling an effective examination of the model’s adaptability under diverse ecological backgrounds. In Figure 11, the experimental results demonstrate that the ES-Net can not only accurately extract the fine boundaries of typical canopy gaps and non-canopy gap regions (as shown in area a of Figure 11) but also exhibit excellent noise suppression capabilities under complex lighting conditions. It effectively filters out interference factors such as crown shadows (as shown in area b of Figure 11), ensuring the accuracy of canopy gap information extraction.

The ES-Net shows a good generalization ability in cross-regional canopy gap extraction tasks. Whether in the typical areas within the study area or in external heterogeneous environments, the model maintains a stable extraction accuracy, verifying its robustness in different geographical settings. It is worth noting that the spectral characteristics, spatial structures, and phenological cycles of ground objects exhibit significant spatial heterogeneity and temporal dynamics. Such differences may potentially affect the model’s performance. Therefore, in practical application scenarios, a strategy of “local samples as the main focus and remote models as auxiliary” is recommended. Specifically, multisource remote-sensing data and in situ observation samples from the target area should be preferentially collected as the primary training data. Meanwhile, pre-trained remote models should be loaded as the basic framework and then optimized through transfer learning with a small number of samples from the target area to maximize the model’s performance. This strategy not only makes full use of the prior knowledge of remote models but also enables rapid adaptation to the unique characteristics of ground objects in the target area, providing a scientific and feasible implementation path for the cross-regional promotion and application of forest gap extraction techniques.

4. Discussion

4.1. The Optimizing Effect of Edge–Semantic Synergy Mechanism on Complex Boundaries

The ES-Net proposed in this study achieves the deep fusion of spectral semantics and geometric edges in canopy gap extraction, and the edge accuracy metrics are significantly improved. Compared with U-Net, ES-Net reduces the HD from 46.02 px to 28.26 px (a 38.6% reduction) and increases the ECR from 82.18% to 85.32%, indicating that the model achieves a breakthrough in capturing irregular boundaries. This optimization stems from two core innovations: (1) Semantic-guided edge feature selection: The EDM does not operate independently but leverages multi-scale features from a shared encoder—shallow gradients, mid-level textures, and deep semantics—to filter out disturbing edges using spectral attention priors. (2) Edge-constrained semantic segmentation refinement: Edge detection results backwardly supervise the semantic segmentation process via a joint loss function, enforcing alignment between segmentation contours and geometric edges.

4.2. In-Depth Analysis of Hausdorff Distance (HD)

As an indicator measuring the maximum spatial deviation between predicted and ground-truth edges, the 38.6% reduction in HD not only reflects an improvement in overall boundary congruence but also reveals the model’s capability to handle extremely complex boundaries. From the error histograms (Figure 10), it can be seen that ES-Net’s edge points exhibit an extremely high density at 0 pixels, with 90% of edge point errors falling within the 0–2-pixel interval, indicating that its localization accuracy has approached the pixel level. Such high precision holds non-negligible value in ecological applications.

Nonetheless, the average HD (28.26 px) remains relatively high, primarily due to the interference of extreme cases: In canopy gap samples with severe spectral ambiguity (e.g., moss-covered bare soil vs. canopy shadows) or extremely fragmented morphology (area < 5 m²), local maximum deviations are significant, thereby inflating the overall mean value. This reflects the inherent complexity of canopy gap extraction in natural scenes.

4.3. Comparison with Existing Methods

Compared with traditional object-oriented classification and single semantic segmentation models, the advantages of ES-Net are reflected in three dimensions: (1) Automation and multi-dimensionality of feature characterization: Object-oriented classification relies on manually designed spectral–textural features, which requires repeated parameter tuning in complex mountainous environments and has limited boundary delineation capability; while ES-Net achieves a transition from “manual feature engineering” to “data-driven representation” through end-to-end training, automatically uncovering non-linear dependencies between spectral channels and multi-scale edge cues. (2) Synergistic gains of multi-task learning: Transformer-based segmentation models, with their global self-attention mechanisms, prioritize long-range dependencies. In contrast, ES-Net employs a local gradient enhancement strategy in its explicit edge detection branch, enabling more efficient fitting of irregular boundaries (e.g., acute angles, curves)—a demonstration of the superiority of task-specific architectural design. (3) Balance between computational efficiency and accuracy: Although ES-Net has a higher parameter count (23.99 M) and FLOPs (65.94 G) than U-Net, its memory footprint is only 405.89 M—significantly lower than that of comparable multi-task models (e.g., DeepLabv3+ [36] requires over 800 M)—achieved through shared encoder convolutional layers and a lightweight edge detection head design. This meets the real-time processing requirements of UAV platforms, with an inference speed of 12 frames per second for 512 × 512-pixel images.

4.4. Data-Model Limitations and Enhancement Strategies

This study still has room for improvement in terms of data composition, modal fusion, and model optimization. These limitations provide clear technical directions for future research.

4.4.1. Insufficient Data Diversity and Cross-Domain Generalization Challenges

Although the current dataset generates 2808 samples through multi-scale cropping and data augmentation, single geographic coverage and limited vegetation types remain significant limitations. The study area is dominated by Pinus Yunnanensis coniferous forests, whose canopy structures are relatively regular, whereas canopy gap characteristics differ fundamentally in tropical rainforests, temperate mixed forests, and plantations. The overlapping of multi-layer canopies in tropical rainforests leads to spectral confusion; the species diversity in temperate mixed forests causes boundary irregularities; and the neat arrangement of plantation forests forms unique textural patterns.

Such data biases may cause the model to face the following problems during cross-regional applications: (1) Spectral domain shift: Differences in the phenological cycles and chlorophyll contents of vegetation in different climate zones mean the spectral responses of canopy gaps in remote-sensing images show significant regional characteristics; (2) Insufficient terrain adaptability: The existing samples lack comparisons between high-altitude mountainous areas and low-altitude plains, meaning they struggle to cope with variations in illumination and shadows caused by slope and altitude. Future research should expand data collection to increase sample diversity, ensuring inclusion of canopy gap samples under diverse geographic conditions to enhance the model’s adaptability to complex environments and improve its generalization performance.

In terms of the technical approach, a domain adaptation framework can be introduced. By constructing a dual-domain dataset that includes Yunnan pine forests (source domain) and target ecosystems (tropical rainforests, temperate mixed forests), CycleGAN [37] can be used to learn bidirectional mapping between domains. The near-infrared reflectance can be adjusted to achieve the spectral “tropicalization” conversion from coniferous forests to broad-leaved forests, and the spectral consistency of the synthetic samples can be verified using the Spectral Angle Mapper (SAM) with a threshold of 0.15. Meanwhile, topographic factors such as slope and altitude can be encoded as conditional vectors and embedded into the encoder’s batch normalization layer. Through cross-topographic experiments involving low-altitude training and high-altitude testing, domain adaptation in both spectral and topographic dimensions can be realized to enhance cross-domain feature invariance.

4.4.2. Limitations of Single-Modal Data and Requirements for Multi-Source Fusion

While UAV-collected image data offers a high resolution, it lacks 3D information such as vegetation height and canopy structure and is only suitable for small-scale study areas, with data acquisition easily constrained by multiple factors. Future research will introduce multi-modal data fusion strategies to fully leverage the complementary advantages of different data sources across spatial, spectral, and temporal dimensions.

Specifically, first obtain Canopy Height Model (CHM) data via LiDAR. A dual-branch encoder network is used to extract spectral–textural features from UAV RGB imagery and 3D structural features from CHM, respectively. Feature fusion is achieved through a cross-modal attention mechanism, and joint training is conducted using Focal Loss and 3D boundary loss to enhance model performance [38,39]. Additionally, a pyramidal scale matching strategy is adopted to align cross-resolution features between high-resolution UAV data and wide-coverage satellite data. Dilated convolution is used to preserve detailed features, while temporal gating units are introduced to fuse satellite time-series phenological features with single-temporal UAV features. During training, fusion weights are dynamically adjusted based on data source reliability, with the spectral angle distance and 3D boundary error serving as cross-modal consistency evaluation metrics [40,41,42]. This multi-modal fusion strategy can effectively overcome the inherent limitations of single data sources in spatial resolution, 3D information acquisition, and monitoring scope, achieving a collaborative improvement in canopy gap boundary localization accuracy and large-scale monitoring efficiency.

4.4.3. Optimization of the Joint Loss Function

Although the synergy between the edge detection module and channel attention mechanism has improved key metrics, the joint loss function still suffers from gradient imbalance in optimizing extreme boundary points. The current fixed-weight strategy tends to cause gradient vanishing when handling acute-angle boundaries or fragmented edges. This is because Focal Loss adjusts the weights of easy/hard samples through a fixed γ parameter, which alleviates class imbalance but does not explicitly model the complexity of edge geometric features. Future work could construct an adaptive dynamic loss weighting framework: leveraging a dynamic weight annealing mechanism, it would adopt balanced weights in the early training stage to promote feature sharing, then adaptively adjust weights based on HD fluctuations in the validation set during later stages, achieving a smooth transition from “feature synergy” to “task specialization.”

5. Conclusions

This study successfully constructed a multi-task deep learning network (ES-Net) suitable for complex mountainous forest areas in Yunnan Province. Compared with Transformer-based models (e.g., EdgeFormer) and traditional multi-task models (e.g., DeepLabv3+), ES-Net achieves precise boundary mutation localization through the multi-scale gradient fusion of its edge detection module (EDM). Simultaneously, the cross-layer attention mechanism in its semantic segmentation module (SSM) enhances the discrimination of spectral–semantic features. Finally, a joint loss function is employed to collaboratively optimize semantic and geometric features. This design enables ES-Net to efficiently address challenges such as blurred contours, missing details, and discontinuous boundaries during canopy gap extraction while maintaining computational efficiency. Experimental results demonstrate that ES-Net exhibits significant advantages when dealing with spectral confusion and shadow interference scenarios, achieving simultaneous improvements in both canopy gap extraction accuracy and edge localization precision. The research outcomes provide a new and efficient technical approach for the precise extraction of canopy gap information in forest ecosystems. Subsequent studies will focus on multi-modal data fusion, lightweight model improvement, and cross-regional migration capability optimization, aiming to continuously advance the upgrading and refinement of canopy gap extraction technology, thereby better serving forest ecological research and related application requirements.

Author Contributions

Conceptualization, J.X. and P.D.; Methodology, Y.W., J.X. and F.Z.; Software, Y.W. and Z.Z.; Validation, Y.W.; Formal analysis, Y.W.; Investigation, Y.W.; Resources, Y.W.; Data curation, Y.W. and Z.Z.; Writing—original draft, Y.W.; Writing—review & editing, Y.W. and P.D.; Visualization, Y.W. and Z.Z.; Supervision, J.X. and F.Z.; Project administration, J.X.; Funding acquisition, J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (No. 42061038).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Watt, A.S. Pattern and process in the plant community. J. Ecol. 1947, 35, 1–22. [Google Scholar] [CrossRef]
Brokaw, N.V.L. The definition of treefall gap and its effect on measures of forest dynamics. Biotropica 1982, 11, 158–160. [Google Scholar] [CrossRef]
Liu, B.; Zhao, P.; Zhou, M.; Wang, Y.; Yang, L.; Shu, Y. Effects of forest gaps on the regeneration pattern of the undergrowth of secondary Poplar-Birch forests in southern greater Xing’an mountains. For. Grassl. Resour. Res. 2019, 4, 31–36+45. [Google Scholar] [CrossRef]
Xu, B. Tree gap and its impact on forest ecosystem. J. Hebei For. Sci. Technol. 2021, 42–46. [Google Scholar] [CrossRef]
Tong, R.; Ji, B.; Wang, G.G.; Lou, C.; Ma, C.; Zhu, N.; Yuan, W.; Wu, T. Canopy gap impacts on soil organic carbon and nutrient dynamic: A meta-analysis. Ann. For. Sci. 2024, 81, 12. [Google Scholar] [CrossRef]
Haber, L.T.; Fahey, R.T.; Wales, S.B.; Correa Pascuas, N.; Currie, W.S.; Hardiman, B.S.; Gough, C.M. Evolution. Forest structure, diversity, and primary production in relation to disturbance severity. Ecology 2020, 10, 4419–4430. [Google Scholar]
Orman, O.; Dobrowolska, D. Gap dynamics in the Western Carpathian mixed beech old-growth forests affected by spruce bark beetle outbreak. Eur. J. For. Res. 2017, 136, 571–581. [Google Scholar] [CrossRef]
Shen, H.; Cai, J.; Li, M.; Chen, Q.; Ye, W.; Wang, Z.; Lian, J.; Song, L. On Chinese forest canopy biodiversity monitoring. Biodivers. Sci. 2017, 25, 229–236. [Google Scholar] [CrossRef]
Smith, A.M.; Ramsay, P.M. A comparison of ground-based methods for estimating canopy closure for use in phenology research. Agric. For. Meteorol. 2018, 252, 18–26. [Google Scholar] [CrossRef]
Zhou, L.; Cheng, X.; Zhang, M. Effects of forest gaps on understory species diversity of Larix Principis-rupprechtiinatural secondary forest. J. Beijing For. Univ. 2024, 46, 48–56. [Google Scholar]
Sun, Z.; Wang, X.; Wang, Z.; Yang, L.; Xie, Y.; Huang, Y. UAVs as remote sensing platforms in plant ecology: Review of applications and challenges. J. Plant Ecol. 2021, 14, 1003–1023. [Google Scholar] [CrossRef]
Guimarães, N.; Pádua, L.; Marques, P.; Silva, N.; Peres, E.; Sousa, J.J. Forestry remote sensing from unmanned aerial vehicles: A review focusing on the data, processing and potentialities. Remote Sens. 2020, 12, 1046. [Google Scholar] [CrossRef]
Ecke, S.; Dempewolf, J.; Frey, J.; Schwaller, A.; Endres, E.; Klemmt, H.-J.; Tiede, D.; Seifert, T. UAV-Based forest health monitoring: A systematic review. Remote Sens. 2022, 14, 3205. [Google Scholar] [CrossRef]
Mao, X.; Xing, X.; Li, J.; Tan, L.; Fan, W. Object-Oriented recognition of forest gap based on aerial orthophoto. Sci. Silvae Sin. 2019, 55, 87–96. [Google Scholar]
Wang, Y.; Lian, J.; Zhang, Z.; Hu, J.; Yang, J.; Li, Y.; Ye, W. Extraction and analysis of forest gaps and forest canopies based on two types of UAV aerial images. Trop. Geogr. 2019, 39, 553–561. [Google Scholar] [CrossRef]
Xia, J.; Wang, Y.; Dong, P.; He, S.; Zhao, F.; Luan, G. Object-Oriented canopy gap extraction from UAV images based on edge enhancement. Remote Sens. 2022, 14, 4762. [Google Scholar] [CrossRef]
Htun, N.M.; Owari, T.; Tsuyuki, S.; Hiroshima, T. Detecting canopy gaps in Uneven-Aged mixed forests through the combined use of unmanned aerial vehicle imagery and deep learning. Drones 2024, 8, 484. [Google Scholar] [CrossRef]
Ding, L.; Hong, D.; Zhao, M.; Chen, H.; Li, C.; Deng, J.; Yokoya, N.; Bruzzone, L.; Chanussot, J. A Survey of Sample-Efficient Deep Learning for Change Detection in Remote Sensing: Tasks, strategies, and challenges. IEEE Geosci. Remote Sens. Mag. 2025, 2–27. [Google Scholar] [CrossRef]
Zhang, J.; Wu, T.; Luo, J.; Hu, X.; Wang, L.; Li, M.; Lu, X.; Li, Z. Toward Agricultural Cultivation Parcels Extraction in the Complex Mountainous Areas Using Prior Information and Deep Learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–14. [Google Scholar] [CrossRef]
Wu, Y.; Peng, Z.; Hu, Y.; Wang, R.; Xu, T. A dual-branch network for crop-type mapping of scattered small agricultural fields in time series remote sensing images. Remote Sens. Environ. 2025, 316, 114497. [Google Scholar] [CrossRef]
Zou, Y.; Ma, Y.Y. Edgeformer: Edge-Enhanced Transformer for High-Quality Image Deblurring. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo, ICME, Brisbane, Australia, 10–14 July 2023; pp. 504–509. [Google Scholar]
Sadel, J.; Tulczyjew, L.; Wijata, A.M.; Przeliorz, M.; Nalepa, J. Monitoring Forest Changes With Foundation Models and Sentinel-2 Time Series. IEEE Geosci. Remote. Sens. Lett. 2025, 22, 5001105. [Google Scholar] [CrossRef]
Nguyen, T.A.; Russwurm, M.; Lenczner, G.; Tuia, D. Multi-temporal forest monitoring in the Swiss Alps with knowledge-guided deep learning. Remote Sens. Environ. 2024, 305, 114109. [Google Scholar] [CrossRef]
Guo, Y.T.; Long, T.F.; Jiao, W.L.; Zhang, X.M.; He, G.J.; Wang, W.; Peng, Y.; Xiao, H. Siamese Detail Difference and Self-Inverse Network for Forest Cover Change Extraction Based on Landsat 8 OLI Satellite Images. Remote Sens. 2022, 14, 627. [Google Scholar] [CrossRef]
Li, J.; Wei, Y.; Wei, T.; He, W. A comprehensive Deep-Learning framework for Fine-Grained farmland mapping from High-Resolution images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
Chen, J.; Fu, Y.; Guo, Y.; Xu, Y.; Zhang, X.; Hao, F. An improved deep learning approach for detection of maize tassels using UAV-based RGB images. Int. J. Appl. Earth. Obs. Geoinf. 2024, 130, 103922. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
Liu, Y.; Cheng, M.M.; Hu, X.; Bian, J.W.; Zhang, L.; Bai, X.; Tang, J. Richer convolutional features for edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1939–1946. [Google Scholar] [CrossRef]
He, J.; Zhang, S.; Yang, M.; Shan, Y.; Huang, T. BDCN: Bi-Directional cascade network for perceptual edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 100–113. [Google Scholar] [CrossRef]
Li, S.; Peng, L.; Hu, Y. FD-RCF-based boundary delineation of agricultural fields in high resolution remote sensing images. J. Univ. Chin. Acad. Sci. 2020, 37, 483–489. [Google Scholar]
Liu, R.; Liu, M.; Zhao, Q.; Ma, X. Urban road boundary extraction method with road edge information. Remote Sens. Inf. 2025, 40, 34–41. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Y.; Ao, Y.; Jiang, D.; Zhang, Z. Building extraction from remote sensing images via the multiscale information fusion method under the Transformer architecture. Natl. Remote Sens. Bull. 2024, 28, 3173–3183. [Google Scholar]
Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A database and Web-Based tool for image annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the Computer Vision—ECCV 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image translation using Cycle-Consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar]
Liu, H.; Cao, F.; She, G.; Cao, L. Extrapolation assessment for forest structural parameters in planted forests of southern China by UAV-LiDAR samples and multispectral satellite imagery. Remote Sens. 2022, 14, 2677. [Google Scholar] [CrossRef]
Gaulton, R.; Malthus, T.J. LiDAR mapping of canopy gaps in continuous cover forests: A comparison of canopy height model and point cloud based techniques. Int. J. Remote Sens. 2010, 31, 1193–1211. [Google Scholar] [CrossRef]
Li, Y.; Yan, W.; An, S.; Gao, W.; Jia, J.; Tao, S.; Wang, W. A Spatio-Temporal fusion framework of UAV and satellite imagery for winter wheat growth monitoring. Drones 2023, 7, 23. [Google Scholar] [CrossRef]
He, H.; Li, C.; Yang, R.; Zeng, H.; Li, L.; Zhu, Y. Multisource data fusion and adversarial nets for landslide extraction from UAV-Photogrammetry-Derived data. Remote Sens. 2022, 14, 3059. [Google Scholar] [CrossRef]
Cheng, J.; Zhu, Y.; Zhao, Y.; Li, T.; Chen, M.; Sun, Q.; Gu, Q.; Zhang, X. Application of an improved U-Net with image-to-image translation and transfer learning in peach orchard segmentation. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103871. [Google Scholar] [CrossRef]

Figure 1. Administrative divisions of Yunnan (a), Google imagery of Chenggong District (b), and UAV orthophoto imagery of the study area (c).

Figure 2. Flowchart of dataset production. The red dotted box indicates the position where the window moves.

Figure 3. Local segmentation results at different resolutions. Some edge information was lost, making it difficult for the model to learn the complete morphology of canopy gaps (a,b); the images contained numerous task-irrelevant background features, leading to a waste of computational resources during training (d); Subfigure (c) balances edge retention and background complexity, avoiding the defects of (a,b,d).

Figure 4. Semantic segmentation module.

Figure 5. Channel attention module.

Figure 6. Edge detection module.

Figure 7. Loss function curve for ES-Net model.

Figure 8. Comparison curves of F1 score, accuracy, and IoU for ES-Net and U-Net under different training epochs.

Figure 9. Visual comparison of extraction results in different scenarios. In sequence, (a–d) display the original image, the label, the extraction result of the U-net model, and the extraction result of the ES-Net model, where green color represents a correct classification, red color represents misclassification, and blue color represents a missed classification.

Figure 10. Edge accuracy comparison of ES-Net and U-Net. Specifically, (a) overlays U-Net extracted edges with ground truth; (b) overlays ES-Net extracted edges with ground truth; (c) shows count distribution of ES-Net and U-Net predicted edges at different distances from ground truth.

Figure 11. Results of canopy gap extraction in the validation area and enlarged view of region (a,b). Red boxes represent random local areas.

Table 1. Evaluation indicators.

Semantic Accuracy	Edge Accuracy
Accuracy (Acc)	Hausdorff Distance (HD)
Precision (P)	Edge Coverage Rate (ECR)
Recall (R)	Edge Coverage Rate (ECR)
F1 score (F1)
Intersection over Union (IoU)

Table 2. Multi-dimensional indicators.

		U-Net	ES-Net
Semantic Accuracy	Acc/%	98.90	99.24
	P/%	97.91	97.64
	R/%	96.51	97.66
	F1/%	97.15	97.64
	IoU/%	94.55	95.41
Edge Accuracy	HD/px	46.02	28.26
Edge Accuracy	ECR/%	82.18	85.32
Efficiency Indexes	Params/M	17.27	23.99
	FLOPs/G	30.77	65.94
	Memory/M	162.79	405.89

Table 3. Results of ablation study (based on test sets). (The IoU values of the ablation experiment are measured on a validation subset (10% of the test set) that contains more challenging cases such as small-area canopy gaps and strong shadow interference, which is different from the complete test set in Table 2).

	IoU	Params (M)
Baseline	0.78	17.27
U-Net + SEC	0.84	23.58
U-Net + EDM	0.82	17.67
ES-Net	0.91	23.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Zhang, Z.; Xia, J.; Zhao, F.; Dong, P. ES-Net Empowers Forest Disturbance Monitoring: Edge–Semantic Collaborative Network for Canopy Gap Mapping. Remote Sens. 2025, 17, 2427. https://doi.org/10.3390/rs17142427

AMA Style

Wang Y, Zhang Z, Xia J, Zhao F, Dong P. ES-Net Empowers Forest Disturbance Monitoring: Edge–Semantic Collaborative Network for Canopy Gap Mapping. Remote Sensing. 2025; 17(14):2427. https://doi.org/10.3390/rs17142427

Chicago/Turabian Style

Wang, Yutong, Zhang Zhang, Jisheng Xia, Fei Zhao, and Pinliang Dong. 2025. "ES-Net Empowers Forest Disturbance Monitoring: Edge–Semantic Collaborative Network for Canopy Gap Mapping" Remote Sensing 17, no. 14: 2427. https://doi.org/10.3390/rs17142427

APA Style

Wang, Y., Zhang, Z., Xia, J., Zhao, F., & Dong, P. (2025). ES-Net Empowers Forest Disturbance Monitoring: Edge–Semantic Collaborative Network for Canopy Gap Mapping. Remote Sensing, 17(14), 2427. https://doi.org/10.3390/rs17142427

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ES-Net Empowers Forest Disturbance Monitoring: Edge–Semantic Collaborative Network for Canopy Gap Mapping

Abstract

1. Introduction

2. Material and Methods

2.1. Study Area

2.2. Data Acquisition

2.3. Canopy Gaps Dataset

2.4. ES-Net Model

2.4.1. Semantic Segmentation Module (SSM)

2.4.2. Edge Detection Module (EDM)

2.5. Loss Function

2.6. Evaluation Index

3. Results

3.1. Experimental Environment and Training Parameter Configuration

3.2. Extraction Effects and Comprehensive Performance Analysis in Multiple Scenarios

3.2.1. Comparison of Extraction Effect in Complex Scenarios

3.2.2. Visual Analysis of Edge Precision

3.3. Comparison of Multi-Dimensional Indicators

3.4. Ablation Study

3.5. Cross-Region Validation

4. Discussion

4.1. The Optimizing Effect of Edge–Semantic Synergy Mechanism on Complex Boundaries

4.2. In-Depth Analysis of Hausdorff Distance (HD)

4.3. Comparison with Existing Methods

4.4. Data-Model Limitations and Enhancement Strategies

4.4.1. Insufficient Data Diversity and Cross-Domain Generalization Challenges

4.4.2. Limitations of Single-Modal Data and Requirements for Multi-Source Fusion

4.4.3. Optimization of the Joint Loss Function

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI