1. Introduction
Farmland shelterbelts are key components of farmland ecosystems, playing a crucial role in safeguarding global food security, ecological balance, and sustainable agricultural development [
1]. These shelterbelts achieve this by reducing wind speeds, preventing soil and water erosion, minimizing transpiration, blocking dust storms, and regulating local microclimates [
2,
3], thereby helping to stabilize crop yields and mitigate the impacts of extreme weather events [
4,
5]. In arid and semi-arid regions, these shelterbelts additionally fulfill vital functions in ecological stabilization, land restoration, and combating desertification [
6,
7,
8,
9]. However, under the backdrop of global climate warming and intensifying anthropogenic disturbances, farmland shelterbelts are confronting challenges of degradation and fragmentation [
10], which weaken their protective functions and threaten farmland ecological stability and agricultural sustainability [
11]. Therefore, promptly acquiring information on the distribution, structure, health status, and change trends of farmland shelterbelts has become a pressing need in ecological conservation and agricultural management [
12,
13]. Moreover, conducting large-scale, high-precision extraction of these shelterbelts represents a critical step in addressing this demand.
Farmland shelterbelts typically exhibit narrow, linear distributions embedded within highly heterogeneous farmland landscapes that undergo pronounced seasonal variations, posing significant challenges to their automatic identification and precise extraction [
14,
15]. In complex farmland environments characterized by mixed vegetation, existing methods often suffer from misclassifications due to spectral confusion, thereby limiting extraction accuracy and completeness [
16]. Although field surveys can yield reliable data, their low efficiency renders them unsuitable for large-scale monitoring tasks. With technological advancements, multi-source remote sensing methods offer advantages in efficiency [
17]; however, while visual interpretation of remote sensing imagery achieves higher accuracy [
18], it is similarly constrained by labor costs, hindering rapid large-area monitoring. Current mainstream approaches integrate high-resolution imagery with traditional machine learning algorithms (e.g., random forest, support vector machines, and classification and regression trees (CART)) [
19,
20], performing well during sparse vegetation periods (e.g., April–May) [
21]; however, accuracy notably declines in the lush growth season, and their reliance on handcrafted feature engineering further limits applicability across diverse regional scenarios.
Deep learning, particularly convolutional neural network (CNN)-based semantic segmentation, has significantly advanced the extraction of complex features from remote sensing imagery [
22]. Unlike classification-based approaches that assign labels at the image or object level, semantic segmentation enables direct pixel-level prediction, allowing precise delineation of target boundaries and preservation of fine structural details. This characteristic is especially critical for farmland shelterbelts, which are typically narrow, linear features embedded in heterogeneous agricultural landscapes. CNN-based models inherently encode strong inductive biases—such as locality, translation invariance, and hierarchical receptive field expansion—that are well suited for capturing fine-scale linear structures and local boundary cues. Accordingly, a series of improved CNN architectures, including U-Net, Attention U-Net (AttU_Net), ResU-Net, and U
2-Net, have been widely applied to remote sensing tasks such as building extraction, road detection, land cover mapping, and vegetation identification, demonstrating strong performance [
23,
24,
25,
26].
In recent years, Transformer architectures have also been introduced into remote sensing semantic segmentation [
27], with derivative models like SwinUNet and TransUNet excelling in high-resolution land cover classification and feature segmentation [
28,
29]. However, farmland shelterbelts represent a distinct category of targets characterized by narrow widths, elongated geometries, and strong local boundary cues, with target widths often spanning only a few pixels in high-resolution imagery. Accurate extraction of such features therefore relies heavily on precise local feature modeling and boundary preservation. Under sample-limited conditions, the global context modeling emphasized by Transformer architectures and their reliance on patch-based tokenization may dilute critical local structural information, potentially leading to fragmented or missed detections of narrow linear features.
Although both CNN-based and Transformer-based models have made notable progress in general remote sensing segmentation tasks, challenges remain in high-resolution agricultural remote sensing applications due to scarce labeled samples and complex background interference [
30]. For the extraction of farmland shelterbelts in high-interference environments, existing studies lack systematic comparisons of the performance, stability, and applicability of different models. Therefore, this study selects six representative models—U-Net, AttU_Net, ResU-Net, U
2-Net, SwinUNet, and TransUNet—for systematic evaluation in sample-limited, complex farmland scenarios. This study aims to provide empirical evidence for high-precision extraction of farmland shelterbelts, thereby supporting practical applications in precision agriculture and ecological monitoring.
3. Methods
3.1. Remote Sensing Data Preprocessing
This study implements a systematic preprocessing workflow, with all operations performed in ENVI (v5.6; L3Harris Geospatial, Boulder, CO, USA) (Environment for Visualizing Images), sequentially encompassing radiometric calibration, atmospheric correction, orthorectification, and image fusion to obtain high-precision surface reflectance products, as illustrated in
Figure 2.
The Radiometric Calibration tool in ENVI is used to convert raw digital number (DN) values to at-sensor radiance, based on the absolute calibration coefficients provided by the China Centre for Resources Satellite Data and Application, ensuring accuracy in subsequent inversions. Atmospheric correction was performed in ENVI using the FLAASH module. The atmospheric model was set to Sub-Arctic Summer and the aerosol model was set to Rural. Aerosol retrieval was disabled (None), water vapor retrieval was not performed, and the initial visibility was set to 40 km. Building on radiometric and spectral corrections, orthorectification was conducted using the RPC Orthorectification tool in ENVI, supported by the Global Multi-resolution Terrain Elevation Data 2010 (GMTED2010) digital elevation model at a spatial resolution of 30 arc-seconds. Finally, image fusion was carried out using the Gram–Schmidt pan-sharpening method, in which the panchromatic image (after radiometric calibration and orthorectification) and the multispectral image (after radiometric calibration, atmospheric correction, and orthorectification) were fused by integrating the spatial detail of the 2 m panchromatic data with the spectral information of the 8 m multispectral data, generating high-quality multispectral imagery at a 2 m spatial resolution, with all parameters kept at their default settings, including the use of bilinear interpolation during the resampling process.
3.2. Dataset Construction
Dataset construction was designed to address the challenge of extracting farmland shelterbelts under high vegetation interference during the peak growing season. To this end, three high-resolution Gaofen-6 images acquired in June, July, and August were selected, corresponding to the lush growth period of northern crops, when spectral similarities among crops, orchards, and shelterbelts are most pronounced. This setting represents the most challenging discrimination conditions for shelterbelt extraction and thus provides a rigorous testbed for model robustness.
From each image scene, 20 image blocks of 512 × 512 pixels were strategically cropped, yielding a total of 60 blocks. These blocks were treated as high-quality source imagery rather than direct training samples. Instead of purely random sampling, block locations were manually guided to ensure representativeness and diversity of challenging scenarios. This sampling strategy intentionally prioritizes typical yet difficult cases over sheer sample quantity, which is particularly important for evaluating model performance under sample-limited conditions. Specifically, the selected blocks were required to contain representative farmland shelterbelt structures, including complete belts, belt edges, and structurally complex or fragmented segments. In addition, emphasis was placed on areas with strong background interferences, such as adjacent croplands, orchards, irrigation channels, and natural vegetation, to enhance the models’ ability to distinguish shelterbelts from spectrally and morphologically similar objects.
Each 512 × 512 block was further subdivided using a sliding-window strategy with a window size of 256 × 256 pixels and a stride of 256 pixels, generating 240 non-overlapping patch-level samples that constituted the actual inputs for model training and validation. All image patches were subjected to pixel-level annotation following a unified definition of farmland shelterbelts. Annotations were performed independently by two trained annotators.
To quantify annotation consistency, inter-rater agreement was evaluated at the pixel level. Specifically, Cohen’ s κ coefficient was computed for the binary classification task (shelterbelt vs. non-shelterbelt). Across all 240 patches, the two annotators achieved a Cohen’ s κ of 0.9114 (95% CI: [0.9062, 0.9166]), indicating high agreement. In addition, the mean Dice coefficient between the two annotation masks was 0.9118, reflecting strong spatial overlap. Pixel-level disagreements accounted for approximately 0.95% of all annotated pixels. All annotation discrepancies were subsequently resolved through a consensus review conducted by a single senior researcher, whose adjudicated annotations were used as the final ground-truth labels for model training and validation.
To further improve model generalization and mitigate overfitting, online data augmentation was applied during training, including random horizontal or vertical flipping (p = 0.5) and random rotation within ±45°, simulating diverse shelterbelt orientations. All augmentation operations were performed in memory with a fixed global random seed (seed = 42) to ensure experimental reproducibility. Through data augmentation, the effective training sample diversity was substantially increased, theoretically expanding the dataset to approximately 1200 samples. Given the relatively consistent structural characteristics of linear shelterbelt targets, this patch-based learning strategy combined with data augmentation provides sufficient training support for robust model learning under the given data scale.
3.3. Experimental Environment and Parameter Settings
To ensure experimental reproducibility and fair comparisons, training and evaluation are conducted under a unified computing environment and hyperparameter configuration. All experiments are performed on a computer equipped with an Intel Core i5-13600KF CPU and NVIDIA GeForce RTX 4070 Super GPU (12 GB VRAM), using a software environment of Python (v3.9.23), PyTorch (v2.6.0), and CUDA (v12.4; NVIDIA, Santa Clara, CA, USA). On this basis, this study adopts a unified hyperparameter configuration (
Table 1), guided by the principle of avoiding model-specific optimizations to ensure that performance differences are attributable to the architectures themselves rather than tuning strategies.
The input image size was fixed at 256 × 256 pixels, and the batch size was set to 16, representing a trade-off between GPU memory constraints and training stability. All models were trained for 200 epochs to ensure sufficient convergence under the unified training protocol. The AdamW optimizer was employed for all experiments, with an initial learning rate of 0.001 and a weight decay of 0.01. This setting follows standard practice in semantic segmentation, where decoupled weight decay has been shown to provide stable optimization and improved generalization performance [
42,
43]. Learning rate scheduling employs a cosine annealing strategy, synchronized with the total training epochs to ensure smooth decay to a minimum of 10
−5 [
44]. The loss function combines binary cross-entropy and Dice loss to synergistically optimize pixel-level classification accuracy while alleviating foreground–background class imbalance issues.
3.4. Loss Function
The semantic segmentation task for farmland shelterbelts requires distinguishing shelterbelt pixels from background regions at the pixel level; however, in practical applications, it faces data distribution challenges, where background pixels (e.g., crops, soil) dominate the image, while shelterbelt pixels are scarce and exhibit narrow, elongated distributions. Such data characteristics readily cause models to overemphasize background classes during training, resulting in insufficient recognition of shelterbelts and ultimately compromising segmentation performance. To address this issue, this study designs an adaptive loss function by combining binary cross-entropy loss (Equation (2)) and Dice loss (Equation (3)), with equal weights of 1 for both [
45,
46].
Here, wb and wd denote the weights of the BCE loss and the Dice loss, respectively; N denotes the total number of pixels; yi ∈ {0, 1} is the ground-truth label for pixel i (1 for shelterbelt, 0 for background); pi ∈ [0, 1] is the model’s predicted probability for shelterbelts; and ϵ is a small smoothing constant (typically 1 × 10−5) to prevent division by zero and ensure numerical stability of the loss function during training. In all experiments, we set wb = wd = 1 and fixed ϵ = 1 × 10−5.
3.5. Evaluation Metrics
Model performance is evaluated using the Dice Similarity Coefficient (DSC), Intersection over Union (IoU), Precision, and Sensitivity (Recall). These metrics are derived from the four fundamental elements of the confusion matrix: true positives (TP: correctly identified shelterbelt pixels), false positives (FP: background pixels misclassified as shelterbelts), true negatives (TN: correctly identified background pixels), and false negatives (FN: missed shelterbelt pixels).
Given the research emphasis on accurately identifying farmland shelterbelts as the foreground target, overall pixel accuracy—Pixel Accuracy = (TP + TN)/(TP + TN + FP + FN)—is not adopted as the primary evaluation metric. In farmland images, background pixels vastly outnumber shelterbelt pixels; thus, even if a model’s shelterbelt recognition is weak, high background classification accuracy can yield inflated overall accuracy scores, masking true target recognition performance and failing to reflect segmentation quality [
47,
48]. DSC emphasizes the proportion of correctly identified pixels and exhibits better adaptability than other metrics when foreground targets (e.g., shelterbelts) constitute a small proportion of the data. Therefore, this study designates DSC as the core evaluation metric to assess the overall consistency between predictions and ground-truth annotations.
3.6. Data Partitioning and Validation Strategy
To systematically evaluate model architectures and ensure unbiased performance estimation, this study adopts a rigorous three-stage validation strategy with strict data isolation and zero information leakage (
Figure 3).
First, an independent test set was created by applying stratified random sampling by acquisition month (June, July, and August), so that the test set retained a proportional representation of different phenological conditions and vegetation interference levels. The remaining samples formed the training/validation pool. Stage one: Four-fold cross-validation was conducted exclusively within the training/validation pool to quantify each architecture’s sensitivity to data partitioning variability and to assess robustness, rather than to tune hyperparameters. Cross-validation was performed at the patch level using the 256 × 256 samples. Importantly, because the dataset was generated with a stride equal to the window size, the resulting patches were non-overlapping, and each patch was assigned to a single fold. Therefore, no pixel- or patch-level overlap existed between folds, preventing information leakage during cross-validation. Stage two: Based on the cross-validation results, final models were retrained for each architecture using the full training/validation pool, which was randomly split into 80% for training and 20% for internal validation. Training was run for 200 epochs, and the model weights corresponding to the epoch with the highest validation DSC were selected as the optimal weights for each architecture. Stage three: The resulting final models were evaluated in a one-shot, blind manner on the previously isolated independent test set, reporting their generalization performance. The independent test set was not used in any stage of training, cross-validation, validation, or model selection. This workflow enforces a rigid sequence of isolation-first, training-second, and testing-last to block information leakage and selection bias, thereby maximizing the fairness of model comparisons and the reliability of conclusions.
4. Results
4.1. Cross-Validation Assessment of Model Generalization and Stability
To systematically evaluate the generalization capability and stability of deep learning models in the farmland shelterbelt extraction task, four-fold cross-validation was performed on the six models. This approach maximizes the use of limited data and assesses stability by examining performance fluctuations across different data subsets. Quantitative results from cross-validation are presented in
Table 2. All models exhibited strong competitiveness, with average DSC values exceeding 87%, confirming the substantial potential of deep learning in this domain. However, clear performance differences emerged among the models.
CNN-based models (U-Net, AttU_Net, ResU-Net, U2-Net) demonstrated overall superior performance compared to Transformer-based architectures (SwinUNet, TransUNet). The top four models in average DSC rankings were all CNN-based, with scores clustered around 91%. In contrast, Transformer-based models performed relatively weaker, with SwinUNet achieving the lowest average DSC (87.34%). This preliminarily indicates that, for the current task and data scale, CNN’s inductive biases (e.g., translation invariance, locality) may offer greater advantages than Transformer’s global attention mechanisms.
Among the high-performing CNN models, U2-Net achieved the highest average DSC (91.03%) and Sensitivity (91.51%), indicating strong capability in identifying shelterbelt pixels. U-Net’s average DSC (90.93%) was close to that of U2-Net, while also featuring the highest average Precision (92.06%), signifying the lowest false positive rate. AttU_Net, incorporating attention mechanisms, and ResU-Net, with residual connections, also yielded competitive results, with average DSC values of 90.89% and 90.16%, respectively.
U2-Net and U-Net formed the performance ceiling of this test, with their average DSC values being extremely close (91.03% vs. 90.93%). However, while maintaining top-tier accuracy, U2-Net sacrificed stability, exhibiting the highest standard deviations in DSC and IoU (±1.50%, ±2.59%) among all models.
For a scientific stability comparison, models must be evaluated at equivalent performance levels, with
Figure 4 providing intuitive evidence. In the high-performance region (DSC > 89.1%), U-Net’s data points were more concentrated, indicating that it delivers more reliable outputs while sustaining top-tier performance. In contrast, U
2-Net’s data points spanned a greater vertical range in this region, corroborating its higher variability.
ResU-Net possessed the lowest DSC standard deviation (±1.26%), reflecting high consistency in its outputs. However, this superior stability came at the cost of sacrificing nearly 1 percentage point in average DSC (90.16%). In other words, ResU-Net performs “stably well,” whereas U-Net achieves “stably excellent” performance. For accuracy-critical applications, among the top performance tier (U-Net, U2-Net, AttU_Net), U-Net achieves the optimal balance of stability and accuracy.
4.2. Independent Test Set Validation of Final Model Performance
Cross-validation unveiled the intrinsic characteristics of the models, while a fully independent test set provided the final, unbiased performance estimates. This section first confirms healthy convergence of all models in the final training phase, then reports their performance on the independent test set, and rigorously delineates differences via statistical tests.
Figure 5 illustrates the loss variation curves for each model during training.
Although some models (e.g., TransUNet, SwinUNet) exhibited relatively higher validation losses, their training losses still trended downward, indicating learning capability. Since this study focuses on the potential performance of model architectures rather than extreme optimization, models without severe overfitting or training failures are deemed valid, with their loss curves serving as a basis for subsequent performance comparisons. Under the unified experimental framework, the convergence and stability of the loss curves suffice to support quantitative comparisons and architectural potential analyses, without requiring individual model fine-tuning.
Based on confirmed model convergence, performance on the independent test set is shown in
Table 3.
The results show that U-Net achieved the best overall performance, with the highest DSC (91.45%) and IoU (84.68%). U
2-Net’s slight average advantage in cross-validation did not translate to leadership on the independent test set. This phenomenon aligns with the stability analysis in
Section 4.1, where U-Net’s robust learning strategy enables superior performance on novel, unseen data compared to the “high-potential but highly variable” U
2-Net.
Table 4 summarizes the statistical significance of pairwise performance differences among the six models on the final test set (
n = 48). Depending on the normality of paired differences assessed by the Shapiro–Wilk test, either a paired-sample t-test or a Wilcoxon signed-rank test was applied.
The performance difference between U-Net and U2-Net was not statistically significant (Wilcoxon signed-rank test, p = 0.566), indicating that these two models belong to the same performance tier from a statistical perspective. Similarly, the comparison between SwinUnet and TransUNet did not show a significant difference (Wilcoxon signed-rank test, p = 0.915).
In contrast, performance differences between CNN-based models and Transformer-based models were consistently and highly significant (p < 0.001 in all corresponding comparisons), revealing a clear performance gap between the two model categories. These results demonstrate that CNN-based architectures outperform Transformer-based models for the farmland shelterbelt extraction task under the evaluated conditions.
Overall, CNN-based models demonstrated superior performance in the farmland shelterbelt extraction task, with U-Net offering the most balanced trade-off between accuracy and stability, making it the most competitive model for this application.
4.3. Analysis of Computational Efficiency and Practical Training Costs
For models addressing practical problems, computational efficiency is as crucial as accuracy. This section evaluates, from a practical deployment perspective, each model’s parameter count, optimal training epochs, training time to peak accuracy, and single-image inference speed (
Table 5). Here, the “Best epoch” refers to the epoch at which the trained model achieved the highest DSC on the final independent test set. The model parameters corresponding to this epoch were used for reporting computational efficiency metrics.
The data reveal the direct computational costs of architectural choices, with U-Net demonstrating clear efficiency advantages, as its parameter count (4.32 M) is the smallest among all models. This structural lightweighting directly translates to the shortest total training time (14.19 min) and fastest inference speed (6.8 ms/image) among CNN-based models, indicating that U-Net’s simplicity is not a limitation but a significant advantage, enabling rapid experimentation and high-responsiveness deployment. In contrast, U2-Net incurs extremely high computational costs. Its large parameter count (44.01 M) results in the longest total training time (50.77 min); more critically, its single-image inference time (43.4 ms) is 6.4 times that of U-Net. When processing large-scale areas, this efficiency gap will be dramatically amplified, posing severe application bottlenecks.
SwinUNet’s performance is noteworthy: despite its high parameter count (41.34 M), it converged to the optimal epoch (106 epochs) the fastest, with the shortest total training time (9.23 min) among all models. This suggests that Transformer-based SwinUNet, leveraging its global self-attention mechanism, possesses a highly efficient optimization pathway, allowing it to find good solutions faster than CNNs. However, this training advantage is offset by its lower final accuracy (DSC: 87.34%) and slower inference speed compared to U-Net.
To comprehensively assess the trade-off between model accuracy and runtime speed, a comparison of DSC scores against single-image inference times was plotted (
Figure 6).
U-Net occupies the optimal region, being the only model to achieve both top-tier accuracy and ultra-low latency simultaneously. SwinUNet and TransUNet offer higher efficiency but insufficient accuracy, whereas U2-Net excels in accuracy but suffers from extremely low efficiency. This efficiency analysis, combined with prior findings on accuracy and stability, positions U-Net not only as the highest-accuracy or most stable model but also as the most computationally efficient, making it the optimal choice for academic research and large-scale operational mapping of farmland shelterbelts.
4.4. Qualitative Visualization and Typical Case Comparisons
Qualitative analysis intuitively reveals model behaviors in challenging scenarios, uncovering strengths and weaknesses that quantitative metrics may obscure. Accordingly,
Figure 7 presents six representative test cases selected from the independent test set, including three high-DSC samples (a, b, c) and three low-DSC samples (d, e, f), to visually compare the segmentation results of all models against the ground-truth annotations.
These test cases correspond to six typical and challenging scenarios commonly encountered in farmland shelterbelt extraction: (a) interspersed fruit tree rows and shelterbelts, (b) fragmented and discontinuous shelterbelts, (c) sparse vegetation disturbance at desert margins, (d) mixed orchard–crop plantings, (e) agricultural channels adjacent to narrow shelterbelts, and (f) riverbanks densely covered by reeds. Together, these scenarios span a wide range of background complexities and interference conditions, providing a comprehensive basis for qualitative comparison of different model behaviors.
In scenarios (a) and (b), all models performed well, successfully identifying and precisely segmenting the main body of the shelterbelts. Compared to other models, U-Net and U2-Net generated smoother, more precise boundary contours and effectively captured complex structural features and fine details. Transformer-based models also exhibited acceptable performance in these relatively simple scenarios, demonstrating the strong capability of deep learning models in handling spectral interferences from orchards and crops, as well as fragmented shelterbelt distributions.
In scenario (c), all models showed mild over-segmentation, primarily manifesting as misidentifying dense understory shrubs as shelterbelts, resulting in extracted shelterbelt widths slightly broader than ground-truth labels. This error likely stems from spectral similarities in vegetation canopies and insufficient model comprehension of spatial contextual relationships. AttU_Net produced noticeable salt-and-pepper noise in this scenario, reflecting higher sensitivity of its attention mechanism to fragmented vegetation at desert edges.
In the challenging scenarios (d, e, f) representing the highest extraction difficulty, the limitations of each model were amplified. These “hard samples” typically feature blurred ground object boundaries, strong background interferences, or extreme target morphologies. In scenario (d), AttU_Net, ResU-Net, and TransUNet exhibited land cover confusion, misidentifying orchard-to-farmland transition zones as shelterbelts, revealing limitations in finely distinguishing different arbor vegetation types. Conversely, U-Net and U2-Net, leveraging their superior local feature extraction capabilities, maintained the highest segmentation accuracy in this scenario. Extraction results in scenarios (e) (agricultural ditches and narrow shelterbelts) and (f) (reed-infested riverbank areas) further highlighted model deficiencies in morphological discrimination. All models misclassified linear hydrological features (ditches, river channels) as shelterbelts, stemming from high spectral and morphological similarities between vegetation-covered linear objects and true shelterbelts. Notably, in scenario (f), U-Net, ResU-Net, and U2-Net failed to detect narrow shelterbelts in the upper-left region, exposing the need for improved sensitivity to weak signals and small target features. Deeper analysis indicates that such errors are not only constrained by model architecture but also closely tied to insufficient extreme cases in training samples, suggesting the need for enhanced coverage and annotation of such challenging samples in future data strategies.
4.5. Large-Scale Regional Application and Belt-Level Assessment
4.5.1. Large-Scale Application and Belt-Level Assessment in the Study Area
To evaluate the practical applicability of the best-performing model, U-Net was applied to large-scale farmland shelterbelt mapping in the primary study area, Alar City, Xinjiang. The input data consisted of Gaofen-6 satellite imagery mosaics covering more than 6000 km
2. The imagery was cropped into multiple 256 × 256 pixel tiles and processed using the final trained U-Net model. Urban roadside trees were excluded through an urban mask, and the predicted tiles were seamlessly mosaicked to generate a regional-scale farmland shelterbelt distribution map (
Figure 8).
The resulting map effectively captured the spatial structure, continuity, and connectivity of the farmland shelterbelt network. Visual inspection indicates a high degree of consistency between the extracted shelterbelts and the corresponding satellite imagery. Enlarged local views further demonstrate the model’s capability to delineate individual shelterbelts and their intersections at the regional scale.
At this large-application scale, validation was conducted using an application-oriented, belt-level visual interpretation approach, rather than pixel-wise quantitative segmentation metrics. Specifically, 500 random points were generated across the study area using ArcGIS Pro. For each point, the nearest shelterbelt was selected as a validation sample, yielding 385 effective shelterbelt samples. Manual interpretation was performed using high-resolution Google Earth imagery as reference data. The spatial distribution of the validation samples is shown in
Figure 9.
Based on this belt-level assessment, the U-Net model achieved an overall shelterbelt extraction accuracy of approximately 95.58% within Alar City. The observed commission and omission error patterns were generally consistent with those identified in
Section 3.4. It should be emphasized that this accuracy reflects object-level correctness of shelterbelt detection and delineation, serving to evaluate the feasibility and reliability of large-scale automated mapping, rather than strict pixel-wise segmentation accuracy.
4.5.2. Qualitative Cross-Regional Case Demonstration
To further examine the potential transferability of the trained model beyond the primary study area, qualitative cross-regional demonstrations were conducted in Aksu City and Shaya County (
Figure 10). These regions are located in the oasis–desert transition zone along the northern edge of the Tarim Basin and share broadly similar climatic and agro-ecological conditions with Alar City, while exhibiting differences in farmland configuration and shelterbelt morphology.
It should be noted that no quantitative accuracy assessment or sample-based validation was conducted in these two regions. The results are presented solely as qualitative demonstrations to illustrate typical model behavior under cross-regional application scenarios.
The results show that the U-Net model successfully extracted the major structural patterns of farmland shelterbelts in both regions, effectively delineating the macroscopic shelterbelt networks. However, several limitations associated with out-of-distribution data were observed. In Aksu City, urban green belts were occasionally misclassified as farmland shelterbelts, primarily due to the absence of urban green space samples as negative classes in the training data. In Shaya County, extremely narrow shelterbelts (approximately 2 m in width, corresponding to about one pixel in the imagery) were frequently omitted, reflecting insufficient small-target perception caused by the scarcity of such ultra-narrow shelterbelt samples in the training dataset.
These observations highlight that the generalization performance of deep learning-based models is strongly dependent on the representativeness and completeness of training data. Overall, the large-scale application in Alar City, combined with qualitative cross-regional demonstrations, indicates that the proposed method is a feasible and effective solution for regional-scale farmland shelterbelt mapping, while also clarifying the current boundaries of its applicability and providing directions for future work involving more rigorous cross-regional quantitative validation.
5. Discussion
5.1. Comparison with Related Studies
5.1.1. Comparison with Global-Scale Land Use/Land Cover (LULC) Products
Global land cover products such as GlobeLand30 [
49], the FROM-GLC series [
50], GLC_FCS30/GLC_FCS30D, and the recently released ESA WorldCover provide long-term, multi-temporal land cover information at spatial resolutions ranging from 30 m to 10 m. These datasets play an essential role in large-scale land cover monitoring and ecological assessment, where forest or tree cover is typically represented as a patch-based class. However, they are not specifically designed to capture farmland shelterbelts, which are small-scale ecological elements characterized by narrow widths, elongated geometries, and close interspersion with croplands.
From a spatial resolution perspective, pixels of 30 m—and even 10 m—often exceed the width of farmland shelterbelts, leading to inevitable mixing of shelterbelts with surrounding farmland within single pixels. This results in smoothing effects, blocky representations and systematic boundary displacement, particularly for narrow and strip-like features. Consequently, while global LULC products are effective for macro-scale forest cover mapping, they remain insufficient for boundary-level, high-precision representation and structural analysis of farmland shelterbelt networks.
In contrast, this study employs 2 m resolution Gaofen-6 fused imagery combined with pixel-level semantic segmentation, enabling discrimination of farmland shelterbelts from surrounding crops, orchards, and natural vegetation at the single-pixel scale. This fine-grained mapping approach allows accurate restoration of shelterbelt orientation, width, and connectivity, providing a structured representation tailored to farmland ecological infrastructure. In application contexts emphasizing ecosystem services, agricultural landscape connectivity, and green infrastructure management, such detailed shelterbelt extraction offers clear advantages over global-scale LULC products.
5.1.2. Comparison with Linear Farmland Shelterbelt Extraction Methods
Earlier approaches to farmland shelterbelt extraction predominantly relied on handcrafted features and rule-based strategies. For example, Xing et al. (2016) combined vegetation indices, mathematical morphology, and object-based analysis to extract shelterbelt skeletons [
51], but remained sensitive to structural variations and intersection complexity. Li et al. (2024) integrated spectral, textural, and vegetation index features with random forest classifiers, improving extraction accuracy while retaining a strong dependence on feature engineering [
21]. Zhang et al. (2024) enhanced vegetation indices using phenological information and validated performance across multiple resolutions, yet the method remained sensitive to threshold settings and prior knowledge [
52]. Deng et al. (2023) focused on repairing fragmented shelterbelts through belt-oriented post-processing, improving continuity but still facing difficulties under complex spectral mixing conditions [
18].
In contrast, the deep learning framework adopted in this study enables end-to-end learning of multi-level spectral–spatial representations, reducing reliance on manually designed features. CNN-based semantic segmentation models demonstrate stronger semantic consistency and structural integrity when extracting shelterbelts in complex environments such as orchards, dense crop fields, and desert margins. Compared with traditional approaches, these models offer improved automation, robustness, and scalability, providing a more reliable pathway for high-precision and large-area farmland shelterbelt mapping.
5.2. Structural Characteristics and Task Adaptability of Deep Learning Models
The experimental results consistently indicate that CNN-based models outperform Transformer-based architectures in farmland shelterbelt extraction. This outcome directly supports the task-oriented considerations articulated in the Introduction, where farmland shelterbelts were characterized as narrow, elongated features with strong local boundary cues that place high demands on precise local feature preservation.
Across both cross-validation and independent testing, CNN-based models exhibit higher performance stability under sample-limited and high-interference conditions. In contrast, Transformer-based models show reduced sensitivity to fine-scale linear structures, which is reflected in lower segmentation consistency and increased variability. These findings suggest that, for narrow-width linear targets embedded in heterogeneous agricultural landscapes, architectural inductive biases favoring locality-aware feature modeling play a more critical role than global context aggregation.
Similar observations have been reported in recent reviews of remote sensing semantic segmentation, which note that pure Transformer architectures often require larger training datasets and stronger regularization strategies to achieve performance comparable to or exceeding that of CNN-based models in high-resolution tasks with strict spatial structural constraints [
49]. The present results provide task-specific empirical evidence for this conclusion in the context of farmland shelterbelt extraction.
Among the evaluated CNN architectures, U-Net achieves the most balanced performance in terms of accuracy, stability, and computational efficiency. Its encoder–decoder structure with skip connections enables effective integration of deep semantic information and shallow boundary details, allowing continuous and coherent delineation of shelterbelts under complex background conditions. Enhanced CNN variants, such as U2-Net, AttU_Net, and ResU-Net, further improve feature representation and connectivity in specific scenarios; however, these gains are often accompanied by increased parameter complexity and computational cost, without consistently outperforming the standard U-Net under medium-scale data conditions.
5.3. Current Limitations and Future Research Directions
Despite this study’s systematic evaluation of deep learning’s advantages in precise farmland shelterbelt extraction, several limitations persist. First, model semantic discrimination in complex feature-mixing environments remains inadequate, particularly for linear features highly similar to shelterbelts in both spectrum and morphology (e.g., shrub-covered ditches, riverbank reed belts, and roadside trees). Such misclassification issues stem from optical imagery’s inability to fully represent canopy structural differences, with local convolutional features prone to confusion under similar textures. Second, small-scale shelterbelt targets—ultra-narrow, fractured, or sparsely canopied—face significant omission risks under dual constraints of resolution and network architecture, as their widths often span only 1–2 pixels, easily weakened during downsampling, convolutional smoothing, or skip connections. Additionally, although this study employs a BCE–Dice combined loss to mitigate class imbalance, small targets may still be systematically overlooked in scenarios with extremely skewed natural distributions. Finally, while cross-regional trials in Aksu City and Shaya County demonstrate certain generalization potential, significant variations in geomorphology, vegetation structure, water resources, and management practices within Xinjiang’s typical oasis-desert transition zones imply that model robustness across larger spatial scales, ecological zones, crop belts, and phenological periods requires more systematic validation.
Addressing these challenges, future research urgently requires synergistic advancements across four directions: “expanding data foundations,” “deepening multi-source information,” “enhancing cross-regional generalization,” and “advancing toward ecological applications.” This is pivotal to advancing shelterbelt intelligent recognition from “case-level studies” to “nationally scalable applications.” Concurrently, integrating multimodal image fusion, structure-aware networks, time-series modeling, and self-supervised or domain-adaptive learning will provide effective pathways for models to discern three-dimensional structural differences and maintain stable cross-regional performance. Furthermore, future shelterbelt remote sensing research should not merely halt at geometric extraction but deeply couple with ecological function models, incorporating features such as belt widths, continuity, and tree height structures into indicators like wind erosion protection, carbon sink estimation, landscape connectivity assessment, and farmland climate regulation, thereby achieving a leap from “spatial identification” to “ecological process quantification” and “agricultural management decision support.” Through integrated advancements in data, models, and applications, deep learning-driven precise extraction of farmland shelterbelts will demonstrate its scientific value and application potential at larger scales.
6. Conclusions
This study systematically evaluated the comprehensive performance of six mainstream deep learning models in the precise extraction task for farmland shelterbelts. Through four-fold cross-validation, independent test set evaluation, computational efficiency analysis, and multi-scenario qualitative comparisons, the following key conclusions were drawn:
Deep learning models exhibit exceptional performance potential in farmland shelterbelt extraction tasks. All evaluated models achieved average DSC values exceeding 87% in cross-validation, with the best DSC reaching 91.45% on the independent test set. This result fully demonstrates that deep learning techniques can achieve high-precision shelterbelt identification and segmentation from remote sensing imagery, providing a solid technical foundation for automated, large-scale shelterbelt monitoring.
Model architecture exerts a decisive influence on performance. This study found that, in the current task, CNN-based models (e.g., U-Net, U2-Net) significantly outperform Transformer-based models (e.g., SwinUNet, TransUNet) in extraction accuracy, result stability, and statistical significance. This phenomenon indicates that, for medium-scale datasets and shelterbelt targets with strong local spatial features, CNN inductive biases (including local connectivity and translation invariance) offer greater advantages than Transformer global attention mechanisms.
The U-Net model achieves the optimal balance among accuracy, stability, and computational efficiency, representing the most practical solution for this task. In terms of accuracy and stability, U-Net demonstrated top-tier and stable performance in both cross-validation and independent test sets, surpassing the more variable U2-Net in stability. In computational efficiency, U-Net attained an inference speed of 6.8 ms/image, with training and deployment costs substantially lower than other high-performance models. In qualitative performance, U-Net generated the smoothest and most precise boundary segmentation results in complex scenarios, exhibiting excellent adaptability to fragmented shelterbelts, orchard interferences, and other challenges.
In summary, this study, through establishing a comprehensive and rigorous evaluation framework, empirically validates the effectiveness and superiority of deep learning techniques—particularly CNN architectures represented by U-Net—in high-precision farmland shelterbelt extraction. The findings provide crucial technical support and empirical evidence for precision agriculture planning, ecological environment monitoring, and forest resource management.