1. Introduction
Forests represent one of the most vital terrestrial ecosystems on Earth, covering approximately 31% of the planet’s land surface. These organisms provide a range of ecosystem services that are indispensable for human well-being, including carbon sequestration, oxygen release, water conservation, biodiversity maintenance, regional climate regulation, and erosion prevention [
1,
2,
3]. In the arid and semi-arid regions of northwest China, particularly Xinjiang, forest lands play a pivotal role in stabilising fragile ecosystems. They have been shown to reduce wind and sand activity (e.g., reducing sandy weather days by approximately 28% and sandstorm days by up to 69% in restored areas), impede desert expansion, protect the agricultural and pastoral production base of oases, and provide critical habitats for rare wildlife [
4,
5]. However, the forest lands of Xinjiang have long been subject to anthropogenic disturbances, including climate warming, changing precipitation patterns, overgrazing, and unsustainable development, resulting in a reduction in area and functional degradation [
6,
7]. Accurate extraction and dynamic monitoring of forest land extent and health status have thus become critical measures for forest conservation in Xinjiang. In recent years, remote sensing technology has emerged as a core tool for the extraction of forest land information, the monitoring of degradation, and the assessment of conservation effectiveness. Utilising its strengths in multispectral data, high spatio-temporal resolution, and long-term temporal coverage, it provides essential data support for the scientific formulation of forest land ecological restoration strategies and the assurance of sustainable forest land development [
8].
The predominant methodologies employed in conventional forest land extraction encompass supervised classification techniques, namely maximum likelihood, support vector machines, and random forests, along with object-oriented classification approaches. Additionally, vegetation index thresholding techniques, including NDVI and EVI threshold segmentation, are employed [
9,
10]. These methods are heavily reliant on the analysis of spectral characteristics and the calculation of index differences of vegetation. These species demonstrate optimal performance in environments characterised by uniform vegetation cover and minimal seasonal variation. However, in the context of the complex terrain of Xinjiang, characterised by sparse desert forests interspersed with shrubs, seasonal snow cover, cloud shadow interference, and oasis farmlands exhibiting spectral signatures analogous to forested areas, the application of these methods frequently results in severe misclassification and underclassification, thereby compromising the accuracy of the results [
11,
12]. In recent years, with the rapid application of deep learning technologies in remote sensing, deep learning-based forest extraction methods have emerged as a research hotspot. In comparison with conventional methodologies, deep learning models have the capacity to automatically learn complex nonlinear features of vegetation across multiple levels, including texture, shape and context. This capability is evidenced by significant advantages when processing highly heterogeneous landscapes [
13,
14]. Current mainstream approaches are based on convolutional neural networks (CNNs), such as U-Net and DeepLab series, which have achieved high segmentation accuracy (e.g., mIoU of 78%–91%, OA of 92%–96%, and F1-scores of 87%–95% in studies from 2023 to 2025) in forest segmentation of high-resolution optical imagery [
15,
16]. However, the majority of extant deep learning models are trained and reasoned using solely the three RGB channels. This is primarily because mainstream backbone networks such as ResNet and Swin Transformer have undergone extensive pre-training on natural image RGB datasets. The employment of direct transfer learning has been demonstrated to result in a substantial enhancement in the accuracy of the training process and its efficiency. Conversely, the incorporation of multispectral band data results in a substantial increase in computational resource demands, a complexification of data preprocessing, and an exacerbation of the scarcity of annotated samples. These factors impose significant constraints on the comprehensive utilisation of multispectral information in deep learning-based forest land extraction [
17,
18]. Consequently, the evaluation of forest extraction performance across diverse spectral band combinations, in conjunction with the construction of customised deep learning datasets tailored to these combinations, is imperative for the advancement of forest extraction from contemporary remote sensing imagery [
19].
In recent years, significant progress has been made in forest land extraction using Sentinel-2 imagery combined with machine learning or deep learning. In relatively homogeneous vegetation areas such as European temperate forests and Asian mountainous regions, Huang et al. achieved an overall accuracy (OA) of 90.4% when mapping plantation tree species using multi-temporal Sentinel-2 data [
20]. Nasiri et al. employed multi-temporal Sentinel-2 imagery and machine learning for mountain forest species classification, achieving 85%–92% accuracy in complex landscapes [
21]; Vorovencii et al. achieved a Kappa coefficient of 0.85 in mountain forest species classification using Sentinel-1/2 multi-temporal imagery combined with machine learning [
20,
22]. These studies predominantly rely on multi-spectral bands to augment random forests, CNNs, or deep learning models, demonstrating good performance in areas with uniform vegetation cover. However, these methods still face challenges in arid and semi-arid regions of Xinjiang [
20,
23]. Issues include weak spectral signals due to sparse vegetation distribution, amplified spectral confusion from terrain shadows and cloud shadows, and accuracy declines to 70%–85% during seasonal variations under extreme climatic conditions. To address these challenges, an increasing number of scholars are exploring the integration of different spectral band combinations with various deep learning models for the mountain-oasis-desert composite ecosystem in Xinjiang. By enhancing shortwave infrared and suppressing interference, this approach aims to improve accuracy metrics, enhance generalization capabilities in arid regions, and lay the groundwork for further investigations into spectral band combination methods [
4].
Previous studies have demonstrated that different tri-band combinations impact the accuracy of forest land extraction. Among these, the natural true-color combination of blue, green, and red light (B2 + B3 + B4) has been widely adopted in multiple studies as a foundational visualization and preliminary classification method. For instance, Immitzer et al. [
24] and Grabska et al. [
25] prioritized this combination when classifying Central European tree species using Sentinel-2 data and identifying forest types across multiple time periods, respectively, to enhance visual distinctions between vegetation and non-vegetation. Persson et al. [
26] also selected this combination for boreal forest species mapping, demonstrating its effectiveness under dense canopy cover. Beyond the blue-green-red combination, different band combinations exhibit distinct characteristics and advantages. The green-red-near-infrared (B3 + B4 + B8) false-color combination is widely applied for vegetation health monitoring and forestland extraction. Bolyn et al. [
27] compared multiple band combinations and found that substituting near-infrared for shortwave infrared yielded optimal performance in separating forestland from cropland. When classifying European temperate forest types using a random forest model on Sentinel-2 imagery, Bolyn et al. investigated multiple combinations (blue, green, red; red, near-infrared, shortwave infrared 1; and red edge enhancement combinations). They found that the green, red, near-infrared combination incorporating near-infrared and red edge exhibited optimal performance. In forest land detection research, Fassnacht et al. [
28] compared multiple vegetation indices and band combinations, concluding that the NDVI combined with near-infrared (B8) was the most superior. Axelsson et al. [
29] employed two combinations—red edge, near-infrared, and shortwave infrared (B7 + B8 + B11) and red, near-infrared, and shortwave infrared 2 (B4 + B8 + B12)—for manual forest land extraction from multispectral imagery, validating the applicability of shortwave infrared in distinguishing forest land moisture content and structural variations. In order to enhance the accuracy of forest land extraction, Grabska et al. [
30] proposed a methodology integrating multi-temporal red edge and shortwave infrared combinations. The experimental results demonstrated that this dynamic selection approach outperformed the fixed strategy that used blue, green, and red light combinations. Concurrently, Wessel et al. [
31] and Persson et al. [
26] explored the application of red, near-infrared, shortwave infrared and red edge bands in forest biomass estimation and canopy cover assessment, respectively.
Current literature-based sensitivity analysis indicates high sensitivity of results to band variations: According to Qarallah et al., replacing band combinations with pure visible light combinations (e.g., B2 + B3 + B4) during deep learning model training may cause accuracy declines of 4%–8% [
32]. Numbisi et al. found that introducing alternative combinations like B2 + B4 + B8 may amplify spectral confusion, increasing misclassification rates by approximately 5%–7% in arid regions [
33]. Shafri et al. [
34] explored enhancing training accuracy by increasing band counts. However, their study showed that expanding beyond four bands introduced more spectral information but increased computational overhead by 15%–25%, with only limited accuracy gains (mIoU improved by just 0.8–1.5 percentage points). Channel redundancy also tended to induce overfitting and impair generalization performance [
5]. Overall, the optimal combination identified in this study demonstrates low sensitivity and high stability in arid regions. Its scalability potential can be further validated through multimodal fusion in future research. In summary, the impact of different combinations of bands on forest land extraction performance, and the determination of optimal three-band combinations, warrant further investigation.
Due to the limitations of traditional methods (such as threshold indices or classical machine learning) in forest land extraction tasks—specifically their poor adaptability to complex topography and heterogeneous forests—and the insufficient research on the impact of different Sentinel-2 band combinations on deep learning approaches, this study aims to investigate how various Sentinel-2 band combinations influence the performance of different deep learning models in forest land extraction tasks. Specific objectives include: (1) Constructing a multi-spectral dataset tailored to mountainous forested areas in Xinjiang to fully reflect the region’s forest type diversity, seasonal variations, and terrain shadow interference; (2) Comparing extraction performance across different band combinations (e.g., true color B2 + B3 + B4, false color B3 + B4 + B8, red-edge enhanced B7 + B8 + B11, and shortwave infrared enhanced B4 + B8 + B12) to identify the optimal three-band combination that maximizes model accuracy and robustness in separating forest from grassland, cropland and bare soil. Through these objectives, this study provides a data foundation and methodological support for high-precision forest land information extraction. By innovatively integrating band combination optimization strategies with diverse deep learning models, this research not only offers new pathways to address spectral confusion between forest land and similar vegetation but also establishes theoretical and practical foundations for further studies in related fields.
4. Result
4.1. Visualization of Sample Bandwidth Combinations
To visually compare the segmentation performance of different deep learning models in forest land extraction tasks, this study selected a representative sample—a remote sensing image with a spatial resolution of 512 × 512 pixels—from the test dataset. High-precision forest land reference labels were created based on manual interpretation.
Figure 7 and
Figure 8 display the raw imagery of this sample under four typical multispectral band combinations—(B4, B3, B8), (B3, B4, B8), (B7, B8, B11), and (B4, B8, B12)—alongside segmentation results from three mainstream semantic segmentation models: DeepLabV3+, PSPNet, and FCN. Red areas indicate pixels classified as forest by the models, while black areas represent non-forest background. It is evident that different spectral band combinations affect the accuracy of segmentation. The three models exhibit distinct differences in detail preservation, boundary integrity and misclassification suppression. Visual comparisons between the models’ outputs, the original images, and the labeled images reveal that FCN outperforms DeepLabV3+ and PSPNet in forest extraction.
4.2. Overall Model Accuracy Assessment
Figure 9 shows the convergence curves of training loss versus iteration steps for DeepLabV3+, PSPNet, and FCN models across four band combinations. Overall, the losses of all three models decrease steadily during training and eventually converge, indicating a stable and effective training process. Among them, the FCN band combination performs best, achieving the lowest final loss values (0.3–0.4). DeepLabV3+ and PSPNet show relatively similar performance, with final loss values exceeding 0.6, reflecting their comparatively weaker feature learning and optimization capabilities. At the band dimension level, the (B4, B8, B12) combination achieved the fastest convergence speed and lowest loss values across all three models. This indicates that the combination of red, near-infrared, and shortwave infrared information is most conducive to the effective expression of forestland features and category separation.
Forest extraction networks based on DeepLabV3+, PSPNet, and FCN all demonstrated relatively strong performance on the validation dataset.
Table 2 presents the evaluation results for models trained on five training datasets. Overall, the network achieved high performance metrics including overall pixel accuracy (aAcc), mean intersection over union (mIoU), mean category recall (mRecall), mean category F1 score (mFscore), and mean category precision (mPrecision), demonstrating its effectiveness for forest land extraction tasks across diverse regions of Xinjiang.
Based on the accuracy metrics across different band combinations and models, the three deep learning models demonstrated comparable overall performance in forest land extraction within the Xinjiang study area. Their aAcc (overall accuracy) consistently exceeded 97.5%, indicating that the combination of visible and near/shortwave infrared multispectral bands is sufficient for achieving high-precision forest land segmentation. However, significant differences exist among models in their recognition capabilities for the specific forest land category: The FCN model leads in most metrics, particularly achieving the highest values across the three key indicators—mIoU, mFscore, and mPrecision. This indicates that FCN provides more precise forest land localization with fewer omissions and misclassifications, delivering the most stable and optimal overall extraction performance. DeepLabV3+ and PSPNet achieved the second-highest overall accuracy, with mIoU generally around 87%, indicating slightly weaker segmentation capabilities for forest edges and small patches compared to FCN.
Regarding optimal band combinations, those incorporating shortwave infrared bands (B11 or B12) outperformed those using only the three visible bands (B4, B3, B2). All models achieved their highest or second-highest scores on the classic “agriculture/vegetation” false-color combination of B4, B8, and B12. Notably, FCN attained the overall best values of mIoU 89.45% and mFscore 94.23%. The B7,B8,B11 combination followed closely, achieving the highest mPrecision (94.97%) on FCN and also performing strongly on other models. These two combinations fully leverage the high reflectance of healthy vegetation in the near-infrared (B8) and its effective absorption characteristics in the shortwave infrared (B11/B12), enabling better differentiation between forested areas and surrounding bare soil, grasslands, and farmlands.
4.3. Analysis of Model Test Results
To investigate the applicability and limitations of the FCN model in forest land extraction tasks within complex alpine canyon regions, this study specifically selected the (B4, B8, B12) band combination for targeted analysis of large contiguous forest areas and small patchy forest areas. This spectral combination—encompassing red light (B4), near-infrared (B8), and shortwave infrared (B12)—effectively highlights spectral differences in moisture content and cellular structure between vegetation and non-vegetation. It has been validated through prior quantitative assessments and loss convergence analyses as the optimal spectral configuration for all three models. Selecting this optimal band combination as input helps eliminate interference from insufficient spectral information on model performance. This approach more accurately reveals the inherent capability differences within the FCN’s own network structure for large-scale and small-scale forest identification tasks, providing reliable basis for subsequent model improvements and band-model synergistic optimization strategies.
4.3.1. FCN Large Forest Area Extraction Analysis
To validate the model’s effectiveness and generalization capability in extracting complex forest areas in arid regions, this study specifically selected representative imagery of typical coniferous forests in the mid-to-high mountain zone of the northern Tianshan Mountains in Xinjiang, along with representative imagery from the forest belt of the Altai Mountains as validation samples. These areas encompass highly heterogeneous mountain landscapes, terrain shadow interference, and challenging scenarios such as forest-grass transition zones. By integrating multispectral remote sensing imagery from Sentinel-2 satellite’s optimal band combination (B4, B8, and B12), the forestland extraction performance of the FCN model was evaluated. Visual interpretation served as the benchmark for comparative analysis, assessing the model’s segmentation accuracy in small-scale fragmented forestlands relative to large contiguous forest areas.
Figure 10 displays the segmentation results of large contiguous forest areas by the FCN model using the spectral band combination (B4, B8, B12). It can be observed that the FCN generally captures the main contours and spatial distribution patterns of large-scale forest areas effectively. The red regions largely align with the actual forest boundaries, indicating the model’s strong recognition capability in large, homogeneous forest areas with distinct multispectral features. However, close-up details within the white-boxed areas in the third row reveal that FCN exhibits noticeable misclassification and underclassification in forest interior regions with relatively complex textures or those affected by shading and sparse canopy cover. Specifically, (1) patchy black voids appear within forest areas; (2) jagged fractures and excessive smoothing occur along some forest edges, resulting in incomplete boundaries; (3) minor non-forest areas are erroneously included within forest boundaries. These misclassifications primarily stem from FCN’s shallow feature extraction architecture and lack of effective multi-scale contextual fusion mechanisms. This results in insufficient perception of local spectral heterogeneity and fine-grained edge information. Consequently, while FCN maintains overall morphological consistency in large-scale forest extraction, it exhibits deficiencies in detail fidelity and noise robustness.
4.3.2. Extraction and Analysis of FCN Small Forest Areas
Small forest patches often exhibit fragmented, small-scale distribution patterns, making them susceptible to interference from terrain shadows, sparse canopy cover, and surrounding heterogeneous backgrounds. Compared to large contiguous forest areas, their smaller spatial scale and highly dispersed distribution result in inadequate feature representation in remote sensing imagery. When extracting small forest patches, models may exhibit oversegmentation, shape distortion, and misclassification errors due to limited receptive fields in neural network structures and insufficient multiscale context fusion. Therefore, the following discussion will focus on evaluating the model’s performance in extracting small forest patches.
Figure 11 illustrates the performance of the FCN model in extracting small, patchy forest areas under the band combination (B4, B8, B12). Compared to large contiguous forest areas, the FCN model shows reduced performance in identifying small-scale, fragmented forest patches, with segmentation results exhibiting noticeable fragmentation and incompleteness. The local details highlighted by the white box in the third row clearly reveal the following errors: (1) Complete segmentation omissions occur, manifesting as large black backgrounds where actual forest areas should be (segmentation omission errors); (2) detected forest patches exhibit shape distortion and area underestimation, with rough boundaries and internal fractures or voids; (3) non-forest areas (e.g., alpine meadows or shadow zones) are misclassified as forest, forming isolated red spots. These phenomena primarily stem from FCN’s shallow network architecture and simplistic upsampling strategy. This results in limited receptive fields and insufficient fusion of multi-scale contextual information. When confronted with small forest patches exhibiting weak spectral responses, blurred boundaries, and highly dispersed spatial distribution, its feature representation capability rapidly degrades, making it difficult to effectively capture fine-grained semantic details and precise edge information.
4.4. Comparison Analysis of Forest Extraction Results with ESA and Dynamic World Datasets
To further evaluate the forest extraction performance of FCN, selected areas were used to compare FCN’s extraction results with forest classification data from the ESA and Dynamic World datasets.
Figure 12 displays the comparison of forest land extraction results based on the fully convolutional network (FCN) with two mainstream global land cover products within the study area (81°40′ E–82°00′ E, 43°20′ N–43°30′ N).
In terms of spatial distribution patterns, the FCN model and ESA WorldCover exhibit consistent performance in delineating forest boundaries. Both effectively capture the intricate texture of intermingled coniferous forests and mountain meadows densely distributed across shaded mountain slopes and river valleys, featuring smooth edges with minimal patch fragmentation. They successfully suppress misclassifications of bare rock, shaded areas, and alpine grasslands. In contrast, Dynamic World exhibits a significant overestimation of forest areas in the same region. This is particularly evident in high-altitude exposed ridges and seasonally snow-covered zones, where large contiguous areas are erroneously classified as red, resulting in an overly continuous and expanded forest distribution pattern. This does not align with the actual sparse vegetation landscape features observed in true-color imagery.
Table 3 quantitatively compares the intersection-over-union (IoU) and proportion of intersecting areas among the forest land extraction results from FCN, ESA WorldCover, and Dynamic World across the entire study area test set, further validating the conclusions of visual interpretation. Among these, FCN and ESA WorldCover exhibited the highest IoU (70.87%), with the intersection area accounting for 87.40% of ESA’s extraction results and 78.94% of FCN’s, indicating superior spatial consistency compared to other combinations. The IoU between FCN and Dynamic World was 62.98%, while that between ESA and Dynamic World was only 57.94%. This reflects a systematic overestimation by Dynamic World in this high-altitude mountainous environment, leading to a marked reduction in overlap with both physically based and deep learning-based products. The jointly identified forested area (intersection area: 1327.47 km
2) accounted for only 55.41% of Dynamic World’s extracted area, yet represented 76.12% and 84.27% of FCN and ESA’s respective areas. This further quantifies Dynamic World’s misclassification rate in bare ground and shaded regions.
Through comprehensive qualitative and quantitative evaluation, the FCN model proposed in this paper demonstrates the highest spatial consistency with the authoritative ESA WorldCover product in complex mountainous environments, outperforming the near-real-time Dynamic World product. It exhibits distinct advantages, particularly in suppressing forest overestimation and preserving edge details. These results demonstrate that for diverse complex terrains, deep learning-driven regional optimization models deliver higher-precision forest land information compared to globally generalized products, providing more reliable spatial baseline data for subsequent ecological monitoring and carbon sink accounting.
4.5. Comparison of Small Patch Discretized Forest Extraction Results with ESA and Dynamic World Datasets
This study investigates the effectiveness of deep learning models in extracting forest areas in arid regions of Xinjiang using Sentinel-2 satellite imagery, examining different spectral band combinations. To evaluate the model’s capability in extracting small-patch fragmented woodlands, an additional experiment was conducted in a typical fragmented woodland area in Xinjiang. The FCN model was applied to extract woodlands using B4, B8, and B12 band combinations. The spatial distribution results were compared with woodland classification products from ESA and Dynamic World to assess the model’s adaptability and accuracy in complex terrain and sparse woodland scenarios. Located in the arid-semiarid zone of Xinjiang, this area features surface elements including rivers, farmlands, and scattered forest patches, making it susceptible to spectral confusion.
Figure 13 shows that the true-color image in Figure (a) depicts the actual surface landscape of the study area, including green vegetation belts, rivers, and surrounding desert regions. Figure (b) shows the forest binary mask extracted by the FCN model, where red dots represent forest areas. These are sparsely distributed, primarily concentrated in densely populated non-agricultural areas, but underestimation occurs—specifically, the omission of small, fragmented forest patches leads to an overall undercount of forest area. In contrast, Figure (c) presents the forest extraction results from the ESA model. The numerous scattered red dots indicate overclassification, where non-forest vegetation (such as grasslands or farmlands) is misclassified as forest. Figure (d) shows the extraction results from Dynamic World. The red areas exhibit linear distributions, capturing forests along riverbanks, but the model demonstrates weak recognition of small, scattered patches, also resulting in some underclassification. Overall, the FCN model exhibits a tendency to underestimate forest land extraction under this spectral combination, indicating that it requires further optimization for small-patch forest monitoring to enhance comprehensiveness.
5. Discussion
5.1. Exploring the Mechanisms Driving Performance Differences in Deep Learning-Based Forest Extraction Across Different Band Combinations
This study investigates the impact of different Sentinel-2 band combinations on the performance of deep learning models such as DeepLabV3 +, PSPNet, and FCN in extracting forest areas in arid regions of Xinjiang. Results indicate that combinations incorporating shortwave infrared bands (e.g., B4 + B8 + B12 and B7 + B8 + B11) outperform true-color combinations (B2 + B3 + B4), which use only visible bands, achieving higher aAcc, mIoU and mFscore. This improvement stems primarily from the sensitivity of shortwave infrared bands to vegetation moisture content and structural variations, effectively mitigating spectral confusion between forested areas and grasslands, farmlands, or bare soil. This enhancement boosts model robustness, particularly in complex mountain-oasis landscapes. Among these, the FCN model demonstrated outstanding performance across all band combinations, achieving a maximum mIoU of 89.45%. This advantage stems from its skip-connection architecture, which excels in restoring boundary details, whereas DeepLabV3+ and PSPNet showed slight deficiencies in capturing multi-scale context. These findings provide empirical evidence for optimizing multispectral remote sensing data inputs, confirming the critical role of band selection in enhancing the accuracy of deep learning-based forest land segmentation.
Furthermore, the incorporation of red-edge bands into the spectral combination enhances the model’s ability to identify canopy health status. This is because its response to chlorophyll absorption peaks helps capture differences in vegetation photosynthesis, thereby reducing edge blurring errors in the transition zone between alpine sparse forests and mountain shrublands. In contrast, while the B3 + B4 + B8 combination enhances near-infrared reflectance to highlight vegetation coverage, it exhibits lower sensitivity to degraded forest areas under water stress, making it susceptible to seasonal drought interference during poplar forest extraction. Experiments revealed that deep learning networks automatically extract complementary features between these bands through convolutional layers. For instance, under B4 + B8 + B12, the model’s attention mechanism prioritizes multi-scale spectral gradient fusion, thereby optimizing suppression of terrain shadows and cloud shadows. This overcomes the generalization bottleneck of traditional RGB inputs under complex illumination conditions. This mechanism difference stems not only from intrinsic correlations in spectral physical properties but also reflects the indirect influence of sample diversity in dataset construction on network parameter optimization, pointing the way forward for subsequent multimodal remote sensing fusion research.
5.2. Adaptability and Limitations of Deep Learning Models in Complex Forest Landscapes of Arid Regions
In highly heterogeneous forest environments of arid regions, deep learning models demonstrate certain structural advantages that enable them to maintain relatively stable recognition capabilities in scenarios with subtle spectral differences and significant background interference. Particularly in areas where coniferous forests, desert riparian forests, and alpine shrublands intermingle, the model can capture subtle variations in forest texture, canopy morphology, and surface structure through multi-layer convolutional or spatial pooling structures. This compensates for the limitations of traditional methods relying solely on spectral information in complex backgrounds. By integrating multispectral inputs, deep learning models extract more discriminative spatial features when addressing typical disturbances such as terrain shadows, seasonal snow cover, and forest-agricultural mosaics. This enables consistent characterization of large contiguous forest areas. These findings indicate that deep learning architectures demonstrate good adaptability to forest types with stable spatial organization, making them highly practical for ecological monitoring and forest resource surveys in arid regions.
At the same time, the limitations of deep learning in forest extraction within arid regions cannot be overlooked, particularly its constrained ability to handle extremely complex backgrounds and fragmented forest patches. When vegetation coverage is low, canopy density is sparse, or terrain obstructs visibility, deep learning models often fail to adequately capture the weak spectral responses of forest areas, leading to missed detections of small-scale patches. Furthermore, features such as high-albedo bare rock, shaded areas, and alpine meadows may exhibit spectral and textural similarities to forested areas at local scales. This similarity can lead to misclassification when models lack sufficient contextual information. In summary, while deep learning shows high potential for forest extraction in arid regions, comprehensively enhancing its reliability in complex landscapes requires ongoing exploration in multi-scale feature modeling, training data diversification, and model structure optimization.
5.3. Region-Specificity and Generalizable Components of Deep Learning Model Frameworks
Within this research framework, dataset construction exhibits regional specificity. The study primarily utilizes 10-scene Sentinel-2 imagery covering the entire Xinjiang region, encompassing typical vegetation types such as mid-elevation coniferous forests on the northern slopes of the Tianshan Mountains, Altai forests, alpine shrublands on the Kunlun Mountains, and river valley poplar forests. These samples intentionally incorporate unique challenges specific to arid regions, such as regional snow cover, terrain shadow interference, and spectral confusion between forest, grassland, and agricultural land. Labeling was performed using ArcGIS Pro tools, focusing on a binary classification task for Xinjiang’s mountain-oasis-desert composite ecosystems. This design addresses Xinjiang’s ecological fragility and forest degradation issues and cannot be directly applied to humid or tropical regions. Instead, new samples must be collected based on local vegetation spectra and interference factors. Furthermore, while model training parameters are universal, optimization is constrained by Xinjiang’s cloud cover < 6% and seasonal selection. Model extraction results were validated against ESA WorldCover and Dynamic World benchmarks specifically in Xinjiang’s alpine mountain regions, highlighting suppression of bare rock and shadow misclassification within the area.
Conversely, the core methodology and technical workflow of the framework are expected to be generalizable, encompassing optimized strategies for four tri-band combinations (B2 + B3 + B4 true color, B3 + B4 + B8 false color, B7 + B8 + B11 red edge enhancement, B4 + B8 + B12 shortwave infrared enhancement). This strategy, grounded in spectral physics, can be extended to other Sentinel-2 datasets with only resolution resampling adjustments required. Deep learning model architectures such as FCN based on HRNet-18s, DeepLabV3+’s ASPP module, PSPNet’s PPM, and related training procedures—including 512 × 512 cropping, 8:2 training-validation segmentation, loss convergence monitoring, and evaluation metrics—are domain standards. This framework can be directly transferred to vegetation extraction tasks in similar arid or semi-arid environments (e.g., Central Asian Gobi, Middle Eastern deserts, or Africa’s Sahel region). Through transfer learning, this framework rapidly adapts to new regions, with expected generalization accuracy exceeding 85%.