Next Article in Journal
Early-Warning Indicators of Mangrove Decline Under Compounded Biotic and Anthropogenic Stressors
Previous Article in Journal
Transforming Fast-Growing Wood into High-Strength Materials via Thermo-Mechanical Densification with Hydrothermal and Alkaline Sulfite Pretreatment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Performance Evaluation of Deep Learning Models for Forest Extraction in Xinjiang Using Different Band Combinations of Sentinel-2 Imagery

1
College of Geography and Remote Sensing Science, Xinjiang University, Urumqi 830017, China
2
College of Surveying and Geo-Informatics, Tongji University, Shanghai 200092, China
3
College of Geography and Environmental Sciences, Zhejiang Normal University, Jinhua 321004, China
*
Author to whom correspondence should be addressed.
Forests 2026, 17(1), 88; https://doi.org/10.3390/f17010088
Submission received: 29 November 2025 / Revised: 27 December 2025 / Accepted: 8 January 2026 / Published: 9 January 2026
(This article belongs to the Section Forest Inventory, Modeling and Remote Sensing)

Abstract

Remote sensing provides an efficient approach for monitoring ecosystem dynamics in the arid and semi-arid regions of Xinjiang, yet traditional forest-land extraction methods (e.g., spectral indices, threshold segmentation) show limited adaptability in complex environments affected by terrain shadows, cloud contamination, and spectral confusion with grassland or cropland. To overcome these limitations, this study used three convolutional neural network-based models (FCN, DeepLabV3+, and PSPNet) for accurate forest-land extraction. Four tri-band training datasets were constructed from Sentinel-2 imagery using combinations of visible, red-edge, near-infrared, and shortwave infrared bands. Results show that the FCN model trained with B4–B8–B12 achieves the best performance, with an mIoU of 89.45% and an mFscore of 94.23%. To further assess generalisation in arid landscapes, ESA WorldCover and Dynamic World products were introduced as benchmarks. Comparative analyses of spatial patterns and quantitative metrics demonstrate that the FCN model exhibits robustness and scalability across large areas, confirming its effectiveness for forest-land extraction in arid regions. This study innovatively combines band combination optimization strategies with multiple deep learning models, offering a novel approach to resolving spectral confusion between forest areas and similar vegetation types in heterogeneous arid ecosystems. Its practical significance lies in providing a robust data foundation and methodological support for forest monitoring, ecological restoration, and sustainable land management in Xinjiang and similar regions.

1. Introduction

Forests represent one of the most vital terrestrial ecosystems on Earth, covering approximately 31% of the planet’s land surface. These organisms provide a range of ecosystem services that are indispensable for human well-being, including carbon sequestration, oxygen release, water conservation, biodiversity maintenance, regional climate regulation, and erosion prevention [1,2,3]. In the arid and semi-arid regions of northwest China, particularly Xinjiang, forest lands play a pivotal role in stabilising fragile ecosystems. They have been shown to reduce wind and sand activity (e.g., reducing sandy weather days by approximately 28% and sandstorm days by up to 69% in restored areas), impede desert expansion, protect the agricultural and pastoral production base of oases, and provide critical habitats for rare wildlife [4,5]. However, the forest lands of Xinjiang have long been subject to anthropogenic disturbances, including climate warming, changing precipitation patterns, overgrazing, and unsustainable development, resulting in a reduction in area and functional degradation [6,7]. Accurate extraction and dynamic monitoring of forest land extent and health status have thus become critical measures for forest conservation in Xinjiang. In recent years, remote sensing technology has emerged as a core tool for the extraction of forest land information, the monitoring of degradation, and the assessment of conservation effectiveness. Utilising its strengths in multispectral data, high spatio-temporal resolution, and long-term temporal coverage, it provides essential data support for the scientific formulation of forest land ecological restoration strategies and the assurance of sustainable forest land development [8].
The predominant methodologies employed in conventional forest land extraction encompass supervised classification techniques, namely maximum likelihood, support vector machines, and random forests, along with object-oriented classification approaches. Additionally, vegetation index thresholding techniques, including NDVI and EVI threshold segmentation, are employed [9,10]. These methods are heavily reliant on the analysis of spectral characteristics and the calculation of index differences of vegetation. These species demonstrate optimal performance in environments characterised by uniform vegetation cover and minimal seasonal variation. However, in the context of the complex terrain of Xinjiang, characterised by sparse desert forests interspersed with shrubs, seasonal snow cover, cloud shadow interference, and oasis farmlands exhibiting spectral signatures analogous to forested areas, the application of these methods frequently results in severe misclassification and underclassification, thereby compromising the accuracy of the results [11,12]. In recent years, with the rapid application of deep learning technologies in remote sensing, deep learning-based forest extraction methods have emerged as a research hotspot. In comparison with conventional methodologies, deep learning models have the capacity to automatically learn complex nonlinear features of vegetation across multiple levels, including texture, shape and context. This capability is evidenced by significant advantages when processing highly heterogeneous landscapes [13,14]. Current mainstream approaches are based on convolutional neural networks (CNNs), such as U-Net and DeepLab series, which have achieved high segmentation accuracy (e.g., mIoU of 78%–91%, OA of 92%–96%, and F1-scores of 87%–95% in studies from 2023 to 2025) in forest segmentation of high-resolution optical imagery [15,16]. However, the majority of extant deep learning models are trained and reasoned using solely the three RGB channels. This is primarily because mainstream backbone networks such as ResNet and Swin Transformer have undergone extensive pre-training on natural image RGB datasets. The employment of direct transfer learning has been demonstrated to result in a substantial enhancement in the accuracy of the training process and its efficiency. Conversely, the incorporation of multispectral band data results in a substantial increase in computational resource demands, a complexification of data preprocessing, and an exacerbation of the scarcity of annotated samples. These factors impose significant constraints on the comprehensive utilisation of multispectral information in deep learning-based forest land extraction [17,18]. Consequently, the evaluation of forest extraction performance across diverse spectral band combinations, in conjunction with the construction of customised deep learning datasets tailored to these combinations, is imperative for the advancement of forest extraction from contemporary remote sensing imagery [19].
In recent years, significant progress has been made in forest land extraction using Sentinel-2 imagery combined with machine learning or deep learning. In relatively homogeneous vegetation areas such as European temperate forests and Asian mountainous regions, Huang et al. achieved an overall accuracy (OA) of 90.4% when mapping plantation tree species using multi-temporal Sentinel-2 data [20]. Nasiri et al. employed multi-temporal Sentinel-2 imagery and machine learning for mountain forest species classification, achieving 85%–92% accuracy in complex landscapes [21]; Vorovencii et al. achieved a Kappa coefficient of 0.85 in mountain forest species classification using Sentinel-1/2 multi-temporal imagery combined with machine learning [20,22]. These studies predominantly rely on multi-spectral bands to augment random forests, CNNs, or deep learning models, demonstrating good performance in areas with uniform vegetation cover. However, these methods still face challenges in arid and semi-arid regions of Xinjiang [20,23]. Issues include weak spectral signals due to sparse vegetation distribution, amplified spectral confusion from terrain shadows and cloud shadows, and accuracy declines to 70%–85% during seasonal variations under extreme climatic conditions. To address these challenges, an increasing number of scholars are exploring the integration of different spectral band combinations with various deep learning models for the mountain-oasis-desert composite ecosystem in Xinjiang. By enhancing shortwave infrared and suppressing interference, this approach aims to improve accuracy metrics, enhance generalization capabilities in arid regions, and lay the groundwork for further investigations into spectral band combination methods [4].
Previous studies have demonstrated that different tri-band combinations impact the accuracy of forest land extraction. Among these, the natural true-color combination of blue, green, and red light (B2 + B3 + B4) has been widely adopted in multiple studies as a foundational visualization and preliminary classification method. For instance, Immitzer et al. [24] and Grabska et al. [25] prioritized this combination when classifying Central European tree species using Sentinel-2 data and identifying forest types across multiple time periods, respectively, to enhance visual distinctions between vegetation and non-vegetation. Persson et al. [26] also selected this combination for boreal forest species mapping, demonstrating its effectiveness under dense canopy cover. Beyond the blue-green-red combination, different band combinations exhibit distinct characteristics and advantages. The green-red-near-infrared (B3 + B4 + B8) false-color combination is widely applied for vegetation health monitoring and forestland extraction. Bolyn et al. [27] compared multiple band combinations and found that substituting near-infrared for shortwave infrared yielded optimal performance in separating forestland from cropland. When classifying European temperate forest types using a random forest model on Sentinel-2 imagery, Bolyn et al. investigated multiple combinations (blue, green, red; red, near-infrared, shortwave infrared 1; and red edge enhancement combinations). They found that the green, red, near-infrared combination incorporating near-infrared and red edge exhibited optimal performance. In forest land detection research, Fassnacht et al. [28] compared multiple vegetation indices and band combinations, concluding that the NDVI combined with near-infrared (B8) was the most superior. Axelsson et al. [29] employed two combinations—red edge, near-infrared, and shortwave infrared (B7 + B8 + B11) and red, near-infrared, and shortwave infrared 2 (B4 + B8 + B12)—for manual forest land extraction from multispectral imagery, validating the applicability of shortwave infrared in distinguishing forest land moisture content and structural variations. In order to enhance the accuracy of forest land extraction, Grabska et al. [30] proposed a methodology integrating multi-temporal red edge and shortwave infrared combinations. The experimental results demonstrated that this dynamic selection approach outperformed the fixed strategy that used blue, green, and red light combinations. Concurrently, Wessel et al. [31] and Persson et al. [26] explored the application of red, near-infrared, shortwave infrared and red edge bands in forest biomass estimation and canopy cover assessment, respectively.
Current literature-based sensitivity analysis indicates high sensitivity of results to band variations: According to Qarallah et al., replacing band combinations with pure visible light combinations (e.g., B2 + B3 + B4) during deep learning model training may cause accuracy declines of 4%–8% [32]. Numbisi et al. found that introducing alternative combinations like B2 + B4 + B8 may amplify spectral confusion, increasing misclassification rates by approximately 5%–7% in arid regions [33]. Shafri et al. [34] explored enhancing training accuracy by increasing band counts. However, their study showed that expanding beyond four bands introduced more spectral information but increased computational overhead by 15%–25%, with only limited accuracy gains (mIoU improved by just 0.8–1.5 percentage points). Channel redundancy also tended to induce overfitting and impair generalization performance [5]. Overall, the optimal combination identified in this study demonstrates low sensitivity and high stability in arid regions. Its scalability potential can be further validated through multimodal fusion in future research. In summary, the impact of different combinations of bands on forest land extraction performance, and the determination of optimal three-band combinations, warrant further investigation.
Due to the limitations of traditional methods (such as threshold indices or classical machine learning) in forest land extraction tasks—specifically their poor adaptability to complex topography and heterogeneous forests—and the insufficient research on the impact of different Sentinel-2 band combinations on deep learning approaches, this study aims to investigate how various Sentinel-2 band combinations influence the performance of different deep learning models in forest land extraction tasks. Specific objectives include: (1) Constructing a multi-spectral dataset tailored to mountainous forested areas in Xinjiang to fully reflect the region’s forest type diversity, seasonal variations, and terrain shadow interference; (2) Comparing extraction performance across different band combinations (e.g., true color B2 + B3 + B4, false color B3 + B4 + B8, red-edge enhanced B7 + B8 + B11, and shortwave infrared enhanced B4 + B8 + B12) to identify the optimal three-band combination that maximizes model accuracy and robustness in separating forest from grassland, cropland and bare soil. Through these objectives, this study provides a data foundation and methodological support for high-precision forest land information extraction. By innovatively integrating band combination optimization strategies with diverse deep learning models, this research not only offers new pathways to address spectral confusion between forest land and similar vegetation but also establishes theoretical and practical foundations for further studies in related fields.

2. Study Area and Data

2.1. Overview of the Study Area

This study focuses on the Xinjiang Uygur Autonomous Region (73°40′–96°23′ E, 34°25′–49°10′ N) in China’s arid and semi-arid northwest. Situated in the heart of the Eurasian continent, this region is surrounded by mountains on three sides, with the Tarim Basin and Junggar Basin nestled between them. This configuration forms a typical mountain-oasis-desert composite ecosystem. Annual precipitation is less than 200 mm, while evaporation exceeds 2500 mm, characterizing an extremely arid temperate continental climate [35]. Xinjiang’s natural forest areas primarily comprise mountain coniferous forests, river valley secondary forests, plain desert riparian forests, and alpine meadow transitional woodlands, collectively covering approximately 4.5%–5.5% of the region’s total land area [36,37]. Among these, the mid-to-high mountain zone on the northern slope of the Tianshan Mountains (elevation 1500–2800 m) concentrates over 80% of Xinjiang’s natural coniferous forests, serving as a vital ecological security barrier and water conservation forest area for China’s northwest region. Xinjiang’s unique topography and extreme arid climate conditions have shaped highly heterogeneous spectral characteristics and spatial distribution patterns of forested areas. Forests often exhibit severe spectral confusion with grasslands, shrublands, bare soil, and seasonal snow cover, posing significant challenges for precise remote sensing extraction [38,39].Therefore, accurate, rapid, and large-scale monitoring of forest land distribution and its dynamic changes in Xinjiang holds significant importance for regional ecological restoration and forest carbon sink accounting. This study, covering the entire Xinjiang region, focuses on the applicability of Sentinel-2 multi-band combinations for extracting forest land information in complex arid areas, demonstrating substantial theoretical value and practical significance. Figure 1 displays the 10 training areas and 3 testing areas selected within the study region.

2.2. Data Sources

The primary data sources for this study include Sentinel-2 satellite imagery, as well as the ESA WorldCover 10 m dataset [40] and the Dynamic World dataset [41] released by the European Space Agency (ESA).
Sentinel-2 is a core component of the European Space Agency’s Copernicus program. Equipped with a Multispectral Imager (MSI), this satellite provides 13 spectral bands with spatial resolutions of 10 m (for blue, green, red, and near-infrared bands), 20 m (for red edge, narrow near-infrared, shortwave infrared, and other bands), and 60 m (for atmospheric bands). with spatial resolutions of 10 m (blue, green, red, and near-infrared bands), 20 m (red edge, narrowband near-infrared, shortwave infrared, and other bands), and 60 m (three bands for atmospheric correction), and a swath width of 290 km [42]. Table 1 lists the Sentinel-2 band information used in this study, providing technical support for band combination design. This study primarily selected Sentinel-2 L2A-level imagery from 10 different regions in Xinjiang between May and July 2022. Seven bands (B2, B3, B4, B7, B8, B11, B12) were extracted and combined to create the training dataset. When selecting imagery, apply a buffer threshold of cloud cover <6% and download data from GEE. During preprocessing, use SCL masks to remove cloud pixels, ensuring valid pixels > 94%. To comprehensively evaluate the performance of different models in forest land extraction, training was conducted using forest areas of varying sizes. Figure 2 below illustrates the four distinct dataset types formed by combining seven different Sentinel-2 bands.
ESA 10 m land classification data refers to the global 10 m resolution land cover classification product released by the European Space Agency through the WorldCover project. It is primarily generated using advanced machine learning algorithms and cloud computing platforms, based on full-year L2A-level imagery from the Sentinel-2 satellite in 2020 and 2021 [40]. This dataset classifies global land surfaces into 11 primary categories, including built-up areas, permanent water bodies, forested areas, grasslands, croplands, shrublands, bare land, snow, ice, moss, lichen, mangroves and temporary water bodies. With an overall accuracy exceeding 85%, it stands as one of the highest-resolution globally available land cover products currently in the public domain. Dynamic World is a near-real-time global land cover dataset (2015–present) jointly developed by Google, the World Resources Institute, and the National Geographic Society. Based on Sentinel-2 10 m imagery, it employs deep learning models to generate pixel-by-pixel probability-based classifications across nine land cover categories: water, trees, grass, flooded vegetation, crops, shrub & scrub, built, bare, and snow & ice. This dataset produces classification results through deep learning models, which are then matched with Sentinel-2 data. It is suitable for regional statistics and time-series analysis [41].

3. Materials and Methods

3.1. Training Dataset and Validation Dataset

To ensure robust and generalizable models, this study systematically selected 10-scene Sentinel-2 imagery across the entire Xinjiang region as foundational training data. These images encompass diverse typical forest types, including mid-mountain coniferous forests on the northern slopes of the Tianshan Mountains, alpine sparse forests on the southern slopes, coniferous forests in the Altai Mountains, and alpine shrub transition forests in the Kunlun Mountains. They also include highly challenging scenarios such as seasonal deciduous trees, canopy shadows, terrain occlusions, cloud shadow interference, and forest-agricultural mixed areas. The sample areas encompass diverse altitudinal zones (800–3500 m), slopes (0–45°), aspect orientations, and arid-semi-humid climate gradients, ensuring spectral diversity, textural complexity, and varied background environments. By constructing training samples within such heterogeneous and representative arid forest environments, this study provides a robust data foundation for models to learn stable forest surface representations. This enhances extraction accuracy and generalization performance within complex mountain-oasis-desert composite ecosystems.
This study utilizes Sentinel-2 satellite imagery as training data, with the B2-Blue, B3-Green, B4-Red, and B8-NIR bands featuring a spatial resolution of 10m, while the B11-SWIR 1, B12-SWIR 2, and B7-Vegetation red edge (Vre) bands have a spatial resolution of 20m. To unify the resolution for deep learning training, the study employed the nearest neighbor pixel method to resample the three 20 m spatial resolution bands to 10 m. Figure 3 below illustrates the processing workflow for reorganizing the remote sensing image bands. Training labels were annotated using ArcGIS Pro 3.5.4. Its built-in image segmentation and annotation tools support object-based segmentation and pixel-level annotation, enabling rapid generation of polygon samples or pixel-level labels for remote sensing imagery. This enhances the efficiency of deep learning training data preparation [43,44]. The training data is divided into forested and non-forested areas. After annotation completion, all images are uniformly cropped to a size of 512 × 512 pixels. From this, 3410 images are selected for training and validation, divided into a training dataset and a validation dataset at an 8:2 ratio. This study implemented an independent validation strategy. Specifically, 10 scenes were selected from the 13 Sentinel-2 images covering the entire Xinjiang region (May–July 2022) to construct the training/validation dataset. The remaining 3 scenes (blue-boxed areas shown in Figure 1) served as a completely independent test set, excluded from both training and validation. These test images encompass diverse scenarios such as alpine sparse forests and river valley forests, enabling predictions for unseen data.
This study leverages the spectral characteristics of Sentinel-2 imagery and the practical requirements for forest land extraction in arid regions of Xinjiang. This study identifies the most discriminative spectral bands by considering multiple dimensions, including color fidelity, vegetation near-infrared reflectance, sensitivity to vegetation moisture and dead wood, and the response of the red edge to canopy health. This approach maximizes the feature expression capability and classification accuracy of deep learning models for forest land extraction in Xinjiang’s arid zones [45,46]. The study ultimately selected the following four most representative and complementary tri-band combinations to construct training and validation datasets: (1) B2 + B3 + B4 (blue + green + red, natural true color combination), which most closely approximates human visual perception, effectively preserves forest texture and tonal information, and is suitable for highlighting canopy morphology and shadow details; (2) B3 + B4 + B8 (green light + red light + near-infrared, standard color combination), fully leveraging vegetation’s strong reflectance in near-infrared; this is the most stable and widely applied combination for traditional NDVI and preliminary forest land identification; (3) B4 + B8 + B12 (red light + near-infrared + shortwave infrared 2), a classic combination for agricultural and forest monitoring. Shortwave infrared 2 is highly sensitive to vegetation moisture and dead wood, effectively suppressing interference from grasslands, farmlands, and bare soil, and enhancing the ability to distinguish poplar forests in arid regions from degraded forest areas; (4) B7 + B8 + B11 (red edge 3 + near infrared + shortwave infrared 1), a red-edge enhanced combination. The red-edge band exhibits extreme sensitivity to chlorophyll content and canopy health status. When combined with Shortwave Infrared 1, it improves identification accuracy for mountain coniferous forests and alpine sparse forests while mitigating spectral confusion in forest-grass transition zones. All four datasets employed identical pixel-level labels, sample distributions, and preprocessing workflows, differing only in input bands. This ensures experimental variations are solely attributable to the spectral combinations themselves. Through systematic comparative experiments on the four combinations, this study aims to reveal the influence mechanism of different spectral information on the feature extraction capability of deep learning models. It identifies the optimal three-band input scheme with the best comprehensive performance in Xinjiang’s complex mountain-oasis-desert ecosystems, providing a basis for optimal band configuration in high-precision, robust forest land remote sensing extraction.

3.2. Forest Extraction Networks Based on Different Deep Learning Models

DeepLabV3+ is a classic semantic segmentation model proposed by the Google team, whose greatest innovations lie in the Atrous Spatial Pyramid Pooling (ASPP) module and its efficient encoder–decoder architecture [47]. Building upon DeepLabV3, this model further incorporates a lightweight decoder. By concatenating and refining the multi-scale contextual features from ASPP with shallow high-resolution features, it effectively restores boundary details such as forest edges and fine branches, demonstrating effective object localization capabilities in complex mountainous canopy extraction [47,48]. In this study, DeepLabV3+ employs ResNet-50 as its dilated convolutional backbone network, representing a traditional CNN-based model. It serves to compare the sensitivity and performance upper limits of convolutional-dominant architectures across different spectral band combinations for forestland extraction in arid regions of Xinjiang. Figure 4 illustrates the schematic diagram of the DeepLabV3+ network architecture.
Pyramid Scene Parsing Network (PSPNet), proposed by Zhao Hengshuai et al., is renowned for its robust global scene parsing capabilities [49]. Its core lies in the Pyramid Pooling Module (PPM), which concurrently captures multi-scale contextual information through four distinct grid-sized pooling operations (1 × 1, 2 × 2, 3 × 3, 6 × 6). Subsequent upsampling and fusion effectively mitigate the loss of local features caused by extensive forest-grass transition zones and terrain shadows in arid regions [50]. This model employs pre-trained ResNet-50 or ResNet-50 as the feature extraction backbone, supplemented by auxiliary loss supervision during training. It demonstrates significant advantages in comprehending complex forest landscapes against intricate backgrounds [51], making it the chosen baseline model for evaluating the impact of band combinations on multi-scale context awareness in this study. Figure 5 illustrates the PSPNet network architecture.
The Fully Convolutional Network (FCN), pioneered by Long et al., marked a groundbreaking achievement in semantic segmentation by first enabling end-to-end pixel-level dense predictions and completely eliminating fully connected layers [44]. This model replaces the fully connected layers of a classification network with 1 × 1 convolutions and employs skip connections to progressively fuse shallow high-resolution features with deep semantic features. Multiple variants emerged, including FCN-32s, FCN-16s, and FCN-8s, with FCN-8s demonstrating the strongest boundary recovery capability [44]. Despite its relatively simple structure, FCN established the fundamental paradigm for modern semantic segmentation. Its low computational overhead and ease of implementation have made it a widely adopted performance baseline. This study adopts HRNet-18s as the backbone network for feature extraction to clearly illustrate the performance evolution and gaps between early purely convolutional networks and current architectures in the complex forestland extraction task in Xinjiang. Figure 6 presents a schematic diagram of the FCN model architecture.

3.3. Model Training Configuration

This study trained the model using the MMSegmentation 0.21.1 framework (integrated with PyTorch 1.10.0+cu111) on a Python 3.7.6 environment. Specific hyperparameter configurations are as follows: The initial learning rate was set to 0.01 and dynamically adjusted using a polynomial decay strategy (power = 0.9); AdamW was selected as the optimizer with a weight decay rate of 0.01 and default beta values (0.9, 0.999); batch size was set to 8 to accommodate GPU memory constraints; total iterations were capped at 20,000. To ensure model convergence and prevent overfitting, an early stopping mechanism (patience = 10) was implemented: training ceases when the validation set loss fails to decrease for 10 consecutive epochs. Concurrently, training and validation loss curves were monitored in real-time and the generalization gap was calculated, requiring the difference between training and validation accuracy to remain below 10%. Furthermore, the cross-entropy loss function was employed during training. Data augmentation techniques including random flipping, scaling, and color dithering were applied to enhance the model’s robustness against spectral variations in Xinjiang’s arid regions. These configurations validated the model’s stability and generalization capabilities on an independent test set, ensuring forest extraction accuracy exceeding 89% across complex terrains.

3.4. Evaluation Indicators

To comprehensively evaluate the performance of deep learning models in forest land extraction tasks across different band combinations in arid regions of Xinjiang, this study employs mainstream pixel-level evaluation metrics from the semantic segmentation domain. These include Overall Accuracy (aAcc), mean Intersection over Union (mIoU), mean F-score, mean Precision (mPrecision), and mean Recall (mRecall) [52,53]. Within a binary classification framework (forest areas as positive class, non-forest areas as negative class), these metrics were computed based on the four fundamental elements of the confusion matrix: TP (True Positive, forest pixels correctly predicted as forest), TN (True Negative, non-forest pixels correctly predicted as non-forest), FP (False Positive, non-forest pixels incorrectly predicted as forest), and FN (False Negative, forest pixels incorrectly predicted as non-forest) [54].
(1) Overall Accuracy (aAcc) is the most intuitive global evaluation metric for semantic segmentation tasks, representing the proportion of pixels correctly classified by the model out of all pixels. Its calculation formula is:
a A c c = T P + T N T P + T N + F P + F N
(2) Mean Intersection over Union (mIoU) is the current gold standard metric in semantic segmentation. It is most robust to class imbalance and directly measures the overlap between predicted forest regions and actual forest regions. It is regarded as the core metric that best reflects a model’s true performance. Its calculation formula is:
m I o U = T P T P + F P + F N
(3) Mean Recall (mRecall) represents the average recall rate across all categories, emphasizing the model’s ability to comprehensively detect true forest areas. Its calculation formula is:
m R e c a l l = T P T P + F N
(4) The mean F1-score (mFscore) is the harmonic mean of precision and recall, balancing both the model’s accuracy and comprehensiveness. It is highly correlated with mIoU and equivalent to the Dice coefficient, making it one of the fairest metrics for evaluating overall performance on minority classes. Its formula is:
m F s c o r e = 2 T P 2 T P + F P + F N
(5) Mean Precision (mPrecision) represents the proportion of pixels predicted as forest by the model that are genuinely forest, emphasizing the “purity” of the results. Its calculation formula is:
m P r e c i s i o n = T P T P + F P
The higher this indicator, the lower the proportion of background errors—such as grasslands, bare soil, and farmland—misclassified as forest land. In arid regions of Xinjiang, it is commonly used to diagnose whether models exhibit significant over-extraction, particularly in areas where poplar forests and desert grasslands share similar spectral signatures.

4. Result

4.1. Visualization of Sample Bandwidth Combinations

To visually compare the segmentation performance of different deep learning models in forest land extraction tasks, this study selected a representative sample—a remote sensing image with a spatial resolution of 512 × 512 pixels—from the test dataset. High-precision forest land reference labels were created based on manual interpretation. Figure 7 and Figure 8 display the raw imagery of this sample under four typical multispectral band combinations—(B4, B3, B8), (B3, B4, B8), (B7, B8, B11), and (B4, B8, B12)—alongside segmentation results from three mainstream semantic segmentation models: DeepLabV3+, PSPNet, and FCN. Red areas indicate pixels classified as forest by the models, while black areas represent non-forest background. It is evident that different spectral band combinations affect the accuracy of segmentation. The three models exhibit distinct differences in detail preservation, boundary integrity and misclassification suppression. Visual comparisons between the models’ outputs, the original images, and the labeled images reveal that FCN outperforms DeepLabV3+ and PSPNet in forest extraction.

4.2. Overall Model Accuracy Assessment

Figure 9 shows the convergence curves of training loss versus iteration steps for DeepLabV3+, PSPNet, and FCN models across four band combinations. Overall, the losses of all three models decrease steadily during training and eventually converge, indicating a stable and effective training process. Among them, the FCN band combination performs best, achieving the lowest final loss values (0.3–0.4). DeepLabV3+ and PSPNet show relatively similar performance, with final loss values exceeding 0.6, reflecting their comparatively weaker feature learning and optimization capabilities. At the band dimension level, the (B4, B8, B12) combination achieved the fastest convergence speed and lowest loss values across all three models. This indicates that the combination of red, near-infrared, and shortwave infrared information is most conducive to the effective expression of forestland features and category separation.
Forest extraction networks based on DeepLabV3+, PSPNet, and FCN all demonstrated relatively strong performance on the validation dataset. Table 2 presents the evaluation results for models trained on five training datasets. Overall, the network achieved high performance metrics including overall pixel accuracy (aAcc), mean intersection over union (mIoU), mean category recall (mRecall), mean category F1 score (mFscore), and mean category precision (mPrecision), demonstrating its effectiveness for forest land extraction tasks across diverse regions of Xinjiang.
Based on the accuracy metrics across different band combinations and models, the three deep learning models demonstrated comparable overall performance in forest land extraction within the Xinjiang study area. Their aAcc (overall accuracy) consistently exceeded 97.5%, indicating that the combination of visible and near/shortwave infrared multispectral bands is sufficient for achieving high-precision forest land segmentation. However, significant differences exist among models in their recognition capabilities for the specific forest land category: The FCN model leads in most metrics, particularly achieving the highest values across the three key indicators—mIoU, mFscore, and mPrecision. This indicates that FCN provides more precise forest land localization with fewer omissions and misclassifications, delivering the most stable and optimal overall extraction performance. DeepLabV3+ and PSPNet achieved the second-highest overall accuracy, with mIoU generally around 87%, indicating slightly weaker segmentation capabilities for forest edges and small patches compared to FCN.
Regarding optimal band combinations, those incorporating shortwave infrared bands (B11 or B12) outperformed those using only the three visible bands (B4, B3, B2). All models achieved their highest or second-highest scores on the classic “agriculture/vegetation” false-color combination of B4, B8, and B12. Notably, FCN attained the overall best values of mIoU 89.45% and mFscore 94.23%. The B7,B8,B11 combination followed closely, achieving the highest mPrecision (94.97%) on FCN and also performing strongly on other models. These two combinations fully leverage the high reflectance of healthy vegetation in the near-infrared (B8) and its effective absorption characteristics in the shortwave infrared (B11/B12), enabling better differentiation between forested areas and surrounding bare soil, grasslands, and farmlands.

4.3. Analysis of Model Test Results

To investigate the applicability and limitations of the FCN model in forest land extraction tasks within complex alpine canyon regions, this study specifically selected the (B4, B8, B12) band combination for targeted analysis of large contiguous forest areas and small patchy forest areas. This spectral combination—encompassing red light (B4), near-infrared (B8), and shortwave infrared (B12)—effectively highlights spectral differences in moisture content and cellular structure between vegetation and non-vegetation. It has been validated through prior quantitative assessments and loss convergence analyses as the optimal spectral configuration for all three models. Selecting this optimal band combination as input helps eliminate interference from insufficient spectral information on model performance. This approach more accurately reveals the inherent capability differences within the FCN’s own network structure for large-scale and small-scale forest identification tasks, providing reliable basis for subsequent model improvements and band-model synergistic optimization strategies.

4.3.1. FCN Large Forest Area Extraction Analysis

To validate the model’s effectiveness and generalization capability in extracting complex forest areas in arid regions, this study specifically selected representative imagery of typical coniferous forests in the mid-to-high mountain zone of the northern Tianshan Mountains in Xinjiang, along with representative imagery from the forest belt of the Altai Mountains as validation samples. These areas encompass highly heterogeneous mountain landscapes, terrain shadow interference, and challenging scenarios such as forest-grass transition zones. By integrating multispectral remote sensing imagery from Sentinel-2 satellite’s optimal band combination (B4, B8, and B12), the forestland extraction performance of the FCN model was evaluated. Visual interpretation served as the benchmark for comparative analysis, assessing the model’s segmentation accuracy in small-scale fragmented forestlands relative to large contiguous forest areas.
Figure 10 displays the segmentation results of large contiguous forest areas by the FCN model using the spectral band combination (B4, B8, B12). It can be observed that the FCN generally captures the main contours and spatial distribution patterns of large-scale forest areas effectively. The red regions largely align with the actual forest boundaries, indicating the model’s strong recognition capability in large, homogeneous forest areas with distinct multispectral features. However, close-up details within the white-boxed areas in the third row reveal that FCN exhibits noticeable misclassification and underclassification in forest interior regions with relatively complex textures or those affected by shading and sparse canopy cover. Specifically, (1) patchy black voids appear within forest areas; (2) jagged fractures and excessive smoothing occur along some forest edges, resulting in incomplete boundaries; (3) minor non-forest areas are erroneously included within forest boundaries. These misclassifications primarily stem from FCN’s shallow feature extraction architecture and lack of effective multi-scale contextual fusion mechanisms. This results in insufficient perception of local spectral heterogeneity and fine-grained edge information. Consequently, while FCN maintains overall morphological consistency in large-scale forest extraction, it exhibits deficiencies in detail fidelity and noise robustness.

4.3.2. Extraction and Analysis of FCN Small Forest Areas

Small forest patches often exhibit fragmented, small-scale distribution patterns, making them susceptible to interference from terrain shadows, sparse canopy cover, and surrounding heterogeneous backgrounds. Compared to large contiguous forest areas, their smaller spatial scale and highly dispersed distribution result in inadequate feature representation in remote sensing imagery. When extracting small forest patches, models may exhibit oversegmentation, shape distortion, and misclassification errors due to limited receptive fields in neural network structures and insufficient multiscale context fusion. Therefore, the following discussion will focus on evaluating the model’s performance in extracting small forest patches.
Figure 11 illustrates the performance of the FCN model in extracting small, patchy forest areas under the band combination (B4, B8, B12). Compared to large contiguous forest areas, the FCN model shows reduced performance in identifying small-scale, fragmented forest patches, with segmentation results exhibiting noticeable fragmentation and incompleteness. The local details highlighted by the white box in the third row clearly reveal the following errors: (1) Complete segmentation omissions occur, manifesting as large black backgrounds where actual forest areas should be (segmentation omission errors); (2) detected forest patches exhibit shape distortion and area underestimation, with rough boundaries and internal fractures or voids; (3) non-forest areas (e.g., alpine meadows or shadow zones) are misclassified as forest, forming isolated red spots. These phenomena primarily stem from FCN’s shallow network architecture and simplistic upsampling strategy. This results in limited receptive fields and insufficient fusion of multi-scale contextual information. When confronted with small forest patches exhibiting weak spectral responses, blurred boundaries, and highly dispersed spatial distribution, its feature representation capability rapidly degrades, making it difficult to effectively capture fine-grained semantic details and precise edge information.

4.4. Comparison Analysis of Forest Extraction Results with ESA and Dynamic World Datasets

To further evaluate the forest extraction performance of FCN, selected areas were used to compare FCN’s extraction results with forest classification data from the ESA and Dynamic World datasets. Figure 12 displays the comparison of forest land extraction results based on the fully convolutional network (FCN) with two mainstream global land cover products within the study area (81°40′ E–82°00′ E, 43°20′ N–43°30′ N).
In terms of spatial distribution patterns, the FCN model and ESA WorldCover exhibit consistent performance in delineating forest boundaries. Both effectively capture the intricate texture of intermingled coniferous forests and mountain meadows densely distributed across shaded mountain slopes and river valleys, featuring smooth edges with minimal patch fragmentation. They successfully suppress misclassifications of bare rock, shaded areas, and alpine grasslands. In contrast, Dynamic World exhibits a significant overestimation of forest areas in the same region. This is particularly evident in high-altitude exposed ridges and seasonally snow-covered zones, where large contiguous areas are erroneously classified as red, resulting in an overly continuous and expanded forest distribution pattern. This does not align with the actual sparse vegetation landscape features observed in true-color imagery.
Table 3 quantitatively compares the intersection-over-union (IoU) and proportion of intersecting areas among the forest land extraction results from FCN, ESA WorldCover, and Dynamic World across the entire study area test set, further validating the conclusions of visual interpretation. Among these, FCN and ESA WorldCover exhibited the highest IoU (70.87%), with the intersection area accounting for 87.40% of ESA’s extraction results and 78.94% of FCN’s, indicating superior spatial consistency compared to other combinations. The IoU between FCN and Dynamic World was 62.98%, while that between ESA and Dynamic World was only 57.94%. This reflects a systematic overestimation by Dynamic World in this high-altitude mountainous environment, leading to a marked reduction in overlap with both physically based and deep learning-based products. The jointly identified forested area (intersection area: 1327.47 km2) accounted for only 55.41% of Dynamic World’s extracted area, yet represented 76.12% and 84.27% of FCN and ESA’s respective areas. This further quantifies Dynamic World’s misclassification rate in bare ground and shaded regions.
Through comprehensive qualitative and quantitative evaluation, the FCN model proposed in this paper demonstrates the highest spatial consistency with the authoritative ESA WorldCover product in complex mountainous environments, outperforming the near-real-time Dynamic World product. It exhibits distinct advantages, particularly in suppressing forest overestimation and preserving edge details. These results demonstrate that for diverse complex terrains, deep learning-driven regional optimization models deliver higher-precision forest land information compared to globally generalized products, providing more reliable spatial baseline data for subsequent ecological monitoring and carbon sink accounting.

4.5. Comparison of Small Patch Discretized Forest Extraction Results with ESA and Dynamic World Datasets

This study investigates the effectiveness of deep learning models in extracting forest areas in arid regions of Xinjiang using Sentinel-2 satellite imagery, examining different spectral band combinations. To evaluate the model’s capability in extracting small-patch fragmented woodlands, an additional experiment was conducted in a typical fragmented woodland area in Xinjiang. The FCN model was applied to extract woodlands using B4, B8, and B12 band combinations. The spatial distribution results were compared with woodland classification products from ESA and Dynamic World to assess the model’s adaptability and accuracy in complex terrain and sparse woodland scenarios. Located in the arid-semiarid zone of Xinjiang, this area features surface elements including rivers, farmlands, and scattered forest patches, making it susceptible to spectral confusion.
Figure 13 shows that the true-color image in Figure (a) depicts the actual surface landscape of the study area, including green vegetation belts, rivers, and surrounding desert regions. Figure (b) shows the forest binary mask extracted by the FCN model, where red dots represent forest areas. These are sparsely distributed, primarily concentrated in densely populated non-agricultural areas, but underestimation occurs—specifically, the omission of small, fragmented forest patches leads to an overall undercount of forest area. In contrast, Figure (c) presents the forest extraction results from the ESA model. The numerous scattered red dots indicate overclassification, where non-forest vegetation (such as grasslands or farmlands) is misclassified as forest. Figure (d) shows the extraction results from Dynamic World. The red areas exhibit linear distributions, capturing forests along riverbanks, but the model demonstrates weak recognition of small, scattered patches, also resulting in some underclassification. Overall, the FCN model exhibits a tendency to underestimate forest land extraction under this spectral combination, indicating that it requires further optimization for small-patch forest monitoring to enhance comprehensiveness.

5. Discussion

5.1. Exploring the Mechanisms Driving Performance Differences in Deep Learning-Based Forest Extraction Across Different Band Combinations

This study investigates the impact of different Sentinel-2 band combinations on the performance of deep learning models such as DeepLabV3 +, PSPNet, and FCN in extracting forest areas in arid regions of Xinjiang. Results indicate that combinations incorporating shortwave infrared bands (e.g., B4 + B8 + B12 and B7 + B8 + B11) outperform true-color combinations (B2 + B3 + B4), which use only visible bands, achieving higher aAcc, mIoU and mFscore. This improvement stems primarily from the sensitivity of shortwave infrared bands to vegetation moisture content and structural variations, effectively mitigating spectral confusion between forested areas and grasslands, farmlands, or bare soil. This enhancement boosts model robustness, particularly in complex mountain-oasis landscapes. Among these, the FCN model demonstrated outstanding performance across all band combinations, achieving a maximum mIoU of 89.45%. This advantage stems from its skip-connection architecture, which excels in restoring boundary details, whereas DeepLabV3+ and PSPNet showed slight deficiencies in capturing multi-scale context. These findings provide empirical evidence for optimizing multispectral remote sensing data inputs, confirming the critical role of band selection in enhancing the accuracy of deep learning-based forest land segmentation.
Furthermore, the incorporation of red-edge bands into the spectral combination enhances the model’s ability to identify canopy health status. This is because its response to chlorophyll absorption peaks helps capture differences in vegetation photosynthesis, thereby reducing edge blurring errors in the transition zone between alpine sparse forests and mountain shrublands. In contrast, while the B3 + B4 + B8 combination enhances near-infrared reflectance to highlight vegetation coverage, it exhibits lower sensitivity to degraded forest areas under water stress, making it susceptible to seasonal drought interference during poplar forest extraction. Experiments revealed that deep learning networks automatically extract complementary features between these bands through convolutional layers. For instance, under B4 + B8 + B12, the model’s attention mechanism prioritizes multi-scale spectral gradient fusion, thereby optimizing suppression of terrain shadows and cloud shadows. This overcomes the generalization bottleneck of traditional RGB inputs under complex illumination conditions. This mechanism difference stems not only from intrinsic correlations in spectral physical properties but also reflects the indirect influence of sample diversity in dataset construction on network parameter optimization, pointing the way forward for subsequent multimodal remote sensing fusion research.

5.2. Adaptability and Limitations of Deep Learning Models in Complex Forest Landscapes of Arid Regions

In highly heterogeneous forest environments of arid regions, deep learning models demonstrate certain structural advantages that enable them to maintain relatively stable recognition capabilities in scenarios with subtle spectral differences and significant background interference. Particularly in areas where coniferous forests, desert riparian forests, and alpine shrublands intermingle, the model can capture subtle variations in forest texture, canopy morphology, and surface structure through multi-layer convolutional or spatial pooling structures. This compensates for the limitations of traditional methods relying solely on spectral information in complex backgrounds. By integrating multispectral inputs, deep learning models extract more discriminative spatial features when addressing typical disturbances such as terrain shadows, seasonal snow cover, and forest-agricultural mosaics. This enables consistent characterization of large contiguous forest areas. These findings indicate that deep learning architectures demonstrate good adaptability to forest types with stable spatial organization, making them highly practical for ecological monitoring and forest resource surveys in arid regions.
At the same time, the limitations of deep learning in forest extraction within arid regions cannot be overlooked, particularly its constrained ability to handle extremely complex backgrounds and fragmented forest patches. When vegetation coverage is low, canopy density is sparse, or terrain obstructs visibility, deep learning models often fail to adequately capture the weak spectral responses of forest areas, leading to missed detections of small-scale patches. Furthermore, features such as high-albedo bare rock, shaded areas, and alpine meadows may exhibit spectral and textural similarities to forested areas at local scales. This similarity can lead to misclassification when models lack sufficient contextual information. In summary, while deep learning shows high potential for forest extraction in arid regions, comprehensively enhancing its reliability in complex landscapes requires ongoing exploration in multi-scale feature modeling, training data diversification, and model structure optimization.

5.3. Region-Specificity and Generalizable Components of Deep Learning Model Frameworks

Within this research framework, dataset construction exhibits regional specificity. The study primarily utilizes 10-scene Sentinel-2 imagery covering the entire Xinjiang region, encompassing typical vegetation types such as mid-elevation coniferous forests on the northern slopes of the Tianshan Mountains, Altai forests, alpine shrublands on the Kunlun Mountains, and river valley poplar forests. These samples intentionally incorporate unique challenges specific to arid regions, such as regional snow cover, terrain shadow interference, and spectral confusion between forest, grassland, and agricultural land. Labeling was performed using ArcGIS Pro tools, focusing on a binary classification task for Xinjiang’s mountain-oasis-desert composite ecosystems. This design addresses Xinjiang’s ecological fragility and forest degradation issues and cannot be directly applied to humid or tropical regions. Instead, new samples must be collected based on local vegetation spectra and interference factors. Furthermore, while model training parameters are universal, optimization is constrained by Xinjiang’s cloud cover < 6% and seasonal selection. Model extraction results were validated against ESA WorldCover and Dynamic World benchmarks specifically in Xinjiang’s alpine mountain regions, highlighting suppression of bare rock and shadow misclassification within the area.
Conversely, the core methodology and technical workflow of the framework are expected to be generalizable, encompassing optimized strategies for four tri-band combinations (B2 + B3 + B4 true color, B3 + B4 + B8 false color, B7 + B8 + B11 red edge enhancement, B4 + B8 + B12 shortwave infrared enhancement). This strategy, grounded in spectral physics, can be extended to other Sentinel-2 datasets with only resolution resampling adjustments required. Deep learning model architectures such as FCN based on HRNet-18s, DeepLabV3+’s ASPP module, PSPNet’s PPM, and related training procedures—including 512 × 512 cropping, 8:2 training-validation segmentation, loss convergence monitoring, and evaluation metrics—are domain standards. This framework can be directly transferred to vegetation extraction tasks in similar arid or semi-arid environments (e.g., Central Asian Gobi, Middle Eastern deserts, or Africa’s Sahel region). Through transfer learning, this framework rapidly adapts to new regions, with expected generalization accuracy exceeding 85%.

6. Conclusions

This study demonstrates the superior performance and robust capability of deep learning models in extracting forest areas from Sentinel-2 remote sensing imagery within the highly heterogeneous mountain-oasis-desert landscapes of Xinjiang’s complex arid regions. The model utilizing the B4, B8, and B12 band combination achieved optimal results, with a mean Intersection over Union (mIoU) of 89.45% and a mean F-score (mFscore) of 94.23%. This outperformed benchmark models such as DeepLabV3+ and PSPNet, demonstrating higher extraction accuracy particularly in scenarios involving both large contiguous forest areas and small fragmented forest patches. This outstanding performance largely stems from the self-built high-quality training dataset and a rational spectral band combination strategy: through diversified sample collection and systematic spectral band selection, four distinct spectral band combination datasets were constructed, including B4 + B8 + B12. Comparative experiments further demonstrate that the B4 + B8 + B12 combination enhances overall accuracy, while other combinations like B2 + B3 + B4, B3 + B4 + B8, and B7 + B8 + B11 provide effective supplementary improvements under specific terrain backgrounds or spectral reflectance characteristics. These findings not only substantially enhance the model’s practical applicability in complex arid environments but also provide crucial reference and empirical evidence for band selection and dataset construction in subsequent remote sensing forest land extraction studies.

Author Contributions

Conceptualization, H.Z. and X.M.; methodology, H.Z. and X.M.; software, H.Z.; validation, H.Z., K.L. and L.D.; formal analysis, K.L.; investigation, L.D.; resources, X.M.; data curation, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, H.Z. and X.M.; visualization, H.Z.; supervision, F.Z. and X.M.; project administration, H.Z. and X.M.; funding acquisition, X.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (42461052), the Tianshan Elite Program (Third Batch) of the Xinjiang Uygur Autonomous Region—Young Elite Talent in Science and Technology Innovation (2024TSYCCX0025), the Department of Science and Technology of the Xinjiang Uygur Autonomous Region, and the Autonomous Region Outstanding Youth Fund (No.: TBD).

Data Availability Statement

Some data is provided in the text. For the full dataset, please visit the website: http://www.ma-lab.cn/. This website is maintained long-term and can be accessed at any time.

Acknowledgments

The authors are particularly grateful to all researchers and institutions for providing data support for this study. The authors would also like to the editors and reviewers for their valuable comments which significantly helped us to improve the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Nesha, K.; Herold, M.; De Sy, V.; E Duchelle, A.; Martius, C.; Branthomme, A.; Garzuglia, M.; Jonsson, O.; Pekkarinen, A. An assessment of data sources, data quality and changes in national forest monitoring capacities in the Global Forest Resources Assessment 2005–2020. Environ. Res. Lett. 2021, 16, 054029. [Google Scholar] [CrossRef]
  2. MacDicken, K.G. Global forest resources assessment 2015: What, why and how? For. Ecol. Manage. 2015, 352, 3–8. [Google Scholar] [CrossRef]
  3. Bonan, G.B. Forests and climate change: Forcings, feedbacks, and the climate benefits of forests. Science 2008, 320, 1444–1449. [Google Scholar] [CrossRef]
  4. He, T.; Zhou, H.; Xu, C.; Hu, J.; Xue, X.; Xu, L.; Lou, X.; Zeng, K.; Wang, Q. Deep learning in forest tree species classification using sentinel-2 on google earth engine: A case study of Qingyuan County. Sustainability 2023, 15, 2741. [Google Scholar] [CrossRef]
  5. Duan, R.; Huang, C.; Dou, P.; Hou, J.; Zhang, Y.; Gu, J. Fine-scale forest classification with multi-temporal sentinel-1/2 imagery using a temporal convolutional neural network. Int. J. Digit. Earth 2025, 18, 2457953. [Google Scholar] [CrossRef]
  6. Rui, H.; Luo, B.; Wang, Y.; Zhu, L.; Zhu, Q. Quantitative impacts of climate change and human activities on grassland growth in Xinjiang, China. Front. Plant Sci. 2025, 15, 1497248. [Google Scholar] [CrossRef]
  7. Yang, L.; Feng, Q.; Adamowski, J.F.; Alizadeh, M.R.; Yin, Z.; Wen, X.; Zhu, M. The role of climate change and vegetation greening on the variation of terrestrial evapotranspiration in northwest China’s Qilian Mountains. Sci. Total Environ. 2021, 759, 143532. [Google Scholar] [CrossRef]
  8. Zhou, J.; Zan, M.; Zhai, L.; Yang, S.; Xue, C.; Li, R.; Wang, X. Remote sensing estimation of aboveground biomass of different forest types in Xinjiang based on machine learning. Sci. Rep. 2025, 15, 6187. [Google Scholar] [CrossRef]
  9. Belgiu, M.; Drăguţ, L. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
  10. Mountrakis, G.; Im, J.; Ogole, C. Support vector machines in remote sensing: A review. ISPRS J. Photogramm. Remote Sens. 2011, 66, 247–259. [Google Scholar] [CrossRef]
  11. Yan, G.; Mas, J.F.; Maathuis, B.; Xiangmin, Z.; Van Dijk, P. Comparison of pixel-based and object-oriented image classification approaches—A case study in a coal fire area, Wuda, Inner Mongolia. China Int. J. Remote Sens. 2006, 27, 4039–4055. [Google Scholar] [CrossRef]
  12. Tian, S.; Zhang, X.; Tian, J.; Sun, Q. Random forest classification of wetland landcovers from multi-sensor data in the arid region of Xinjiang, China. Remote Sens. 2016, 8, 954. [Google Scholar] [CrossRef]
  13. Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
  14. Lv, J.; Shen, Q.; Lv, M.; Li, Y.; Shi, L.; Zhang, P. Deep learning-based semantic segmentation of remote sensing images: A review. Front. Ecol. Evol. 2023, 11, 1201125. [Google Scholar] [CrossRef]
  15. Dimitrovski, I.; Spasev, V.; Loshkovska, S.; Kitanovski, I. U-net ensemble for enhanced semantic segmentation in remote sensing imagery. Remote Sens. 2024, 16, 2077. [Google Scholar] [CrossRef]
  16. Kalinaki, K.; Malik, O.A.; Lai, D.T.C. FCD-AttResU-Net: An improved forest change detection in Sentinel-2 satellite images using attention residual U-Net. Int. J. Appl. Earth Obs. Geoinf. 2023, 122, 103453. [Google Scholar] [CrossRef]
  17. Chen, X.; Xie, S.; He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9640–9649. [Google Scholar]
  18. Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.-S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
  19. Petersson, H.; Gustafsson, D.; Bergstrom, D. Hyperspectral image analysis using deep learning—A review. In Proceedings of the 2016 Sixth International Conference on Image Processing Theory, Tools and Applications (IPTA), Oulu, Finland, 12–15 December 2016; pp. 1–6. [Google Scholar]
  20. Huang, Z.; Zhong, L.; Zhao, F.; Wu, J.; Tang, H.; Lv, Z.; Xu, B.; Zhou, L.; Sun, R.; Meng, R. A spectral-temporal constrained deep learning method for tree species mapping of plantation forests using time series Sentinel-2 imagery. ISPRS J. Photogramm. Remote Sens. 2023, 204, 397–420. [Google Scholar] [CrossRef]
  21. Nasiri, V.; Beloiu, M.; Darvishsefat, A.A.; Griess, V.C.; Maftei, C.; Waser, L.T. Mapping tree species composition in a Caspian temperate mixed forest based on spectral-temporal metrics and machine learning. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103154. [Google Scholar] [CrossRef]
  22. Vorovencii, I.; Dincă, L.; Crișan, V.; Postolache, R.-G.; Codrean, C.-L.; Cătălin, C.; Greșiță, C.I.; Chima, S.; Gavrilescu, I. Local-scale mapping of tree species in a lower mountain area using Sentinel-1 and-2 multitemporal images, vegetation indices, and topographic information. Front. For. Global Change 2023, 6, 1220253. [Google Scholar] [CrossRef]
  23. Tan, J.; Li, J.; Ma, T.; Yan, X.; Huo, Z. Leveraging Sentinel-1/2 time series and deep learning for accurate forest tree species mapping. Front. For. Global Change 2025, 8, 1599510. [Google Scholar] [CrossRef]
  24. Immitzer, M.; Neuwirth, M.; Böck, S.; Brenner, H.; Vuolo, F.; Atzberger, C. Optimal input features for tree species classification in Central Europe based on multi-temporal Sentinel-2 data. Remote Sens. 2019, 11, 2599. [Google Scholar] [CrossRef]
  25. Grabska, E.; Hostert, P.; Pflugmacher, D.; Ostapowicz, K. Forest stand species mapping using the Sentinel-2 time series. Remote Sens. 2019, 11, 1197. [Google Scholar] [CrossRef]
  26. Persson, M.; Lindberg, E.; Reese, H. Tree species classification with multi-temporal Sentinel-2 data. Remote Sens. 2018, 10, 1794. [Google Scholar] [CrossRef]
  27. Bolyn, C.; Michez, A.; Gaucher, P.; Lejeune, P.; Bonnet, S. Forest mapping and species composition using supervised per pixel classification of Sentinel-2 imagery. Biotechnol. Agron. Société Environ. 2018, 22, 16. [Google Scholar] [CrossRef]
  28. Fassnacht, F.E.; Latifi, H.; Stereńczak, K.; Modzelewska, A.; Lefsky, M.; Waser, L.T.; Straub, C.; Ghosh, A. Review of studies on tree species classification from remotely sensed data. Remote Sens. Environ. 2016, 186, 64–87. [Google Scholar] [CrossRef]
  29. Axelsson, A.; Lindberg, E.; Reese, H.; Olsson, H. Tree species classification using Sentinel-2 imagery and Bayesian inference. Int. J. Appl. Earth Obs. Geoinf. 2021, 100, 102318. [Google Scholar] [CrossRef]
  30. Grabska, E.; Frantz, D.; Ostapowicz, K. Evaluation of machine learning algorithms for forest stand species mapping using Sentinel-2 imagery and environmental data in the Polish Carpathians. Remote Sens. Environ. 2020, 251, 112103. [Google Scholar] [CrossRef]
  31. Wessel, M.; Brandmeier, M.; Tiede, D. Evaluation of different machine learning algorithms for scalable classification of tree types and tree species based on Sentinel-2 data. Remote Sens. 2018, 10, 1419. [Google Scholar] [CrossRef]
  32. Qarallah, B.; Othman, Y.A.; Al-Ajlouni, M.; Alheyari, H.A.; Qoqazeh, B.A. Assessment of small-extent forest fires in semi-arid environment in Jordan using Sentinel-2 and Landsat sensors data. Forests 2022, 14, 41. [Google Scholar] [CrossRef]
  33. Numbisi, F.N. Minding Spatial Allocation Entropy: Sentinel-2 Dense Time Series Spectral Features Outperform Vegetation Indices to Map Desert Plant Assemblages. Remote Sens. 2025, 17, 2553. [Google Scholar] [CrossRef]
  34. Shafri, H.Z.M.; Taherzadeh, E.; Mansor, S.; Ashurov, R. Hyperspectral remote sensing of urban areas: An overview of techniques and applications. Res. J. Appl. Sci. Eng. Technol. 2012, 4, 1557–1565. [Google Scholar]
  35. Wang, J.; Zhang, F.; Jim, C.-Y.; Chan, N.W.; Johnson, V.C.; Liu, C.; Duan, P.; Bahtebay, J. Spatio-temporal variations and drivers of ecological carrying capacity in a typical mountain-oasis-desert area, Xinjiang, China. Ecol. Eng. 2022, 180, 106672. [Google Scholar] [CrossRef]
  36. Guo, X.; Zhu, L.; Yang, Z.; Yang, C.; Li, Z. Spatial–Temporal Changes in the Distribution of Populus euphratica Oliv. Forests in the Tarim Basin and Analysis of Influencing Factors from 1990 to 2020. Forests 2024, 15, 1384. [Google Scholar] [CrossRef]
  37. Li, G.; Liang, J.; Wang, S.; Zhou, M.; Sun, Y.; Wang, J.; Fan, J. Characteristics and drivers of vegetation change in Xinjiang, 2000–2020. Forests 2024, 15, 231. [Google Scholar] [CrossRef]
  38. Zhang, Y.; JiMei, L.; Chang, S.; Xiang, L.; JianJiang, L. Spatial distribution pattern of Picea schrenkiana population in the Middle Tianshan Mountains and the relationship with topographic attributes. J. Arid. Land 2012, 4, 457–468. [Google Scholar] [CrossRef]
  39. Mirzabaev, A.; Ahmed, M.; Werner, J.; Pender, J.; Louhaichi, M. Rangelands of Central Asia: Challenges and opportunities. J. Arid. Land 2016, 8, 93–108. [Google Scholar] [CrossRef]
  40. Zanaga, D.; Van De Kerchove, R.; Daems, D.; De Keersmaecker, W.; Brockmann, C.; Kirches, G.; Wevers, J.; Cartus, O.; Santoro, M.; Fritz, S. ESA WorldCover 10 m 2021 v200; IIASA PURE: Laxenburg, Austria, 2022. [Google Scholar]
  41. Brown, C.F.; Brumby, S.P.; Guzder-Williams, B.; Birch, T.; Hyde, S.B.; Mazzariello, J.; Czerwinski, W.; Pasquarella, V.J.; Haertel, R.; Ilyushchenko, S.; et al. Dynamic World, Near real-time global 10 m land use land cover mapping. Sci. Data 2022, 9, 251. [Google Scholar] [CrossRef]
  42. Martimort, P.; Fernandez, V.; Kirschner, V.; Isola, C.; Meygret, A. Sentinel-2 MultiSpectral imager (MSI) and calibration/validation. In Proceedings of the 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, Germany, 22–27 July 2012; pp. 6999–7002. [Google Scholar]
  43. Anita, N.; Sukojo, B.M.; Meisajiwa, S.H.; Romadhon, M.A. Oil pattern identification analysis using semantic deep learning method from pleiades-1b sateliite imagery with arcgis pro software (Case Study: Village “A”). IOP Conf. Ser. Earth Environ. Sci. 2021, 936, 012021. [Google Scholar] [CrossRef]
  44. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  45. da Silveira, H.L.F.; Galvão, L.S.; Sanches, I.D.; de Sá, I.B.; Taura, T.A. Use of MSI/Sentinel-2 and airborne LiDAR data for mapping vegetation and studying the relationships with soil attributes in the Brazilian semi-arid region. Int. J. Appl. Earth Obs. Geoinf. 2018, 73, 179–190. [Google Scholar] [CrossRef]
  46. Qiu, S.; He, B.; Yin, C.; Liao, Z. Assessments of sentinel-2 vegetation red-edge spectral bands for improving land cover classification, The International Archives of the Photogrammetry. Remote Sens. Spat. Inf. Sci. 2017, 42, 871–874. [Google Scholar]
  47. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  48. Li, X.; Li, Y.; Ai, J.; Shu, Z.; Xia, J.; Xia, Y. Semantic segmentation of UAV remote sensing images based on edge feature fusing and multi-level upsampling integrated with Deeplabv3+. PLoS ONE 2023, 18, e0279097. [Google Scholar] [CrossRef]
  49. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  50. Su, Y.; Cheng, J.; Bai, H.; Liu, H.; He, C. Semantic segmentation of very-high-resolution remote sensing images via deep multi-feature learning. Remote Sens. 2022, 14, 533. [Google Scholar] [CrossRef]
  51. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  52. Wang, X.; Hu, Z.; Shi, S.; Hou, M.; Xu, L.; Zhang, X. A deep learning method for optimizing semantic segmentation accuracy of remote sensing images based on improved UNet. Sci. Rep. 2023, 13, 7600. [Google Scholar] [CrossRef] [PubMed]
  53. Wang, Y.; Yang, L.; Liu, X.; Yan, P. An improved semantic segmentation algorithm for high-resolution remote sensing images based on DeepLabv3+. Sci. Rep. 2024, 14, 9716. [Google Scholar] [CrossRef]
  54. Wang, J.; Chen, T.; Zheng, L.; Tie, J.; Zhang, Y.; Chen, P.; Luo, Z.; Song, Q. A multi-scale remote sensing semantic segmentation model with boundary enhancement based on UNetFormer. Sci. Rep. 2025, 15, 14737. [Google Scholar] [CrossRef]
Figure 1. Schematic diagram of study area overview: Red boxes denote training and validation datasets; blue boxes denote test dataset; yellow circles and red circles, respectively, indicate the number of different samples selected within this range.
Figure 1. Schematic diagram of study area overview: Red boxes denote training and validation datasets; blue boxes denote test dataset; yellow circles and red circles, respectively, indicate the number of different samples selected within this range.
Forests 17 00088 g001
Figure 2. Schematic diagram of the study area under different band combinations ((a): composite image of B4, B3, B2 bands; (b): composite image of B3, B4, B8 bands; (c): composite image of B7, B8, B11 bands; (d): composite image of B4, B8, B12 bands).
Figure 2. Schematic diagram of the study area under different band combinations ((a): composite image of B4, B3, B2 bands; (b): composite image of B3, B4, B8 bands; (c): composite image of B7, B8, B11 bands; (d): composite image of B4, B8, B12 bands).
Forests 17 00088 g002
Figure 3. Schematic diagram of Sentinel-2 band extraction combination (Vre denotes vegetation red edge).
Figure 3. Schematic diagram of Sentinel-2 band extraction combination (Vre denotes vegetation red edge).
Forests 17 00088 g003
Figure 4. DeepLabV3+ network architecture diagram.
Figure 4. DeepLabV3+ network architecture diagram.
Forests 17 00088 g004
Figure 5. PSPNet network framework diagram.
Figure 5. PSPNet network framework diagram.
Forests 17 00088 g005
Figure 6. FCN architecture diagram.
Figure 6. FCN architecture diagram.
Forests 17 00088 g006
Figure 7. Schematic diagram of forest extraction performance for each model under different band combinations (The red represents forest land, and black represents non-forest land).
Figure 7. Schematic diagram of forest extraction performance for each model under different band combinations (The red represents forest land, and black represents non-forest land).
Forests 17 00088 g007
Figure 8. Schematic diagram of forest extraction performance for each model under different band combinations (The red represents forest land, and black represents non-forest land).
Figure 8. Schematic diagram of forest extraction performance for each model under different band combinations (The red represents forest land, and black represents non-forest land).
Forests 17 00088 g008
Figure 9. Loss functions under different band combinations for each mode l ((a) Loss function curves of DeepLabV3+ under different band combinations; (b) Loss function curves of PSPNet under different band combinations; (c) Loss function curves of FCN under different band combinations).
Figure 9. Loss functions under different band combinations for each mode l ((a) Loss function curves of DeepLabV3+ under different band combinations; (b) Loss function curves of PSPNet under different band combinations; (c) Loss function curves of FCN under different band combinations).
Forests 17 00088 g009
Figure 10. Schematic diagram of FCN extraction results for large forest areas in the B4, B8, and B12 bands (The red represents forest land, and black represents non-forest land. The white boxes indicate areas where the model made incorrect identifications).
Figure 10. Schematic diagram of FCN extraction results for large forest areas in the B4, B8, and B12 bands (The red represents forest land, and black represents non-forest land. The white boxes indicate areas where the model made incorrect identifications).
Forests 17 00088 g010
Figure 11. Schematic diagram of FCN extraction results for small forest patches under B4, B8, and B12 bands (The red represents forest land, and black represents non-forest land. The white boxes indicate areas where the model made incorrect identifications).
Figure 11. Schematic diagram of FCN extraction results for small forest patches under B4, B8, and B12 bands (The red represents forest land, and black represents non-forest land. The white boxes indicate areas where the model made incorrect identifications).
Forests 17 00088 g011
Figure 12. Extraction Performance of FCN in the (B4, B8, B12) band combination compared with different land classification products ((a): True-color composite image of B4, B3, B2 bands; (b): binary mask of forest land extraction by FCN model using B4, B8, B12 band combination; (c): forest land category extraction results from ESA; (d): forest land category extraction results from Google Dynamic World).
Figure 12. Extraction Performance of FCN in the (B4, B8, B12) band combination compared with different land classification products ((a): True-color composite image of B4, B3, B2 bands; (b): binary mask of forest land extraction by FCN model using B4, B8, B12 band combination; (c): forest land category extraction results from ESA; (d): forest land category extraction results from Google Dynamic World).
Forests 17 00088 g012
Figure 13. Extraction performance of FCN in the (B4, B8, B12) band combination compared with different land classification products ((a): true-color composite image of B4, B3, B2 bands; (b): binary mask of forest land extraction by FCN model using B4, B8, B12 band combination; (c): forest land category extraction results from ESA; (d): forest land category extraction results from Google Dynamic World).
Figure 13. Extraction performance of FCN in the (B4, B8, B12) band combination compared with different land classification products ((a): true-color composite image of B4, B3, B2 bands; (b): binary mask of forest land extraction by FCN model using B4, B8, B12 band combination; (c): forest land category extraction results from ESA; (d): forest land category extraction results from Google Dynamic World).
Forests 17 00088 g013
Table 1. Spectral characteristics and resolution of Sentinel-2 bands used in the study.
Table 1. Spectral characteristics and resolution of Sentinel-2 bands used in the study.
BandBand NameCenter Wavelength
(nm)
Bandwidth
(nm)
Spatial Resolution
(m)
B2Blue4906510
B3Green5603510
B4Red6653010
B7Vegetation red edge7832020
B8NIR84211510
B11SWIR 116109020
B12SWIR 2219018020
Table 2. Accuracy metric values for different band combinations across various models.
Table 2. Accuracy metric values for different band combinations across various models.
Model NameBand CombinationVarious Precision Metrics
aAccmIoUmRecallmFscoremPrecision
DeepLabV3+B4, B3, B297.8588.0892.9893.3893.8
B3, B4, B898.1187.391.192.8794.83
B7, B8, B1198.0887.8492.0993.2194.41
B4, B8, B1297.8187.8693.2493.2593.24
PSPNetB2, B3, B497.6286.3893.0492.3191.61
B3, B4, B897.5886.8392.2592.6192.98
B7, B8, B1197.7387.9693.5793.3293.07
B4, B8, B1297.9587.8693.1293.2493.36
FCNB2, B3, B497.9387.7293.5793.1592.74
B3, B4, B897.9188.8993.8193.8993.96
B7, B8, B1198.1588.8892.8293.8694.97
B4, B8, B1297.9889.4593.994.2394.56
Table 3. Comparison of FCN model extraction performance with intersection-over-union and coverage relationships on ESA and Dynamic World Datasets.
Table 3. Comparison of FCN model extraction performance with intersection-over-union and coverage relationships on ESA and Dynamic World Datasets.
Dataset ComparisonIntersection Area (km2)Union Area (km2)IoU
(%)
ESA Area (%)FCN Area (%)Dynamic World Area (%)
FCN versus ESA1376.711942.4670.8787.4078.9487.39
FCN versus Dynamic World1599.742539.9862.9897.8691.7366.77
ESA versus Dynamic World1456.722514.2857.9492.4883.2460.80
Common region1327.472609.2550.8884.2776.1255.41
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, H.; Luo, K.; Dang, L.; Zhang, F.; Ma, X. Performance Evaluation of Deep Learning Models for Forest Extraction in Xinjiang Using Different Band Combinations of Sentinel-2 Imagery. Forests 2026, 17, 88. https://doi.org/10.3390/f17010088

AMA Style

Zhou H, Luo K, Dang L, Zhang F, Ma X. Performance Evaluation of Deep Learning Models for Forest Extraction in Xinjiang Using Different Band Combinations of Sentinel-2 Imagery. Forests. 2026; 17(1):88. https://doi.org/10.3390/f17010088

Chicago/Turabian Style

Zhou, Hang, Kaiyue Luo, Lingzhi Dang, Fei Zhang, and Xu Ma. 2026. "Performance Evaluation of Deep Learning Models for Forest Extraction in Xinjiang Using Different Band Combinations of Sentinel-2 Imagery" Forests 17, no. 1: 88. https://doi.org/10.3390/f17010088

APA Style

Zhou, H., Luo, K., Dang, L., Zhang, F., & Ma, X. (2026). Performance Evaluation of Deep Learning Models for Forest Extraction in Xinjiang Using Different Band Combinations of Sentinel-2 Imagery. Forests, 17(1), 88. https://doi.org/10.3390/f17010088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop