1. Introduction
With global environmental change and the intensification of human activities, forest land, as one of the most important ecosystems on Earth, is facing increasingly serious threats and damage. Remote sensing technology has become an important tool in the fields of forest land resource management and ecological environment monitoring by virtue of its advantages of wide coverage and high spatial and temporal resolution. By utilizing deep learning technology to carry out change detection research of forest land on remote sensing images, we can more accurately and efficiently understand the distribution status and change trend of forest ecosystems, which is of far-reaching significance and wide application value for the protection and management of forest land resources.
As the foundation of forest ecosystems, woodlands play a key role in maintaining ecological balance, regulating climate, protecting biodiversity and are the basic conditions for the implementation of the scientific concept of development and guaranteeing the survival and development of human beings [
1]. From 2014 to 2018, China carried out the ninth national forest resources inventory, and the results showed that the area of forests amounted to 220,446,200 hectares, which accounted for a global forest area of 5.51%, with an overall favorable trend of increasing quantity, improving quality, and enhancing ecological functions [
2]. However, China is still a country with a lack of forests and green and fragile ecosystems, and the problems of insufficient total forest resources, low quality and uneven distribution are prominent. At present, the average forest area per capita in China is only 0.16 hectares, less than one third of the world average. The per capita forest stock is 12.35 cubic meters, only one-sixth of the world, and forestry development is still facing serious challenges [
3,
4,
5]. Therefore, how to quickly, efficiently and economically obtain information on forest land resources, establish a dynamic monitoring system, and realize scientific management and efficient utilization has become an important topic that needs to be solved urgently by the state. It is also the research focus of the forest land management department [
6,
7].
The traditional forest land survey relies on manual field collection and is based on sampling statistics, which is inefficient and costly and suffers from strong subjectivity, information lag, and poor spatial continuity [
8]. The emergence of remote sensing technology provides an effective solution for forest land information acquisition [
9]. Remote sensing receives electromagnetic waves reflected or emitted by targets through sensors or satellites to realize long-distance acquisition of ground surface information, which has the advantages of wide imaging range, short acquisition cycle, and small restrictions by ground conditions [
10]. With the development of multispectral, hyperspectral and unmanned aircraft remote sensing, forest land information acquisition is changing to the direction of high efficiency and intelligence [
11,
12].
Remote sensing image change detection is the main way to study changes in surface cover or feature types. It is performed by comparing and analyzing two or more remote sensing image data from different time points in the same area to identify and extract the areas in which changes have occurred [
13,
14,
15]. This technical approach enables the effective monitoring of resources such as forest land according to specific types of changes, which is an important tool in the fields of geoscientific research, urban planning, environmental assessment and disaster warning. It also provides a decision-making basis for sustainable development and environmental protection [
16,
17]. However, in the practical application of forest land change detection, the interference of many factors, such as the non-consistency of sensor equipment, the change of climate conditions and the influence of atmospheric conditions, often leads to the existence of certain imaging differences between the acquired remote sensing images of different time phases. The existence of these differences seriously affects the image quality and interpretability and significantly increases the difficulty and complexity of detection. Therefore, coping with these problems by means of the appropriate preprocessing of image data and the development of more demanding algorithms has become a core challenge in the field of forest land change detection from remote sensing images.
Remote sensing images acquired in different time periods have certain differences in imaging environment and shooting angle. To ensure the accuracy and effectiveness of the subsequent analysis of forest land change, it is usually necessary to carry out image preprocessing before analysis [
18]. Commonly used preprocessing methods include geometric correction, radiometric correction, and image alignment [
19]. Among them, geometric correction is designed to remove image distortion errors caused by sensor distortion, Earth curvature, atmospheric refraction and terrain undulation to ensure the precise geospatial alignment of remote sensing images at different time points. Radiometric correction consists of two steps: radiometric calibration and atmospheric correction, aiming at removing the changes in image brightness caused by the differences in sensitivity of the sensors, changes in solar altitude angle, atmospheric conditions to make the images of different time points comparable, and analyzed on the same radiometric scale. Image matching is the process of matching and superimposing images acquired at different points in time, with different sensors or under different conditions to correct minor misalignments that may exist between images at different points in time due to factors such as the rotation of the Earth, movement of the sensors, and changes in the surface. In addition, other preprocessing methods are equally important, including the removal of noise, clouds and shadows, which can effectively improve the data quality of remote sensing images, provide a more reliable data base for the subsequent change detection algorithms, and ensure the accuracy and reliability of the change detection results.
In the field of remote sensing image change detection, extensive research has been conducted by scholars worldwide, leading to the establishment of a relatively comprehensive system of theories and methods [
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35]. From the perspective of technical evolution, early approaches mainly relied on simple algebraic operations such as image differencing and ratioing, combined with thresholding, to achieve pixel-level change extraction. These were later extended to feature-level change detection methods that fuse multiple types of features, as well as object-based approaches that take image objects or patches, rather than individual pixels, as the basic analysis units [
22,
23,
24,
25,
30,
31,
32,
33,
34,
35]. These traditional methods have, to some extent, improved the accuracy of change detection and provided important technical support for land use/land cover change and forest/woodland change monitoring. However, when dealing with high-resolution, multi-source, and multi-temporal remote sensing images, they still suffer from limited feature representation capability and insufficient robustness in complex scenes [
12,
13,
14,
15,
16,
17,
18].
Currently, remote sensing image change detection methods based on deep learning twin networks can be subdivided into two types: binary change detection and semantic change detection. The main purpose of binary change detection is to recognize which pixels in a remote sensing image correspond to a change without distinguishing the specific type of change. It is usually applied to monitor significant changes caused by natural disasters such as landslides and forest fires. Binary change detection can be specialized for specific elements and can directly detect corresponding changes in features such as forest land, buildings and roads. Fang et al. [
36] proposed a densely connected twin network for change detection, which maintains a high-resolution representation by jumping connections between the encoder and decoder to reduce the loss of deep localization information of the neural network. Chen et al. [
37] proposed a new twin network-based spatio-temporal attention neural network by designing a self-attention mechanism to model spatio-temporal relations for the building change detection. Semantic change detection is more complex in that it needs to identify which pixels have changed and also the type of change that has occurred at each location. Earlier semantic change detection was mainly based on post-classification comparison methods. However, such methods rely excessively on the accuracy of the classification when localizing the change region, leading to the accumulation of errors and inaccurate boundaries. Daudt et al. [
38] proposed integrating semantic segmentation and binary change detection into a multi-task learning network architecture where the semantic change detection task can be efficiently performed using only a single network model. The internal structure of the model also considers the interconnection between the two subtasks. Yang et al. [
39] proposed an asymmetric twin network based on the difference in the proportion of land cover changes in multi-temporal images. Semantic changes were localized and identified by features obtained from modules with different structures, and different numbers of parameters were applied to consider the differences in land cover distribution, including forested land, in different time periods.
Building on the above studies, a series of deep learning–based change detection networks for high-resolution remote sensing images have been proposed and have achieved promising results on public datasets. Daudt et al. proposed FC-Siam-conc and related Siamese fully convolutional networks, which perform end-to-end binary change detection by concatenating or differencing features extracted from bi-temporal image pairs, and represent one of the early representative Siamese architectures in the change detection field [
40]. Fang et al. proposed SNUNet-CD, which enhances multi-scale feature fusion through dense connections and a U-Net encoder–decoder structure, showing advantages in recognizing fine-scale changes in detailed regions [
36]. Chen et al. proposed STANet, which introduces a spatial–temporal attention mechanism to highlight relevant change regions between bi-temporal images, making it suitable for complex backgrounds and long time intervals [
37]. On this basis, Chen et al. further proposed the Transformer-based BIT method to better model long-range dependencies, demonstrating strong capability in representing changes under complex land-cover distributions and over large spatial extents [
41]. Overall, these methods provide important technical support for high-resolution remote sensing change detection, but there is still room for improvement in terms of boundary delineation for fragmented objects such as woodland, suppression of pseudo-changes, and model efficiency.
2. Methods
The uneven distribution of forest land, the diversity of tree species and the complexity of imaging conditions in remote sensing images make it challenging to automatically and accurately recognize forest land in remote sensing images. In light of the problems of incomplete recognition of woodland boundary region and low accuracy of small woodland segmentation in remote sensing image segmentation by ordinary convolutional neural network, we conducted an in-depth study on woodland extraction algorithms for high-scoring remote sensing images, and proposed a High-Resolution Net with Attention Mechanism (CAM-HRNet). The method is based on the HRNet [
42] backbone network and utilizes its parallel multiresolution coding structure, which is able to improve the segmentation accuracy of the forest floor while preserving the detail information. By introducing the convolutional attention mechanism, the model is able to pay more attention to the important regions in the image. Meanwhile, the construction of the step-by-step up-sampling mechanism helps the model to more smoothly fuse multi-scale features, further improving the accuracy of woodland area and boundary recognition. By combining CAM-HRNet with the twin network structure, we constructed a binary change detection model and a semantic change detection model for forest land change detection in remote sensing images, respectively.
The overall framework of the proposed woodland change detection method is shown in
Figure 1 and consists of three main components: (1) a single-temporal woodland segmentation network based on CAM-HRNet; (2) a Siamese CAM-HRNet architecture for binary woodland change detection; and (3) an extended architecture for semantic change detection. In CAM-HRNet, the convolutional attention structure is a key component and is composed of the CBAM, the improved channel attention module, and the spatial attention module, whose detailed structures are shown in
Figure 2,
Figure 3 and
Figure 4. The multi-scale feature fusion is implemented by a progressive upsampling scheme, as illustrated in
Figure 5.
2.1. CAM-HRNet
This paper designed the CAM-HRNet network model to address the complexity and diversity of forest land features in remote sensing images and to extract forest land information more accurately. The model fully utilizes the significant advantages of HRNet in processing high-resolution images, and the multi-scale parallel structure enables the network to capture forest land features at different scales in parallel, which ensures efficient extraction of forest land information from complex remote sensing images. In addition, the HRNet family includes several variants, such as HRNet-W18, HRNet-W40, and HRNet-W64. They differ in the depth and width of the network to apply to different tasks and scene requirements. The HRNet-W40 among them was chosen as the backbone network in this paper to balance the accuracy and computational efficiency of the model.
To further improve the performance of the model, this paper introduced a convolutional attention mechanism to enhance the ability of the model to capture and express key features to enable the network to pay more attention to the recognizable woodland feature regions in the image to segment the woodland more accurately. This paper also constructed a step-by-step upsampling mechanism to help the model fuse multi-scale features more smoothly and reduce the loss of image quality caused during the feature fusion process. The structure of the CAM-HRNet model is shown in
Figure 1. Specifically, CAM-HRNet integrates an improved convolutional attention module into each resolution branch of HRNet, where the module is derived from the original CBAM by replacing its MLP-based channel attention with the proposed 1D-convolution-based variant while retaining the spatial attention design.
The main process of this network model includes the follows: (1) Firstly, the input image is processed by the HRNet network, which generates four feature layers with different resolutions containing information ranging from details to global information, providing a rich feature representation for the subsequent processing. (2) The different feature layers are fed into a convolutional attention module consisting of CA channel attention and SA spatial attention, respectively, to achieve the purpose of suppressing irrelevant features and reinforcing key features. (3) The enhanced multi-scale features are fused using step-by-step upsampling to ensure that the feature fusion is more comprehensive and adequate. (4) Finally, the final result is obtained after the FCN decoder decodes and interpolates the upsampling operation to restore the information to the original image size. The whole process fully combines the multi-scale feature extraction capability of HRNet and the focusing characteristics of the convolutional attention mechanism to realize the accurate extraction and segmentation of forest information.
2.1.1. Convolutional Attention Mechanisms
The extraction of woodland features has always been a challenging task in remote sensing image processing. Because of the complexity of the woodland and its surroundings, its unique features are often hidden in a large amount of background information, making it difficult for traditional image processing techniques to effectively separate and highlight these features. Through the introduction of the convolutional attention mechanism, the attention of the model to the woodland features can be enhanced to improve its ability to recognize them from non-woodland in complex scenes.
The convolutional attention mechanism has shown a strong potential for application in the field of computer vision due to its simple, flexible, plug-and-play characteristics [
43]. Its core idea lies in dynamically adjusting and optimizing the weight allocation of the model, which enables the model to focus on the key information in the input data and enhances the ability of the model to recognize and process important features. Therefore, to further enhance the performance of the model in the forest floor extraction task, this paper integrated another convolutional attention mechanism in the HRNet feature extraction stage. CBAM (Convolutional Block Attention Module), as a simple and effective feed-forward convolutional neural network attention module, aims to improve the performance of the model by strengthening the attention of the model to the important information in the input features, and its structure is shown in
Figure 2.
CBAM [
44] mainly consists of the Channel Attention Module (CAM) and Spatial Attention Module (SAM), which are used to infer the attention maps of the input features in channel and spatial dimensions, respectively. The two modules work together to adaptively improve the channel and pixel spatial weights of the input features by multiplying the attention maps with the input feature maps to refine the features more efficiently.
The pre-improved CBAM channel attention module is shown in
Figure 3a. It uses a weight-sharing, Multilayer Perceptron (MLP) network with hidden layers to integrate the average pooled and maximum pooled features of an image. By analyzing the inter-channel relationships of the feature map, the module generates a weight value for each channel that reflects its importance in the overall feature representation. This weight value is multiplied with the corresponding channel to realize the adaptive adjustment of the features of different channels. The final goal is to strengthen the important channel features and suppress the secondary channel features.
However, Wang et al. [
45] pointed out that this operation of dimensionality reduction and then dimensionality enhancement destroys the direct correspondence between the channel attention weights and the channels, which can easily lead to the loss of information. In addition, the fully connected layer in the multilayer perceptron plays an important role in connecting neurons in each layer and integrating information. However, since each of its output nodes is connected to all the nodes in the previous layer, this dense connectivity tends to introduce a large number of parameters, causing the problem of parameter redundancy [
46]. Therefore, to reduce the number of parameters and maintain dimensional stability, this paper introduced a one-dimensional convolution to improve the channel attention module of CBAM, as shown in
Figure 3b. One-dimensional convolution performs local modeling on the channel dimension, maintaining consistency in the channel dimension without dimensionality reduction, effectively preserving discriminative information. Meanwhile, its sparse connection feature significantly reduces the number of parameters, enhancing the accuracy and generalization ability of attention weights while improving model efficiency. Due to the inherent parameter sharing mechanism of convolution operations, the model size is further compressed, and training stability is enhanced. Additionally, the design of local receptive fields enables the model to focus on the correlations between adjacent channels, which is more in line with the actual feature distribution patterns, thereby generating attention responses that are more physically meaningful and semantically consistent.
Figure 3a shows the original channel attention structure of CBAM for comparison, whereas the proposed CAM-HRNet actually adopts the improved channel attention module shown in
Figure 3b.
The specific steps of the improved channel attention module are as follows. (1) Firstly, the input feature maps are feature compressed on the spatial scale using average pooling and maximum pooling, respectively, and the number of channels of the pooled feature maps remains unchanged, and the size of the dimensions is changed to 1 × 1. (2) Separately, the pooled features are allowed to go through another one-dimensional convolution operation with an adaptive convolution kernel, the size of which is related to the number of channels and is used to generate a feature representation of smaller dimensions for capturing the inter-channel interrelationships. (3) The convolution results obtained from the two pooling paths are summed over the corresponding positional elements to merge the different feature representations captured based on average and maximum pooling. (4) Finally, the merged feature maps are processed through an activation function to introduce nonlinearities and generate the final channel attention map. In the improved channel attention module, the relationship between the channel attention map FC and the input feature map F can be expressed as follows.
where
denotes the Sigmoid activation function.
denotes a one-dimensional convolution operation with convolution kernel size
k. AvgPool, MaxPool denote the average pooling and maximum pooling operations in space, respectively. The relationship between the convolution kernel size of
k and the number of channels
C is as follows.
where ODD denotes upward rounding to the nearest odd number;
and
b are used to describe the functional relationship between
k and
C and take the values of 2 and 1, respectively.
After multiplying the channel attention map
FC with the input feature map
F, the intermediate feature map
F′ after the channel attention mechanism is obtained. It is then passed into the spatial attention module to further enhance the perception ability of the model in the spatial dimension. The structure of the spatial attention module is shown in
Figure 4, and the specific steps are as follows. (1) Firstly, average pooling and maximum pooling operations are performed on the intermediate feature maps, respectively to compress the feature maps over the channels, and the size of the dimensions remain the same after pooling, and the number of channels becomes 1. (2) The feature maps obtained after the two types of pooling are then spliced along the channel dimensions to form a composite feature representation. (3) The spliced feature maps are fed into a 2D convolutional layer for learning the correlation between features in the spatial dimension and generating a new feature map. (4) Finally, the results of the convolution operation are processed through an activation function to introduce nonlinearities and generate the final spatial attention map.
In the spatial attention module, the relationship between the spatial attention graph
FS and the intermediate feature graph
F′ can be expressed as follows:
where
denotes the Sigmoid activation function;
denotes a two-dimensional convolution operation with convolution kernel size 7; AvgPool, MaxPool denote the average pooling and maximum pooling operations on the channel, respectively.
Finally, the spatial attention map FS is multiplied with the intermediate feature map F′ to obtain the new feature F″ optimized by both channel and spatial attention mechanisms. To further enhance the ability of HRNet to extract woodland features and ensure that the network can effectively capture target feature information at different scales, this paper added convolutional attention modules to all four different resolution feature maps’ output from HRNet. These modules enable the network to focus more on the target features to provide more accurate and effective feature representations at both detailed and global scales.
2.1.2. Stage-by-Stage Upsampling Mechanisms
After the four different resolution feature layers output by HRNet processing, an effective feature fusion operation is required to carry out the subsequent segmentation and extraction task of the forest land. Feature fusion mainly combines feature layers of different resolutions organically, thus integrating information from different scales. This can enhance the richness of the feature representation, improve the learnability of the network, and improve the accuracy and stability of forest land extraction.
In the HRNetV2 [
47] network, feature fusion is performed as shown in
Figure 5a. It uses bilinear interpolation upsampling to adjust all the low-resolution feature layers to the same size as the highest-resolution feature layer, respectively, while keeping the number of channels unchanged, and finally the adjusted feature layers are spliced in the channel dimension. Although this fusion method is simpler and more direct, it still has some drawbacks. It does not optimize the computational allocation between high-resolution and low-resolution branches, leading to the trivialization of low-resolution branches with stronger semantic representations. Since the lowest resolution feature layers usually contain the strongest semantic information, and the direct output makes them underutilized, so they need to go through more layers to merge with the high resolution feature layers to achieve more effective feature fusion. To integrate feature information at each scale more comprehensively, this paper constructed a feature fusion method with a step-by-step upsampling mechanism, as shown in
Figure 5b. The “stepwise upsampling” mechanism proposed in this paper starts from the lowest resolution, upsamples step by step and fuses with adjacent high-resolution features, enabling strong semantic information to be transmitted layer by layer and fully interact with local details. At the same time, double cubic interpolation is adopted to improve the quality of upsampling, effectively enhancing the boundary expression ability, thereby achieving more accurate and coherent multi-scale feature fusion. The method is processed in the order of feature layer resolution from low to high, using twofold double-three interpolation and 1 × 1 convolution operation to upsample the low-resolution feature layer, and then stitching and fusing it with the neighboring high-resolution feature layers sequentially.
Bilinear interpolation is calculated based on linear weighting of only the four nearest neighbor pixel points around the target pixel. This method is simple and fast, but may introduce some degree of blurring or jagged effects. In contrast, bicubic interpolation takes into account information from the four pixel points around the target pixel and more neighboring pixel points. By using a cubic polynomial for the interpolation calculation, double cubic interpolation better preserves the details and textures of the image, resulting in a smoother and more natural looking image when zoomed in or out. Bicubic interpolation usually provides better results in image processing tasks, especially in applications that require high accuracy such as image semantic segmentation and super-resolution reconstruction. It can reduce the information loss during the interpolation process and enhance the edge and texture information of the image to some extent to improve the overall quality of the image. In addition, by combining the 1 × 1 convolution operation, it is able to add nonlinear features along with the bicubic interpolation, which helps to capture the complex structures and details in the image and further enhances the information representation of the model. Therefore, this smoother multi-scale feature fusion further optimizes the fusion effect, which can enhance the quality of the feature map and produce clearer feature map edges, which is conducive to improving the accuracy of the subsequent segmentation.
2.2. Binary Change Detection Model
Based on the excellent performance of CAM-HRNet in feature extraction, remote sensing images were deeply analyzed and critical and useful feature information was extracted. To further improve the accuracy and usefulness of the model in change detection, this paper combined the unique advantages of the twin network structure in matching and comparison. A binary change detection model specialized for forest land change detection in remote sensing imagery was constructed, and its structure is shown in
Figure 6. This model was designed to accurately capture the changes of forest land in images at different time points and ignore the changes of the remaining irrelevant feature types to improve the specificity and practicability of the model.
Figure 6.
Structure of binary change detection model. The internal structure of the “difference discrimination network” module is shown in
Figure 7.
Figure 6.
Structure of binary change detection model. The internal structure of the “difference discrimination network” module is shown in
Figure 7.
Figure 7.
Structure of difference discrimination network. Corresponding to the “difference discrimination network” module in
Figure 6.
Figure 7.
Structure of difference discrimination network. Corresponding to the “difference discrimination network” module in
Figure 6.
The main process of the network model includes the following: (1) Firstly, in the feature extraction stage, the model utilizes the powerful feature extraction capability of CAM-HRNet, and inputs the former and later remote sensing images into two branches of the CAM-HRNet twin network with the same structure and weight sharing, respectively. After the parallel processing of the two branches, each extracts the key features of the forest land in the image. (2) The model adopts the difference feature extraction method, where the extracted feature information of two images at different time points is subjected to the difference operation and the absolute value is taken. A difference feature map containing the change information of the forest land is obtained, which effectively highlights the differences between the images. (3) After obtaining the difference feature map, the model sends it to the difference discriminant network for pixel-by-pixel classification. This network is a specially designed classifier that can classify each pixel in the difference feature map to determine whether it belongs to the woodland change area or not. (4) Finally, the model presents the classification results in a binarized way, with forest land change areas and unchanged or other change areas.
The decoding of the disparity discrimination network (DDN) is shown in
Figure 7, which mainly consists of two convolutional layers of 3 × 3 size, a batch normalization (BN) layer, and a ReLU activation function. In the figure, C represents the channel, H represents the height, and W represents the width. The role of the first convolutional layer is to further extract the difference features of the dual-temporal remote sensing images. The second convolutional layer is responsible for abstracting and integrating these features for final classification. The addition of batch normalization helps to stabilize the training process and reduces the sensitivity of the model to the initialization weights and the learning rate, which improves the generalization ability of the model. Finally, the introduction of nonlinearity to the network through the ReLU activation function allows the network to learn and fit complex patterns and data distributions, which in turn improves the classification accuracy.
2.3. Semantic Change Detection Model
On the basis of the binary change detection model, this paper was further extended to improve the demand and depth of the change detection task. By adding semantic segmentation branches to the two sub-networks of the binary change detection model, a new semantic change detection model was constructed, the structure of which is shown in
Figure 8.
Unlike the binary change detection model, the semantic change detection model incorporates the FCNhead decoder with shared weights for the semantic segmentation task of the dual-temporal image on both CAM-HRNet twin network branches to finely categorize pixels in each image to their corresponding feature classes. In the semantic change detection model, the change information extracted by the disparity discriminant network will be more comprehensive, including forest land and all other feature types that have changed. After extracting the change information, the model uses the change information as a mask to filter the semantic segmentation results of the dual-temporal image, and finally outputs the semantic information maps of the changed areas of the two images before and after.
In summary, the semantic change detection model inherits the advantages of the binary change detection model and further expands the functions and application scope. Through the addition of semantic segmentation branches, the model is able to detect all feature type changes including woodland, and conduct in-depth semantic analysis of the nature of the changes. For example, it recognizes newly added woodlands, disappearing woodlands, and transformations between woodlands and other feature types, which gives the model a wider application value.
4. Results and Discussion
4.1. Analysis of Comparative Experimental Results of Binary Change Detection Models
The experiment was based on a BCD dataset specifically for woodland, and the model constructed in this paper was compared with several commonly used binary change detection methods, including FC-Siam-conc [
32], SNUNet, STANet, and BIT [
41]. Among them, FC-Siam-conc is a change detection model based on a fully convolutional twin network. It achieves feature fusion and complementarity by skipping the feature maps of the two encoder branches and the corresponding layers of the decoder. SNUNet reduces the loss of position information during deep network training by tightly connecting the encoder and the decoder. At the same time, it also introduces an integrated channel attention mechanism for deep supervision, refines representative features at different semantic levels and uses them for final classification. STANet uses a spatiotemporal attention mechanism to calculate the attention weights between any two pixels at different times and positions, and performs feature analysis on image sub-regions obtained by multi-scale segmentation, which can enhance the model’s detection performance for objects of different sizes and shapes. BIT is a lightweight and efficient model based on the powerful generalization ability of transformers. It can provide the advantages of high accuracy and large-area object detection while maintaining low computational costs. All network models were trained and tested in the same experimental environment. The qualitative comparison of the test results is shown in
Figure 10.
The figure contains five groups of contrast images, and the area of the woodland change area gradually expands from left to right. From the detection results of the first group of images, it can be seen that the FC-Siam-conc and STANet methods had some trivial areas of misdetection, while the SNUNet and BIT methods were not close enough on the edge. In the second group of images, the woodland change area in the upper left corner was relatively slender. It can be seen that the detection results of the FC-Siam-conc and BIT methods were relatively trivial and failed to form a coherent change area. Similarly, for the upper half of the third group of images with a relatively slender change area, it can be clearly seen that the other methods had serious missed detection phenomena and failed to capture these subtle changes. In the fourth group of images, after the lower right corner area changed from woodland to open space, due to the darker color of the area, the FC-Siam-conc, SNUNet and STANet methods had incomplete extraction results due to missed detection, while the BIT method also suffered from incomplete extraction due to a hole in the upper right corner. Finally, the fifth group of images mainly showed the change from woodland to buildings, but because the woodland in this area was originally sparse, the change area extracted by other methods had more or less holes. In contrast, our method performed better in all groups of images, and the extraction results were more complete, which can more accurately reflect the changes in the woodland.
In addition, given that the change detection task is similar to the semantic segmentation task in terms of results and labels, in order to analyze and compare the effects of different methods in more depth, this paper used the same evaluation indicators as the semantic segmentation task for evaluation. The quantitative comparison results of different methods are shown in
Table 1. By comparing the values of these evaluation indicators in detail, it can be seen that the proposed method showed the best performance in the remote sensing image woodland change detection results. Compared with other commonly used binary change detection methods, the overall accuracy rate was improved by 0.24 to 2.27 percentage points, the IoU value was improved by 0.83 to 7.85 percentage points, and the F1 score was improved by 0.55 to 5.36 percentage points. In addition, our method also achieved good results in the numerical value and balance relationship of the two indicators of precision and recall, thus proving the significant advantages of this method in the remote sensing image woodland change detection task.
These improvements are consistent with previous Siamese and attention-based change detection networks, such as FC-Siam-conc, SNUNet-CD, STANet and BIT [
36,
37,
40,
41], where multi-scale feature fusion and attention mechanisms have also been shown to effectively enhance change-detection accuracy on high-resolution remote sensing images.
4.2. Analysis of Anti-Interference Ability of Binary Change Detection Model
In remote sensing images at different time points, the diversified changes of ground object types and pseudo-change phenomena caused by various interference factors are common. These changes are not only numerous, but also often intertwined, which to a certain extent increases the difficulty of extracting specific ground object change information from remote sensing images. Since the BCD method constructed in this paper aims to filter out other irrelevant changes from these complex change environments and accurately extract only the change information related to the woodland, this paper conducted a detailed analysis of the detection results of woodland changes in dual-phase images with different interfering ground object changes, as shown in
Figure 11.
The figure contains five groups of dual-phase images, each of which shows different land use changes. Among them, the first two groups of images mainly show the transformation between woodland and cultivated land, the first group is the change from woodland to cultivated land, and the second group is the change from cultivated land to woodland; the third and fourth groups of images show the transformation between woodland and vegetation, the third group is the change from vegetation to woodland, and the fourth group is the change from woodland to vegetation; in the fifth group of images, the upper part is mainly the change from woodland to bare land, and the lower part is mainly the change from vegetation to bare land. It can be seen from the detection results that after being processed by the method in this paper, the area of woodland change was more accurately identified, and the changes of other non-target objects were effectively excluded and suppressed. This result fully demonstrates the excellent performance of our method in extracting woodland change information. Even in the face of interference from cultivated land, vegetation and other objects with similar characteristics to woodland, our method can still maintain high accuracy and reliability, further proving its practical application value.
4.3. Analysis of Experimental Results of Semantic Change Detection Model
As an extension of the BCD model, the SCD model has a visualization of the detection results, as shown in
Figure 12. As can be seen from the figure, for the first and second groups of images with relatively simple labels, the model had good detection effects on the change area and semantic information; while for the third and fourth groups of images with more complex labels, although the model still achieved good results overall, there were still some deficiencies in the detection of some detailed semantic information. It is worth mentioning that in the right half of the fifth group of T2 images, the model mistakenly classified some withered low vegetation as bare land. The reasons are multi-faceted. On the one hand, the spectral and texture features of vegetation-related ground objects can change significantly in different seasons or climatic conditions. For instance, in drought or winter, low-growing vegetation may appear withered, sparse, or even nearly exposed on the surface, resulting in highly similar remote sensing image features to bare land and causing semantic ambiguity. On the other hand, the samples of such transitional states or edge scenes in the training data may be relatively insufficient, which causes the model to fail to fully learn the subtle discriminative features between withered vegetation and true bare land, and thus leads to confusion in the inference stage. The combined effect of the above factors makes the model tend to classify regions with strong spectral ambiguity and ambiguous labeling boundaries into more “typical” categories (such as bare land) when facing them, reflecting the limitations of current methods in terms of sensitivity to environmental changes and dependence on data distribution. This also suggests that in the future, it is necessary to enhance sample diversity at the data level (such as introducing multi-temporal and multi-seasonal samples) and incorporate more robust ground object discrimination priors in model design. Similar observations have also been reported in recent semantic change detection studies, where jointly estimating change masks and pre-/post-change land-cover categories under class imbalance and complex backgrounds is considered substantially more challenging than binary change detection [
38,
39].
In order to more accurately quantify the performance of the semantic change detection model, the evaluation indicators for the detection results of various types of objects in the change area are detailed in
Table 2. In addition, in order to verify the effectiveness of the model in detecting overall changes, the experiment also calculated the evaluation indicators for the detection results of the change area. The specific results are shown in
Table 3.
From the index data in
Table 3, it can be seen that when the model was trained for both change detection and semantic segmentation, it still achieved a good overall change detection performance, with slightly higher IoU, F1 score, and OA compared to the woodland change detection results in
Table 1. However, according to the data in
Table 2, there is a need for further improvement in detecting semantic information within the changed areas. The reason may be that the changed areas with specific semantic categories in each group of images only account for a small part, while the unchanged areas without specific semantic categories account for the majority, resulting in data imbalance. There were significant differences in the IoU values of different feature categories in the semantic change detection results (for example, 73.86% vs. for buildings). The low vegetation (47.18%) reflects that the model’s recognition ability for various ground features is not balanced. For instance, artificial or structured features such as buildings (73.86%) and water bodies (65.43%) performed well, while the detection accuracy of low vegetation (47.18%) and bare land (56.63%) was significantly lower. This phenomenon mainly stems from the complexity of natural features themselves: when low-growing vegetation withers, becomes sparse or is affected by light and shadow, its spectral and texture characteristics are very likely to be confused with those of bare land. Meanwhile, such features usually lack clear boundaries and are spatially diffused, resulting in highly sensitive pixel-level classification in edge areas. Furthermore, the samples of such transitional states in the training data may be insufficient, and the manual annotation itself is subjective and ambiguous in areas with low vegetation coverage, further exacerbating the difficulty of model learning. In contrast, artificial features such as buildings have regular geometric shapes, high contrast and stable spectral responses, making them easier to be accurately captured by networks.
4.4. Analysis of Loss Function Results
It can be seen from the loss results in
Table 4 that the focus loss of the binary change detection model stably converged to a relatively low level (below 0.2) on both the training set and the validation set, and the validation loss was only slightly higher than the training loss, indicating that the model did not experience overfitting and had good generalization ability. Meanwhile, the design of focus loss effectively alleviated the severe category imbalance problem between the forest land change area (positive sample) and the non-change area (negative sample). For the semantic change detection model, its various loss components—including the change detection focus loss (
LF), the pre-image semantic cross-entropy loss (
LMCE1), and the post-image semantic cross-entropy loss (
LMCE1)—all showed a stable downward trend. Among them, the convergence value of
LF was significantly lower than that of
LMCE1 and
LMCE1 This is consistent with the design that assigns a higher weight (×4) to
LF in the total loss, ensuring that the model prioritizes optimizing the detection accuracy of the changing region during the multi-task joint training process. Furthermore, the total validation loss of this model was only slightly higher than the training loss, further indicating that the training process is stable and no severe overfitting has occurred. It indicates that the proposed loss function design can effectively drive network learning and achieve robust and accurate forest land change detection in complex high-resolution remote sensing scenarios.
4.5. Analysis of Ablation Experiment Results
Based on the HRNet backbone network, this paper added a convolutional attention module and constructed a step-by-step upsampling mechanism to optimize it. In order to verify the effectiveness of these improved modules, this paper set up four groups of ablation comparison experiments in the same experimental environment:
G1: Uses only the original HRNetV2 backbone, without the progressive upsampling structure or any attention modules, serving as the baseline model;
G2: Based on G1, replaces only the multi-scale feature fusion component with the progressive upsampling structure while keeping the rest of the network architecture unchanged to evaluate the contribution of the progressive upsampling mechanism;
G3: Based on G2, incorporates the original CBAM attention module into each scale branch to evaluate the performance difference between the standard CBAM and the configuration utilizing only progressive upsampling;
G4: Based on G2, incorporates the improved 1D convolutional channel attention and spatial attention module proposed in this paper into each scale branch (i.e., the complete CAM-HRNet) to evaluate the advantages of the improved convolutional attention module over the original CBAM.
The experimental results are shown in
Table 5.
From the data in the table, it can be seen that the improved fourth group of experiments improved the overall accuracy by 0.21 to 0.62 percentage points, the IoU value by 0.89 to 2.50 percentage points, and the F1 score by 0.56 to 1.55 percentage points compared with other experimental groups. Through the improvement of these quantitative indicators, it reflects the positive impact of the improved modules in the CAM-HRNet method proposed in this paper on the final results of remote sensing image woodland extraction, and verifies its effectiveness.