CAM-HRNet: A Multi-Scale Parallel Structure Change Detection Framework for Forest Land

Li, Xihan; Yang, Jianyu

doi:10.3390/rs18010012

Open AccessArticle

CAM-HRNet: A Multi-Scale Parallel Structure Change Detection Framework for Forest Land

by

Xihan Li

and

Jianyu Yang

^*

The College of Land Science and Technology, China Agricultural University, Beijing 100193, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(1), 12; https://doi.org/10.3390/rs18010012

Submission received: 21 October 2025 / Revised: 8 December 2025 / Accepted: 9 December 2025 / Published: 19 December 2025

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

The proposed CAM-HRNet achieves state-of-the-art performance in binary forest land change detection (IoU: 75.24%, F1: 85.88%).
The framework effectively performs semantic change detection, identifying both changed areas and their specific land-cover types.

What are the implications of the main findings?

It provides a powerful, automated tool for accurate and efficient dynamic monitoring of forest land resources.
The integrated architecture offers a valuable reference for other precise feature extraction tasks in remote sensing.

Abstract

As a vital component of Earth’s ecosystem, forest land plays an irreplaceable role in maintaining ecological balance. To address the limitations of traditional deep learning methods in remote sensing data generalization, this paper proposes CAM-HRNet, a change detection framework that integrates the High-Resolution Net (HRNet) backbone with a convolutional attention mechanism. The method leverages HRNet’s multi-scale parallel structure and incorporates attention mechanisms at different scales, while a step-by-step up-sampling mechanism enables smooth multi-scale feature fusion with reduced information loss. By combining CAM-HRNet’s feature extraction capabilities with a twin network structure, we develop both binary and semantic change detection methods. The binary method identifies changed areas through difference feature calculation, while the semantic method adds segmentation branches to analyze specific change types. Focus loss is introduced to address sample imbalance. Experimental results demonstrate that our binary change detection achieved 75.24% IoU and 85.88% F1, outperforming comparable methods by 2.62% and 1.74%, respectively, while effectively filtering irrelevant changes. The semantic method maintained strong performance with 63.93% IoU and 77.53% F1, providing accurate monitoring of forest ecosystem dynamics to support resource protection and management.

Keywords:

woodland extraction; semantic segmentation; convolutional attention mechanism; change detection; twin networks

1. Introduction

With global environmental change and the intensification of human activities, forest land, as one of the most important ecosystems on Earth, is facing increasingly serious threats and damage. Remote sensing technology has become an important tool in the fields of forest land resource management and ecological environment monitoring by virtue of its advantages of wide coverage and high spatial and temporal resolution. By utilizing deep learning technology to carry out change detection research of forest land on remote sensing images, we can more accurately and efficiently understand the distribution status and change trend of forest ecosystems, which is of far-reaching significance and wide application value for the protection and management of forest land resources.

As the foundation of forest ecosystems, woodlands play a key role in maintaining ecological balance, regulating climate, protecting biodiversity and are the basic conditions for the implementation of the scientific concept of development and guaranteeing the survival and development of human beings [1]. From 2014 to 2018, China carried out the ninth national forest resources inventory, and the results showed that the area of forests amounted to 220,446,200 hectares, which accounted for a global forest area of 5.51%, with an overall favorable trend of increasing quantity, improving quality, and enhancing ecological functions [2]. However, China is still a country with a lack of forests and green and fragile ecosystems, and the problems of insufficient total forest resources, low quality and uneven distribution are prominent. At present, the average forest area per capita in China is only 0.16 hectares, less than one third of the world average. The per capita forest stock is 12.35 cubic meters, only one-sixth of the world, and forestry development is still facing serious challenges [3,4,5]. Therefore, how to quickly, efficiently and economically obtain information on forest land resources, establish a dynamic monitoring system, and realize scientific management and efficient utilization has become an important topic that needs to be solved urgently by the state. It is also the research focus of the forest land management department [6,7].

The traditional forest land survey relies on manual field collection and is based on sampling statistics, which is inefficient and costly and suffers from strong subjectivity, information lag, and poor spatial continuity [8]. The emergence of remote sensing technology provides an effective solution for forest land information acquisition [9]. Remote sensing receives electromagnetic waves reflected or emitted by targets through sensors or satellites to realize long-distance acquisition of ground surface information, which has the advantages of wide imaging range, short acquisition cycle, and small restrictions by ground conditions [10]. With the development of multispectral, hyperspectral and unmanned aircraft remote sensing, forest land information acquisition is changing to the direction of high efficiency and intelligence [11,12].

Remote sensing image change detection is the main way to study changes in surface cover or feature types. It is performed by comparing and analyzing two or more remote sensing image data from different time points in the same area to identify and extract the areas in which changes have occurred [13,14,15]. This technical approach enables the effective monitoring of resources such as forest land according to specific types of changes, which is an important tool in the fields of geoscientific research, urban planning, environmental assessment and disaster warning. It also provides a decision-making basis for sustainable development and environmental protection [16,17]. However, in the practical application of forest land change detection, the interference of many factors, such as the non-consistency of sensor equipment, the change of climate conditions and the influence of atmospheric conditions, often leads to the existence of certain imaging differences between the acquired remote sensing images of different time phases. The existence of these differences seriously affects the image quality and interpretability and significantly increases the difficulty and complexity of detection. Therefore, coping with these problems by means of the appropriate preprocessing of image data and the development of more demanding algorithms has become a core challenge in the field of forest land change detection from remote sensing images.

Remote sensing images acquired in different time periods have certain differences in imaging environment and shooting angle. To ensure the accuracy and effectiveness of the subsequent analysis of forest land change, it is usually necessary to carry out image preprocessing before analysis [18]. Commonly used preprocessing methods include geometric correction, radiometric correction, and image alignment [19]. Among them, geometric correction is designed to remove image distortion errors caused by sensor distortion, Earth curvature, atmospheric refraction and terrain undulation to ensure the precise geospatial alignment of remote sensing images at different time points. Radiometric correction consists of two steps: radiometric calibration and atmospheric correction, aiming at removing the changes in image brightness caused by the differences in sensitivity of the sensors, changes in solar altitude angle, atmospheric conditions to make the images of different time points comparable, and analyzed on the same radiometric scale. Image matching is the process of matching and superimposing images acquired at different points in time, with different sensors or under different conditions to correct minor misalignments that may exist between images at different points in time due to factors such as the rotation of the Earth, movement of the sensors, and changes in the surface. In addition, other preprocessing methods are equally important, including the removal of noise, clouds and shadows, which can effectively improve the data quality of remote sensing images, provide a more reliable data base for the subsequent change detection algorithms, and ensure the accuracy and reliability of the change detection results.

In the field of remote sensing image change detection, extensive research has been conducted by scholars worldwide, leading to the establishment of a relatively comprehensive system of theories and methods [19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]. From the perspective of technical evolution, early approaches mainly relied on simple algebraic operations such as image differencing and ratioing, combined with thresholding, to achieve pixel-level change extraction. These were later extended to feature-level change detection methods that fuse multiple types of features, as well as object-based approaches that take image objects or patches, rather than individual pixels, as the basic analysis units [22,23,24,25,30,31,32,33,34,35]. These traditional methods have, to some extent, improved the accuracy of change detection and provided important technical support for land use/land cover change and forest/woodland change monitoring. However, when dealing with high-resolution, multi-source, and multi-temporal remote sensing images, they still suffer from limited feature representation capability and insufficient robustness in complex scenes [12,13,14,15,16,17,18].

Currently, remote sensing image change detection methods based on deep learning twin networks can be subdivided into two types: binary change detection and semantic change detection. The main purpose of binary change detection is to recognize which pixels in a remote sensing image correspond to a change without distinguishing the specific type of change. It is usually applied to monitor significant changes caused by natural disasters such as landslides and forest fires. Binary change detection can be specialized for specific elements and can directly detect corresponding changes in features such as forest land, buildings and roads. Fang et al. [36] proposed a densely connected twin network for change detection, which maintains a high-resolution representation by jumping connections between the encoder and decoder to reduce the loss of deep localization information of the neural network. Chen et al. [37] proposed a new twin network-based spatio-temporal attention neural network by designing a self-attention mechanism to model spatio-temporal relations for the building change detection. Semantic change detection is more complex in that it needs to identify which pixels have changed and also the type of change that has occurred at each location. Earlier semantic change detection was mainly based on post-classification comparison methods. However, such methods rely excessively on the accuracy of the classification when localizing the change region, leading to the accumulation of errors and inaccurate boundaries. Daudt et al. [38] proposed integrating semantic segmentation and binary change detection into a multi-task learning network architecture where the semantic change detection task can be efficiently performed using only a single network model. The internal structure of the model also considers the interconnection between the two subtasks. Yang et al. [39] proposed an asymmetric twin network based on the difference in the proportion of land cover changes in multi-temporal images. Semantic changes were localized and identified by features obtained from modules with different structures, and different numbers of parameters were applied to consider the differences in land cover distribution, including forested land, in different time periods.

Building on the above studies, a series of deep learning–based change detection networks for high-resolution remote sensing images have been proposed and have achieved promising results on public datasets. Daudt et al. proposed FC-Siam-conc and related Siamese fully convolutional networks, which perform end-to-end binary change detection by concatenating or differencing features extracted from bi-temporal image pairs, and represent one of the early representative Siamese architectures in the change detection field [40]. Fang et al. proposed SNUNet-CD, which enhances multi-scale feature fusion through dense connections and a U-Net encoder–decoder structure, showing advantages in recognizing fine-scale changes in detailed regions [36]. Chen et al. proposed STANet, which introduces a spatial–temporal attention mechanism to highlight relevant change regions between bi-temporal images, making it suitable for complex backgrounds and long time intervals [37]. On this basis, Chen et al. further proposed the Transformer-based BIT method to better model long-range dependencies, demonstrating strong capability in representing changes under complex land-cover distributions and over large spatial extents [41]. Overall, these methods provide important technical support for high-resolution remote sensing change detection, but there is still room for improvement in terms of boundary delineation for fragmented objects such as woodland, suppression of pseudo-changes, and model efficiency.

2. Methods

The uneven distribution of forest land, the diversity of tree species and the complexity of imaging conditions in remote sensing images make it challenging to automatically and accurately recognize forest land in remote sensing images. In light of the problems of incomplete recognition of woodland boundary region and low accuracy of small woodland segmentation in remote sensing image segmentation by ordinary convolutional neural network, we conducted an in-depth study on woodland extraction algorithms for high-scoring remote sensing images, and proposed a High-Resolution Net with Attention Mechanism (CAM-HRNet). The method is based on the HRNet [42] backbone network and utilizes its parallel multiresolution coding structure, which is able to improve the segmentation accuracy of the forest floor while preserving the detail information. By introducing the convolutional attention mechanism, the model is able to pay more attention to the important regions in the image. Meanwhile, the construction of the step-by-step up-sampling mechanism helps the model to more smoothly fuse multi-scale features, further improving the accuracy of woodland area and boundary recognition. By combining CAM-HRNet with the twin network structure, we constructed a binary change detection model and a semantic change detection model for forest land change detection in remote sensing images, respectively.

The overall framework of the proposed woodland change detection method is shown in Figure 1 and consists of three main components: (1) a single-temporal woodland segmentation network based on CAM-HRNet; (2) a Siamese CAM-HRNet architecture for binary woodland change detection; and (3) an extended architecture for semantic change detection. In CAM-HRNet, the convolutional attention structure is a key component and is composed of the CBAM, the improved channel attention module, and the spatial attention module, whose detailed structures are shown in Figure 2, Figure 3 and Figure 4. The multi-scale feature fusion is implemented by a progressive upsampling scheme, as illustrated in Figure 5.

2.1. CAM-HRNet

This paper designed the CAM-HRNet network model to address the complexity and diversity of forest land features in remote sensing images and to extract forest land information more accurately. The model fully utilizes the significant advantages of HRNet in processing high-resolution images, and the multi-scale parallel structure enables the network to capture forest land features at different scales in parallel, which ensures efficient extraction of forest land information from complex remote sensing images. In addition, the HRNet family includes several variants, such as HRNet-W18, HRNet-W40, and HRNet-W64. They differ in the depth and width of the network to apply to different tasks and scene requirements. The HRNet-W40 among them was chosen as the backbone network in this paper to balance the accuracy and computational efficiency of the model.

To further improve the performance of the model, this paper introduced a convolutional attention mechanism to enhance the ability of the model to capture and express key features to enable the network to pay more attention to the recognizable woodland feature regions in the image to segment the woodland more accurately. This paper also constructed a step-by-step upsampling mechanism to help the model fuse multi-scale features more smoothly and reduce the loss of image quality caused during the feature fusion process. The structure of the CAM-HRNet model is shown in Figure 1. Specifically, CAM-HRNet integrates an improved convolutional attention module into each resolution branch of HRNet, where the module is derived from the original CBAM by replacing its MLP-based channel attention with the proposed 1D-convolution-based variant while retaining the spatial attention design.

The main process of this network model includes the follows: (1) Firstly, the input image is processed by the HRNet network, which generates four feature layers with different resolutions containing information ranging from details to global information, providing a rich feature representation for the subsequent processing. (2) The different feature layers are fed into a convolutional attention module consisting of CA channel attention and SA spatial attention, respectively, to achieve the purpose of suppressing irrelevant features and reinforcing key features. (3) The enhanced multi-scale features are fused using step-by-step upsampling to ensure that the feature fusion is more comprehensive and adequate. (4) Finally, the final result is obtained after the FCN decoder decodes and interpolates the upsampling operation to restore the information to the original image size. The whole process fully combines the multi-scale feature extraction capability of HRNet and the focusing characteristics of the convolutional attention mechanism to realize the accurate extraction and segmentation of forest information.

2.1.1. Convolutional Attention Mechanisms

The extraction of woodland features has always been a challenging task in remote sensing image processing. Because of the complexity of the woodland and its surroundings, its unique features are often hidden in a large amount of background information, making it difficult for traditional image processing techniques to effectively separate and highlight these features. Through the introduction of the convolutional attention mechanism, the attention of the model to the woodland features can be enhanced to improve its ability to recognize them from non-woodland in complex scenes.

The convolutional attention mechanism has shown a strong potential for application in the field of computer vision due to its simple, flexible, plug-and-play characteristics [43]. Its core idea lies in dynamically adjusting and optimizing the weight allocation of the model, which enables the model to focus on the key information in the input data and enhances the ability of the model to recognize and process important features. Therefore, to further enhance the performance of the model in the forest floor extraction task, this paper integrated another convolutional attention mechanism in the HRNet feature extraction stage. CBAM (Convolutional Block Attention Module), as a simple and effective feed-forward convolutional neural network attention module, aims to improve the performance of the model by strengthening the attention of the model to the important information in the input features, and its structure is shown in Figure 2.

CBAM [44] mainly consists of the Channel Attention Module (CAM) and Spatial Attention Module (SAM), which are used to infer the attention maps of the input features in channel and spatial dimensions, respectively. The two modules work together to adaptively improve the channel and pixel spatial weights of the input features by multiplying the attention maps with the input feature maps to refine the features more efficiently.

The pre-improved CBAM channel attention module is shown in Figure 3a. It uses a weight-sharing, Multilayer Perceptron (MLP) network with hidden layers to integrate the average pooled and maximum pooled features of an image. By analyzing the inter-channel relationships of the feature map, the module generates a weight value for each channel that reflects its importance in the overall feature representation. This weight value is multiplied with the corresponding channel to realize the adaptive adjustment of the features of different channels. The final goal is to strengthen the important channel features and suppress the secondary channel features.

However, Wang et al. [45] pointed out that this operation of dimensionality reduction and then dimensionality enhancement destroys the direct correspondence between the channel attention weights and the channels, which can easily lead to the loss of information. In addition, the fully connected layer in the multilayer perceptron plays an important role in connecting neurons in each layer and integrating information. However, since each of its output nodes is connected to all the nodes in the previous layer, this dense connectivity tends to introduce a large number of parameters, causing the problem of parameter redundancy [46]. Therefore, to reduce the number of parameters and maintain dimensional stability, this paper introduced a one-dimensional convolution to improve the channel attention module of CBAM, as shown in Figure 3b. One-dimensional convolution performs local modeling on the channel dimension, maintaining consistency in the channel dimension without dimensionality reduction, effectively preserving discriminative information. Meanwhile, its sparse connection feature significantly reduces the number of parameters, enhancing the accuracy and generalization ability of attention weights while improving model efficiency. Due to the inherent parameter sharing mechanism of convolution operations, the model size is further compressed, and training stability is enhanced. Additionally, the design of local receptive fields enables the model to focus on the correlations between adjacent channels, which is more in line with the actual feature distribution patterns, thereby generating attention responses that are more physically meaningful and semantically consistent.

Figure 3a shows the original channel attention structure of CBAM for comparison, whereas the proposed CAM-HRNet actually adopts the improved channel attention module shown in Figure 3b.

The specific steps of the improved channel attention module are as follows. (1) Firstly, the input feature maps are feature compressed on the spatial scale using average pooling and maximum pooling, respectively, and the number of channels of the pooled feature maps remains unchanged, and the size of the dimensions is changed to 1 × 1. (2) Separately, the pooled features are allowed to go through another one-dimensional convolution operation with an adaptive convolution kernel, the size of which is related to the number of channels and is used to generate a feature representation of smaller dimensions for capturing the inter-channel interrelationships. (3) The convolution results obtained from the two pooling paths are summed over the corresponding positional elements to merge the different feature representations captured based on average and maximum pooling. (4) Finally, the merged feature maps are processed through an activation function to introduce nonlinearities and generate the final channel attention map. In the improved channel attention module, the relationship between the channel attention map FC and the input feature map F can be expressed as follows.

F_{C} = σ (f^{k} (AvgPool (F)) + f^{k} (MaxPool (F)))

(1)

where

σ

denotes the Sigmoid activation function.

f^{k}

denotes a one-dimensional convolution operation with convolution kernel size k. AvgPool, MaxPool denote the average pooling and maximum pooling operations in space, respectively. The relationship between the convolution kernel size of k and the number of channels C is as follows.

k = ODD |\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|

(2)

where ODD denotes upward rounding to the nearest odd number;

γ

and b are used to describe the functional relationship between k and C and take the values of 2 and 1, respectively.

After multiplying the channel attention map F_C with the input feature map F, the intermediate feature map F′ after the channel attention mechanism is obtained. It is then passed into the spatial attention module to further enhance the perception ability of the model in the spatial dimension. The structure of the spatial attention module is shown in Figure 4, and the specific steps are as follows. (1) Firstly, average pooling and maximum pooling operations are performed on the intermediate feature maps, respectively to compress the feature maps over the channels, and the size of the dimensions remain the same after pooling, and the number of channels becomes 1. (2) The feature maps obtained after the two types of pooling are then spliced along the channel dimensions to form a composite feature representation. (3) The spliced feature maps are fed into a 2D convolutional layer for learning the correlation between features in the spatial dimension and generating a new feature map. (4) Finally, the results of the convolution operation are processed through an activation function to introduce nonlinearities and generate the final spatial attention map.

In the spatial attention module, the relationship between the spatial attention graph F_S and the intermediate feature graph F′ can be expressed as follows:

F_{S} = σ (f^{7} ([AvgPool (F ’); MaxPool (F ’)]))

(3)

where

σ

denotes the Sigmoid activation function;

f^{7}

denotes a two-dimensional convolution operation with convolution kernel size 7; AvgPool, MaxPool denote the average pooling and maximum pooling operations on the channel, respectively.

Finally, the spatial attention map F_S is multiplied with the intermediate feature map F′ to obtain the new feature F″ optimized by both channel and spatial attention mechanisms. To further enhance the ability of HRNet to extract woodland features and ensure that the network can effectively capture target feature information at different scales, this paper added convolutional attention modules to all four different resolution feature maps’ output from HRNet. These modules enable the network to focus more on the target features to provide more accurate and effective feature representations at both detailed and global scales.

2.1.2. Stage-by-Stage Upsampling Mechanisms

After the four different resolution feature layers output by HRNet processing, an effective feature fusion operation is required to carry out the subsequent segmentation and extraction task of the forest land. Feature fusion mainly combines feature layers of different resolutions organically, thus integrating information from different scales. This can enhance the richness of the feature representation, improve the learnability of the network, and improve the accuracy and stability of forest land extraction.

In the HRNetV2 [47] network, feature fusion is performed as shown in Figure 5a. It uses bilinear interpolation upsampling to adjust all the low-resolution feature layers to the same size as the highest-resolution feature layer, respectively, while keeping the number of channels unchanged, and finally the adjusted feature layers are spliced in the channel dimension. Although this fusion method is simpler and more direct, it still has some drawbacks. It does not optimize the computational allocation between high-resolution and low-resolution branches, leading to the trivialization of low-resolution branches with stronger semantic representations. Since the lowest resolution feature layers usually contain the strongest semantic information, and the direct output makes them underutilized, so they need to go through more layers to merge with the high resolution feature layers to achieve more effective feature fusion. To integrate feature information at each scale more comprehensively, this paper constructed a feature fusion method with a step-by-step upsampling mechanism, as shown in Figure 5b. The “stepwise upsampling” mechanism proposed in this paper starts from the lowest resolution, upsamples step by step and fuses with adjacent high-resolution features, enabling strong semantic information to be transmitted layer by layer and fully interact with local details. At the same time, double cubic interpolation is adopted to improve the quality of upsampling, effectively enhancing the boundary expression ability, thereby achieving more accurate and coherent multi-scale feature fusion. The method is processed in the order of feature layer resolution from low to high, using twofold double-three interpolation and 1 × 1 convolution operation to upsample the low-resolution feature layer, and then stitching and fusing it with the neighboring high-resolution feature layers sequentially.

Bilinear interpolation is calculated based on linear weighting of only the four nearest neighbor pixel points around the target pixel. This method is simple and fast, but may introduce some degree of blurring or jagged effects. In contrast, bicubic interpolation takes into account information from the four pixel points around the target pixel and more neighboring pixel points. By using a cubic polynomial for the interpolation calculation, double cubic interpolation better preserves the details and textures of the image, resulting in a smoother and more natural looking image when zoomed in or out. Bicubic interpolation usually provides better results in image processing tasks, especially in applications that require high accuracy such as image semantic segmentation and super-resolution reconstruction. It can reduce the information loss during the interpolation process and enhance the edge and texture information of the image to some extent to improve the overall quality of the image. In addition, by combining the 1 × 1 convolution operation, it is able to add nonlinear features along with the bicubic interpolation, which helps to capture the complex structures and details in the image and further enhances the information representation of the model. Therefore, this smoother multi-scale feature fusion further optimizes the fusion effect, which can enhance the quality of the feature map and produce clearer feature map edges, which is conducive to improving the accuracy of the subsequent segmentation.

2.2. Binary Change Detection Model

Based on the excellent performance of CAM-HRNet in feature extraction, remote sensing images were deeply analyzed and critical and useful feature information was extracted. To further improve the accuracy and usefulness of the model in change detection, this paper combined the unique advantages of the twin network structure in matching and comparison. A binary change detection model specialized for forest land change detection in remote sensing imagery was constructed, and its structure is shown in Figure 6. This model was designed to accurately capture the changes of forest land in images at different time points and ignore the changes of the remaining irrelevant feature types to improve the specificity and practicability of the model.

Figure 6. Structure of binary change detection model. The internal structure of the “difference discrimination network” module is shown in Figure 7.

Figure 7. Structure of difference discrimination network. Corresponding to the “difference discrimination network” module in Figure 6.

The main process of the network model includes the following: (1) Firstly, in the feature extraction stage, the model utilizes the powerful feature extraction capability of CAM-HRNet, and inputs the former and later remote sensing images into two branches of the CAM-HRNet twin network with the same structure and weight sharing, respectively. After the parallel processing of the two branches, each extracts the key features of the forest land in the image. (2) The model adopts the difference feature extraction method, where the extracted feature information of two images at different time points is subjected to the difference operation and the absolute value is taken. A difference feature map containing the change information of the forest land is obtained, which effectively highlights the differences between the images. (3) After obtaining the difference feature map, the model sends it to the difference discriminant network for pixel-by-pixel classification. This network is a specially designed classifier that can classify each pixel in the difference feature map to determine whether it belongs to the woodland change area or not. (4) Finally, the model presents the classification results in a binarized way, with forest land change areas and unchanged or other change areas.

The decoding of the disparity discrimination network (DDN) is shown in Figure 7, which mainly consists of two convolutional layers of 3 × 3 size, a batch normalization (BN) layer, and a ReLU activation function. In the figure, C represents the channel, H represents the height, and W represents the width. The role of the first convolutional layer is to further extract the difference features of the dual-temporal remote sensing images. The second convolutional layer is responsible for abstracting and integrating these features for final classification. The addition of batch normalization helps to stabilize the training process and reduces the sensitivity of the model to the initialization weights and the learning rate, which improves the generalization ability of the model. Finally, the introduction of nonlinearity to the network through the ReLU activation function allows the network to learn and fit complex patterns and data distributions, which in turn improves the classification accuracy.

2.3. Semantic Change Detection Model

On the basis of the binary change detection model, this paper was further extended to improve the demand and depth of the change detection task. By adding semantic segmentation branches to the two sub-networks of the binary change detection model, a new semantic change detection model was constructed, the structure of which is shown in Figure 8.

Unlike the binary change detection model, the semantic change detection model incorporates the FCNhead decoder with shared weights for the semantic segmentation task of the dual-temporal image on both CAM-HRNet twin network branches to finely categorize pixels in each image to their corresponding feature classes. In the semantic change detection model, the change information extracted by the disparity discriminant network will be more comprehensive, including forest land and all other feature types that have changed. After extracting the change information, the model uses the change information as a mask to filter the semantic segmentation results of the dual-temporal image, and finally outputs the semantic information maps of the changed areas of the two images before and after.

In summary, the semantic change detection model inherits the advantages of the binary change detection model and further expands the functions and application scope. Through the addition of semantic segmentation branches, the model is able to detect all feature type changes including woodland, and conduct in-depth semantic analysis of the nature of the changes. For example, it recognizes newly added woodlands, disappearing woodlands, and transformations between woodlands and other feature types, which gives the model a wider application value.

3. Experimental Setting

3.1. Dataset and Experimental Parameter Configuration

The experiments were conducted on the publicly available SECOND dataset for Semantic Change Detection (hereafter referred to as SCD). This dataset comprises 4662 pairs of co-registered aerial images acquired from multiple platforms and sensors over urban areas including Hangzhou, Chengdu, and Shanghai. Each image pair captures the same geographic region at two distinct time points and is accompanied by pixel-level annotations that jointly encode both the binary change mask (indicating whether a change occurred) and the semantic labels before and after the change. All images are of size 512 × 512 pixels, with spatial resolutions ranging from 0.5 m to 3 m per pixel. The dataset is widely recognized for its high annotation quality, semantic richness, and broad geographic coverage.

For the SCD dataset, its label is shown in Figure 9b. The semantic information mainly includes 6 types of objects, including no change, water, surface, low vegetation, woodland, building and playground. In addition, since the focus of this paper was the change of woodland, only the areas where the woodland had changed were extracted from the two annotated images corresponding to each group of images while ignoring the changes of other types of objects, so as to construct a binary change detection dataset specifically for woodland, as shown in Figure 9c.

Although the SCD dataset was originally designed for generic semantic change detection with multiple land-cover transition categories, the primary focus of this work was woodland change. Therefore, experiments treated woodland-related change as the target class, while all non-woodland land-cover types and their changes were grouped into the “no change or other changes” category, as indicated in the legend of Figure 10. Other land-cover classes still appeared in the images and participated in the decision process of the network, but were not considered as separate target classes in our quantitative evaluation.

After constructing these two types of datasets, they were divided into a training set, test set, and validation set in a ratio of 8:1:1. When training the model, the size of the input image was set to 512 × 512 pixels, the batch size was 4, and the learning rate was linearly decreased from 0.01 through the LambdaLR optimizer. The binary change detection (BCD) model used the focal loss function and performed 1000 iterations; the semantic change detection model used the hybrid loss function constructed in this paper and performed 2000 iterations.

3.2. Loss Function

3.2.1. BCD Loss Function

Cross-entropy loss is widely used in machine learning and deep learning. It is an indicator to measure the difference between the probability distribution predicted by the model and the true probability distribution. However, in scenarios with unbalanced samples, cross-entropy loss may cause the model to favor majority class samples, resulting in the insufficient learning of minority class samples. In response to this, this paper introduced focal loss [48], the improvement of which is to add a weight adjustment mechanism based on cross-entropy loss. Specifically, focal loss automatically reduces the weight of correctly classified samples, that is, when the model predicts a higher probability of a certain sample, the corresponding loss weight will be smaller; on the contrary, for misclassified samples, since the model gives a lower prediction probability, its loss weight will increase accordingly. The calculation formula of the focal loss function is:

L_{F} = \frac{1}{N} \sum_{i = 1}^{N} - [y_{i} \times {(1 - p_{i})}^{γ} \times \log (p_{i}) + (1 - y_{i}) \times p_{i}^{γ} \times \log (1 - p_{i})]

(4)

In the formula,

γ

represents the focusing coefficient of the loss weight. When its value is 0, the focus loss becomes the cross entropy loss. By adjusting this parameter, the loss contribution of samples of different categories can be effectively balanced, and the performance of the model in the scenario of sample imbalance can be further improved. In this paper, the value of

γ

was 2.

3.2.2. SCD Loss Function

Since the semantic change detection model adds two semantic segmentation branches specifically for the front and back images on the basis of BCD, its loss function also needs to be adjusted accordingly. In addition to maintaining the original change information loss function, two loss functions for the semantic segmentation branch need to be added. Among them, the change information loss is mainly responsible for guiding the model to identify the differences between the dual-phase images, while the semantic information loss ensures that the model accurately divides the semantic categories in each phase image, so that the model can learn the change information and semantic information of the dual-phase images at the same time.

For semantic segmentation loss, this paper adopted the multi-classification cross entropy loss function, which is calculated as follows:

L_{M C E} = \frac{1}{N} \sum_{c = 1}^{C} \sum_{i = 1}^{N} - y_{i c} \times \log (p_{i c})

(5)

where N is the total number of pixels in the image; C is the number of categories for semantic segmentation classification; p_ic is the probability that the model predicts that the i-th pixel belongs to category c; y_ic represents the true label of the i-th pixel, which is 1 when the pixel belongs to category c, otherwise it is 0.

The SCD aims to capture the change information in the dual-phase images and the semantic information of each image at the same time. Therefore, its total loss function design combines the change information focus loss and the semantic information cross entropy loss, thereby helping the model to simultaneously optimize the change detection and semantic segmentation capabilities during the training process. The calculation formula is:

L = 4 L_{F} + L_{M C E 1} + L_{M C E 2}

(6)

In the formula, L_F is the change information focus loss; L_MCE1 and L_MCE2 are the semantic information cross entropy losses of the front and back images, respectively. At the same time, in order to balance the importance of these two tasks in the training process, the total loss function also sets corresponding weight ratios for them: the weight of the focus loss is 4, because its value after weight multiplication is relatively small, and the model needs to pay more attention to the performance of change detection during training; and the weight of the cross entropy loss is 1, ensuring that the model will also fully learn the features of semantic segmentation.

3.3. Experimental Evaluation Indicators

In order to comprehensively evaluate the performance of the model, this paper adopted multiple evaluation indicators, including intersection over union (IoU), overall accuracy (OA), precision (P), recall (R), and F1 score. The calculation method is shown in Formulas (7)–(11).

I o U = \frac{T P}{T P + F P + F N}

(7)

O A = \frac{T P + T N}{T P + F P + F N + T N}

(8)

P = \frac{T P}{T P + F P}

(9)

R = \frac{T P}{T P + F N}

(10)

F 1 = 2 \times \frac{P \times R}{P + R}

(11)

where TP, FP, FN, and TN are different results of pixel classification, representing true positive, false positive, false negative, and true negative, respectively.

4. Results and Discussion

4.1. Analysis of Comparative Experimental Results of Binary Change Detection Models

The experiment was based on a BCD dataset specifically for woodland, and the model constructed in this paper was compared with several commonly used binary change detection methods, including FC-Siam-conc [32], SNUNet, STANet, and BIT [41]. Among them, FC-Siam-conc is a change detection model based on a fully convolutional twin network. It achieves feature fusion and complementarity by skipping the feature maps of the two encoder branches and the corresponding layers of the decoder. SNUNet reduces the loss of position information during deep network training by tightly connecting the encoder and the decoder. At the same time, it also introduces an integrated channel attention mechanism for deep supervision, refines representative features at different semantic levels and uses them for final classification. STANet uses a spatiotemporal attention mechanism to calculate the attention weights between any two pixels at different times and positions, and performs feature analysis on image sub-regions obtained by multi-scale segmentation, which can enhance the model’s detection performance for objects of different sizes and shapes. BIT is a lightweight and efficient model based on the powerful generalization ability of transformers. It can provide the advantages of high accuracy and large-area object detection while maintaining low computational costs. All network models were trained and tested in the same experimental environment. The qualitative comparison of the test results is shown in Figure 10.

The figure contains five groups of contrast images, and the area of the woodland change area gradually expands from left to right. From the detection results of the first group of images, it can be seen that the FC-Siam-conc and STANet methods had some trivial areas of misdetection, while the SNUNet and BIT methods were not close enough on the edge. In the second group of images, the woodland change area in the upper left corner was relatively slender. It can be seen that the detection results of the FC-Siam-conc and BIT methods were relatively trivial and failed to form a coherent change area. Similarly, for the upper half of the third group of images with a relatively slender change area, it can be clearly seen that the other methods had serious missed detection phenomena and failed to capture these subtle changes. In the fourth group of images, after the lower right corner area changed from woodland to open space, due to the darker color of the area, the FC-Siam-conc, SNUNet and STANet methods had incomplete extraction results due to missed detection, while the BIT method also suffered from incomplete extraction due to a hole in the upper right corner. Finally, the fifth group of images mainly showed the change from woodland to buildings, but because the woodland in this area was originally sparse, the change area extracted by other methods had more or less holes. In contrast, our method performed better in all groups of images, and the extraction results were more complete, which can more accurately reflect the changes in the woodland.

In addition, given that the change detection task is similar to the semantic segmentation task in terms of results and labels, in order to analyze and compare the effects of different methods in more depth, this paper used the same evaluation indicators as the semantic segmentation task for evaluation. The quantitative comparison results of different methods are shown in Table 1. By comparing the values of these evaluation indicators in detail, it can be seen that the proposed method showed the best performance in the remote sensing image woodland change detection results. Compared with other commonly used binary change detection methods, the overall accuracy rate was improved by 0.24 to 2.27 percentage points, the IoU value was improved by 0.83 to 7.85 percentage points, and the F1 score was improved by 0.55 to 5.36 percentage points. In addition, our method also achieved good results in the numerical value and balance relationship of the two indicators of precision and recall, thus proving the significant advantages of this method in the remote sensing image woodland change detection task.

These improvements are consistent with previous Siamese and attention-based change detection networks, such as FC-Siam-conc, SNUNet-CD, STANet and BIT [36,37,40,41], where multi-scale feature fusion and attention mechanisms have also been shown to effectively enhance change-detection accuracy on high-resolution remote sensing images.

4.2. Analysis of Anti-Interference Ability of Binary Change Detection Model

In remote sensing images at different time points, the diversified changes of ground object types and pseudo-change phenomena caused by various interference factors are common. These changes are not only numerous, but also often intertwined, which to a certain extent increases the difficulty of extracting specific ground object change information from remote sensing images. Since the BCD method constructed in this paper aims to filter out other irrelevant changes from these complex change environments and accurately extract only the change information related to the woodland, this paper conducted a detailed analysis of the detection results of woodland changes in dual-phase images with different interfering ground object changes, as shown in Figure 11.

The figure contains five groups of dual-phase images, each of which shows different land use changes. Among them, the first two groups of images mainly show the transformation between woodland and cultivated land, the first group is the change from woodland to cultivated land, and the second group is the change from cultivated land to woodland; the third and fourth groups of images show the transformation between woodland and vegetation, the third group is the change from vegetation to woodland, and the fourth group is the change from woodland to vegetation; in the fifth group of images, the upper part is mainly the change from woodland to bare land, and the lower part is mainly the change from vegetation to bare land. It can be seen from the detection results that after being processed by the method in this paper, the area of woodland change was more accurately identified, and the changes of other non-target objects were effectively excluded and suppressed. This result fully demonstrates the excellent performance of our method in extracting woodland change information. Even in the face of interference from cultivated land, vegetation and other objects with similar characteristics to woodland, our method can still maintain high accuracy and reliability, further proving its practical application value.

4.3. Analysis of Experimental Results of Semantic Change Detection Model

As an extension of the BCD model, the SCD model has a visualization of the detection results, as shown in Figure 12. As can be seen from the figure, for the first and second groups of images with relatively simple labels, the model had good detection effects on the change area and semantic information; while for the third and fourth groups of images with more complex labels, although the model still achieved good results overall, there were still some deficiencies in the detection of some detailed semantic information. It is worth mentioning that in the right half of the fifth group of T2 images, the model mistakenly classified some withered low vegetation as bare land. The reasons are multi-faceted. On the one hand, the spectral and texture features of vegetation-related ground objects can change significantly in different seasons or climatic conditions. For instance, in drought or winter, low-growing vegetation may appear withered, sparse, or even nearly exposed on the surface, resulting in highly similar remote sensing image features to bare land and causing semantic ambiguity. On the other hand, the samples of such transitional states or edge scenes in the training data may be relatively insufficient, which causes the model to fail to fully learn the subtle discriminative features between withered vegetation and true bare land, and thus leads to confusion in the inference stage. The combined effect of the above factors makes the model tend to classify regions with strong spectral ambiguity and ambiguous labeling boundaries into more “typical” categories (such as bare land) when facing them, reflecting the limitations of current methods in terms of sensitivity to environmental changes and dependence on data distribution. This also suggests that in the future, it is necessary to enhance sample diversity at the data level (such as introducing multi-temporal and multi-seasonal samples) and incorporate more robust ground object discrimination priors in model design. Similar observations have also been reported in recent semantic change detection studies, where jointly estimating change masks and pre-/post-change land-cover categories under class imbalance and complex backgrounds is considered substantially more challenging than binary change detection [38,39].

In order to more accurately quantify the performance of the semantic change detection model, the evaluation indicators for the detection results of various types of objects in the change area are detailed in Table 2. In addition, in order to verify the effectiveness of the model in detecting overall changes, the experiment also calculated the evaluation indicators for the detection results of the change area. The specific results are shown in Table 3.

From the index data in Table 3, it can be seen that when the model was trained for both change detection and semantic segmentation, it still achieved a good overall change detection performance, with slightly higher IoU, F1 score, and OA compared to the woodland change detection results in Table 1. However, according to the data in Table 2, there is a need for further improvement in detecting semantic information within the changed areas. The reason may be that the changed areas with specific semantic categories in each group of images only account for a small part, while the unchanged areas without specific semantic categories account for the majority, resulting in data imbalance. There were significant differences in the IoU values of different feature categories in the semantic change detection results (for example, 73.86% vs. for buildings). The low vegetation (47.18%) reflects that the model’s recognition ability for various ground features is not balanced. For instance, artificial or structured features such as buildings (73.86%) and water bodies (65.43%) performed well, while the detection accuracy of low vegetation (47.18%) and bare land (56.63%) was significantly lower. This phenomenon mainly stems from the complexity of natural features themselves: when low-growing vegetation withers, becomes sparse or is affected by light and shadow, its spectral and texture characteristics are very likely to be confused with those of bare land. Meanwhile, such features usually lack clear boundaries and are spatially diffused, resulting in highly sensitive pixel-level classification in edge areas. Furthermore, the samples of such transitional states in the training data may be insufficient, and the manual annotation itself is subjective and ambiguous in areas with low vegetation coverage, further exacerbating the difficulty of model learning. In contrast, artificial features such as buildings have regular geometric shapes, high contrast and stable spectral responses, making them easier to be accurately captured by networks.

4.4. Analysis of Loss Function Results

It can be seen from the loss results in Table 4 that the focus loss of the binary change detection model stably converged to a relatively low level (below 0.2) on both the training set and the validation set, and the validation loss was only slightly higher than the training loss, indicating that the model did not experience overfitting and had good generalization ability. Meanwhile, the design of focus loss effectively alleviated the severe category imbalance problem between the forest land change area (positive sample) and the non-change area (negative sample). For the semantic change detection model, its various loss components—including the change detection focus loss (L_F), the pre-image semantic cross-entropy loss (L_MCE1), and the post-image semantic cross-entropy loss (L_MCE1)—all showed a stable downward trend. Among them, the convergence value of L_F was significantly lower than that of L_MCE1 and L_MCE1 This is consistent with the design that assigns a higher weight (×4) to L_F in the total loss, ensuring that the model prioritizes optimizing the detection accuracy of the changing region during the multi-task joint training process. Furthermore, the total validation loss of this model was only slightly higher than the training loss, further indicating that the training process is stable and no severe overfitting has occurred. It indicates that the proposed loss function design can effectively drive network learning and achieve robust and accurate forest land change detection in complex high-resolution remote sensing scenarios.

4.5. Analysis of Ablation Experiment Results

Based on the HRNet backbone network, this paper added a convolutional attention module and constructed a step-by-step upsampling mechanism to optimize it. In order to verify the effectiveness of these improved modules, this paper set up four groups of ablation comparison experiments in the same experimental environment:

G1: Uses only the original HRNetV2 backbone, without the progressive upsampling structure or any attention modules, serving as the baseline model;

G2: Based on G1, replaces only the multi-scale feature fusion component with the progressive upsampling structure while keeping the rest of the network architecture unchanged to evaluate the contribution of the progressive upsampling mechanism;

G3: Based on G2, incorporates the original CBAM attention module into each scale branch to evaluate the performance difference between the standard CBAM and the configuration utilizing only progressive upsampling;

G4: Based on G2, incorporates the improved 1D convolutional channel attention and spatial attention module proposed in this paper into each scale branch (i.e., the complete CAM-HRNet) to evaluate the advantages of the improved convolutional attention module over the original CBAM.

The experimental results are shown in Table 5.

From the data in the table, it can be seen that the improved fourth group of experiments improved the overall accuracy by 0.21 to 0.62 percentage points, the IoU value by 0.89 to 2.50 percentage points, and the F1 score by 0.56 to 1.55 percentage points compared with other experimental groups. Through the improvement of these quantitative indicators, it reflects the positive impact of the improved modules in the CAM-HRNet method proposed in this paper on the final results of remote sensing image woodland extraction, and verifies its effectiveness.

5. Conclusions

In this paper, aiming at the problems such as insufficient accuracy and low efficiency existing in forest land supervision, an CAM-HRNet model integrating the improved channel attention mechanism was proposed and embedded in the twin network framework. Binary change detection and semantic change detection methods for forest land changes were constructed, respectively. The experimental results show that the proposed method significantly outperformed the existing mainstream models on the dataset. The IoU and F1 of binary change detection reached 75.24% and 85.88%, respectively, and the semantic change detection also achieved an IoU of 63.93% and an F1 of 77.53%, and could effectively suppress the interference of non-forest land changes. The ablation experiment further verified the effectiveness of the attention module and the feature fusion strategy. Although the model has superior performance, it is limited by the inherent high computational overhead of HRNet. Future work will focus on achieving lightweight and accelerated inference while maintaining accuracy.

Author Contributions

Conceptualization, X.L.; methodology, X.L.; software, J.Y.; validation, J.Y.; formal analysis, X.L.; investigation, J.Y.; resources, J.Y.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, J.Y.; visualization, X.L.; supervision, J.Y.; project administration, J.Y.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

The study was funded by: (1) the National Key Research and Development Program of China (NO.2022YFB3903504); and (2) the National Natural Science Foundation of China (NO.42271282).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, B. Study on Dynamic Change of Forest Land Monitoring by High Resolution Satellite Remote Sensing Image. Agric. Technol. Equip. 2019, 12, 83–85. [Google Scholar] [CrossRef]
Cui, H.; Liu, M. Analysis on the Results of the 9th National Forest Inventory. J. West China For. Sci. 2020, 49, 90–95. [Google Scholar] [CrossRef]
Chen, H. Current status and development trends of forest land change monitoring technology. For. Sci. Technol. 2022, 39–42. [Google Scholar] [CrossRef]
Liu, M. Study on the 8th National Forest Inventory from Multiple Perspective. For. Econ. 2014, 36, 3–9. [Google Scholar]
Wang, T.; Zuo, Y.; Manda, T.; Hwarari, D.; Yang, L. Harnessing Artificial Intelligence, Machine Learning and Deep Learning for Sustainable Forestry Management and Conservation: Transformative Potential and Future Perspectives. Plants 2025, 14, 998. [Google Scholar] [CrossRef] [PubMed]
Mou, H. Monitoring of Forestland Dynamic Changes by Using Multi Source and High Resolution Satellite Remote Sensing Images. For. Grassl. Resour. Res. 2016, 32, 107–113. [Google Scholar]
Zhang, Z. Forest Land Types Precise Classification and Change Monitoring Based on Multi-source Remote Sensing Data. Master’s Thesis, Xi’an University of Science and Technology, Xi’an, China, 2018. [Google Scholar]
Ling, C.; Zhu, L.; Wu, L. The research and realization of the forest information extraction based on the WorldView-2 image. Sci. Surv. Mapp. 2010, 35, 205–207. [Google Scholar]
Li, Z.; Zhao, L.; Li, K.; Chen, E.; Wan, X.; Xu, K. A survey of developments on forest resources monitoring technology of synthetic aperture radar. J. Nanjing Univ. Inf. Sci. Technol. 2020, 12, 150–158. [Google Scholar]
Li, X. Research on Extraction Method of Forest Land Resources Information from Remote Sensing Images. Master’s Thesis, Lanzhou Jiaotong University, Lanzhou, China, 2022. [Google Scholar]
Shi, L. Current Situation of Remote Sensing and It’s Application in Forestry. For. Grassl. Resour. Res. 2004, 2, 50–63. [Google Scholar] [CrossRef]
Li, S.; Zhang, J.; Xin, H.; Sun, B.; Gao, Z.; Zhao, L.; Wang, H. Extraction Method of Urban Forest Land Information Based on Spectral Information. For. Grassl. Resour. Res. 2021, 3, 96–100. [Google Scholar] [CrossRef]
Zhang, Z.; Jiang, H.; Pang, S.; Hu, X. Review and prospect in change detection of multitemporal remote sensing images. Acta Geod. Cartogr. Sin. 2022, 51, 1091–1107. [Google Scholar]
Li, H.; Zhang, Z.; Tang, X. Forest Land Change Detection Method Based on Remote Sensing Images. Cent. South For. Inventory Plan. 2021, 40, 33–39. [Google Scholar]
Shi, W.; Zhang, P. State-of-the-Art Remotely Sensed Images-Based Change Detection Methods. Geomat. Inf. Sci. Wuhan Univ. 2018, 43, 1832–1837. [Google Scholar]
Hussain, M.; Chen, D.; Cheng, A.; Wei, H.; Stanley, D. Change detection from remotely sensed images: From pixel-based to object-based approaches. ISPRS J. Photogramm. Remote Sens. 2013, 80, 90–106. [Google Scholar] [CrossRef]
Ren, Q.; Yang, W.; Wang, C.; Wei, W.; Yunyun, Q. Review of remote sensing image change detection. J. Comput. Appl. 2021, 41, 2294–2305. [Google Scholar] [CrossRef]
Zhang, H. High Resolution Remote Sensing Change Detection Based on Multiscale Deep Representation Learning. Master’s Thesis, Xidian University, Xi’an, China, 2022. [Google Scholar]
Fu, W. Research on Change Detection Algorithm of Remote Sensing Image Based on Deep Learning. Master’s Thesis, University of Electronic Science and Technology of China, Chengdu, China, 2021. [Google Scholar]
Singh, A. Review Article Digital change detection techniques using remotely-sensed data. Int. J. Remote Sens. 1989, 10, 989–1003. [Google Scholar] [CrossRef]
Ma, Y.; Li, H. Review on the Methods of Change Detection Techniques Using Remotely-Sensed Data. Geomat. Spat. Inf. Technol. 2014, 37, 132–134. [Google Scholar]
Sui, H.; Feng, W.; Li, W.; Sun, K. Review of Change Detection Methods for Multi-temporal Remote Sensing Imagery. Geomat. Inf. Sci. Wuhan Univ. 2018, 43, 1885–1898. [Google Scholar]
Nielsen, A.A.; Conradsen, K.; Simpson, J.J. Multivariate Alteration Detection (MAD) and MAF Postprocessing in Multispectral, Bitemporal Image Data: New Approaches to Change Detection Studies. Remote Sens. Environ. 1998, 64, 1–19. [Google Scholar] [CrossRef]
Gong, M.; Zhou, Z.; Ma, J. Change Detection in Synthetic Aperture Radar Images based on Image Fusion and Fuzzy Clustering. IEEE Trans. Image Process. 2011, 21, 2141–2151. [Google Scholar] [CrossRef]
Benedek, C.; Sziranyi, T. Change Detection in Optical Aerial Images by a Multilayer Conditional Mixed Markov Model. IEEE Trans. Geosci. Remote Sens. 2009, 47, 3416–3430. [Google Scholar] [CrossRef]
Shen, S.; Lai, Z.; Wan, Y. Change Detection of High-resolution RS Imagery Based on Fusion. Bull. Surv. Mapp. 2009, 2009, 16–19. [Google Scholar]
Celik, T. Unsupervised Change Detection in Satellite Images Using Principal Component Analysis and k-Means Clustering. IEEE Geosci. Remote Sens. Lett. 2009, 6, 772–776. [Google Scholar] [CrossRef]
Dewan, A.M.; Yamaguchi, Y. Using remote sensing and GIS to detect and monitor land use and land cover change in Dhaka Metropolitan of Bangladesh during 1960–2005. Environ. Monit. Assess. 2009, 150, 237–249. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Shu, N.; Gong, Y. A Study of Land Use Change Detection Based on High Resolution Remote Sensing Images. Remote Sens. Nat. Resour. 2012, 24, 43–47. [Google Scholar]
Li, D.; Shu, N.; Li, L. Remote sensing image land use change detection method like spot. Geospat. Inf. 2011, 9, 75–78. [Google Scholar]
Fang, S.; Dian, Y.; Li, W. Change Detection Based on Both Edge sand Gray. Geomat. Inf. Sci. Wuhan Univ. 2005, 30, 135–138. [Google Scholar]
Tian, X.; Yan, M.; van der Tol, C.; Li, Z.; Su, Z.; Chen, E.; Li, X.; Li, L.; Wang, X.; Pan, X.; et al. Modeling forest above-ground biomass dynamics using multi-source data and incorporated models: A case study over the qilian mountains. Agric. For. Meteorol. 2017, 246, 1–14. [Google Scholar] [CrossRef]
Yuan, X.; Song, Y. A Building Change Detection Method Considering Projection Influence Based on Spectral Feature and Texture Feature. Geomat. Inf. Sci. Wuhan Univ. 2007, 32, 489–493. [Google Scholar]
Du, P.; Liu, S. Change detection from multi-temporal remote sensing images by integrating multiple features. Natl. Remote Sens. Bull. 2012, 16, 663–677. [Google Scholar]
Desclée, B.; Bogaert, P.; Defourny, P. Forest change detection by statistical object-based method. Remote Sens. Environ. 2006, 102, 1–11. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A Densely Connected Siamese Network for Change Detection of VHR Images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A.; Gousseau, Y. Multitask learning for large-scale semantic change detection. Comput. Vis. Image Underst. 2019, 187, 102783. [Google Scholar] [CrossRef]
Yang, K.; Xia, G.-S.; Liu, Z.; Du, B.; Yang, W.; Pelillo, M.; Zhang, L. Asymmetric Siamese Networks for Semantic Change Detection in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
Daudt, R.C.; Saux, B.L.; Boulch, A. Fully Convolutional Siamese Networks for Change Detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Chen, H.; Shi, Z.; Qi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5686–5696. [Google Scholar]
Meng, F.; Xu, H.; Fang, W.; Zhang, D.; Zhang, W. Forest land extraction from satellite remote sensing images based on improved DeepLabV3+ network. Opt. Tech. 2023, 49, 743–749. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I. CBAM: Convolutional Block Attention Module. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Zhang, D.; Tian, Y.; Xu, P.; Zhong, C. Road surface damage area identification method based on E-HRNet. J. Beijing Jiaotong Univ. 2023, 47, 110–119. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]

Figure 1. Structure of CAM-HRNet model. CA and SA denote the Channel Attention and Spatial Attention modules, respectively, and cat denotes the feature concatenation operation.

Figure 2. CBAM structure.

Figure 3. Improved channel attention module. (a) Original CBAM channel attention structure for comparison; (b) the improved channel attention module actually used in the proposed CAM-HRNet.

Figure 4. Spatial attention module. Cascaded with the improved channel attention module in Figure 3b to form the convolutional attention module used in CAM-HRNet.

Figure 5. Multi-scale feature fusion methods.

Figure 8. Structure of semantic change detection model.

Figure 9. Classification system of the SCD dataset.

Figure 10. Comparison of images for woodland binary change detection results using different methods.

Figure 11. Forest land change detection results of binary change detection model.

Figure 12. Display of semantic change detection results.

Table 1. Comparison of indicators for woodland binary change detection results using different methods.

Method	OA (%)	IoU (%)	P (%)	R (%)	F1 (%)
FC-Siam-conc	90.99	67.39	84.55	76.86	80.52
SNUNet	91.76	69.35	85.72	78.42	81.90
STANet	93.02	74.41	87.62	83.15	85.33
BIT	92.50	72.62	86.57	81.84	84.14
Ours	93.26	75.24	88.03	83.82	85.88

Table 2. Semantic change detection result indicators.

Classes	IoU (%)	P (%)	R (%)	F1 (%)	OA (%)
No change	79.93	91.64	86.21	88.84	81.79
Water	65.43	75.84	82.65	79.10
Ground	56.63	71.24	73.42	72.31
Low vegetation	47.18	58.33	71.16	64.11
Woodland	59.93	74.61	75.28	74.94
Building	73.86	82.64	87.42	84.96
Playground	64.57	81.08	76.03	78.47
Average	63.93	76.48	78.88	77.53

Table 3. Detection result indicators for changed areas.

Classes	IoU (%)	P (%)	R (%)	F1 (%)	OA (%)
Changed area	76.20	89.82	83.41	86.50	87.78

Table 4. Model training and validation loss values.

	Loss Item	Training Loss	Verification Loss
Binary change detection	L_F	0.182	0.197
Semantic change detection	L_F	0.215	0.231
	L_MCE1	0.348	0.362
	L_MCE2	0.351	0.368
	L	1.829	1.921

Table 5. Comparison of indicators for ablation experiment results with different improved modules.

Experimental Groups	OA (%)	IoU (%)	P (%)	R (%)	F1 (%)
G1	94.86	78.10	89.87	85.64	87.71
G2	95.07	78.91	90.37	86.16	88.21
G3	95.27	79.71	90.58	86.91	88.70
G4	95.48	80.60	90.89	87.68	89.26

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Yang, J. CAM-HRNet: A Multi-Scale Parallel Structure Change Detection Framework for Forest Land. Remote Sens. 2026, 18, 12. https://doi.org/10.3390/rs18010012

AMA Style

Li X, Yang J. CAM-HRNet: A Multi-Scale Parallel Structure Change Detection Framework for Forest Land. Remote Sensing. 2026; 18(1):12. https://doi.org/10.3390/rs18010012

Chicago/Turabian Style

Li, Xihan, and Jianyu Yang. 2026. "CAM-HRNet: A Multi-Scale Parallel Structure Change Detection Framework for Forest Land" Remote Sensing 18, no. 1: 12. https://doi.org/10.3390/rs18010012

APA Style

Li, X., & Yang, J. (2026). CAM-HRNet: A Multi-Scale Parallel Structure Change Detection Framework for Forest Land. Remote Sensing, 18(1), 12. https://doi.org/10.3390/rs18010012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CAM-HRNet: A Multi-Scale Parallel Structure Change Detection Framework for Forest Land

Highlights

Abstract

1. Introduction

2. Methods

2.1. CAM-HRNet

2.1.1. Convolutional Attention Mechanisms

2.1.2. Stage-by-Stage Upsampling Mechanisms

2.2. Binary Change Detection Model

2.3. Semantic Change Detection Model

3. Experimental Setting

3.1. Dataset and Experimental Parameter Configuration

3.2. Loss Function

3.2.1. BCD Loss Function

3.2.2. SCD Loss Function

3.3. Experimental Evaluation Indicators

4. Results and Discussion

4.1. Analysis of Comparative Experimental Results of Binary Change Detection Models

4.2. Analysis of Anti-Interference Ability of Binary Change Detection Model

4.3. Analysis of Experimental Results of Semantic Change Detection Model

4.4. Analysis of Loss Function Results

4.5. Analysis of Ablation Experiment Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI