In the field of remote sensing, change detection research refers to the process of obtaining surface change information by analyzing two images of the same area acquired at different times [
1]. In applications such as land use, resource exploration and disaster monitoring, the use of remote sensing technology for the dynamic monitoring of surface cover is an extremely important technical means [
2]. The class imbalance of changed objects means that a small number of categories of changed objects account for most of the samples [
3], while other categories have very few samples, which will lead the model to be biased towards learning the features of the dominant categories during training, resulting in underfitting for minority categories and serious misclassification. Meanwhile, variable object sizes will make the model unable to capture consistent semantic and geometric features of objects of different scales [
4] (small-sized changed objects are easily ignored due to insufficient feature extraction, while large-sized changed objects are prone to incomplete detection or false detection due to imbalance feature responses). In recent years, with the rapid development of technologies such as artificial intelligence, neural networks and large language models, deep learning has also been widely applied to task scenarios such as target recognition and change detection in the remote sensing field [
3,
5,
6]. Due to its strong learning ability and deep mining of complex features, compared with other change detection methods, deep learning can more accurately capture change information in remote sensing images, improve the accuracy of interpretation and reduce the time of feature extraction [
4,
7,
8].
The Transformer model is an important innovation in the field of deep learning; it mainly realizes the parallelization of sequence processing through its unique self-attention mechanism [
9,
10]. A classic method of applying Transformer to change detection is ChangeFormer, proposed by Bandara et al. [
11], which uses a hierarchical Transformer encoder and a lightweight MLP decoder to process bi-temporal images in a Siamese architecture. Although it reduces the computational overhead, its fixed-window attention mechanism restricts cross-window feature interaction, resulting in limited capabilities for detecting irregular changes. Zhang et al. [
12] combined Swin Transformer with UNet to propose SwinSUNet; it breaks free from the locality constraint of convolution but suffers from problems such as an excessively large number of parameters and insufficient utilization of shallow detailed features. Teng et al. [
13] proposed SFCD, which uses Swin Transformer instead of a traditional CNN as the encoder in the feature extraction stage, exerting the advantages of Swin Transformer in small-target and local-area change detection. However, this method relies on ImageNet pre-trained weights, leading to limited generalization performance on small-sample remote sensing datasets. Guo et al. [
14] proposed an iterative difference enhancement method (IDET), which enhances differential features in an iterative manner to improve the change detection accuracy, but multi-scale iterative refinement introduces extra computational overhead, and the inference efficiency needs to be improved. Yang et al. [
15] proposed a Siamese encoder–decoder network based on graph context attention (GCA-SEDN). It fuses graph context attention to capture the spatial topological relationships of ground objects and eliminates the annotation dependence, making it suitable for unlabeled scenarios. However, it is designed specifically for polarimetric SAR data and has poor adaptability to optical remote sensing images. In recent years, multi-task learning and multi-scale fusion have become research hotspots in the field of semantic change detection (SCD). Traditional CNN models and their extended networks, such as LeNet-5 [
16], AlexNet [
17], VGG [
18], ResNet [
19] and DenseNet [
20], which are dedicated to binary change detection tasks, have gradually failed to meet the requirements. Chen et al. [
21] combined the Siamese network with the UNet model and introduced Atrous Spatial Pyramid Pooling (ASPP) to enhance the multi-scale feature detection capabilities. Pang et al. [
22] proposed SCA-CDNet, a robust Siamese correlation and attention change detection network; however, this method still relies on a CNN as its main backbone, resulting in insufficient global modeling capabilities. Cui et al. [
23] proposed MTSCD-Net, which adopts a Swin Transformer-based Siamese semantic perception encoder to extract bi-temporal multi-scale features, but it suffers from insufficient task interaction, weak semantic consistency constraints and the inadequate suppression of pseudo-changes and seasonal disturbances. Wang et al. [
24] proposed a cross-difference semantic consistency network, which improves SCD performance by enhancing the collaboration between binary change detection and semantic segmentation subtasks and using modeled difference features to resolve the limitation of consistency in the bi-temporal feature space, but it is difficult to simultaneously balance global semantics and local details. Some studies focus on improving SCD performance through semantic enhancement and change consistency strategies. For example, methods based on SAM2 are used to extract global features to address the problems of insufficient semantic extraction and inconsistent change features [
25]. To better capture surface cover features in complex scenarios, Liu et al. [
26] proposed an SCD model based on spatiotemporal attention perception and multi-scale fusion to solve the problems of spatial detail loss and insufficient global feature modeling capabilities. Although such methods have achieved good results, how to effectively utilize the correlation between tasks and promote the overall performance of the model remains a challenge.
To address the above dilemma of insufficient multi-task collaboration and achieve the high-precision semantic change detection of bi-temporal remote sensing images, it is necessary to address (1) structured semantic representation in multi-scale feature fusion; (2) expressive multi-scale difference modeling that bridges the change detection and semantic segmentation branches; and (3) noise suppression in temporal cross-attention to safeguard the per-temporal semantic accuracy. We propose three collaborative improvement modules for different subtasks based on BGSNet and construct a semantic change detection network, SRDFNet (Semantic Refinement and Differential Features), oriented to multi-task joint optimization. Our research contributions are summarized as follows: