MsFNet: Multi-Scale Fusion Network Based on Dynamic Spectral Features for Multi-Temporal Hyperspectral Image Change Detection

Feng, Yining; Ni, Weihan; Song, Liyang; Wang, Xianghai

doi:10.3390/rs16163037

Open AccessArticle

MsFNet: Multi-Scale Fusion Network Based on Dynamic Spectral Features for Multi-Temporal Hyperspectral Image Change Detection

¹

School of Geography, Liaoning Normal University, Dalian 116029, China

²

School of Computer Science and Artificial Intelligence, Liaoning Normal University, Dalian 116029, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 3037; https://doi.org/10.3390/rs16163037

Submission received: 6 July 2024 / Revised: 16 August 2024 / Accepted: 17 August 2024 / Published: 18 August 2024

(This article belongs to the Special Issue Recent Advances in the Processing of Hyperspectral Images)

Download

Browse Figures

Versions Notes

Abstract

With the development of satellite technology, the importance of multi-temporal remote sensing (RS) image change detection (CD) in urban planning, environmental monitoring, and other fields is increasingly prominent. Deep learning techniques enable a profound exploration of the intrinsic features within hyperspectral (HS) data, leading to substantial enhancements in CD accuracy while addressing several challenges posed by traditional methodologies. However, existing convolutional neural network (CNN)-based CD approaches frequently encounter issues during the feature extraction process, such as the loss of detailed information due to downsampling, which hampers a model’s ability to accurately capture complex spectral features. Additionally, these methods often neglect the integration of multi-scale information, resulting in suboptimal local feature extraction and, consequently, diminished model performance. To address these limitations, we propose a multi-scale fusion network (MsFNet) which leverages dynamic spectral features for effective multi-temporal HS-CD. Our approach incorporates a dynamic convolution module with spectral attention, which adaptively modulates the receptive field size according to the spectral characteristics of different bands. This flexibility enhances the model’s capacity to focus on critical bands, thereby improving its ability to identify and differentiate changes across spectral dimensions. Furthermore, we develop a multi-scale feature fusion module which extracts and integrates features from deep feature maps, enriching local information and augmenting the model’s sensitivity to local variations. Experimental evaluations conducted on three real-world HS-CD datasets demonstrate that the proposed MsFNet significantly outperforms contemporary advanced CD methods in terms of both efficacy and performance.

Keywords:

change detection; deep learning; hyperspectral image; multi-scale feature; multi-temporal analysis

1. Introduction

Hyperspectral (HS) imaging boasts superior spectral resolution, thereby offering richer spectral information and conferring a unique advantage in surface environmental change detection (CD) [1,2]. HS image CD aims to obtain dynamic information on land surface changes by comparing two HS images collected at different times before and after covering the same scene. It generally includes three steps: preprocessing of HS images, application of CD algorithms and evaluation of the results [3]. Of these, selecting an appropriate CD methodology profoundly influences detection accuracy. Feature extraction and change region identification constitute pivotal stages within CD methodologies. Feature extraction entails extracting pertinent attributes such as the color, texture, and spatial relationships of ground objects, laying a robust groundwork for subsequent analysis. Change region identification focuses on employing algorithms to conduct intricate analysis of extracted features, aiming for precise delineation of changing regions.

Early researchers focused on studying the differences in features. Change vector analysis (CVA) [4] is a typical algebraic operation CD method, which treats each spectral curve as a vector and achieves CD by calculating the difference between the vectors. Deng et al. [5] enhanced change information based on principal component analysis (PCA) and combined supervised and unsupervised methods to accurately identify change areas. Celik et al. [6] used PCA to process data and create feature vector spaces for subsequent CD tasks. Wu et al. [7] used adaptive cosine estimation to detect the target change regions in stacked images. Erturk et al. [8] used an HS image CD technique based on spectral unmixing. They compared the abundance differences of the same region in images obtained at different times by using differential operations to identify the regions where changes occurred. Hou et al. [9] proposed a new method for HS image CD using multi-modal profiles. Traditional methods for HS-CD are characterized by their straightforward implementation and comprehensibility. However, these approaches place stringent requirements on the accuracy of multi-temporal HS image registration. Additionally, they demonstrated the limitations in effectively utilizing both spectral and geometric features, which present challenges in attaining reliable detection accuracy.

Recently, with the continuous progress of deep learning (DL) technology, its application in HS-CD has become a hot research topic, and various DL-based methods have emerged one after another [10]. Compared with traditional mathematical models limited to extracting superficial features, DL techniques garner considerable interest, owing to their pronounced efficacy in addressing high-dimensional challenges and performing intricate feature extraction. DL approaches inherently discern and assimilate profound semantic insights from image datasets via intricate network structures. In the field of HS-CD, effective implementation not only enhances processing efficiency but also offers robust solutions for the intricate analysis of multi-temporal hyperspectral images [11]. Wang et al. [12] proposed a 2D convolutional neural network (CNN) framework (GETNET) which trains deep networks in an unsupervised manner. However, the limitation of this method is that it relies on the pseudo training set generated by other CD methods, which not only increases the complexity of the algorithm but may also affect model accuracy due to the presence of noise. Zhan et al. [13] proposed a 3D spectral-spatial convolutional network (TDSSC) which extracts image features from three dimensions using three-dimensional convolution. Mostafa et al. [14] proposed a new CD method which utilizes U-Net neural networks to effectively learn and restore the details of the input images and utilizes attention to highlight key information, thereby effectively determining the region of change. Zhan et al. [15] proposed a spectral-spatial convolution neural network with a Siamese architecture (SSCNN-S) for HS-CD. Song et al. [16] integrated iterative active learning and affinity graph learning into a unified framework. They built an HS-CD network based on active learning. In recent years, owing to the widespread adoption of attention mechanisms within computer vision, researchers have naturally extended their application to HS-CD tasks [17,18,19]. Wang et al. [20] proposed a dual-branch framework based on spatiotemporal joint graph attention and complementary strategies (CSDBF). Song et al. [21] proposed a new cross-temporal interactive symmetric attention network (CSANet) which can effectively extract and integrate the spatial-spectral temporal joint features of HS images while enhancing the ability to distinguish the features of changes. In addition to CNNs, other network frameworks have also been applied to CD tasks and have achieved certain results, such as the GNN [22], GRU [23], and Transformer [24].

However, extant CD methodologies encounter challenges during the feature extraction phase. Owing to the susceptibility to detail loss during network downsampling, prevailing models struggle to accurately identify intricate spectral characteristics. Moreover, these approaches frequently overlook the incorporation of multi-scale information during feature extraction, thereby impeding satisfactory extraction of local features. HS image has tens to hundreds of bands, and each band contains rich spectral information. Traditional methods exhibit a lack of emphasis on these issues, thereby exerting a certain degree of impact on model performance. To effectively capture the complex spectral features in HS images, it is necessary to combine multi-scale features, comprehensively consider the importance of spectral features, and extract more representative features for CD and analysis.

To this end, we present a multi-scale fusion network based on dynamic spectral features for multi-temporal HS-CD (MsFNet). We proposed a dual-branch network architecture tailored for feature extraction across distinct multi-temporal phases. To accurately capture complex spectral features and consider them at multiple scales, a dynamic convolution module based on a spectral attention mechanism is used to adaptively adjust the features of different bands. Utilizing a multi-scale feature fusion module enhances the model’s capacity for local information perception by amalgamating multi-scale feature extraction and fusion mechanisms. The primary innovations presented in this article can be summarized as follows:

(1) A multi-scale fusion network based on dynamic spectral features for multi-temporal HS-CD (MsFNet) is proposed, which uses a dual-branch network to extract features from multi-temporal phases and fuse them to capture complex spectral features and perceive local information, improving the accuracy of CD.

(2) A spectral dynamic feature extraction module is proposed, which utilizes the dynamic convolution module to dynamically select features. At the same time, the attention module in the original dynamic convolution is improved to achieve adaptive adjustment of features in different bands, making the network more flexible when focusing on important bands.

(3) A multi-scale feature extraction and fusion module is proposed, which effectively solves the problem of losing detailed information in the downsampling process of convolutional networks by considering features at multiple scales and enhances the model’s perception ability of local information.

(4) Experiments are performed on three multi-temporal HS image CD datasets, yielding outcomes which outstrip certain state-of-the-art (SOTA) methods. Findings indicate that the proposed approach demonstrates notable superiority in performance when juxtaposed with contemporary SOTA CD methods.

The structure of the remaining sections is as follows. Section 2 presents a literature review on multi-temporal remote sensing (RS) image CD based on feature fusion and multi-scale learning. Section 3 describes the proposed framework and novel modules. Section 4 delineates the experimental configuration, analysis of results, and ablation experiment, while Section 5 provides our conclusions.

2. Related Work

2.1. RS-CD Based on Feature Fusion

In RS-CD, multi-temporal features are carriers which directly reflect the characteristics of changing targets between different temporal images. They can consider temporal correlation and avoid errors introduced by single temporal features. Effective extraction and application of these features are the keys to determining whether changing targets are accurately detected. Integrating multi-dimensional and multi-temporal features is beneficial for extracting more accurate features and significantly improving CD results.

The integration of multi-temporal features has been extensively investigated by numerous researchers. Le Hégarat-Mascle et al. [25] used the Dempster–Shafer (D-S) evidence theory to perform decision fusion of different change indicators and detector results in order to improve detection performance and stability. Nemmour et al. [26] used fuzzy integration to integrate decision fusion of CD results from a single support vector machine (SVM) output to improve detection accuracy. Wang et al. [27] used cross-sharpening technology to detect changes in multi-temporal RS images after fusion, and the results showed that it can effectively reduce the false detection changes caused by different acquisition angles of multi-temporal images. Du et al. [28] integrated the results using three decision-level fusion strategies—the voting method, evidence theory, and fuzzy integration—which can gradually improve the accuracy of CD. In recent years, some DL-based fusion methods have also been applied to RS-CD tasks. Lyu et al. [29] proposed a CD method based on recurrent neural networks (RNNs), which improved the RNN framework with the long short-term memory (LSTM) model. Song et al. [30] proposed a recurrent 3D fully CNN (Re3FCN) which extracts spatial-spectral features from the subnets of a a 2D-CNN and feeds them back into the RNN subnets to extract temporal variation features. Wang et al. [31] proposed a new attention-guided feature fusion network (AgF2Net), which employs the combined attention module to facilitate the fusion of spatial and spectral information derived from the backbone network, thereby harnessing the complementary advantages inherent in multidimensional features.

However, currently, most methods based on feature fusion use simple weight sharing or feature superposition or products, lacking deep fusion methods for features. This simple fusion method mostly stays at the level of “feature accumulation”, which is insufficient for HS images with rich spectral information. How to effectively integrate features is still a quite noteworthy issue.

2.2. Multi-Scale Learning

Multi-scale features denote the characteristic features extracted at varying scales, whcih are adept at capturing both the structural and semantic insights of input data across diverse levels of granularity [32,33]. Usually, different features can be observed at different scales, with more details visible at smaller scales and overall trends visible at larger scales. By effectively utilizing multiple scales, more comprehensive extraction of information can be achieved, encompassing both the global context and intricate local details. Multi-scale features have the following advantages: improving the generalization ability of the model, enhancing its expressive power, helping to process complex data, enhancing the robustness of the model, and facilitating end-to-end feature learning.

In computer vision, multi-scale features find frequent application across diverse tasks like image classification and target detection. Within CNNs, neurons within each layer are typically connected solely to a localized region of the preceding layer, known as the receptive field (RF). During the sequential extraction of target features, an excessively small RF restricts neurons to perceiving solely local input data information, while an overly large RF may induce information blurring or loss.

In recent years, multi-scale learning has also been applied to RS tasks and has achieved good results [34]. Chen et al. [35] proposed a novel noise suppression method based on encoder-decoder architecture, replacing the skip connection module between the encoder and decoder in U-Net with multi-channel feature fusion to capture more complex channel dependencies. Guo et al. [36] proposed a high-resolution image segmentation method which combines multi-scale segmentation and fusion. The strategy of multi-scale segmentation involves segmenting the same image iteratively at various scales, followed by feature extraction of the objects. Chen et al. [37] designed two new deep Siamese convolutional networks for unsupervised and supervised CD in multi-temporal very high-resolution (VHR) images. Yu et al. [38] proposed a nested network based on CNNs (NestNet) to improve the accuracy of automatic CD tasks by using RS time series images. Lu et al. [39] proposed a new multi-scale feature progressive fusion network (MFPF-Net). Li et al. [40] proposed a novel CNN-Transformer network (ConvTransNet) with a multi-scale framework to better utilize global local information in optical RS images. Xu et al. [41] proposed an attention-guided multi-scale context aggregation network (AMCA) for RS-CD.

3. Methodology

3.1. Overall Architecture

This paper presents a multi-scale fusion network based on dynamic spectral features for multi-temporal HS image CD (MsFNet). Figure 1 illustrates the complete framework. A spectral dynamic feature extraction module is employed to dynamically adapt the receptive field based on the characteristics of distinct spectral bands. Compared with the attention mechanism employed in the original dynamic convolution, MsFNet prioritizes inter-band relationships through spectral attention. This approach enhances the network’s flexibility in emphasizing critical bands, thereby aiding the model in discerning and distinguishing band-specific changes. Consequently, this refinement bolsters the model’s acuity in perceiving and capturing intricate spectral features within an HS image. Additionally, the multi-scale feature fusion module extracts varied scale features from deep feature maps and integrates them to enrich local information within the feature map, thereby augmenting the model’s capability to perceive local details.

3.2. Spectral Dynamic Feature Extraction Module (SDFEM)

As a novel CNN operator design, the dynamic convolution has a stronger nonlinear representation ability than traditional static convolution [42]. The latter usually uses a single convolution kernel for each convolution layer, which results in a static computed graph which cannot flexibly adapt to changes in the input data. On the basis of dynamic convolution, a parallel convolution kernel is introduced, and the network realizes flexible adjustment of the convolution parameters through a dynamic aggregation convolution kernel so as to better adapt to different inputs.

HS images encompass multiple bands, with each capturing distinct spectral features with spatially varying characteristics across the dataset. Learning and effectively capturing these complex spectral features are crucial for HS-CD tasks. To address these features, a dynamic convolution module based on spectral attention is proposed, which is integrated into the early stages of the network without increasing its depth. Unlike the attention mechanism in traditional dynamic convolution, this module prioritizes inter-band relationships through spectral attention, thereby enhancing the model’s capacity to perceive and capture intricate spectral nuances [43]. Figure 2 illustrates the architecture of the proposed spectral attention module within the framework of the dynamic convolution.

For the input feature map x, first, maxpooling and avgpooling operations are performed to obtain the maximum and average responses to spectral features. A multi-layer perceptron (MLP) structure is used which includes two convolutional layers to process the results, obtaining

m a x_o u t

and

a v g_o u t

. Then,

m a x_o u t

and

a v g_o u t

are summed to obtain

M_{c}

. Subsequently, the preliminary integration of features is completed through a fully connected layer. Subsequently, the attention weight

ω_{c}

is obtained through the ReLU layer, fully connected layer, and SoftMax layer. The specific calculation method is

\{\begin{matrix} m a x_o u t = MLP (M a x P o o l 2 d (x)) \\ a v g_o u t = MLP (A v g P o o l 2 d (x)) \\ M_{c} = m a_o u t + a v g_o u t \\ F_{c} = ReLU (f c_{1} (M_{c})) \\ ω_{c} = Softmax (f c_{2} (F_{c})) \end{matrix}

(1)

where

f c_{1} (\cdot)

and

f c_{2} (\cdot)

are are two fully connected layers with different weight parameters.

The spectral dynamic convolution layer is obtained through aggregation of the obtained attention weight

ω_{c}

and the convolution kernel as shown in Figure 3:

y = g ({\hat{W}}^{T} (x) x + \hat{b} (x))

(2)

where

\begin{matrix} \hat{W} (x) = \sum_{k = 1}^{K} π_{k} (x) {\hat{W}}_{k}, \hat{b} (x) = \sum_{k = 1}^{K} π_{k} (x) {\hat{b}}_{k} \\ s . t . 0 \leq π_{k} (x) \leq 1, \sum_{k = 1}^{K} π_{k} (x) = 1 \end{matrix}

(3)

Here,

π_{k}

represents the kth linear function

{\hat{W}}_{k}^{T} x + {\hat{b}}_{k}

. The attention weight of x varies depending on the input x. Therefore, given the input, the dynamic perceptron represents the optimal linear function combination for that input. Because the model is nonlinear, the dynamic perceptron has a stronger representation ability. For the time branch

T_{n}, n \in \{1, 2\}

, an HS image is passed through the spectral dynamic convolution module, which captures more local features.

The dynamic convolution based on spectral attention can not only adjust the receptive field size adaptively according to the features of different bands but also capture more local micro features. The addition of spectral attention can make the network more flexible in focusing on more important bands and better capture and distinguish the changes between different bands. Therefore, the adaptability and expression ability of the CD algorithm for an HS image can be improved.

3.3. Multi-Temporal Deep Feature Encoder

Multi-temporal HS images have rich spatial and spectral information. Employing a dual-branch input mechanism, as opposed to traditional differential images, enables a more comprehensive depiction of the multidimensional features present within HS data, thereby effectively mitigating information loss and enhancing CD accuracy. For this purpose, we designed two time branches

T_{1}

and

T_{2}

to extract features from HS images.

For a time branch

T_{n} (n \in \{1, 2\})

, chunk processing is performed on HS images

H_{1}

and

H_{2}

. The corresponding patch block of an HS image is first input into the dynamic convolution block module to obtain a feature map

F_{d y, n} (n \in \{1, 2\})

, which captures more local features. Afterward, features are extracted from the multi-temporal HS images using the following formula:

F_{n}^{(e)} = \{\begin{matrix} f (B N_{γ, β} (ω_{h}^{(e) T} F_{d y, n} + b_{h}^{(e)})), & e = 1 \\ f (B N_{γ, β} (ω_{h}^{(e) T} F_{n}^{(e - 1)} + b_{h}^{(e)})), & e = 2, 3 \\ ω_{h}^{(e) T} F_{n}^{(e - 1)} + b_{h}^{(e)}, & e = 4 \end{matrix}

(4)

where

F_{i}

represents the eth features extracted by the encoder block of an HS image,

\{ω_{h}^{(e)}, b_{h}^{(e)}\}

is the weights and biases for the encoder parts of two source datasets,

f (\cdot)

is the ReLU nonlinear activation function, and

B N_{γ, β} (x_{i}) = γ {\hat{x}}_{i} + β

represents the batch normalization (BN) layer. Its function is to accelerate parameter learning and avoid gradient vanishing during the training process [44].

3.4. Multi-Scale Feature Fusion Module (MsFFM)

Currently, many DL networks primarily utilize single-scale features in CD tasks. However, for some fine features, pixel information is relatively small and is easily lost during the downsampling process of CNNs. In current CNNs, the latter stages are typically configured to maintain the spatial dimensions of the input, aiming to preserve the scale of the output feature map. However, this approach often leads to insufficient capturing of detailed local information within deep feature maps. To mitigate this issue, MsFNet introduces a multi-scale feature fusion module which operates on the deep features of the network. The deep features in the MsFFM refer to the multi-scale feature extraction method through the deep learning network, the use of shallow-to-deep convolution levels to gradually extract abstract and high-level information, and the integration of features of different scales. These levels can capture more global and semantic features of the ground objects, such as the overall shape of the ground objects, spatial relations, and context information. This module extracts and integrates multi-scale features from the deep feature map, enriching the local information content of the feature map and thereby enhancing the model’s capability to perceive detailed local features. The architectural configuration of the multi-scale feature fusion module is illustrated in Figure 4. By processing the deep feature map, this module effectively amalgamates information across different scales, facilitating more comprehensive comprehension and processing of input features by the model.

For the input feature graph

F

, by using convolution kernels of different sizes, a feature

C_{i} (i = 2, 3, 4)

with different sizes and numbers of channels multiplied is obtained. The main purpose of this step is to construct a feature representation with multi-scale and richer channel information while extracting the semantic information of

F

.

Firstly, 1×1 convolution is performed on

C_{4}

to reduce the number of channels and obtain

M_{4}

. Then, 2× nearest neighbor upsampling is performed on

M_{4}

. The 2× nearest neighbor upsampling method is simple to implement, has a small amount of computation, is fast, and for some scenes with obvious edges, it can better maintain edge information. The utilization of 2× nearest neighbor upsampling was chosen to ensure the resultant feature map encompassed both robust spatial information and substantial semantic content. The 2× nearest neighbor upsampling process is illustrated in Figure 5.

Next, the obtained feature map is combined with

M_{3}

, which also passes through a 1 × 1 convolution kernel to reduce the number of channels. The combined method uses pixel-to-pixel addition operations. The design of this step aims to integrate information from different channels and then repeat the above process on the newly obtained feature map to obtain

M_{2}

:

\{\begin{matrix} M_{3} = u p s a m p l e_2 x (M_{4}) + f_{1 \times 1} (F_{3}) \\ M_{2} = u p s a m p l e_2 x (M_{3}) + f_{1 \times 1} (F_{2}) \end{matrix}

(5)

Finally, convolution is performed on

M_{2}

to reduce feature confusion caused by the upsampling and stacking processes to obtain the multi-scale feature

P_{2}

:

P_{2} = f_{2 \times 2} (M_{2})

(6)

3.5. CD Module

In order to obtain the results of CD, the deep feature

F_{d y, n} (n \in \{1, 2\})

is obtained through convolutional blocks, and the dual temporal feature maps obtained by distinguishing

F_{d y, 1}

and

F_{d y, 2}

within the group are used to obtain the feature

F_{D Y}

:

F_{D Y} = F_{d y, 1} - F_{d y, 2}

(7)

Multi-scale feature fusion is used to transfer feature maps and obtain the final feature

F_{r e s u l t}

. Throughout the process, features of different scales are fully utilized. Compared with directly using convolution for downsampling, not only can deep features be obtained through convolution, but spatial information in shallow specialties is also preserved, making it more likely to obtain rich feature representations at different scales. This is of great significance for feature extraction. Then,

F_{r e s u l t}

is entered into the fully connected layer to obtain

F_{E}

:

F_{E} = f (ω \cdot F_{r e s u l t} + b)

(8)

The feature

F_{E}

is then input into the SoftMax layer to obtain the final predicted probability distribution

p r e d

. The formula for this is as follows:

p r e d = \frac{exp (θ_{k} |F_{E})}{\sum_{K = 1}^{2} exp (θ_{k} |F_{E})}

(9)

where k is the number of category categories and there are only two types of CD tasks—changed and unchanged—where

k = 2

, with k being the class sequence number. Meanwhile,

θ_{k}

represents the value of the weight of the kth column in the prediction layer, and

p r e d

is the 1D matrix of the probability of changed pixels and unchanged pixels.

3.6. Network Training

The proposed MsFNet adopts an end-to-end approach during network training, directly learning feature representations and classification boundaries from raw data. Train uses training set samples, which are input in the form of

{\{(X_{T 1}^{i}, X_{T 2}^{i}), Y_{i}\}}_{i = 1}^{N}

. Among them,

X_{T 1}^{i}, X_{T 2}^{i}

is the HS image block of the ith training sample input at different times, and

Y_{i}

represents the true label which matches the corresponding image block. N represents the number of training set samples used during the training process.

In HS image CD tasks, class imbalance usually occurs because changes in different regions or scenes may be rare, resulting in class imbalance. The weighted cross-entropy loss function can deal with this situation well and improve the performance of rare CD. In this chapter, weighted cross-entropy is chosen as the loss function:

L_{W - C} = - \frac{1}{U} \sum_{n = 1}^{U} \sum_{j = 1}^{K} ω_{j} log P r (Y^{n} = j | (X_{H}^{n}, X_{L}^{n}); θ)

(10)

where

ω_{j}

is the jth weight and

P r (Y^{n} = j | (X_{H}^{n}, X_{L}^{n}); θ)

is the probability that the pixel belongs to the jth class. This approach allowed us to assign different weights to the loss contributions from individual categories based on their respective frequencies in the training data, thereby addressing the imbalance and ensuring that the model effectively learned from all classes, regardless of their representation in the dataset. For the weights, the definition is as follows:

ω_{j} = \frac{n_{j}}{U}

(11)

where

n_{j}

is the number of class j in the ground truth training sample. The overall process of MsFNet is shown in Algorithm 1.

Algorithm 1: Actions performed by the algorithm.

Require:: multi-temporal HS image $H_{1}$ and $H_{2}$ ; training label Y; training epochs $e p o c h s$ ;
Ensure:: CD map, P;
1:: Initialize all weights;
2:: The input block related to H and L is derived based on the value of r;
3:: while $e p o c h < e p o c h s$ do:
         step 1: Calculate spectral attention weights $ω_{c}$ with Equation (1);
         step 2: Extract multitemporal HS image features $F_{d y, n}, n \in \{1, 2\}$ with dynamic convolution module;
         step 3: Extract multitemporal HS image deep features $F_{D Y}$ with Equation (4);
         step 4: Extract multi-scale deep features $P_{2}$ with Equations (5) and (6);
         step 5: Train the network to obtain the highest weight result;
4:: Use classification function $S o f t M a x$ according to Equation (9) to obtain CD map.

4. Experiments and Analysis

4.1. Dataset Description

To validate the efficacy of the proposed framework, various datasets were employed as experimental benchmarks. Their detailed introduction follows.

4.1.1. Irrigated Agricultural Area (IA)

This dataset covers farmland scenes in Umatilla County in the United States (Available from http://earthexplorer.usgs.gov/ 16 August 2024). HS images were taken for two time periods: 1 May 2004 and 8 May 2007. This dataset is widely used in research on agricultural land use change, land cover change, and other fields. Its original image size is 307 × 241 pixels. The HS image dataset used in the experiment had 154 bands. We used random sampling to construct a neural network training set, with a training set proportion of 9.77%. The specific visualization images are shown in Figure 6a–d [45].

4.1.2. Wetland Agricultural Area (WA)

This dataset covers the wetland agricultural area in Jiangsu Province in China and mainly describes the changes in arable land, having a variety of surface cover types (Available from http://earthexplorer.usgs.gov/ 16 August 2024). HS images were taken for two time periods: 3 May 2006 and 3 May 2007. Its original image size is 450 × 140 pixels. The experiment used 155 bands. This article used random sampling to construct a neural network training set, with a training set proportion of 20.95%. The specific visualization images are shown in Figure 6e–h [45].

4.1.3. River Area (River)

This dataset covers images of river land types located in Jiangsu Province in China (Available from http://crabwq.github.io/ 16 August 2024). HS images were taken for two time periods: 3 May 2013 and 31 December 2013. Its original image size is 463 × 241 pixels. The experiment used 198 bands. The image space size was 463 × 241 pixels. This article used random sampling to construct a neural network training set, with a training set proportion of 3.36%. The specific visualization images are shown in Figure 6i–l [12].

4.2. Parameter Tuning

The programs in this article were all written in Python 3.6, and the CNN part was built using PyTorch. The performance of a network is influenced by its structural design. The number of epochs for network training was 200. The batch size was 32, and Adam served as the optimizer. This section evaluates the model’s performance by adjusting the hyperparameters, including the number of multi-scale features, input patch size, and learning rate. We used the overall accuracy (OA) and Kappa coefficient as evaluation metrics, with higher values indicating superior CD performance.

4.2.1. Multi-Scale Feature Number Comparison

The multi-scale feature fusion module aggregates features of different scales derived from deep representations, amalgamating them to enrich local information and thereby augment the model’s ability to discern intricate local details. Varied quantities of multi-scale features yield differing levels of feature enhancement, where a minimal number may render the enhancement effect insufficient, potentially overlooking critical local information. Conversely, an excessive quantity may introduce information redundancy. To further examine the impact of the multi-scale feature quantity on performance, we conducted experiments with varying quantities, comparing the effectiveness of inputting HS image blocks of different sizes in CD tasks. Figure 7 depicts the algorithm’s performance under different multi-scale feature numbers, highlighting an optimal number of four.

4.2.2. Input Patch Size Comparison

Selection of the patch size is pivotal in HS image analysis, as it determines the model’s spatial perception range and influences its attention to contextual information and local details. Larger patch sizes are advantageous for capturing global features, providing better comprehension of the overall background and context. Conversely, smaller patch sizes are more effective for focusing on local details and achieving finer granularity. To further investigate the impact of the patch size on performance, we conducted a series of experiments comparing HS image blocks of various sizes in CD tasks. Figure 8 illustrates the experimental results using patch sizes, such as 7 × 7, 9 × 9, 11 × 11, and 13 × 13, under the proposed method. The figure demonstrates that the patch size significantly influenced the algorithm performance, with an optimal patch size of 9 × 9 identified.

4.2.3. Learning Rate Comparison

The magnitude of the neural network’s learning rate plays a pivotal role in DL. The selection of an appropriate learning rate profoundly influences the training speed, stability, and ultimate performance of the model. A suitable learning rate can expedite model convergence, enhance training stability, and facilitate better generalization to novel data. Conversely, excessively large or small learning rates may precipitate training challenges, including convergence issues, overfitting, and model instability. Optimal learning rate selection is paramount for enhancing model convergence speed, stability, and generalization performance. Figure 9 depicts the algorithm’s performance across varying learning rates, highlighting an optimal rate of 0.0005.

4.3. Comparison and Analysis of CD Performance

To assess the effectiveness of MsFNet, a comparative analysis was performed against SOTA methods, including change vector analysis (CVA) [4], deep slow feature analysis (DSFA) [46], multiple morphological profiles (MMP) [9], the attention-guided feature fusion network (AgF2Net) [31], a double-branch dual-attention network (DBDA) [47], a multiscale diff-changed feature fusion network (MSDFFN) [48], the gated transmitting-based multi-scale Siamese network (GTMSiam) [23] and the cross-temporal interaction symmetric attention network (CSANet) [21]. Among these methods, CVA, DSFA, and MMP employ unsupervised learning, while the remaining approaches are supervised DL methods. Throughout the experiment, the parameters yielding the best results in the reference literature were utilized to ensure fairness in the comparative analysis. The number of training and testing samples remained consistent across all methods.

The experimental results depicted in Figure 10, Figure 11 and Figure 12 showcase the advantages of the proposed method, MsFNet, across three datasets. The IA dataset primarily documents the temporal and spatial progression of agricultural land, characterized by relatively straightforward developments. Consequently, the texture details within this dataset are relatively sparse. Nonetheless, the MsFNet results exhibited clearer and more defined edges, with fewer instances of pixel misjudgment or omission compared with the GT map. Similarly, in the WA dataset, MsFNet demonstrated minimal blurring or confusion when contrasted with the comparison algorithms. Few cases of pixel misjudgment or omission relative to the GT map underscore its ability to effectively discern between changing and unchanged pixels, showcasing a degree of superiority. Notably, the superiority of the proposed method was most evident in the River dataset. The residential area in the upper right corner of the dataset has a large number of scattered change points, which is highly challenging for change detection tasks. It can be seen that many comparison algorithms mistakenly classified the invariant points as changes, while MsFNet enhanced its ability to perceive and capture complex spectral features in HS images by using a dynamic convolution module based on spectral attention. In addition, the multi-scale fusion module improved its ability to sense local information, enabling it to better identify subtle changes in residential areas than the comparison algorithms, thereby improving the performance of change detection.

In order to evaluate the advantages and disadvantages of different algorithms, we selected the overall accuracy (OA), kappa coefficient (Kappa), precision, recall, and F1-score as objective evaluation indicators [49,50]. The specific description of each objective evaluation index is as follows:

(1) The OA is the most intuitive evaluation criterion, representing the proportion of samples correctly classified by the model. Its formula is

OA = \frac{TP + TN}{TP + TN + FP + FN}

(12)

where true positive (TP) denotes the number of correctly detected changed pixels, true negative (TN) represents the number of correctly identified unchanged pixels, false positive (FP) refers to the number of false-alarm pixels, and false negative (FN) indicates the number of missed changed pixels.

(2) The kappa coefficient provides a more robust accuracy measurement than the OA, evaluating model performance by comparing the observed accuracy with the expected accuracy under random conditions. More suitable for change detection tasks with imbalanced categories, it can better reflect the true performance of the model and is the most reflective indicator of changes in detection performance. Its formula is

P_{C} = \frac{(TP + FP) (TP + FN) + (FN + TN) (FP + TN)}{{(TP + FP + TN + FN)}^{2}}

(13)

Kappa = \frac{OA - P_{C}}{1 - P_{C}}

(14)

(3) Precision is the ratio of the number of correctly detected changed pixels to the total number of detected changed pixels. A higher precision indicates fewer FPs, meaning the model is better at avoiding incorrectly classifying unchanged pixels as changed. Its formula is

\Pr = \frac{TP}{TP + FP}

(15)

(4) Recall is the ratio of the number of correctly detected changed pixels to the number of all actually changed pixels. A higher recall indicates that the model is better at capturing all changed pixels, even if it might also include more FPs. Its formula is

Re = \frac{TP}{TP + FN}

(16)

(5) The F1 score is the harmonic mean of the accuracy and recall, taking into account both the accuracy and completeness of the model. The reason for proposing this indicator is precisely because there is a contradiction between the accuracy and recall which makes it difficult to judge the quality of the model. When both indicators need to be optimized simultaneously, the F1 score provides a balanced evaluation. Its formula is

F 1 = 2 \times \frac{Pr \times Re}{Pr + Re}

(17)

From Table 1, Table 2 and Table 3, it can be seen that the proposed algorithm achieved the best results in terms of OA, Kappa, and F1-score. The best results are highlighted in bold, while the second best results are highlighted in underline. For CD tasks, the kappa coefficient takes into account both the classification accuracy and expected accuracy of random opportunities, which can more accurately evaluate algorithm performance. The accuracy was the best indicator on both datasets and the second best for the River dataset, which proves the reliability of the CD results. Compared with the recall rate, the results were slightly weaker, being second best on both datasets and slightly weaker on the River dataset, indicating that the model had some minor missed detections. In order to comprehensively consider the accuracy and completeness of the model, the F1-score was used to measure the performance of the model. MsFNet showed significantly enhanced F1 scores in all three datasets, highlighting its advantage in CD tasks. MsFNet showed a significantly enhanced Kappa and F1 score in all three datasets, highlighting its advantage in CD tasks. The introduction of a spectral dynamic convolution module in MsFNet enhanced the adaptability of the network in emphasizing key bands, thereby improving the network’s ability to recognize and capture complex spectral features in HS images. It could better utilize the rich spectral information of HS images, thereby increasing the quality of the extracted spectral features. In addition, MsFNet also introduced a multi-scale feature fusion module to extract and integrate features of different scales from deep representations, enhancing the model’s ability to recognize local features and improving the quality of the spatial features compared with single-scale features. The collaborative application of these modules improved the quality of feature extraction from both spectral and spatial dimensions, thereby enhancing the accuracy of the CD.

Meanwhile, we illustrate the confusion matrix of MsFNet across three datasets, shown in Figure 13. These matrices expedited the analysis of misclassification scenarios within each category. Predominantly, the correct classification of most pixels was observed. In the context of CD tasks, the primary objective was to identify a reduced number of change points, which emphasized accuracy. The principal errors entailed misclassifying change pixels as invariant pixels, while the likelihood of misclassifying invariant pixels as change pixels was exceedingly low. This observation underscores the suitability of the proposed algorithm for CD tasks.

To evaluate the robustness of the proposed model in scenarios with limited training data, systematic experiments were conducted using various sample sizes. The detailed experimental results are illustrated in Figure 14. Notably, despite the scarcity of training data, the proposed approach exhibited encouraging performance. As the number of training samples grew, greater consistency in the change detection outcomes was observed, consistently favoring MsFNet over comparative algorithms. In the realm of change detection tasks for multi-temporal hyperspectral imagery, where acquiring labeled samples incurred substantial costs and workload, MsFNet demonstrated heightened accuracy with reduced labeling requirements, thereby bolstering its competitive advantage.

4.4. Ablation Study

The proposed neural network framework incorporates a spectral dynamic feature extraction module and a multi-scale feature fusion module. These modules adjust the receptive field size adaptively based on the features of distinct bands. Unlike the original dynamic convolution approach, the spectral attention mechanism emphasizes inter-band relationships, enhancing the network’s flexibility to prioritize significant bands. This improvement augments the model’s capability to perceive and capture intricate spectral features in HS images. Additionally, the multi-scale feature extraction operates on deep feature maps and integrates multi-scale features to enrich local information within the feature maps, thereby enhancing the model’s perception of local details. To demonstrate the effectiveness of the dynamic feature extraction module and the multi-scale feature fusion module, we controlled the use of the dynamic feature extraction module and the multi-scale feature fusion module. Table 4 shows the results of the ablation experiment. The best results are highlighted in bold. It is not difficult to see that these two modules had a significant impact on improving the performance of the algorithm proposed in this chapter. The combined use of the two modules could achieve the best results without the phenomenon of mutual exclusion. To further confirm the effectiveness of our proposed modules, we conducted separate ablation studies on each module. The results of the ablation experiment further confirmed the effectiveness of the proposed module.

As the two proposed modules may bring about computing challenges, the calculation time and calculation results of each proposed module were compared, as shown in Figure 15. It can be seen that although the two proposed modules did require some additional computing resources, the performance improvement was greater than the linear improvement, which made up for this shortcoming. Given the improved performance of the CD, this additional computing resource is acceptable.

4.4.1. Effect of Spectral Dynamic Feature Extraction Module

The spectral dynamic feature extraction module proposed in this article adaptively adjusts the receptive field size based on the features of different bands. Compared with the attention in the original dynamic convolution (DC), the use of spectral attention focuses more on processing the relationships between different bands, making the network more flexible in focusing on more important bands and improving the model’s perception and ability to capture complex spectral features in HS images. To validate the presence of information transmission within the proposed module, we performed ablation experiments. The results of these experiments are presented in Table 5. The data reveal that the DC module, which utilizes spectral attention, achieved the most favorable outcomes, thereby demonstrating the efficacy of the proposed module.

Figure 16 depicts the visual outcomes from the ablation experiments involving spectral dynamic convolution across three datasets. The baseline results exhibited notably deficient detection of detailed change points. Moreover, the traditional dynamic convolution experiments yielded a considerable number of false alarms, which was particularly prominent in the WA dataset, indicating suboptimal performance. In contrast, visualizations employing spectral dynamic convolution exhibited clearer boundaries, underscoring the module’s efficacy in enhancing detection precision.

4.4.2. Effect of Multi-Scale Feature Fusion Module

The multi-scale feature fusion module extracts diverse scale features from deep feature maps and integrates them to enhance local information within the feature maps, thereby augmenting the model’s local perception capability. To validate its effectiveness, ablation experiments were conducted across three datasets, with the results detailed in Table 6. The findings demonstrate substantial performance enhancements attributed to the module’s application, affirming its rationale and practical utility.

Figure 17 displays the visual results from the three datasets to highlight the role of the multi-scale feature fusion module in MsFNet. The absence of the module revealed numerous false positives in the results. Specifically, in the WA dataset, the upper right region incorrectly identified numerous unchanged pixel points as changing, while in the River dataset, the middle right region misclassified numerous changing pixel points as unchanged. Integration of the multi-scale feature fusion module enabled pixel discrimination across multiple scales, thereby mitigating false positives and enhancing the results’ accuracy.

At the same time, we conducted ablation experiments on commonly used upsampling methods, such as the 2× nearest neighbor method (2× NN), linear interpolation method (LI), and weighted average interpolation method (WAI), and the results are shown in Table 7. It can be seen that the 2× nearest neighbor method achieved the best results on all three datasets. The linear interpolation method increased the dimension according to the weights of adjacent points, thus making the edges of image objects smoother. However, this method made the edge of the target incredibly smooth, weakened the difference between the edge and the background, and was not conducive to the edge detection of change information. Thus, it is not suitable for the feature extraction of change detection tasks. The weighted average interpolation method estimates the new pixel value with the weighted average of the surrounding pixels. Although weighted average interpolation has its advantages, it also has the disadvantages of edge blurring and excessive smoothing, and weighted average interpolation is more sensitive to noise in an image.

5. Conclusions

In this article, we proposed a novel multi-scale fusion network based on dynamic spectral features (MsFNet) for multi-temporal HS image CD. MsFNet adopts a dual-branch network to extract the features of multi-temporal phases. At the same time, in order to accurately capture complex spectral features and consider them at multiple scales, a spectral dynamic convolution module is used. By introducing a spectral attention mechanism, the adaptive adjustment of features in different bands was achieved, improving the model’s perception and ability to capture complex spectral features in HS data. The multi-scale feature fusion module operates on deep features, enhancing the model’s perception of local information via multi-scale feature extraction and fusion. Through experiments on three real-world HS image CD datasets, the results show that MsFNet exhibited significant superiority in terms of both effectiveness and performance compared with current advanced CD methods. The source code of the MsFNet will be public at https://github.com/SYFYN0317/MsFNet 16 August 2024.

Author Contributions

Formal analysis and writing—original draft preparation, Y.F.; Investigation validation, Data curation, W.N.; Supervision, Writing—review and editing, L.S.; Conceptualization and methodology, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 42371338) and the Key Scientific Research Project of Liaoning Provincial Education Department (Grant No.JYTZD2023101).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, L.; Wu, C. Advance and future development of change detection for multi-temporal remote sensing imagery. Acta Geod. Cartogr. Sin. 2017, 46, 1447. [Google Scholar]
Song, R.; Feng, Y.; Cheng, W.; Wang, X. Advance in Hyperspectral Images Change Detection. Spectrosc. Spectr. Anal. 2023, 43, 2354–2362. [Google Scholar]
Liu, S.; Marinelli, D.; Bruzzone, L.; Bovolo, F. A review of change detection in multitemporal hyperspectral images: Current techniques, applications, and challenges. IEEE Geosci. Remote Sens. Mag. 2019, 7, 140–158. [Google Scholar] [CrossRef]
Johnson, R.D.; Kasischke, E. Change vector analysis: A technique for the multispectral monitoring of land cover and condition. Int. J. Remote Sens. 1998, 19, 411–426. [Google Scholar] [CrossRef]
Deng, J.; Wang, K.; Deng, Y.; Qi, G. PCA-based land-use change detection and analysis using multitemporal and multisensor satellite data. Int. J. Remote Sens. 2008, 29, 4823–4838. [Google Scholar] [CrossRef]
Celik, T. Unsupervised change detection in satellite images using principal component analysis and k-means clustering. IEEE Geosci. Remote Sens. Lett. 2009, 6, 772–776. [Google Scholar] [CrossRef]
Wu, C.; Zhang, L.; Du, B. Targeted change detection for stacked multi-temporal hyperspectral image. In Proceedings of the 2012 4th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Shanghai, China, 4–7 June 2012; pp. 1–4. [Google Scholar]
Ertürk, A.; Ertürk, S.; Plaza, A. Unmixing with SLIC superpixels for hyperspectral change detection. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 11–15 July 2016; pp. 3370–3373. [Google Scholar]
Hou, Z.; Li, W.; Li, L.; Tao, R.; Du, Q. Hyperspectral change detection based on multiple morphological profiles. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5507312. [Google Scholar] [CrossRef]
Shi, W.; Zhang, M.; Zhang, R.; Chen, S.; Zhan, Z. Change detection based on artificial intelligence: State-of-the-art and challenges. Remote Sens. 2020, 12, 1688. [Google Scholar] [CrossRef]
Shafique, A.; Cao, G.; Khan, Z.; Asad, M.; Aslam, M. Deep learning-based change detection in remote sensing images: A review. Remote Sens. 2022, 14, 871. [Google Scholar] [CrossRef]
Wang, Q.; Yuan, Z.; Du, Q.; Li, X. GETNET: A general end-to-end 2-D CNN framework for hyperspectral image change detection. IEEE Trans. Geosci. Remote Sens. 2018, 57, 3–13. [Google Scholar] [CrossRef]
Zhan, T.; Song, B.; Sun, L.; Jia, X.; Wan, M.; Yang, G.; Wu, Z. TDSSC: A three-directions spectral–spatial convolution neural network for hyperspectral image change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 377–388. [Google Scholar] [CrossRef]
Moustafa, M.S.; Mohamed, S.A.; Ahmed, S.; Nasr, A.H. Hyperspectral change detection based on modification of UNet neural networks. J. Appl. Remote Sens. 2021, 15, 028505. [Google Scholar] [CrossRef]
Zhan, T.; Song, B.; Xu, Y.; Wan, M.; Wang, X.; Yang, G.; Wu, Z. SSCNN-S: A spectral-spatial convolution neural network with Siamese architecture for change detection. Remote Sens. 2021, 13, 895. [Google Scholar] [CrossRef]
Song, R.; Feng, Y.; Xing, C.; Mu, Z.; Wang, X. Hyperspectral image change detection based on active convolutional neural network and spatial–spectral affinity graph learning. Appl. Soft Comput. 2022, 125, 109130. [Google Scholar] [CrossRef]
Dong, W.; Zhao, J.; Qu, J.; Xiao, S.; Li, N.; Hou, S.; Li, Y. Abundance matrix correlation analysis network based on hierarchical multihead self-cross-hybrid attention for hyperspectral change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5501513. [Google Scholar] [CrossRef]
Dong, W.; Yang, Y.; Qu, J.; Xiao, S.; Li, Y. Local information-enhanced graph-transformer for hyperspectral image change detection with limited training samples. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5509814. [Google Scholar] [CrossRef]
Li, Y.; Ren, J.; Yan, Y.; Liu, Q.; Ma, P.; Petrovski, A.; Sun, H. CBANet: An end-to-end cross-band 2-D attention network for hyperspectral change detection in remote sensing. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5513011. [Google Scholar] [CrossRef]
Wang, X.; Zhao, K.; Zhao, X.; Li, S. CSDBF: Dual-branch framework based on temporal–spatial joint graph attention with complement strategy for hyperspectral image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5540118. [Google Scholar] [CrossRef]
Song, R.; Ni, W.; Cheng, W.; Wang, X. CSANet: Cross-temporal interaction symmetric attention network for hyperspectral image change detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6010105. [Google Scholar] [CrossRef]
Zhao, X.; Li, S.; Geng, T.; Wang, X. GTransCD: Graph Transformer-Guided Multitemporal Information United Framework for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5500313. [Google Scholar] [CrossRef]
Wang, X.; Zhao, K.; Zhao, X.; Li, S. GTMSiam: Gated Transmitting Based Multiscale Siamese Network for Hyperspectral Image Change Detection. IEEE Geosci. Remote. Sens. Lett. 2023, 20, 5511805. [Google Scholar] [CrossRef]
Wang, X.; Zhao, K.; Zhao, X.; Li, S. TriTF: A triplet transformer framework based on parents and brother attention for hyperspectral image change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5507213. [Google Scholar] [CrossRef]
Le Hégarat-Mascle, S.; Seltz, R. Automatic change detection by evidential fusion of change indices. Remote Sens. Environ. 2004, 91, 390–404. [Google Scholar] [CrossRef]
Nemmour, H.; Chibani, Y. Multiple support vector machines for land cover change detection: An application for mapping urban extensions. ISPRS J. Photogramm. Remote Sens. 2006, 61, 125–133. [Google Scholar] [CrossRef]
Wang, B.; Choi, S.; Byun, Y.; Lee, S.; Choi, J. Object-based change detection of very high resolution satellite imagery using the cross-sharpening of multitemporal data. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1151–1155. [Google Scholar] [CrossRef]
Du, P.; Liu, S.; Xia, J.; Zhao, Y. Information fusion techniques for change detection from multi-temporal remote sensing images. Inf. Fusion 2013, 14, 19–27. [Google Scholar] [CrossRef]
Lyu, H.; Lu, H.; Mou, L. Learning a transferable change rule from a recurrent neural network for land cover change detection. Remote Sens. 2016, 8, 506. [Google Scholar] [CrossRef]
Song, A.; Choi, J.; Han, Y.; Kim, Y. Change detection in hyperspectral images using recurrent 3D fully convolutional networks. Remote Sens. 2018, 10, 1827. [Google Scholar] [CrossRef]
Wang, X.; Ni, W.; Feng, Y.; Song, L. AgF2Net: Attention-guided Feature Fusion Network for Multi-temporal Hyperspectral Image Change Detection. IEEE Geosci. Remote. Sens. Lett. 2023, 20, 5507805. [Google Scholar]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
Chen, G.; Yang, W.; Liu, Y.; Wang, H.; Huang, X. Salt structure elastic full waveform inversion based on the multiscale signed envelope. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4508912. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Chen, J.; Chen, G.; Li, J.; Du, R.; Qi, Y.; Li, C.; Wang, N. Efficient seismic data denoising via deep learning with improved MCA-SCUNet. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5903614. [Google Scholar] [CrossRef]
Guo, Q.; Zhang, J.; Li, T.; Lu, X. Change detection for high-resolution remote sensing imagery based on multi-scale segmentation and fusion. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 1919–1922. [Google Scholar]
Chen, H.; Wu, C.; Du, B.; Zhang, L. Deep Siamese multi-scale convolutional network for change detection in multi-temporal VHR images. In Proceedings of the 2019 10th International Workshop on the Analysis of Multitemporal Remote Sensing Images (MultiTemp), Shanghai, China, 5–7 August 2019; pp. 1–4. [Google Scholar]
Yu, X.; Fan, J.; Chen, J.; Zhang, P.; Zhou, Y.; Han, L. NestNet: A multiscale convolutional neural network for remote sensing image change detection. Int. J. Remote Sens. 2021, 42, 4898–4921. [Google Scholar] [CrossRef]
Lu, D.; Cheng, S.; Wang, L.; Song, S. Multi-scale feature progressive fusion network for remote sensing image change detection. Sci. Rep. 2022, 12, 11968. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Xue, L.; Wang, X.; Li, G. ConvTransNet: A CNN-transformer network for change detection with multi-scale global-local representations. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610315. [Google Scholar] [CrossRef]
Xu, X.; Yang, Z.; Li, J. AMCA: Attention-guided multi-scale context aggregation network for remote sensing image change detection. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 5908619. [Google Scholar] [CrossRef]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Hasanlou, M.; Seydi, S.T. Hyperspectral change detection: An experimental comparative study. Int. J. Remote Sens. 2018, 39, 7029–7083. [Google Scholar] [CrossRef]
Du, B.; Ru, L.; Wu, C.; Zhang, L. Unsupervised deep slow feature analysis for change detection in multi-temporal remote sensing images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9976–9992. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Yang, Y.; Wang, X. Classification of hyperspectral image based on double-branch dual-attention mechanism network. Remote Sens. 2020, 12, 582. [Google Scholar] [CrossRef]
Luo, F.; Zhou, T.; Liu, J.; Guo, T.; Gong, X.; Ren, J. Multiscale diff-changed feature fusion network for hyperspectral image change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5502713. [Google Scholar] [CrossRef]
Yao, Q.; Zhou, Y.; Tang, C.; Xiang, W.; Zheng, G. End-to-end Hyperspectral Image Change Detection Based on Band Selection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5617614. [Google Scholar] [CrossRef]
Jian, P.; Ou, Y.; Chen, K. Uncertainty Aware Graph Self-Supervised Learning for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5509019. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed MsFNet.

Figure 2. Schematic diagram of spectral attention module.

Figure 3. Schematic diagram of the spectral dynamic convolution module.

Figure 4. Schematic diagram of the multi-scale feature fusion module.

Figure 5. The 2× nearest neighbor upsampling process.

Figure 6. Dual temporal HS image cubes, with label distribution of torfa itnraininginagn danteds ting samples. Among them, black represents unchanged pixels, white represents changed pixels, and the green part is the background. (a–d) IA dataset. (e–h) WA dataset. (i–l) River dataset.

Figure 7. Performance comparison of different multi-scale feature numbers. (a) OA. (b) Kappa.

Figure 8. Performance comparison of different patch sizes. (a) OA. (b) Kappa.

Figure 9. Performance comparison of different learning rates. (a) OA. (b) Kappa.

Figure 10. CD maps of IA dataset under different methods. (a) Pseudo-color image for the before temporal HS image. (b) Pseudo-color image for the after temporal HS image. (c) CVA. (d) DSFA. (e) MMP. (f) AgF2Net. (g) DBDA. (h) MSDFFN. (i) GTMSiam. (j) CSANet. (k) Proposed model. (l) GT.

Figure 11. CD maps of WA dataset under different methods. (a) Pseudo-color image for the before temporal HS image. (b) Pseudo-color image for the after temporal HS image. (c) CVA. (d) DSFA. (e) MMP. (f) AgF2Net. (g) DBDA. (h) MSDFFN. (i) GTMSiam. (j) CSANet. (k) Proposed model. (l) GT.

Figure 12. CD maps of River dataset under different methods. (a) Pseudo-color image for the before temporal HS image. (b) Pseudo-color image for the after temporal HS image. (c) CVA. (d) DSFA. (e) MMP. (f) AgF2Net. (g) DBDA. (h) MSDFFN. (i) GTMSiam. (j) CSANet. (k) Proposed model. (l) GT.

Figure 13. Confusion matrix for (a) IA, (b) WA, and (c) River datasets.

Figure 14. OA of different methods with varying training sample sizes: (a) IA, (b) WA, and (c) River datasets.

Figure 15. Comparison of CD results and calculation times: (a) IA, (b) WA, and (c) River datasets.

Figure 16. Comparison of ablation results of spectral dynamic feature extraction module.

Figure 17. Comparison of ablation results of multi-scale feature fusion module.

Table 1. Evaluation indicators of IA dataset under different methods.

Method	OA	Kappa	Precision	Recall	F1 Score
CVA	0.931	0.782	0.782	0.898	0.837
DSFA	0.949	0.834	0.882	0.905	0.894
MMP	0.922	0.799	0.921	0.893	0.907
AgF2Net	0.971	0.918	0.937	0.938	0.938
DBDA	0.963	0.892	0.915	0.897	0.905
MSDFFN	0.967	0.905	0.919	0.924	0.921
GTMSiam	0.963	0.892	0.939	0.908	0.927
CSANet	0.969	0.913	0.947	0.911	0.931
Proposed	0.973	0.922	0.947	0.934	0.941

Table 2. Evaluation indicators of WA dataset under different methods.

Method	OA	Kappa	Precision	Recall	F1 Score
CVA	0.790	0.499	0.478	0.869	0.617
DSFA	0.783	0.673	0.626	0.816	0.708
MMP	0.960	0.902	0.988	0.985	0.986
AgF2Net	0.983	0.959	0.986	0.989	0.988
DBDA	0.978	0.947	0.942	0.977	0.959
MSDFFN	0.982	0.957	0.948	0.983	0.965
GTMSiam	0.956	0.914	0.962	0.956	0.960
CSANet	0.979	0.950	0.962	0.941	0.951
Proposed	0.984	0.961	0.992	0.986	0.989

Table 3. Evaluation indicators of River dataset under different methods.

Method	OA	Kappa	Precision	Recall	F1 Score
CVA	0.812	0.337	0.761	0.815	0.787
DSFA	0.942	0.727	0.603	0.870	0.712
MMP	0.949	0.728	0.716	0.828	0.767
AgF2Net	0.967	0.773	0.872	0.663	0.753
DBDA	0.959	0.726	0.748	0.726	0.737
MSDFFN	0.963	0.756	0.700	0.891	0.785
GTMSiam	0.968	0.780	0.659	0.801	0.723
CSANet	0.963	0.762	0.839	0.735	0.786
Proposed	0.973	0.795	0.844	0.769	0.802

Table 4. Ablation study of proposed modules.

Dateset	SDFEM	MsFFM	OA	Kappa	Precision	Recall	F1 Score
IA	×	×	0.955	0.874	0.895	0.942	0.917
	✓	×	0.967	0.892	0.896	0.961	0.927
	×	✓	0.961	0.878	0.936	0.949	0.940
	✓	✓	0.973	0.922	0.947	0.934	0.941
WA	×	×	0.960	0.912	0.971	0.942	0.955
	✓	×	0.977	0.934	0.982	0.960	0.968
	×	✓	0.973	0.927	0.984	0.944	0.964
	✓	✓	0.984	0.961	0.992	0.986	0.989
River	×	×	0.952	0.709	0.785	0.791	0.787
	✓	×	0.955	0.753	0.797	0.794	0.793
	×	✓	0.964	0.752	0.818	0.749	0.782
	✓	✓	0.973	0.795	0.844	0.769	0.802

Table 5. Ablation study of spectral dynamic feature extraction module.

Dataset		Without DC	With Original DC	With Spectral DC
IA	OA	0.961	0.970	0.973
	Kappa	0.878	0.919	0.922
	Precision	0.918	0.913	0.947
	Recall	0.960	0.955	0.934
	F1 score	0.937	0.934	0.941
WA	OA	0.973	0.982	0.984
	Kappa	0.927	0.959	0.961
	Precision	0.984	0.987	0.992
	Recall	0.944	0.959	0.986
	F1 score	0.964	0.972	0.989
River	OA	0.964	0.972	0.973
	Kappa	0.752	0.789	0.795
	Precision	0.818	0.819	0.844
	Recall	0.749	0.759	0.769
	F1 score	0.782	0.789	0.802

Table 6. Ablation study of multi-scale feature fusion module.

Dataset		Without MsFF	With MsFF
IA	OA	0.967	0.973
	Kappa	0.892	0.922
	Precision	0.896	0.947
	Recall	0.961	0.934
	F1 score	0.927	0.941
WA	OA	0.977	0.984
	Kappa	0.934	0.961
	Precision	0.982	0.992
	Recall	0.960	0.986
	F1 score	0.968	0.989
River	OA	0.955	0.973
	Kappa	0.753	0.795
	Precision	0.797	0.844
	Recall	0.794	0.769
	F1 score	0.793	0.802

Table 7. Ablation study of upsampling methods.

Dataset		LI	WAI	2× NN
IA	OA	0.972	0.969	0.973
	Kappa	0.920	0.909	0.922
	Precision	0.935	0.918	0.947
	Recall	0.941	0.960	0.934
	F1 score	0.937	0.937	0.941
WA	OA	0.971	0.981	0.984
	Kappa	0.927	0.953	0.961
	Precision	0.982	0.981	0.992
	Recall	0.947	0.969	0.986
	F1 score	0.963	0.974	0.989
River	OA	0.972	0.965	0.973
	Kappa	0.789	0.774	0.795
	Precision	0.807	0.790	0.844
	Recall	0.793	0.792	0.769
	F1 score	0.799	0.789	0.802

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, Y.; Ni, W.; Song, L.; Wang, X. MsFNet: Multi-Scale Fusion Network Based on Dynamic Spectral Features for Multi-Temporal Hyperspectral Image Change Detection. Remote Sens. 2024, 16, 3037. https://doi.org/10.3390/rs16163037

AMA Style

Feng Y, Ni W, Song L, Wang X. MsFNet: Multi-Scale Fusion Network Based on Dynamic Spectral Features for Multi-Temporal Hyperspectral Image Change Detection. Remote Sensing. 2024; 16(16):3037. https://doi.org/10.3390/rs16163037

Chicago/Turabian Style

Feng, Yining, Weihan Ni, Liyang Song, and Xianghai Wang. 2024. "MsFNet: Multi-Scale Fusion Network Based on Dynamic Spectral Features for Multi-Temporal Hyperspectral Image Change Detection" Remote Sensing 16, no. 16: 3037. https://doi.org/10.3390/rs16163037

APA Style

Feng, Y., Ni, W., Song, L., & Wang, X. (2024). MsFNet: Multi-Scale Fusion Network Based on Dynamic Spectral Features for Multi-Temporal Hyperspectral Image Change Detection. Remote Sensing, 16(16), 3037. https://doi.org/10.3390/rs16163037

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MsFNet: Multi-Scale Fusion Network Based on Dynamic Spectral Features for Multi-Temporal Hyperspectral Image Change Detection

Abstract

1. Introduction

2. Related Work

2.1. RS-CD Based on Feature Fusion

2.2. Multi-Scale Learning

3. Methodology

3.1. Overall Architecture

3.2. Spectral Dynamic Feature Extraction Module (SDFEM)

3.3. Multi-Temporal Deep Feature Encoder

3.4. Multi-Scale Feature Fusion Module (MsFFM)

3.5. CD Module

3.6. Network Training

4. Experiments and Analysis

4.1. Dataset Description

4.1.1. Irrigated Agricultural Area (IA)

4.1.2. Wetland Agricultural Area (WA)

4.1.3. River Area (River)

4.2. Parameter Tuning

4.2.1. Multi-Scale Feature Number Comparison

4.2.2. Input Patch Size Comparison

4.2.3. Learning Rate Comparison

4.3. Comparison and Analysis of CD Performance

4.4. Ablation Study

4.4.1. Effect of Spectral Dynamic Feature Extraction Module

4.4.2. Effect of Multi-Scale Feature Fusion Module

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI