Multi-Head Graph Attention Adversarial Autoencoder Network for Unsupervised Change Detection Using Heterogeneous Remote Sensing Images

Jia, Meng; Lou, Xiangyu; Zhao, Zhiqiang; Lu, Xiaofeng; Shi, Zhenghao

doi:10.3390/rs17152581

Open AccessArticle

Multi-Head Graph Attention Adversarial Autoencoder Network for Unsupervised Change Detection Using Heterogeneous Remote Sensing Images

by

Meng Jia

^*,

Xiangyu Lou

,

Zhiqiang Zhao

,

Xiaofeng Lu

and

Zhenghao Shi

The School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2581; https://doi.org/10.3390/rs17152581

Submission received: 28 May 2025 / Revised: 16 July 2025 / Accepted: 22 July 2025 / Published: 24 July 2025

Download

Browse Figures

Versions Notes

Abstract

Heterogeneous remote sensing images, acquired from different sensors, exhibit significant variations in data structure, resolution, and radiometric characteristics. These inherent heterogeneities present substantial challenges for change detection, a task that involves identifying changes in a target area by analyzing multi-temporal images. To address this issue, we propose the Multi-Head Graph Attention Mechanism (MHGAN), designed to achieve accurate detection of surface changes in heterogeneous remote sensing images. The MHGAN employs a bidirectional adversarial convolutional autoencoder network to reconstruct and perform style transformation of heterogeneous images. Unlike existing unidirectional translation frameworks (e.g., CycleGAN), our approach simultaneously aligns features in both domains through multi-head graph attention and dynamic kernel width estimation, effectively reducing false changes caused by sensor heterogeneity. The network training is constrained by four loss functions: reconstruction loss, code correlation loss, graph attention loss, and adversarial loss, which together guide the alignment of heterogeneous images into a unified data domain. The code correlation loss enforces consistency in feature representations at the encoding layer, while a density-based kernel width estimation method enhances the capture of both local and global changes. The graph attention loss models the relationships between features and images, improving the representation of consistent regions across bitemporal images. Additionally, adversarial loss promotes style consistency within the shared domain. Our bidirectional adversarial convolutional autoencoder simultaneously aligns features across both domains. This bilateral structure mitigates the information loss associated with one-way mappings, enabling more accurate style transformation and reducing false change detections caused by sensor heterogeneity, which represents a key advantage over existing unidirectional methods. Compared with state-of-the-art methods for heterogeneous change detection, the MHGAN demonstrates superior performance in both qualitative and quantitative evaluations across four benchmark heterogeneous remote sensing datasets.

Keywords:

domain transformation; graph attention; unsupervised change detection; heterogeneous remote sensing images

Graphical Abstract

1. Introduction

Remote sensing image change detection focuses on analyzing dual-temporal images of the same geographic area to accurately identify and quantify changes within the observed region [1]. Traditionally, this task has relied on homogeneous images—those acquired by the same sensor [2]—and has been widely applied in fields such as remote sensing information processing [3], computer vision [4], urban development monitoring [5], deforestation assessment [6], and land cover change analysis [7,8]. However, in many practical scenarios, obtaining homogeneous imagery is challenging due to sensor limitations and varying acquisition conditions. With the rapid advancement of satellite and imaging technologies, the availability of large-scale and diverse remote sensing data has significantly increased, making heterogeneous remote sensing image change detection—utilizing images from different sensors—a timely and practical alternative [9,10]. This approach not only meets the real-world demand for rapid and effective change detection, especially in emergency disaster assessment, but also benefits from the complementarity of diverse data sources. Consequently, there is a growing need to develop fast, automated, and unsupervised heterogeneous change detection methods that can harness these diverse datasets for robust and efficient surface change monitoring.

Currently, heterogeneous change detection (CD) methods can be broadly classified into traditional methods and deep-learning-based methods. Traditional methods mainly rely on images similarity measures and classification-based approaches. Image similarity measures detect change areas by quantifying the difference or similarity between images, with typical methods including analysis based on correlation coefficients, mutual information, and other metrics. Examples include post-classification comparison [11] and multi-temporal segmentation and compound classification (MS-CC) [12]. However, these methods are often sensitive to data heterogeneity and struggle to address complex change relationships between multi-sensor or multi-modal data. Classification-based methods, on the other hand, classify the images first and then compare the classification results to detect change areas. Examples include Sorted Histogram Distance (SHD) [13] and Pixel Pair (PP) methods [14]. While these methods can effectively detect specific changes, classification errors and heterogeneity significantly impact the performance of the PCC method in practical applications. Additionally, as the complexity of data increases, these methods face challenges in terms of computational efficiency and robustness, making it difficult to meet the needs of high-resolution data. In recent years, deep learning methods, with their powerful feature extraction capabilities, have shown great potential in heterogeneous remote sensing image change detection. Deep-learning-based methods automatically learn high-level features from multi-modal data and perform complex pattern matching, effectively alleviating the issues caused by data heterogeneity. Examples include Symmetric Convolution Coupled Networks (SCCNs) [15], which employ symmetric convolutional layers to extract and fuse high-level features from multi-modal images, ensuring consistent representation across different sensors. Approximate Symmetric DNNs [16] use parallel deep neural network streams to generate similar feature representations for heterogeneous inputs. Conditional Generative Adversarial Networks (cGANs) [17] incorporate additional conditioning information to guide the generation process, ensuring that the synthesized outputs maintain spatial and semantic consistency with the input data.

Despite the progress made by deep learning approaches, the inherent structural differences and scene complexity of heterogeneous images still pose significant challenges. For example, differences in imaging principles, resolutions, and spectral characteristics across sensors make pixel-level comparisons difficult, and high-resolution images may capture minor changes—such as crop growth or water level fluctuations—that result in “false changes”. To mitigate these issues, Zhang et al. [18] introduced a multi-scale convolutional neural network that fuses features extracted at various scales, allowing the model to capture both fine and coarse details and reduce the impact of false changes due to minor variations. Additionally, target scenes are diverse and complex, encompassing natural environments (e.g., forests, rivers, and mountains) and man-made structures (e.g., urban buildings, roads, and bridges). These introduce multi-scale features, dynamic variability, and background interference, further complicating cross-domain feature alignment and change discrimination. To address this, Chen et al. [19] proposed a robust feature alignment framework that employs cross-domain adversarial learning to minimize sensor discrepancies, while Li et al. [20] developed a context-aware deep learning approach that integrates multi-scale contextual information to suppress background interference and enhance the detection of significant changes even in highly complex scenes.

To address these challenges, this paper proposes a MHGAN specifically designed for unsupervised change detection in heterogeneous remote sensing images. The MHGAN constructs a bidirectional adversarial convolutional autoencoder framework that maps heterogeneous images from different sensors into opposing data domains. Through a multi-head graph attention mechanism and dynamic kernel width estimation, the proposed method achieves simultaneous feature alignment across both domains, effectively reducing false changes caused by sensor heterogeneity. The network is trained using four loss functions—reconstruction loss, code correlation loss, graph attention loss, and adversarial loss—to align heterogeneous images into a unified data domain. In particular, the code correlation loss aligns feature representations in the encoding layer with a density-based dynamic kernel width estimation method to better capture both local and global changes. The graph attention loss establishes relationships between features and images, enhancing the representation of consistent regions between dual-temporal images, while the adversarial loss promotes stylistic consistency within the shared domain by generating pseudo-difference maps with homogeneous data structures.

The key contributions of this method are summarized as follows:

This paper proposes a novel change detection method for heterogeneous remote sensing images based on a MHGAM. The MHGAM is built on a bipartite autoencoder architecture, enabling a self-supervised learning approach that requires no labeled training data.
We developed a graph attention loss for the MHGAM to model the spatial coupling between bitemporal images. This facilitates efficient spatial information transfer across feature domains and improves the spatial alignment of feature representations between different domains. In addition, it establishes relationships between features and images, enhancing the representation of consistent regions in bitemporal images and mitigating the impact of pseudo-changes on the learning objective.
In addition to the reconstruction and graph attention losses, we designed two additional loss functions. Inspired by CAA [2], the proposed MHGAN introduces a code correlation loss to align feature representations in the code space. To address the noise sensitivity caused by fixed kernel widths in CAA, we innovatively enhanced this approach by adopting a density-based dynamic kernel width estimation method. Furthermore, we introduce an adversarial loss to promote stylistic consistency within the shared domain, encouraging the network to generate paired images with homogeneous data structures and styles, thereby producing more accurate difference maps.
The proposed MHGAN was evaluated on four benchmark heterogeneous remote sensing image datasets and achieved excellent performance. The implementation is available at https://github.com/xauter/MHGAN_CD, accessed on 1 December 2024.

The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3 presents the proposed model and the change detection process. To evaluate the performance of our method, we design and discuss experiments in Section 4. Finally, conclusions are drawn in Section 5.

2. Related Works

2.1. Unsupervised Remote Sensing Change Detection

In remote sensing change detection, there are two primary approaches: supervised and unsupervised methods. Supervised methods rely on labeled data to train classifiers and are widely applied when sufficient labeled data are available. These methods typically train classifiers to identify different land cover types and their temporal changes, based on labeled samples from image pairs. Although supervised methods can achieve high detection accuracy when reliable labeled data are available, they have certain limitations when dealing with heterogeneous image pairs [21,22]. One of the main challenges of supervised change detection is the availability and quality of labeled data. In many cases, especially in large-scale applications, obtaining accurate and comprehensive ground truth data is expensive, time-consuming, and often impractical. This issue is particularly prominent in heterogeneous change detection as sensor differences, acquisition time disparities, and atmospheric condition variations make manual labeling more difficult [23]. Moreover, supervised methods are prone to overfitting to the training data, which can lead to poor generalization when applied to new, unseen data [21]. In contrast, unsupervised change detection methods do not require labeled data, offering significant advantages in situations where ground truth data are lacking or difficult to obtain. Unsupervised methods typically rely on intrinsic data properties, such as pixel intensity changes, statistical features, or temporal patterns, to detect changes between image pairs [21]. These methods provide a more flexible and scalable solution as they can be applied to a wide range of scenarios without the need for manual data labeling [24]. In heterogeneous change detection, the advantages of unsupervised methods are especially pronounced, particularly when dealing with data from different sensors or when temporal changes are subtle and difficult to detect. These methods can effectively handle differences between datasets, such as spectral variations and geometric misalignment, which are common in heterogeneous datasets [25]. Furthermore, unsupervised methods are often more adaptable because they do not rely on predefined classes or labeled data, which may not be available in future applications [26].

For example, Liu et al. [27] proposed a Homogeneous Pixel Transformation (HPT) method, which aligns heterogeneous image pairs to a shared domain through forward and backward pixel transformations, enabling effective change analysis. Similarly, Sun et al. [28] proposed the non-local patch-based graph (NPSG) method and its enhanced version, the INLPG [29]. This method analyzes non-local graph structures in dual-temporal remote sensing images by leveraging structural invariance between different modalities, robustly detecting changes. In recent years, unsupervised methods based on machine learning and deep learning technologies have made significant progress, further enhancing their ability to tackle complex and large-scale CD tasks. For instance, unsupervised deep learning models, such as autoencoders and Generative Adversarial Networks (GANs), have achieved remarkable results in detecting subtle changes in heterogeneous image pairs, without the need for labeled data [24,30]. These models are capable of learning high-level representations of data, enabling them to generalize across different sensor types and acquisition conditions, making them highly suitable for processing heterogeneous data in remote sensing applications [28].

2.2. Image Translation

The challenge of heterogeneous CD arises from the fact that different sensors have different imaging mechanisms, causing the two temporal images to exist in different domains. As a result, changes cannot be detected through simple arithmetic operations, as is possible with homogeneous images, such as image difference [31] and image ratio [32]/log-ratio [33] (typically used for SAR images). Currently, there are two main approaches for heterogeneous CD: one maps the heterogeneous images from both domains to a common domain for further comparison, and the other maps the heterogeneous images from one domain to the other domain for comparison. Recent advances in self-supervised learning have further enhanced unsupervised CD. For instance, Luppino et al. [34] proposed code-aligned autoencoders for multimodal data alignment, while Wang et al. [35] integrated topological structures with deep networks. However, these methods often rely on fixed kernel assumptions and lack spatial attention mechanisms. Our work addresses these gaps by introducing dynamic kernel estimation and multi-head graph attention to capture both local and global changes. These networks transform heterogeneous data into a unified feature space, enabling pixel-level comparison and accurate difference map generation, effectively addressing the problem of “false changes”. Wang et al. [35] proposed a bidirectional Convolutional Neural Network (BCNN) to model the probabilistic differences between pixel pairs in heterogeneous images. Luppino et al. [34] developed an unsupervised multimodal CD framework based on an encoder–decoder architecture, which captures difference information by establishing pixel pair correlations in the encoded feature space. Furthermore, Liu et al. [2] constructed a symmetric network architecture that can transform heterogeneous remote sensing images into a shared feature space, allowing change detection based on consistent feature representations. Niu et al. [17] proposed a Conditional Generative Adversarial Network (cGAN)-based approach that maps optical images to distributions similar to SAR data, simplifying the identification of change pixels in the common domain. Liu et al. [36] used a cycle-consistent adversarial networks (CycleGANs) to transform one image into the pixel space of another, extracting the difference map of homogeneous images through subtraction. Unlike the unidirectional translation in CycleGAN [36] and conditional GANs [17], the MHGAN’s bidirectional framework simultaneously aligns both domains, thereby mitigating information loss with one-way mapping. Combined with the MHGAM’s ability to capture fine-grained spatial correlations, this makes the proposed method more robust than recent dual-branch networks (e.g., H-FIENet [18]) for heterogeneous feature alignment.

2.3. Attention Mechanism

The attention mechanism has emerged as a pivotal technique in remote sensing change detection, addressing the challenges of heterogeneous data and complex scene characteristics. By dynamically assigning weights to significant features, attention mechanisms enable models to focus on salient changes while suppressing irrelevant information, thereby improving detection accuracy and robustness. As remote sensing data become increasingly diverse and high-dimensional, attention-based methods have demonstrated their capability to effectively enhance the performance of CD tasks.

Spatial attention mechanisms have been widely adopted to localize and amplify significant change regions. For instance, Chen et al. [19] proposed a novel Siamese-based spatial–temporal attention neural network for remote sensing image change detection, introducing a CD self-attention mechanism to model spatial–temporal relationships and leveraging multi-scale subregions to capture dependencies at various scales. Li et al. [20] designed the Channel-Spatial Attention Network (CSANet), which effectively addresses the semantic complexity in bi-temporal high-resolution remote sensing images under varying weather and lighting conditions by extracting multi-scale and semantic information. Isola et al. [37] introduced a dual-branch attention model that integrates spatial and channel attention modules, reinforcing spatially distributed change features while capturing global channel dependencies. This hybrid approach has proven particularly effective in handling high-resolution imagery within complex scenarios. Furthermore, self-attention mechanisms based on transformers have recently emerged as a promising direction. Royer et al. [38] proposed an end-to-end transformer-based change detection approach trained from scratch, which achieves state-of-the-art performance by leveraging a shuffled sparse-attention mechanism and a change-enhanced feature fusion (CEFF) module to effectively capture inductive biases and enhance semantic changes. Compared with the recent Topological Structure Coupling Network (TSCNet) [35], which incorporates wavelet transforms and attention mechanisms, the MHGAM offers improved cross-domain spatial alignment by explicitly modeling graph-based relationships between bitemporal features. This enables more precise spatial context transfer across domains. Moreover, its dynamic kernel width, adaptively adjusted based on local feature density, effectively avoids over-smoothing in homogeneous regions and under-fitting in complex areas, representing a clear advantage over fixed-kernel methods such as DBFNet [17].

3. Methodology

3.1. Framework of MHGAN

The primary objective of heterogeneous remote sensing image change detection is to quantitatively analyze and identify substantial changes on the Earth’s surface by comparing images data acquired at different times and from various sensors while filtering out minor or low-confidence changes. A key challenge in this process is addressing “false changes”, which arise from factors such as noise, radiometric discrepancies, lighting variations, and adverse weather conditions. Drawing inspiration from the concept of image-to-image (I2I) translation [37,38], which aims to transform images with differing styles into a shared domain with uniform feature representations, heterogeneous remote sensing images can similarly be translated into a common domain. By doing so, data from different temporal and modal sources become directly comparable. In this context, heterogeneous remote sensing images from dual temporal instances are treated as stylistically distinct representations of the same geographic region.

Thus, the core task in heterogeneous remote sensing image change detection is to develop a robust one-to-one mapping mechanism for heterogeneous data. This approach should effectively establish relationships between images with different visual and spectral characteristics while preserving essential semantic and structural information. Moreover, the mapping process must minimize the impact of “false changes” to enhance the reliability and precision of CD outcomes. To address this, a new bidirectional adversarial autoencoder network model (MHGAN) based on images style transfer is proposed for data transformation of heterogeneous remote sensing images, enabling precise detection of change areas. Specifically, let

X

and

Y

represent a pair of input dual-temporal remote sensing images. The goal is to achieve two transformations:

\hat{Y} = F (X)

and

\hat{X} = G (Y)

, where

F

:

X^{h \times w}

→

Y^{h \times w}

and

G

:

Y^{h \times w}

→

X^{h \times w}

, thus enabling data mapping between the image domains. Through this approach, input image

X

(or

Y

) can be mapped to the opposite domain

Y

(or

X

), allowing for CD by computing the difference image

Δ

as a weighted average:

Δ = \frac{1}{c_{x}} \cdot d (\hat{Y}, X) + \frac{1}{c_{y}} \cdot d (\hat{X}, Y)

(1)

where

d (∙)

represents a pixel-level distance metric, where Euclidean distance is used due to the computational cost of large-scale data.

c_{x}

represents the number of channels in image

X

, and

c_{y}

represents the number of channels in image

Y

.

To implement

F (X)

and

G (Y)

, a framework composed of two autoencoders is used, with each autoencoder associated with one of the image domains,

X

and

Y

. The bidirectional autoencoder network model consists of two sets of encoder–decoder sub-networks:

The encoder $E_{1} (X) : X \to Z_{1}$ and its corresponding decoder $D_{1} (Z) : Z_{1} \to \tilde{X}$ , denoted as $\tilde{X} = D_{1} (E_{1} (X))$ ;
The encoder $E_{2} (Y) : Y \to Z_{2}$ and its corresponding decoder $D_{2} (Z) : Z_{2} \to \tilde{Y}$ , denoted as $\tilde{Y} = D_{2} (E_{2} (Y))$ .

The overall architecture is shown in Figure 1, which realizes domain transformation and feature alignment of heterogeneous images through bidirectional mapping of encoder-decoder. The two encoder–decoder pairs are constructed using deep fully convolutional networks, where

Z_{1}

and

Z_{2}

represent the encoding layers or latent spaces of encoders

E_{1}

and

E_{2,}

respectively.

Z

denotes the encoding layer representation of the tensors in

Z_{1}

and

Z_{2} .

With appropriate regularization, the bidirectional autoencoders can be jointly trained to learn the mapping of their inputs to the latent space domain. The latent features are then projected back into their original domains through the cascading of the encoder and corresponding decoder, producing high-fidelity reconstructions. Additionally, the data are mapped through the opposite decoder, which leads to the desired style transformation, as illustrated in Figure 2.

\hat{Y} = F (X) = D_{2} (Z_{1}) = D_{2} (E_{1} (X))

(2)

\hat{X} = G (Y) = D_{1} (Z_{2}) = D_{1} (E_{2} (Y))

(3)

However, without any external guidance, the feature representations of

\hat{Y}

and

X

,

\hat{X}

and

Y

are typically not aligned, making it impossible to directly use them for generating difference maps. To address this, we introduce a loss term to enhance their transformation alignment. If the features of

\hat{Y}

and

X

,

\hat{X}

and

Y

are successfully aligned, changes can be detected more accurately.

Specifically, the proposed MHGAN for heterogeneous remote sensing image change detection is based on two sets of autoencoder networks that fit the functions

F (∙)

and

G (∙)

, respectively, to achieve domain transformation for dual-temporal remote sensing images. The training of the bidirectional autoencoders is carried out by minimizing a loss function with respect to the parameter

υ

of the constructed network. The loss function

L (υ)

consists of four components: reconstruction loss, code correlation loss, graph attention loss, and adversarial loss.

3.2. Reconstruction Loss

The first regularization term in the objective function of the MHGAN is the reconstruction loss, which aims to ensure that the reconstructed output images obtained after encoding and decoding by the network are as close as possible to the input images. Therefore, the reconstruction loss term can be expressed as follows:

L_{r e c o n} = M (X_{1}, {\tilde{X}}_{1}) + M (Y_{2}, {\tilde{Y}}_{2})

(4)

where

M

represents the mean squared error (MSE) used to measure the loss between the expected output and the reconstructed output. The MSE is defined as the average of the squared differences between corresponding pixels in the input and reconstructed images.

3.3. Code Correlation Loss

To align the feature representations of the encoding layer, a specific loss term known as the code correlation loss [2] has been introduced, as depicted in Figure 3. The objective is to synchronize the code layers of two autoencoders, treating them as a shared latent space. This allows the output of one encoder to serve as the input for both decoders, enabling the reconstruction of data within its native domain as well as its transformation into another domain. When pixel pairs

(i, j)

are similar in the input space, their distances in the code space should also be close. Conversely, when pixel pairs

(i, j)

are dissimilar in the input space, their distances in the code space should be more distant.

For any modality

X

or

Y

, the feature distance for all pixel pairs

(i, j)

within the same training patch is defined as

d_{i, j}^{X} = d^{X} (x_{i}, x_{j}), d_{i, j}^{Y} = d^{Y} (y_{i}, y_{j})

(5)

where

x_{i}

and

x_{j}

are the feature vectors of the

i

-th and

j

-th pixels in patch

X

, respectively;

y_{i}

and

y_{j}

are the feature vectors of the

i

-th and

j

-th pixels in patch

Y

, respectively.

d^{X}

and

d^{Y}

are specific distance metrics, with Euclidean distance being the choice for this paper, formulated as

d^{X} (x_{i}, x_{j}) = ‖ x_{i} - x_{j} ‖_{2}, d^{Y} (y_{i}, y_{j}) = ‖ y_{i} - y_{j} ‖_{2}

(6)

Based on the aforementioned distance, the affinity matrix for local pixel pairs is calculated as follows:

A_{i, j}^{x} = e x p (- \frac{{(d_{i, j}^{x})}^{2}}{s_{i}^{2}}), A_{i, j}^{y} = e x p (- \frac{{(d_{i, j}^{y})}^{2}}{s_{i}^{2}})

(7)

where

A_{i, j}^{x} a n d A_{i, j}^{y} \in [0,1]

represent the similarity between pixel pairs (with values closer to 1 indicating higher similarity). The Gaussian kernel function smoothly maps distances in the feature space to similarity scores, enhancing robustness against noise.

σ_{i}

is the kernel width, and in CAA,

σ_{i}

is set as a fixed value, typically the mean distance of the

k

-nearest neighbors. However, determining kernel width in this manner depends on the ranked distances of neighboring points (e.g., the closest 25% of data points) and does not explicitly account for the density distribution of the entire image. Since data often exhibit varying densities across different regions, using a fixed kernel width may lead to the following issues:

High-density regions: If the kernel width is too large, local structural information may become blurred, making it difficult to capture fine details.
Low-density regions: If the kernel width is too small, the kernel function may fail to adequately cover the data, resulting in insufficient information.

To address this issue, this study adopts a density estimation-based approach, where a dynamic kernel width is assigned to each pixel based on its local density. High-density regions are assigned smaller kernel widths, while low-density regions are assigned larger kernel widths, thereby better accommodating the non-uniform distribution of the data The local density

p (x_{i})

of each pixel

x_{i}

is first calculated through kernel density estimation based on the surrounding pixels:

p (x_{i}) = \frac{1}{n} \sum_{j = 1}^{n} e x p (- ‖ x_{i} - x_{j} ‖^{2})

(8)

where

n

represents the total number of surrounding pixels, and

∥ x_{i} - x_{j} ∥

denotes the Euclidean distance between pixel

x_{i}

and pixel

x_{j} .

The kernel width

σ_{i}

is then calculated as the inverse of the local density:

σ_{i} = \frac{ξ}{p (x_{i})}

(9)

where

ξ

is a tuning parameter determined via cross-validation on the training subsets of all four datasets. It controls the sensitivity of kernel width to local density: smaller values of

ξ

(e.g., 0.1) make kernel width more responsive to density variations, while larger values (e.g., 0.5) produce a smoother effect. Based on validation results, the selected value of

ξ

achieves a balance between capturing fine local details in high-density regions and ensuring robustness to noise in low-density areas.

The affinity calculation used in this study is better adapted to the local characteristics of the data distribution. High-density regions capture finer local differences, while low-density regions mitigate the effects of excessive noise amplification. Compared to fixed kernel widths, dynamic kernel widths can more effectively reflect both local and global variations within the image. This method is particularly suitable for scenarios where significant distribution differences exist between modalities, such as SAR and optical imagery, infrared and multispectral imagery, etc.

Each row of the affinity matrix is considered as the feature representation of a pixel:

A_{i}^{X} = [\begin{matrix} A_{i, 1}^{X}, A_{i, 2}^{X}, \dots, A_{i, n}^{X} \end{matrix}], A_{j}^{Y} = [\begin{matrix} A_{j, 1}^{Y}, A_{j, 2}^{Y}, \dots, A_{j, n}^{Y} \end{matrix}]

(10)

where

A_{i}^{X}

is a vector that encompasses the similarity values of pixel

i

with all other pixels in modality

X

. Similarly,

A_{j}^{Y}

contains the similarity values of pixel

j

with all other pixels in modality

Y

. By calculating the cross-modal distance, the following distance in the cross-modal space is obtained:

D_{i, j} = \frac{1}{\sqrt{n}} ‖ A_{i}^{X} - A_{j}^{Y} ‖_{2}

(11)

where

D_{i, j} \in (0,1]

, representing the normalized distance that indicates the degree of similarity between pixels across modalities. The essence of this formula is that if two pixels exhibit similar patterns of similarity with other pixels in different modalities (i.e., their similarity vectors are alike), then the cross-modal distance between these two pixels should be small. Conversely, if there is a significant difference in the similarity vectors, the cross-modal distance will be larger. By representing each pixel with its affinity vector relative to other pixels, a relational feature is formed for each pixel. These relational features reflect the distribution characteristics of pixels within the local context. The features of a single pixel may not be sufficient to capture global or local changes, but by considering its relationships (affinity) with other pixels, a more comprehensive description of the structural information between pixels can be provided. These relational features are well suited for transfer to cross-modal alignment tasks.

To ensure that the similarity in the input space is maintained in the encoding layer, a similarity matrix

S_{i, j}

is defined as follows:

S_{i, j} = 1 - D_{i, j}

(12)

As Figure 3 illustrated,

S_{i, j} \in [0,1]

. The encodings in the encoding layer are denoted as

z_{i}^{X}

and

z_{j}^{Y}

, corresponding to the encodings of pixel

i

in modality

X

and pixel

j

in modality

Y

, respectively. By optimizing the encoding alignment loss, the similarity between

z_{i}^{X}

and

z_{j}^{Y}

is constrained to be consistent with

S_{i, j} .

R_{i, j} = \frac{{(z_{i}^{X})}^{T} z_{j}^{Y}}{‖z_{i}^{X}‖ \cdot ‖z_{j}^{Y}‖}

(13)

where

R_{i, j} \in [- 1,1],

representing the cosine similarity between the encodings, with the goal of making

R_{i, j} \approx S_{i, j}

. The cosine similarity

R_{i, j}

between pixel pairs in the encoding space serves as a representation of similarity at the encoding layer.

Ultimately, the similarity

R_{i, j}

between pixel pairs in the encoding space is enforced to be consistent with the similarity

S_{i, j}

in the input space. The code correlation loss

L_{c o r r}

is defined as

L_{c o r r} = ∥ R - S ∥_{F}^{2}

(14)

where

R

represents the similarity matrix of all pixel pairs in the encoding layer.

S

denotes the similarity matrix of pixel pairs between the input images

X

and

Y

. This loss function

L_{c o r r}

is the expected value of the squared Frobenius norm of the difference between the similarity matrices

R

and

S

, which encourages the encodings to preserve the input space similarities.

3.4. Graph Attention Loss

As illustrated in Figure 2, the methodology for data transformation employing a unidirectional convolutional autoencoder is outlined. For image

X

, a random image patch

I_{1}

is selected and encoded into the latent space using encoder

E_{1}

. Subsequently, the decoder

D_{2}

, associated with image

Y

, is utilized to decode the features from the latent space, resulting in an image patch

{\hat{I}}_{2}

that mirrors the style of image

Y

. This transformation is mathematically expressed as

{\hat{I}}_{2} = F (I_{1}) = D_{2} (E_{1} (I_{1}))

. Analogously, for image

Y

, an image patch

I_{2}

is randomly extracted and encoded using encoder

E_{2}

. Then, the decoder

D_{1}

, corresponding to image

X

, decodes the features to generate an image patch

{\hat{I}}_{1}

that aligns with the style of image

X

. This process is represented as

{\hat{I}}_{1} = G (I_{2}) = D_{1} (E_{2} (I_{2}))

. To synchronize the image representations of

{\hat{I}}_{2}

with

I_{1}

and

{\hat{I}}_{1}

with

I_{2}

, a novel structure based on the MHGAM is introduced, as shown in Figure 4.

This approach leverages the MHGAM to uncover latent spatial feature relationships within the spatial domain. It is independent of manual labeling and does not require specific prior knowledge. The assumption is that regions that remain unchanged can capture a greater degree of spatial correlation, thereby enhancing the effectiveness of feature representation.

Giving encoders

E_{1}

and

E_{2}

to obtain feature maps

Z_{1}

and

Z_{2}

, respectively, these feature maps are then subjected to a linear transformation using a

1 \times 1 \times C

convolutional kernel to derive node representations. This transformation results in the creation of a graph that represents the relationships between each node, where

C

denotes the number of the feature channels. Let

V_{1}

represent the set of all nodes contained within

Z_{1}

and

V_{2}

represent the set of all nodes contained within

Z_{2}

. A bipartite graph is constructed to model the images’ before and after change for each pair

(m, n) \in E

, where

m \in V_{1}

and

n \in V_{2}

, and the coefficient

e_{m n}

signifies the correlation between nodes

m

and

n

. The feature vectors

z_{1}^{m} \in

R_{1}^{c}

and

z_{2}^{n} \in

R_{2}^{c}

represent the nodes

m

and

n

in the latent space. The greater the similarity between the features of the images before and after change, the higher the likelihood that the region remains unchanged. Consequently, more spatial correction information should be conveyed to such regions. The correlation

e_{m n}

is thus defined to be directly proportional to the similarity of the feature vectors of nodes

m

and

n

. The inner product of the feature vectors is used to calculate this similarity:

e_{m n} = {(W_{1} z_{1}^{m})}^{T} (W_{2} z_{2}^{n})

(15)

where

W_{1} \in R_{1}^{c}

and

W_{2} \in R_{2}^{c}

are linear transformations applied to each node in

Z_{1}

and

Z_{2}

, respectively. They can also be interpreted as two weight matrices. This formula quantifies the correlation between nodes

m \in V_{1}

and

n \in V_{2}

using the inner product of linearly transformed feature vectors. From a technical mechanism perspective, the linear transformation (via weight matrices

W_{1}

,

W_{2}

) projects features into a shared subspace, enabling cross-domain comparison. The inner product then measures the similarity of these projected features, reflecting the spatial coupling strength between bitemporal features. A higher value indicates stronger similarity in structural characteristics (e.g., unchanged regions like stable terrain) as the mechanism effectively captures consistent spatial patterns across domains.

Based on the correlation coefficient

e_{m n}

, a self-attention mechanism is applied to the nodes. The influence of node

n

on node

m

is represented by

a_{m n}

, which is obtained by normalizing the correlation coefficient

e_{m n}

using the SoftMax function:

a_{m n} = \frac{\exp (e_{m n})}{\sum_{k \in V_{2}} \exp (e_{i k})}

(16)

where the SoftMax normalization ensures that attention weights

a_{m n}

sum to 1, quantifying the proportion of information from node n, which propagates to node m. Technically, SoftMax operates by exponentiating and normalizing the similarity scores from Formula (15), which inherently prioritizes nodes with larger raw similarity values. This mechanism leverages the structural consistency encoded in the graph-nodes with stronger spatial correlations (e.g., adjacent pixels in unchanged urban areas) receiving higher weights, while noise-induced false correlations (from speckle or sensor artifacts) are suppressed due to their lower relative similarity.

The coefficient

a_{m n}

determines the extent to which information from all nodes in

V_{2}

is propagated to the

m

-th node in

V_{1}

. A linear transformation matrix

W_{v} \in R_{2}^{c}

is applied to the features of

V_{2}

, resulting in the aggregation of information for node

m

as follows:

v_{m} = \sum_{n \in V_{2}} a_{m n} W_{v} z_{2}^{n}

(17)

By integrating the aggregated representation

v_{m}

with the node feature

z_{1}^{m}

, a more robust feature representation

{\hat{z}}_{1}^{m}

is obtained:

{\hat{z}}_{1}^{m} = R e L U (v_{m} ∥ (W_{v} z_{1}^{m}))

(18)

where

∥

denotes the vector concatenation operator. This attention mechanism functions as a unidirectional feed forward neural layer. The process is parallelized for all

z_{1}^{m}

, where

\forall m \in V_{1}

, resulting in a new feature map

{\hat{Z}}_{1}

. Similarly, a new feature map

{\hat{Z}}_{2}

can be computed for image

Y

using the same method. These formulas aggregate context-aware information from cross-domain nodes and fuse it with the original features. From a technical standpoint, the aggregation leverages the attention weights to selectively combine relevant cross-domain features. The fusion with original features balances new context acquisition and preservation of raw input information. This enhances the representation of consistent regions by reinforcing shared structural patterns while preserving discriminative features of changed regions (e.g., newly built roads) through retention of unique local characteristics, ultimately balancing stability and sensitivity to changes.

After this step, the feature maps

{\hat{Z}}_{1}

and

{\hat{Z}}_{2}

have been updated once. However, in the context of CD in heterogeneous remote sensing images, the features of the images before and after change may differ significantly. A single update of the feature maps could potentially overlook important feature information. To address this, the MHGAM is designed to include additional attention heads. By repeating the process for each attention head

k

, and then concatenating the feature maps

{\hat{Z}}_{1}

obtained from each attention head, the final output

{\hat{Z}}_{1}

of the MHGAM is produced:

The aggregated feature representation

{\tilde{Z}}_{1}

is calculated by summing the feature maps from each attention head and then taking their average:

{\tilde{Z}}_{1} = \frac{1}{K} \sum_{k = 1}^{K} {\hat{Z}}_{1}^{k}

(19)

where

{\tilde{Z}}_{2}

is a pixel-level distance metric function. The number of attention heads

K

was determined experimentally. We tested

K

= 2, 4, and 8 and found that

K

= 4 provides the optimal balance. Fewer heads (e.g., 2) fail to capture diverse feature relationships (e.g., both spectral and spatial correlations), while a larger number of heads (e.g., 8) increases computational cost without yielding significant performance improvements. In Equation (19), features from the

K

heads are aggregated by averaging, effectively fusing complementary information from different subspaces (each head learns distinct spatial correlation patterns). This strategy avoids over-reliance on a single attention head and enhances the robustness of the feature representation. To meet computational cost constraints, the Euclidean distance is preferred.

L_{g a l} = d_{2} ({\hat{I}}_{1}, {\tilde{Z}}_{1}) + d_{2} ({\hat{I}}_{2}, {\tilde{Z}}_{2})

(20)

where

d_{2} (\cdot)

is a pixel-level distance metric function. To meet computational cost constraints, the Euclidean distance is preferred.

By utilizing the graph attention loss, a node matching relationship between feature images is established, which helps to reduce the impact of spectral feature differences between image pairs on the analysis of the target area. The MHGAM allows the model to learn features in different spaces. It processes input features in parallel using multiple independent attention heads, with each head capable of learning different feature representations and relationships, thereby enhancing the model’s expressive power. The MHGAM maps spatial information from one feature domain to another, resulting in more effective image embedding representations. This approach better handles the heterogeneity between cross-domain image pairs.

3.5. Adversarial Loss

The bidirectional autoencoder network functions as a generator that facilitates the transformation of heterogeneous data, producing images with improved stylistic consistency. However, images generated solely through convolutional autoencoder networks may fail to fully achieve the desired style transfer effects. To overcome this limitation, a discriminator is introduced to assess the stylistic consistency between the generated and original images, ensuring that the transferred style more closely matches the characteristics of the target image. By jointly training the generator and the discriminator, the goal of ideal style transfer can be gradually achieved. A straightforward approach involves feeding input data pairs

({\hat{I}}_{1}, I_{1})

and

({\hat{I}}_{2}, I_{2})

into the discriminator network. The discriminator distinguishes between the pixel distributions of the transformed images and the real images, constructing an adversarial loss function that compels the generated images to have more consistent styles. Mathematically, this can be expressed as

L_{a d v e r} = \underset{F}{m i n} \underset{P_{2}}{a x} L_{G A N, 1 \to 2} + \underset{G}{m i n} \underset{P_{1}}{m a x} L_{G A N, 2 \to 1}

(21)

L_{G A N, 1 \to 2} = E_{I_{2} \sim P_{d a t a (I_{2})}} [l o g P_{2} (I_{2})] + E_{I_{1} \sim P_{d a t a (I_{1})}} [l o g {(1 - P}_{2} ({\hat{I}}_{2}))]

(22)

L_{G A N, 2 \to 1} = E_{I_{1} \sim P_{d a t a (I_{1})}} [l o g P_{1} (I_{1})] + E_{I_{2} \sim P_{d a t a (I_{2})}} [l o g {(1 - P}_{2} ({\hat{I}}_{1}))]

(23)

3.6. Total Loss

The total loss is defined as the weighted average of four loss terms, as shown below:

L_{t o t a l} = L_{r e c o n} + α L_{c o r r} + β L_{g a l} + γ L_{a d v e r}

(24)

where

α, β

, and

γ

are hyperparameters used to balance the four loss terms.

Upon conclusion of the training regimen, the anticipated difference image

Δ

is constructed in accordance with Equation (1). Subsequently, the quintessential Otsu thresholding technique is harnessed to ascertain the definitive CD outcome, encapsulated within the change map. The crux of this threshold selection algorithm lies in the identification of the optimal threshold value. Once the ultimate threshold is established, it serves as the benchmark for evaluating all pixel values. Pixels exceeding this threshold are classified as exhibiting change, while those falling below are categorized as unaltered.

The determination of pixel classification within the change map

C M

is mathematically formalized as follows:

C M = \{\begin{matrix} 1, & i f Δ \geq t h r e s h o l d \\ 0, & e l s e \end{matrix}

(25)

4. Experimental Study

4.1. Experimental Settings and Datasets

The experimental validation of the proposed MHGAN network was carried out utilizing four distinct public heterogeneous remote sensing image datasets. Figure 5 shows the two-phase images of these four sets of datasets along with the variation reference plots:

Italy Dataset [29]: Captured by Landsat-5 over Sardinia, this dataset measures 412 × 300 pixels. It includes a NIR band image from September 1995 and an RGB image from July 1996. The ground truth reflects the actual lake expansion in Sardinia.
California Dataset [28]: This dataset, initially 3500 × 2000 pixels, was downsampled to 875 × 500 for training. It includes an RGB image from Landsat 8 on 5 January 2017, and a SAR image from Sentinel-1A on February 18, 2017. The true change map depicts the effects of flooding.
Tianhe Dataset [34]: With dimensions of 666 × 615 pixels, this dataset covers an area of 7.28 × 6.73 km² with an approximate resolution of 11 m/pixel. It features a panchromatic image from Landsat-7 in July 2002 and an optical image from Google Earth in June 2013. The change map illustrates building alterations at Wuhan Tianhe Airport.
Shuguang Dataset [29]: Measuring 593 × 921 pixels with an approximate resolution of 8 m/pixel, this dataset includes a C-band SAR images from Radarsat-2 in June 2008 and an optical image from Google Earth in September 2012. The change map reveals building expansion in Shuguang Village.

The architecture of the proposed MHGAN consists of two encoder–decoder pairs, forming the core components of the network. Each encoder comprises two convolutional layers with dimensions 3 × 3 × 50, utilizing LeakyReLU as the activation function. The subsequent convolutional layer adopts 3 × 3 × 3 dimensions with the Tanh activation function to generate latent feature representations of size 100 × 100 × 3. Decoders mirror this structure, with the final layer applying a 3 × 3 × C convolution and Tanh activation to reconstruct the features back into the original image dimensions. The negative slope coefficient for LeakyReLU is set to 0.3. The Tanh function normalizes data within the [−1,1] range, aiding the convergence process during training [39].

The network was implemented using TensorFlow, and all experiments were conducted on a personal computer equipped with an NVIDIA GeForce RTX 2080 Ti (11 GB VRAM) and 16 GB RAM, running a Windows 10 environment. For development, PyCharm Professional Edition 2022.1 was used as the primary integrated development environment. During training, the Adam optimizer was employed due to its adaptive learning rate capabilities, and the model’s parameters were updated using the minimum value of the learning decay strategy. The training process lasted 40 epochs, with the network receiving 100 × 100 random image patches at each epoch. To enhance robustness against adversarial examples, a continuous random cropping technique was employed, ensuring variability in image input. The batch size and learning rates were specifically tuned to values of 25, 100, 50, and 1, respectively.

4.2. Evaluation Criteria

To evaluate the performance of the proposed MHGAN, four quantitative assessment criteria are adopted via comparison of each binary change detection map with the ground truth map: false positives (

F P

), false negatives (

F N

), overall accuracy (

O A

), and Kappa coefficient (

K a p p a

) [40]. Meanwhile, the area under the ROC curve (AUC) was utilized to evaluate the difference image.

O A = \frac{T P + T N}{F P + F N + T P + T N}

(26)

where

T P

(true positives) are correctly identified changes;

T N

(true negatives) are correctly identified no-changes;

F P

(False Positives) are no-changes incorrectly identified as changes; and

F N

(False Negatives) are changes incorrectly identified as no-changes.

The Cohen Kappa coefficient (

K a p p a \in [- 1,1]

) assesses the consistency between two classifiers. The value of 1 indicates perfect agreement, −1 indicates perfect disagreement, and 0 suggests no association (random guessing). The coefficient is computed as

k a p p a = \frac{p_{o} - p_{e}}{1 - p_{e}}

(27)

where

p_{o}

is the observed agreement (equivalent to

O A

), and

p_{e}

is the probability of random agreement, estimated based on

T P

,

T N

,

F P,

and

F N

:

p_{e} = \frac{(T P + F P) \cdot (T P + F N)}{(T P + T N + F P + F N)^{2}} + \frac{(F N + T N) \cdot (F P + T N)}{(T P + T N + F P + F N)^{2}}

(28)

The AUC is a measure of the model’s ability to distinguish between classes. It is derived from the Receiver Operating Characteristic (ROC) curve, which is plotted by varying the classification threshold and recording the corresponding false-positive rate (

F P R

) and true-positive rate

(T P R) .

T P R = \frac{T P}{T P + F N}

(29)

F P R = \frac{F P}{F P + F N}

(30)

4.3. Compared Methods

To validate the effectiveness of the proposed algorithm, seven widely used traditional and deep-learning-based heterogeneous remote sensing image change detection methods were selected as comparison algorithms. These seven methods are as follows:

CAA (code-aligned autoencoders for unsupervised change detection in multimodal remote sensing images [34]): The CAA approach leverages the power of conditional adversarial networks to identify alterations in images. By training on the disparities between images acquired at distinct temporal instances, CAA effectively discerns changes and is particularly adept at managing heterogeneous data, which encompasses data from diverse sensor types.
NPSG (Nonlocal Patch Similarity Graph [28]): NPSG employs a graph-based strategy that focuses on the structural comparison of image patches, as opposed to pixel-level analysis. This method enhances the precision of change detection, proving especially beneficial in regions with intricate textures where traditional pixel-based methods may falter.
INLPG (Improved NonLocal Patch-based Graph [29]): The INLPG represents an advanced iteration of the NPSG, engineered by Yuli Sun and colleagues. It refines the nonlocal patch similarity graph to elevate the performance of change detection, offering superior handling of both homogeneous and heterogeneous remote sensing images.
ACE (deep image translation with an affinity-based change prior [41]): ACE introduces an unsupervised change detection technique grounded in adversarial cyclic encoding. Utilizing deep image translation methodologies, ACE incorporates an affinity-based change prior to bolster the accuracy of change detection across multimodal datasets, making it suitable for a range of optical and radar imagery.
BASNet (Bipartite Adversarial Autoencoders with Structural Self-Similarity [42]): Based on the image transformation paradigm, a new bidirectional adversarial autoencoder network model is proposed for the change detection of unsupervised heterogeneous remote sensing images.
IRGMcS (Iterative Robust Graph for Unsupervised Change Detection of Heterogeneous Remote Sensing Images [43]): IRGMcS proposes a robust graph mapping method for unsupervised heterogeneous change detection in remote sensing imagery. It leverages the shared structural information of the same ground objects across heterogeneous image pairs, addressing the challenge of directly comparing heterogeneous images due to differences in imaging mechanisms.
TSCNet (Topological Structure Coupling Network for Change Detection of Heterogeneous Remote Sensing Images [35]): TSCNet leverages an encoder–decoder architecture to transform the feature space of heterogeneous images and incorporates wavelet transform, as well as channel and spatial attention mechanisms, to enhance the detection performance.

4.4. Results on Italy Dataset

Experiments were conducted on the Italy dataset using the proposed method and seven comparison algorithms. The experimental results are quantitatively described and compared using the evaluation metrics mentioned earlier, with visual comparison also provided. Figure 6 presents the final change maps obtained by the eight methods on the Italy dataset, while Table 1 provides the quantitative analysis results for this dataset.

The challenge in detecting changes on the Italy dataset lies in the different acquisition bands of the images and the rugged mountainous terrain. The drastic variations in mountain heights lead to color discrepancies in the images, causing pixels from areas that should remain unchanged to be erroneously detected as change areas, which results in higher false positives, as indicated by the green marks in the change map. As shown in Figure 6c, the NPSG method achieves the lowest overall accuracy (0.8165), misclassifying the center of the lake in Sardinia as a change area. A similar issue is observed with the INLPG method to a certain extent. The MHGAN method exhibits the lowest number of false positives, as shown in Table 1, with only 920 false positives, demonstrating its advantage in suppressing background noise compared to the other methods. False negatives

(F N)

, which refer to detecting changed areas as unchanged, are mainly observed at the boundary between lakes and mountains, where the lake has expanded. The MHGAN method is positioned in the middle range among the eight methods, but upon comprehensive analysis, it is found to have the lowest number of incorrectly detected pixels (

F P

+

F N

), indicating fewer background noise points while maintaining a strong ability to detect true change areas. This balance between false positives and false negatives allows the MHGAN to achieve the best performance of Kappa with an accuracy of 0.7078.

Figure 7 presents a visual demonstration of the MHGAN applied to the Italy dataset. It is evident that the input images undergo a transformation from one domain to another while maintaining a high degree of stylistic similarity to their original counterparts. Consequently, the enhancement in the effectiveness of change detection is pronounced.

Additionally, Figure 8 shows the ROC curves of the difference images for eight methods on the Italy dataset, with the corresponding AUC values provided in Table 1. It can be observed that the AUC value of the proposed method is higher than that of BASNet. Moreover, the proposed MHGAN method achieves the highest

O A

and

K a p p a

results among the eight methods, demonstrating a significant improvement over the comparison methods.

4.5. Results on California Dataset

The change maps produced by the MHGAN and its comparison methods for the California dataset are illustrated in Figure 9. A notable challenge in detecting changes within this dataset arises from the flat terrain where the flooding occurred. While the widening of the main river channel caused by the flood is relatively straightforward to identify, the accurate detection of scattered small lakes and puddles around the river channel presents a greater difficulty. This challenge is evident in the change maps, where capturing the patchy and dispersed changes in the reference map proves to be a complex task.

In general, when comparison methods are applied to non-interactive regions, their performance often shows substantial variability in detecting large-scale or non-interactive changes. In contrast, the MHGAN method demonstrates strong capability in accurately identifying areas of river channel widening induced by flooding. According to Table 2, the MHGAN attains the highest

O A

and Kappa, highlighting its outstanding overall performance and effectiveness in this scenario.

Figure 9 presents a visualization of the MHGAN on the California dataset. From Figure 10b,c, as well as Figure 10f,g, it can be observed that the style similarity between the transformed and original images is high, with unchanged regions remaining largely unaffected. This minimizes the impact of unchanged regions on CD results to the greatest extent.

The MHGAN trailing by only 0.03 percentage points. The other methods show a significant gap compared to them. Finally, the ROC curves of all comparison methods are shown in Figure 11. Among them, the ROC curve of the INLPG is the closest to the (0, 1) point, corresponding to the highest AUC value. Although the ROC curve of the MHGAN is inferior to that of the INLPG, the MHGAN achieves the best performance among all methods in suppressing background noise. Based on the above observations, the MHGAN ranks first in

O A

and

K a p p a

indicating that it strikes a good balance between suppressing background noise and detecting change areas.

4.6. Results on Tianhe Dataset

Figure 12 shows the change maps generated by the MHGAN method and the comparison algorithms on the Tianhe dataset. The main changes in the Tianhe dataset involve the construction of two new buildings and a road. The difficulty in detecting these changes lies in the fact that the two new buildings were built at the ends of existing structures, making it easy to mistakenly classify parts of the old buildings as changes during detection. Additionally, the style differences introduced by the heterogeneous images further complicate the precise detection of change areas. These challenges are evident in the results for ACE and CAA, as shown in Figure 12.

While CAA correctly identifies the two new buildings, it also incorrectly labels the old building in the middle as a change area. Both methods also misclassify many unchanged areas around the new buildings as changed. Compared to CAA and ACE, the other methods perform better on the Tianhe dataset, accurately detecting the new buildings and road while avoiding numerous misclassifications of unchanged areas. Among these, the MHGAN produces the best results, accurately detecting both new buildings and the road without being affected by background noise or misclassifying other areas as changes. Additionally, in Figure 13, we present a visual example of the transformations obtained with the proposed method on the dataset used in this section. The evaluation results provided in Table 3 further support the analysis of the change maps. The INLPG achieves a slightly higher AUC (0.9975 vs. 0.9972) possibly due to its strength in capturing local structural similarities via nonlocal patch graphs. However, the MHGAN outperforms the INLPG in Kappa (0.7787 vs. 0.7143) because the multi-head graph attention mechanism better preserves global spatial relationships across domains, reducing misalignment-induced errors in large-scale unchanged regions. The MHGAN shows the fewest false positives (FN + FP), yielding the highest AUC of 0.9970 and the highest Kappa value of 0.7787.

Figure 14 illustrates the ROC curves for all methods. It is evident that the curves of all eight methods are closely aligned with the ideal point, with the curves of the MHGAN and INLPG being nearly indistinguishable. This observation is further supported by the AUC values in Table 3, where the INLPG achieves the highest AUC among all methods, with the MHGAN trailing by only 0.03 percentage points. The other methods show a significant gap compared to them.

4.7. Results on Shuguang Dataset

Figure 15 displays the change maps obtained by all methods on the Shuguang dataset. The main changes in the Shuguang dataset include the construction of a new building in the agricultural area at the top left, as well as a linear change region at the bottom right caused by terrain variations. The difficulty in detecting changes in this dataset arises from the presence of several narrow agricultural fields in the upper-right part of the new building, which remain unchanged. These fields are susceptible to seasonal effects and the heterogeneity of the image data, making them prone to being mistakenly classified as changed areas. Furthermore, the narrow linear region at the bottom right is prone to being affected by background noise, leading to its misclassification as unchanged.

Upon comparing the results in Figure 15, it is evident that the methods CAA, NPSG, INLPG, ACE, and IRGMcS all incorrectly classify the agricultural area next to the building as a change zone. In contrast, the proposed MHGAN method accurately identifies this region as unchanged. This is further supported by the results in Table 4, which show that the MHGAN has the lowest number of false positives (

F P

+

F N

), indicating its superior ability to suppress background noise. As a result, the MHGAN achieves the highest

O A

of 0.9895, demonstrating its effective suppression of background noise while accurately detecting changed areas. Moreover, the highest Kappa value of 0.8245 indicates that the MHGAN maintains a good balance between detecting both the background and the changed areas. The MHGAN’s lower AUC compared to ACE in Shuguang dataset is due to their differing design goals. ACE uses an affinity-based change prior [41] that amplifies local differences, which boosts AUC but increases false positives (e.g., mistaking speckle noise in bare soil for changes). In contrast, the MHGAN’s graph attention loss suppresses noise in homogeneous regions (e.g., agricultural fields) by emphasizing structural consistency, reducing false alarms but slightly lowering AUC. This discrepancy reflects the MHGAN’s focus on reliable change maps with minimal false detections, which is more valuable in applications like land use monitoring where precision is critical. The MHGAN prioritizes suppressing false positives, making it more reliable for practical applications where reducing false alarms is critical. Figure 16 showcases a visual representation of the MHGAN when applied to the Shuguang dataset. Upon examination of Figure 16b,c, as well as Figure 16f,g, it is evident that the transformed images exhibit a high degree of stylistic similarity to their original images. The consistency in unaltered regions is particularly notable, which significantly mitigates the impact of these regions on the CD outcomes, thereby enhancing the accuracy of the CD process.

Figure 17 presents the ROC curves reflecting the classification performance of the network. Based on the corresponding AUC values, the results obtained by the MHGAN method are relatively modest. It can be observed that the curves of all eight methods are close to the ideal (0, 1) point, with NPSG and ACE showing the best performance. This is further confirmed by the AUC values in Table 4, where the ACE method achieves the highest AUC.

4.8. Ablation Study

To validate the effectiveness of the loss terms in optimizing and constraining network training, this section designs a series of ablation experiments for comparative analysis.

(1) Component Analysis. In these experiments, the network using only the reconstruction loss is considered the Baseline model. The other three loss terms are gradually added to the baseline model to construct the ablation models: Model-1 is Baseline + code correlation loss, Model-2 is Baseline + code correlation loss + adversarial loss, and the MHGAN is Baseline + code correlation loss + adversarial loss + graph attention loss. To facilitate direct performance comparison among these models, the CD results on four datasets are presented.

The performance of the models on the four datasets is evaluated using five overall metrics: AUC,

O A

, Kappa,

F P,

and

F N

, with the results summarized in Table 5, Table 6, Table 7 and Table 8. The change maps are shown in Figure 18. By incorporating code correlation loss, Model-1 achieves better alignment of feature representations, effectively mitigating the impact of interference from transformed data. As a result, Model-1 shows significant improvements in AUC,

O A

, and Kappa values across all four datasets compared to the Baseline model. With the addition of adversarial loss, Model-2 further enhances stylistic consistency in the common domain, making the generated pairs of images more consistent in style. Thus, Model-2 exhibits noticeable improvements over Model-1. Finally, with the inclusion of graph attention loss in the MHGAN, the model can more effectively transfer spatial information between feature domains, improving spatial alignment of feature representations and achieving domain transformation. Consequently, the MHGAN shows substantial improvements in AUC,

O A

, and Kappa values on the Italy and Tianhe dataset compared to Model-2. On the California dataset, the

O A

value decreases slightly, but both AUC and Kappa show notable improvements. On the Shuguang dataset, the AUC value decreases somewhat, but the Kappa value increases significantly. Figure 19 illustrates the comparison of AUCs.

(2) Parameter Analyses. The total loss function introduced in this paper utilizes the parameters

α, β

, and

γ

to balance the contributions of the four loss components. To evaluate the impact of these parameters on CD performance, a series of parameter combinations (

α, β

,

γ

) were tested, specifically (0.5, 0.5, 0.5), (2, 1, 1), (1, 2, 1), and (1, 1, 2). The four parameter combinations (0.5, 0.5, 0.5), (2, 1, 1), (1, 2, 1), and (1, 1, 2) were selected to systematically evaluate the influence of each loss component. Specifically, (0.5, 0.5, 0.5) serves as a baseline with equal weights across all losses. The other three combinations emphasize one loss term at a time: (2, 1, 1) highlights the code correlation loss, which is critical for cross-domain feature alignment; (1, 2, 1) emphasizes the graph attention loss, which models spatial relationships; and (1, 1, 2) focuses on the adversarial loss, which promotes style consistency. These values were chosen based on preliminary experiments, which indicated that the code correlation loss had the most significant impact on overall performance. Therefore, it was prioritized in our evaluation. The results, summarized in Table 9, show that when the parameters are set to (0.5, 0.5, 0.5), the performance of the MHGAN is suboptimal, indicating that reducing the influence of

L_{c o r r}, L_{g a l}, a n d L_{a d v e r}

leads to poorer CD outcomes.

In contrast, when the parameters are adjusted to (1, 2, 1) and (1, 1, 2), the MHGAN exhibits significant performance improvements compared to (0.5, 0.5, 0.5), demonstrating that increasing the weights of

L_{g a l} a n d L_{a d v e r}

positively impacts the model’s effectiveness. When the parameters are set to (2, 1, 1), the MHGAN achieves the best results across all four datasets. Thus, the final parameter values for

α, β

,

γ

are set to (2, 1, 1).

4.9. Computational Resource Comparison

To evaluate computational efficiency, we compared the MHGAN with seven state-of-the-art methods on the same hardware (NVIDIA RTX 2080 Ti, 16 GB RAM). Table 10 summarizes the average training time, inference time per image, and peak memory usage on four datasets.

The MHGAN achieves favorable computational resource consumption performance through a series of well-designed architectural choices. For training time (≈210 s/dataset), it benefits from the MHGAM, where a multi-head structure (four heads, optimized via ablation) enables parallelizable feature aggregation; this graph-based, localized attention reduces redundant computations, unlike methods with sequential or fully connected attention (such as NPSG’s non-local patch graph that incurs high pairwise comparison costs). Additionally, the bilateral adversarial framework, using a weighted combination of reconstruction, code correlation, and graph attention losses, avoids overfitting to noise (a common cause of prolonged training in methods like ACE that prioritize high sensitivity), thus converging faster on representative features. In terms of inference time (≈0.8 s/image), efficiency comes from the MHGAM’s pre-computed graph structures that minimize runtime graph construction, contrasting with methods like the INLPG that dynamically recompute non-local relationships per image and add latency and the encoder–decoder architecture, which compresses input features into a low-dimensional latent space feature maps in the bottleneck, reducing the computational load for change detection and outperforming uncompressed or weakly compressed methods (e.g., BASNet, which retains higher-resolution intermediate features). For peak memory (≈8.5 GB), memory efficiency is attained through the MHGAM constructing sparse graphs (retaining only top-k nearest neighbors per node) instead of dense cross-domain matrices, reducing memory overhead compared to methods like TSCNet that store full pairwise similarity matrices and the density-based dynamic kernel in the code correlation loss that avoids allocating fixed-size memory buffers for kernel operations, unlike methods with static kernels, as this adaptivity optimizes memory usage per-image based on local feature density.

5. Conclusions

This paper proposes an unsupervised change detection method for heterogeneous remote sensing images based on a MHGAN. The MHGAN employs four loss functions to facilitate unsupervised learning: reconstruction loss, code correlation loss, graph attention loss, and adversarial loss. To align feature representations in the encoding layer, a code correlation loss is introduced, enhanced by a density-based kernel width estimation method to better capture both local and global changes in heterogeneous images. To effectively transfer spatial information across feature domains and improve spatial alignment, a graph attention loss is designed. This loss models the relationship between features and images, enhances the representation of consistent regions in bitemporal images, and mitigates the impact of changes on the learning objectives. Additionally, an adversarial loss is incorporated to further promote style consistency within the shared domain, encouraging the network to generate paired images with homogeneous data structures and styles. Comparative experiments with state-of-the-art models demonstrate the superior performance of the MHGAN across four benchmark datasets for change detection tasks.

Author Contributions

This research article was mainly performed and prepared by authors M.J. and X.L. (Xiangyu Lou). Z.Z., X.L. (Xiaofeng Lu) and Z.S. assist the algorithm and experiment design; they also revised the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62076201), in part by the Key R&D Project of Shaanxi Province (No. 2023-YBGY-222), and in part by the Natural Science Foundation of Shaanxi Province (No. 2025JC-YBMS-747).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Inglada, J.; Mercier, G. A new statistical similarity measure for change detection in multitemporal SAR images and its extension to multiscale change analysis. IEEE Trans. Geosci. Remote Sens. 2007, 45, 1432–1445. [Google Scholar] [CrossRef]
Liu, Z.; Li, G.; Mercier, G.; He, Y.; Pan, Q. Change detection in heterogenous remote sensing images via homogeneous pixel transformation. IEEE Trans. Image Process. 2018, 27, 1822–1834. [Google Scholar] [CrossRef] [PubMed]
Gong, M.G.; Su, L.Z.; Li, H.; Liu, J. A survey on change detection in synthetic aperture radar imagery. J. Comput. Res. Dev. 2016, 53, 123–137. [Google Scholar]
Radke, R.J.; Andra, S.; Al-Kofahi, O.; Roysam, B. Image change detection algorithms: A systematic survey. IEEE Trans. Image Process. 2005, 14, 294–307. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Wang, C.; Zhang, H.; Zhang, B.; Wu, F. Urban building change detection in SAR images using combined differential image and residual U-Net network. Remote Sens. 2019, 11, 1091. [Google Scholar] [CrossRef]
Brunner, D.; Lemoine, G.; Bruzzone, L. Earthquake damage assessment of buildings using VHR optical and SAR imagery. IEEE Trans. Geosci. Remote Sens. 2010, 48, 2403–2420. [Google Scholar] [CrossRef]
Griffiths, P.; Hostert, P.; Gruebner, O.; van der Linden, S. Mapping megacity growth with multi-sensor data. Remote Sens. Environ. 2010, 114, 426–439. [Google Scholar] [CrossRef]
Pantze, A.; Fransson, J.E.; Santoro, M. Forest change detection from L-band satellite SAR images using iterative histogram matching and thresholding together with data fusion. In Proceedings of the 2010 IEEE International Geoscience and Remote Sensing Symposium, Honolulu, HI, USA, 25–30 July 2010; pp. 1226–1229. [Google Scholar]
Dalla Mura, M.; Prasad, S.; Pacifici, F.; Gamba, P.; Chanussot, J.; Benediktsson, J.A. Challenges and opportunities of multimodality and data fusion in remote sensing. Proc. IEEE 2015, 103, 1585–1601. [Google Scholar] [CrossRef]
Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.M.; Hofle, B.; Bruzzone, L.; Bovolo, F.; Chi, M.M.; Anders, K.; Gloaguen, R.; et al. Multisource and multitemporal data fusion in remote sensing: A comprehensive review of the state of the art. IEEE Geosci. Remote Sens. Mag. 2019, 7, 6–39. [Google Scholar] [CrossRef]
Zhou, W.; Troy, A.; Grove, M. Object-based land cover classification and change analysis in the Baltimore metropolitan area using multitemporal high resolution remote sensing data. Sensors 2008, 8, 1613–1636. [Google Scholar] [CrossRef] [PubMed]
Wan, L.; Xiang, Y.; You, H. A post-classification comparison method for SAR and optical images change detection. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1026–1030. [Google Scholar] [CrossRef]
Wan, L.; Zhang, T.; You, H. Multi-sensor remote sensing image change detection based on sorted histograms. Int. J. Remote Sens. 2018, 39, 3753–3775. [Google Scholar] [CrossRef]
Kwan, C.; Ayhan, B.; Larkin, J.; Kwan, L.; Bernabé, S.; Plaza, A. Performance of change detection algorithms using heterogeneous images and extended multi-attribute profiles (EMAPs). Remote Sens. 2019, 11, 2377. [Google Scholar] [CrossRef]
Liu, J.; Gong, M.; Qin, K.; Zhang, P. A deep convolutional coupling network for change detection based on heterogeneous optical and radar images. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 545–559. [Google Scholar] [CrossRef] [PubMed]
Zhao, W.; Wang, Z.; Gong, M.; Liu, J. Discriminative feature learning for unsupervised change detection in heterogeneous images based on a coupled neural network. IEEE Trans. Geosci. Remote Sens. 2017, 55, 7066–7080. [Google Scholar] [CrossRef]
Niu, X.; Gong, M.; Zhan, T.; Yang, Y. A conditional adversarial network for change detection in heterogeneous images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 45–49. [Google Scholar] [CrossRef]
Zhang, Y.; Li, H.; Wang, S. Multi-scale convolutional neural network for remote sensing image change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 1543–1554. [Google Scholar] [CrossRef]
Chen, F.; Zhou, X.; Liu, J. Robust feature alignment for heterogeneous remote sensing image change detection via cross-domain adversarial learning. Remote Sens. Lett. 2018, 9, 997–1006. [Google Scholar]
Li, J.; Sun, Q.; Zhao, R. Context-aware deep learning for complex scene change detection in heterogeneous remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 162, 55–68. [Google Scholar]
Singh, A. Digital change detection techniques using remotely-sensed data. Int. J. Remote Sens. 1989, 10, 989–1003. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, L. Change detection for remote sensing images: A survey. Int. J. Image Graph. 2011, 11, 179–211. [Google Scholar]
Li, X.; Chen, L. A review of change detection techniques for remotely sensed imagery. Remote Sens. 2018, 10, 831. [Google Scholar]
Zhang, L.; Song, D. Unsupervised change detection of remote sensing images based on generative adversarial networks. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9736–9748. [Google Scholar]
Nascimento, S.M.C.; Dias, J.G. Change detection in remote sensing images using multiscale morphological processing. Pattern Recognit. 2005, 38, 1709–1719. [Google Scholar]
Wang, J.; Xu, Y. Heterogeneous remote sensing image change detection based on transfer learning. Remote Sens. 2017, 9, 1254. [Google Scholar]
Liu, J.; Zhang, W.; Liu, F.; Xiao, L. A probabilistic model based on bipartite convolutional neural network for unsupervised change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4701514. [Google Scholar] [CrossRef]
Sun, Y.; Lei, L.; Li, X.; Sun, H.; Kuang, G. Nonlocal patch similarity based heterogeneous remote sensing change detection. Pattern Recognit. 2021, 109, 107598. [Google Scholar] [CrossRef]
Sun, Y.; Lei, L.; Li, X.; Tan, X.; Kuang, G. Structure consistency-based graph for unsupervised change detection with homogeneous and heterogeneous remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–21. [Google Scholar] [CrossRef]
He, Y.; Zhang, X. Change detection in remote sensing imagery using deep learning techniques: A survey. ISPRS J. Photogramm. Remote Sens. 2020, 164, 43–58. [Google Scholar]
Bovolo, F.; Marchesi, S.; Bruzzone, L. A framework for automatic and unsupervised detection of multiple changes in multitemporal images. IEEE Trans. Geosci. Remote Sens. 2012, 50, 2196–2212. [Google Scholar] [CrossRef]
Ghanbari, M.; Akbari, V. Generalized minimum-error thresholding for unsupervised change detection from multilook polarimetric SAR data. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 1853–1856. [Google Scholar]
Bovolo, F.; Bruzzone, L. A detail-preserving scale-driven approach to change detection in multitemporal SAR images. IEEE Trans. Geosci. Remote Sens. 2005, 43, 2963–2972. [Google Scholar] [CrossRef]
Luppino, L.T.; Hansen, M.A.; Kampffmeyer, M.; Bianchi, F.M.; Moser, G.; Jenssen, R.; Anfinsen, S.N. Code-aligned autoencoders for unsupervised change detection in multimodal remote sensing images. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 60–72. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Cheng, W.; Feng, Y.; Song, R. TSCNet: Topological structure coupling network for change detection of heterogeneous remote sensing images. Remote Sens. 2023, 15, 621. [Google Scholar] [CrossRef]
Liu, Z.-G.; Zhang, Z.-W.; Pan, Q.; Ning, L.-B. Unsupervised change detection from heterogeneous data based on image translation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4403413. [Google Scholar] [CrossRef]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017 (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar]
Royer, A.; Bousmalis, K.; Gouws, S.; Bertsch, F.; Mosseri, I.; Cole, F.; Murphy, K. XGAN: Unsupervised image-to-image translation for many-to-many mappings. In Domain Adaptation for Visual Understanding; Singh, R., Ed.; Springer: Cham, Switzerland, 2020; pp. 33–49. [Google Scholar]
LeCun, Y.A.; Bottou, L.; Orr, G.B.; Müller, K.-R. Efficient backpropagation. In Neural Networks: Tricks of the Trade, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 9–48. [Google Scholar]
Brennan, R.L.; Prediger, D.J. Coefficient Kappa: Some uses, misuses, and alternatives. Educ. Psychol. Meas. 1981, 41, 687–699. [Google Scholar] [CrossRef]
Luppino, L.T.; Kampffmeyer, M.; Bianchi, F.M.; Moser, G.; Serpico, S.B.; Jenssen, R.; Anfinsen, S.N. Deep image translation with an affinity-based change prior for unsupervised multimodal change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4700422. [Google Scholar] [CrossRef]
Jia, M.; Zhang, C.; Lv, Z.; Zhao, Z.; Wang, L. Bipartite adversarial autoencoders with structural self-similarity for unsupervised heterogeneous remote sensing image change detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6515705. [Google Scholar] [CrossRef]
Sun, Y.; Lei, L.; Guan, D.; Kuang, G. Iterative robust graph for unsupervised change detection of heterogeneous remote sensing images. IEEE Trans. Image Process. 2021, 30, 6277–6291. [Google Scholar] [CrossRef] [PubMed]

Figure 1. MHGAN architecture: Encoder–decoder pairs:

E_{1} / D_{1}

and

E_{2} / D_{2}

for domain

X

and

Y

. Discriminators:

P_{1}

and

P_{2}

. MHGAM: multi-head graph attention module. Loss pathways: reconstruction (black), code correlation (purple), graph attention (yellow), and adversarial (red).

Figure 1. MHGAN architecture: Encoder–decoder pairs:

E_{1} / D_{1}

and

E_{2} / D_{2}

for domain

X

and

Y

. Discriminators:

P_{1}

and

P_{2}

. MHGAM: multi-head graph attention module. Loss pathways: reconstruction (black), code correlation (purple), graph attention (yellow), and adversarial (red).

Figure 2. Illustration of image data transformation.

Figure 3. Illustration of the construction of the similarity matrix.

Figure 4. Illustration of the multi-head graph attention mechanism.

Figure 5. Four heterogeneous remote sensing image change detection datasets.

Figure 6. Change maps obtained by different methods on the Italy dataset. (a) Ground truth. (b) CAA. (c) NPSG. (d) INLPG. (e) ACE. (f) BASNet. (g) IRGMcS. (h) TSCNet. (i) MHGAN.

Figure 7. Feature visualization of MHGAN on the Italy dataset. (A) Input image

X

. (B) Input image

Y

. (a) Transformations of

X

into the code space

Z_{1} = E_{1} (X) .

(b) Transformations

\hat{X} = G (Y)

. (c) Reconstructions

\tilde{X} = D_{1} (E_{1} (X))

. (d) Difference image Δ between

\hat{X}

and

Y

. (e) Transformations of

Y

into the code space

Z_{2} = E_{2} (Y)

. (f) Transformations

\hat{Y} = F (X)

. (g) Reconstructions

\tilde{Y} = D_{2} (E_{2} (Y))

. (h) Difference image Δ between

\hat{Y}

and

X

.

Figure 7. Feature visualization of MHGAN on the Italy dataset. (A) Input image

X

. (B) Input image

Y

. (a) Transformations of

X

into the code space

Z_{1} = E_{1} (X) .

(b) Transformations

\hat{X} = G (Y)

. (c) Reconstructions

\tilde{X} = D_{1} (E_{1} (X))

. (d) Difference image Δ between

\hat{X}

and

Y

. (e) Transformations of

Y

into the code space

Z_{2} = E_{2} (Y)

. (f) Transformations

\hat{Y} = F (X)

. (g) Reconstructions

\tilde{Y} = D_{2} (E_{2} (Y))

. (h) Difference image Δ between

\hat{Y}

and

X

.

Figure 8. ROC curves of difference images on the Italy dataset.

Figure 9. Change maps obtained by different methods on the California dataset. (a) Ground truth. (b) CAA. (c) NPSG. (d) INLPG. (e) ACE. (f) BASNet. (g) IRGMcS. (h) TSCNet. (i) MHGAN.

Figure 10. Feature visualization of MHGAN on the California dataset. (A) Input image

X

. (B) Input image

Y

. (a) Transformations of

X

into the code space

Z_{1} = E_{1} (X) .

(b) Transformations

\hat{X} = G (Y)

. (c) Reconstructions

\tilde{X} = D_{1} (E_{1} (X))

. (d) Difference image Δ between

\hat{X}

and

Y

. (e) Transformations of

Y

into the code space

Z_{2} = E_{2} (Y)

. (f) Transformations

\hat{Y} = F (X)

. (g) Reconstructions

\tilde{Y} = D_{2} (E_{2} (Y))

. (h) Difference image Δ between

\hat{Y}

and

X

.

Figure 10. Feature visualization of MHGAN on the California dataset. (A) Input image

X

. (B) Input image

Y

. (a) Transformations of

X

into the code space

Z_{1} = E_{1} (X) .

(b) Transformations

\hat{X} = G (Y)

. (c) Reconstructions

\tilde{X} = D_{1} (E_{1} (X))

. (d) Difference image Δ between

\hat{X}

and

Y

. (e) Transformations of

Y

into the code space

Z_{2} = E_{2} (Y)

. (f) Transformations

\hat{Y} = F (X)

. (g) Reconstructions

\tilde{Y} = D_{2} (E_{2} (Y))

. (h) Difference image Δ between

\hat{Y}

and

X

.

Figure 11. ROC curves of difference images on the California dataset.

Figure 12. Change maps obtained by different methods on the Tianhe Dataset. (a) Ground truth. (b) CAA. (c) NPSG. (d) INLPG. (e) ACE. (f) BASNet. (g) IRGMcS. (h) TSCNet. (i) MHGAN.

Figure 13. Feature visualization of MHGAN on the Tianhe dataset. (A) Input image

X

. (B) Input image

Y

. (a) Transformations of

X

into the code space

Z_{1} = E_{1} (X) .

(b) Transformations

\hat{X} = G (Y)

. (c) Reconstructions

\tilde{X} = D_{1} (E_{1} (X))

. (d) Difference image Δ between

\hat{X}

and

Y

. (e) Transformations of

Y

into the code space

Z_{2} = E_{2} (Y)

. (f) Transformations

\hat{Y} = F (X)

. (g) Reconstructions

\tilde{Y} = D_{2} (E_{2} (Y))

. (h) Difference image Δ between

\hat{Y}

and

X

.

Figure 13. Feature visualization of MHGAN on the Tianhe dataset. (A) Input image

X

. (B) Input image

Y

. (a) Transformations of

X

into the code space

Z_{1} = E_{1} (X) .

(b) Transformations

\hat{X} = G (Y)

. (c) Reconstructions

\tilde{X} = D_{1} (E_{1} (X))

. (d) Difference image Δ between

\hat{X}

and

Y

. (e) Transformations of

Y

into the code space

Z_{2} = E_{2} (Y)

. (f) Transformations

\hat{Y} = F (X)

. (g) Reconstructions

\tilde{Y} = D_{2} (E_{2} (Y))

. (h) Difference image Δ between

\hat{Y}

and

X

.

Figure 14. ROC curves of difference images on the Tianhe dataset.

Figure 15. Change maps obtained by different methods on the Shuguang dataset. (a) Ground truth. (b) CAA. (c) NPSG. (d) INLPG. (e) ACE. (f) BASNet. (g) IRGMcS. (h) TSCNet. (i) MHGAN.

Figure 16. Feature visualization of MHGAN on the Shuguang dataset. (A) Input image

X

. (B) Input image

Y

. (a) Transformations of

X

into the code space

Z_{1} = E_{1} (X) .

(b) Transformations

\hat{X} = G (Y)

. (c) Reconstructions

\tilde{X} = D_{1} (E_{1} (X))

. (d) Difference image Δ between

\hat{X}

and

Y

. (e) Transformations of

Y

into the code space

Z_{2} = E_{2} (Y)

. (f) Transformations

\hat{Y} = F (X)

. (g) Reconstructions

\tilde{Y} = D_{2} (E_{2} (Y))

. (h) Difference image Δ between

\hat{Y}

and

X

.

Figure 16. Feature visualization of MHGAN on the Shuguang dataset. (A) Input image

X

. (B) Input image

Y

. (a) Transformations of

X

into the code space

Z_{1} = E_{1} (X) .

(b) Transformations

\hat{X} = G (Y)

. (c) Reconstructions

\tilde{X} = D_{1} (E_{1} (X))

. (d) Difference image Δ between

\hat{X}

and

Y

. (e) Transformations of

Y

into the code space

Z_{2} = E_{2} (Y)

. (f) Transformations

\hat{Y} = F (X)

. (g) Reconstructions

\tilde{Y} = D_{2} (E_{2} (Y))

. (h) Difference image Δ between

\hat{Y}

and

X

.

Figure 17. ROC curves of difference images on the Shuguang dataset.

Figure 18. Change maps obtained by different modules with ablation study on the four datasets.

Figure 19. The

A U C

of the difference image obtained by ablation study on four datasets.

Figure 19. The

A U C

of the difference image obtained by ablation study on four datasets.

Table 1. Results of change detection on Italy dataset.

Method	AUC	OA	Kappa	FP	FN
CAA	0.9008	0.9304	0.5167	6288	2313
NPSG	0.8978	0.8165	0.2851	21,273	1408
INLPG	0.9486	0.9308	0.5520	3022	6999
ACE	0.8722	0.9168	0.5450	3470	6812
BASNet	0.9120	0.9275	0.5180	6931	2035
IRGMcS	0.8811	0.9714	0.6928	1323	2214
TSCNet	0.8765	0.9303	0.4595	5243	3376
MHGAN	0.9378	0.9756	0.7078	920	2779