CAGMC-Defence: A Cross-Attention-Guided Multimodal Collaborative Defence Method for Multimodal Remote Sensing Image Target Recognition

Cui, Jiahao; Cao, Hang; Meng, Lingquan; Guo, Wang; Zhang, Keyi; Wang, Qi; Chang, Cheng; Li, Haifeng

doi:10.3390/rs17193300

Open AccessArticle

CAGMC-Defence: A Cross-Attention-Guided Multimodal Collaborative Defence Method for Multimodal Remote Sensing Image Target Recognition

by

Jiahao Cui

¹

,

Hang Cao

¹

,

Lingquan Meng

¹

,

Wang Guo

¹

,

Keyi Zhang

¹

,

Qi Wang

¹

,

Cheng Chang

^2,* and

Haifeng Li

¹

School of Geosciences and Info-Physics, Central South University, Changsha 410083, China

²

Research and Development Centre, China Academy of Launch Vehicle Technology, Beijing 100076, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(19), 3300; https://doi.org/10.3390/rs17193300

Submission received: 9 July 2025 / Revised: 19 September 2025 / Accepted: 24 September 2025 / Published: 25 September 2025

(This article belongs to the Special Issue Advances in Multimodal Remote Sensing Data: Processing, Fusion and Applications)

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

Proposed CAGMC-Defence, a cross-attention-guided multimodal collaborative defence framework integrating feature fusion (MFEF) and Multimodal Adversarial Training (MAT).
Achieved 85.74% accuracy under the strongest white-box MIM attack ( $ϵ = 0.05$ ) on the WHU-OPT-SAR dataset, consistently outperforming seven state-of-the-art baselines with lower inference cost.

What is the implication of the main finding?

Demonstrates a scalable and efficient paradigm for enhancing adversarial robustness of multimodal remote sensing models.
Provides a generalizable defence strategy applicable to other modality combinations and safety-critical tasks such as disaster response and environmental monitoring.

Abstract

With the increasing diversity of remote sensing modalities, multimodal image fusion improves target recognition accuracy but also introduces new security risks. Adversaries can inject small, imperceptible perturbations into a single modality to mislead model predictions, which undermines system reliability. Most existing defences are designed for single-modal inputs and face two key challenges in multimodal settings: 1. vulnerability to perturbation propagation due to static fusion strategies, and 2. the lack of collaborative mechanisms that limit overall robustness according to the weakest modality. To address these issues, we propose CAGMC-Defence, a cross-attention-guided multimodal collaborative defence framework for multimodal remote sensing. It contains two main modules. The Multimodal Feature Enhancement and Fusion (MFEF) module adopts a pseudo-Siamese network and cross-attention to decouple features, capture intermodal dependencies, and suppress perturbation propagation through weighted regulation and consistency alignment. The Multimodal Adversarial Training (MAT) module jointly generates optical and SAR adversarial examples and optimizes network parameters under consistency loss, enhancing robustness and generalization. Experiments on the WHU-OPT-SAR dataset show that CAGMC-Defence maintains stable performance under various typical adversarial attacks, such as FGSM, PGD, and MIM, retaining 85.74% overall accuracy even under the strongest white-box MIM attack (

ϵ = 0.05

), significantly outperforming existing multimodal defence baselines.

Keywords:

multimodal remote sensing; adversarial defence; cross-attention mechanism; collaborative defence; adversarial training

1. Introduction

Deep neural network-based multimodal remote sensing image recognition integrates multisource data from heterogeneous sensors, such as RGB, SAR, and LiDAR, to achieve cross-modality collaborative perception and to compensate for the physical and perceptual limitations of any single modality [1,2,3]. In representative applications, such as disaster response and environmental monitoring [4,5], this approach enables high-precision, highly robust landcover identification and dynamic change analysis and has become a key driver of more intelligent Earth observation systems. However, the high-dimensional input space and complex feature interactions in multimodal fusion introduce new attack surfaces. An adversary can inject imperceptible, norm-bounded perturbations into a single modality (e.g., RGB). This can mislead deep neural networks and cause incorrect predictions. As a result, the decision reliability of multimodal remote sensing recognition models is severely compromised [6,7,8]. This security dilemma highlights the urgent need for defences that address collaborative multimodal attacks; consequently, devising reliable multimodal adversarial defence strategies has become critical to safeguarding the safety and stability of intelligent multimodal remote sensing systems.

In recent years, significant progress has been made in multimodal adversarial defence methods. These approaches can generally be categorized into two main strategies: enhancing data robustness and enhancing model robustness. Data-centric methods focus on mitigating the effects of adversarial perturbations through input preprocessing techniques such as data reconstruction or by improving input quality via multimodal information collaboration. By improving the quality of the input data, these approaches indirectly strengthen the model’s resilience to attacks [9,10,11]. In contrast, model-centric methods aim to directly improve the model’s ability to recognize and defend against adversarial examples by innovating network architectures or optimizing training strategies. This includes, but is not limited to, architectural modifications, advanced training schemes, and adversarial training techniques [12,13].

Although various adversarial defence strategies have been developed to protect deeplearning-based multimodal remote sensing classification models, they still face significant limitations when confronted with collaborative multimodal attacks. First, most existing defences are designed from a single-modality perspective, focusing primarily on enhancing the robustness of convolutional neural networks without modelling the intermodality collaboration mechanisms—making them inadequate for effectively addressing multimodal threat scenarios. Second, current multimodal classification models typically rely on static fusion strategies, which lack the ability to dynamically perceive and regulate modality quality or adversarial perturbations. This often leads to cascading perturbation propagation across modalities during fusion, which amplifies the attack’s impact and severely degrades the overall defence performance.

To overcome the absence of collaborative defences and the fusion fragility observed in current multimodal remote sensing recognition models, we introduce CAGMC-Defence, a cross-attention-guided multimodal collaborative defence framework. CAGMC-Defence fully leverages the complementary strengths of heterogeneous data sources to increase robustness against adversarial attacks while retaining strong generalization on clean inputs. The framework contains two key components. Multimodal Feature Enhancement and Fusion (MFEF): To counter the vulnerability of static fusion, MFEF first employs a pseudo-Siamese network that models optical and SAR modalities independently, preventing perturbations from propagating across modalities in shallow layers. It then introduces a cross-attention mechanism [14] to dynamically capture intermodal correlations. This mechanism assigns attention weights based on feature consistency and stability. When one modality is locally degraded, its weight is reduced automatically. Simultaneously, the other modality receives a higher weight. This enables robust adaptive fusion and effectively suppresses cascading cross-modal perturbations. Multimodal Adversarial Training (MAT): MAT builds a joint perturbation–optimization and mixed-training scheme that simultaneously generates optical and SAR adversarial examples during training, guiding the network to learn resilient features under cooperative perturbations. In addition, a prediction-consistency loss further improves adaptation to distribution shifts, reinforcing the system’s resistance to both known and unseen attacks. The main contributions of this work are as follows:

This paper proposes a novel multimodal adversarial defence method, namely, CAGMC-Defence, which integrates feature enhancement and fusion with robust Multimodal Adversarial Training into a unified modelling framework. By guiding the model to learn complementary features under collaborative perturbations, the method effectively improves the robustness of multimodal remote sensing models in complex adversarial scenarios and accelerates the convergence of the defence strategy.
In the feature fusion stage, this work introduces a pseudo-Siamese architecture and a multimodal cross-attention mechanism. Optical and SAR features are first extracted independently to prevent the cascading propagation of perturbations in shallow network layers. Dynamic attention weights are subsequently used to enhance robust features and suppress abnormal regions, enabling more adaptive and resilient feature fusion. This mechanism effectively improves model stability under adversarial conditions and is particularly applicable to satellite-based sensing systems with high reliability requirements.
We conducted a comprehensive evaluation of the proposed method on the WHU-OPT-SAR multimodal remote sensing dataset [15], assessing its robustness under different levels of adversarial attack. CAGMC-Defence maintains stable performance under various typical adversarial attacks, such as FGSM, PGD, and MIM, retaining an overall accuracy of 0.8574 even under a strong white-box MIM attack ( $ϵ = 0.05$ ), demonstrating its effectiveness and broad applicability in real-world adversarial remote sensing scenarios.

2. Related Work

2.1. Remote Sensing Image Fusion Based on Optical and SAR Data

In recent years, with the rapid development of deep learning, intelligent recognition methods based on multisource remote sensing images have attracted widespread attention. Optical and synthetic aperture radar (SAR) images are complementary in terms of imaging mechanisms and have shown significant advantages in tasks such as target recognition and scene understanding. However, notable differences exist between the two types of images in terms of spatial structure, texture representation, and noise distribution. These differences pose considerable challenges for multimodal information fusion and joint recognition. To fully exploit the complementary information from optical and SAR images, researchers have proposed various fusion strategies under deep learning frameworks. These strategies are mainly categorized into three levels: pixel-level fusion, feature-level fusion, and decision-level fusion [16].

Pixel-level fusion integrates optical and SAR images at the input stage to generate a composite image as the input to deep learning models [17]. This approach typically relies on end-to-end deep networks to learn joint representations across modalities automatically, improving both perceptual consistency and discriminative capability. For example, Chen et al. [18] proposed a self-supervised fusion framework based on multiview contrastive loss. It incorporates early, middle, and late fusion strategies at both the image and superpixel levels to jointly extract pixel-level features from SAR and optical images under unlabelled conditions. Combined with spectral indices, the method achieves accurate land cover classification. Irfan et al. [19] used a dual-branch neural network to separately process SAR and optical inputs. They applied pixel-level fusion at the input stage to integrate low-level information, which significantly improved the model’s ability to distinguish land cover types and adapt to complex scenes.

Feature-level fusion extracts deep features from optical and SAR images separately during the intermediate representation stage. It aligns and jointly models these features in the feature space to enhance the expression of complementary information and improve semantic consistency between modalities [20]. For example, Xu et al. [21] proposed a dual-branch convolutional network that independently extracts features from optical and SAR images and performs multiscale fusion in intermediate layers to improve classification performance over complex surfaces. Liu et al. [22] introduced JoiTriNet, a joint network that integrates encoder- and decoder-level features from both modalities. By incorporating a dual-attention fusion module, the model enables discriminative feature learning and semantic enhancement, significantly improving land cover classification and model robustness. In addition, Gao et al. [23] developed a dual-encoder fusion network (DEN), where optical and SAR features are extracted independently to address the modality mismatch caused by imaging differences. They introduced a detail attention module (DAM) to capture key textures and contours while suppressing SAR speckle noise. An adaptive weighted loss function was also designed to preserve discriminative features and maintain image quality. Compared with pixel-level fusion, which directly operates on image data, feature-level fusion can maintain the independence of modality-specific features while extracting more discriminative joint representations. It offers stronger adaptability and robustness, particularly in tasks such as heterogeneous modality alignment, complex object recognition, and missing information reconstruction.

Decision-level fusion is a strategy that integrates the outputs of different modalities after feature extraction and task inference have been completed independently. This approach is suitable for scenarios where optical and SAR images differ significantly in their semantic representations. It enhances the robustness of the final classification or recognition results while preserving the individual discriminative capabilities of each modality. For example, Lin et al. [24] proposed a unified U-Net (UU-Net) model that fuses Sentinel-1 SAR and Sentinel-2 optical data for large-scale road extraction. When trained on a multisource road extraction dataset, the model significantly outperformed single-modality methods, achieving higher accuracy in 80% of the 200 evaluated regions. This demonstrates the complementarity and effectiveness of combining optical and SAR data for road extraction. Salehian et al. [25] introduced a decision-level fusion method based on majority voting. By combining optical and polarimetric SAR images and integrating change maps produced by multiple change detection algorithms, their method improved the overall accuracy and robustness of urban change detection. In addition, Chen et al. [26] proposed a decision-level fusion algorithm for multisource image-based object detection. Their approach applies object detection and semantic segmentation separately to visible and SAR images and then merges the results at the decision layer. This method effectively improves both the detection accuracy and the recall in multisource fusion scenarios.

2.2. Multimodal Adversarial Defence

Research on multimodal adversarial defences can be broadly categorized into two strategic lines: data-level robustness enhancement and model-level robustness enhancement. Data-centric approaches mitigate the impact of adversarial perturbations by preprocessing input samples or leveraging multimodal information synergy, thereby indirectly improving overall model resilience. For example, Yang et al. [27] reported that redundancy across modalities provides inherent robustness against single-modality perturbations. On this basis, they proposed an adversarial–robust fusion strategy. The method measures intermodal consistency to detect corrupted modalities. It then suppresses the propagation of their features and allows only clean modalities to influence the final decision. This approach significantly improves robustness under single-source attacks. Zhang et al. [28] proposed a frequency-domain-based defence method that enhances model robustness in image classification tasks by suppressing high-frequency components in images to mitigate the impact of adversarial perturbations. Similarly, Bansal et al. [29] introduced the Clean-CLIP framework, which employs modality-independent feature reconstruction to weaken spurious associations inserted by backdoor attacks. Experiments show that combining multimodal contrastive objectives with single-modal self-supervised targets during unsupervised fine-tuning both alleviates backdoor effects and preserves performance on benign samples. Karim et al. [9] proposed the MCAE model, which combines PixelCNN with a convolutional autoencoder to reconstruct adversarial inputs into their clean counterparts, providing a robustness-enhanced image restoration mechanism.

Waseda et al. [30] reported that adversarial training in vision-language models (VLMs) tends to overfit the fixed one-to-one (1:1) image–text pairing common in typical datasets. They reported that introducing one-to-many (1:N) or many-to-one (N:1) augmentation strategies can significantly improve adversarial robustness. Based on this insight, they proposed a novel defence method for image–text retrieval (ITR). The method leverages N:N mappings and combines basic data augmentation with generative augmentation. This creates diverse and tightly aligned image–text pairs, offering a promising new direction for defending against VLMs. Fares et al. [31] further explored text-to-image (T2I) models for adversarial detection: they regenerated images from captions produced by the target model and identified potential adversarial examples by measuring the embedding similarity between the original and regenerated images. Empirical evaluations on multiple datasets show that this approach outperforms baselines transferred directly from image classification tasks while maintaining model-agnosticism and task flexibility. In addition, Ramesh et al. [32] designed a universal evasion detector capable of blocking adversarial inputs across diverse attack scenarios. Li et al. [11] reported that both attack success and defence effectiveness in vision-language models (VLMs) are highly sensitive to the choice of textual prompts. Based on this observation, they proposed Adversarial Prompt Tuning (APT). This method optimizes robust prompts to improve model stability under adversarial conditions. It achieves this without introducing significant computational or data overhead, making it an efficient and effective defence strategy.

Although data-level approaches can suppress adversarial perturbations before they reach the model and enhance robustness at the input stage, they often rely on strong assumptions such as the availability of clean modalities, consistent cross-modal correlations, or complex data reconstruction procedures. As a result, their generalizability tends to degrade under adaptive or unforeseen attacks. Moreover, these methods may introduce information loss and high computational overhead, which limits their practicality in realtime applications.

Defence strategies aimed at enhancing model robustness focus primarily on architectural improvements and training optimization, enabling the model to learn the distribution of adversarial perturbations directly and thereby improve its ability to distinguish adversarial examples. Among these, adversarial training is a widely adopted and well-established method in image classification and has been proven effective in improving model robustness. However, Mao et al. [12] reported that existing adversarial training methods exhibit certain limitations when facing different types of attacks. Owing to the diverse nature of perturbations generated by various attack methods, a single adversarial training strategy struggles to provide unified protection against multiple threats. To address this, they proposed a hybrid adversarial training approach that jointly optimizes multiple perturbations, enabling the model to simultaneously learn and defend against a broader range of adversarial features and thereby enhancing multimodal adversarial robustness. In addition, Kuang et al. [33] explored the potential link between adversarial and backdoor attacks and proposed a defence method called Adversarial Backdoor Defence (ABD). By generating adversarial perturbations aligned with backdoor features and incorporating them into the fine-tuning stage as a data augmentation technique, ABD effectively improves the model’s ability to identify and suppress backdoor samples. In the context of model compression and knowledge distillation, Nagarajan et al. [13] incorporated defensive distillation into the adversarial training process. They introduced adversarial examples during training to generate robust soft labels from the teacher model. These soft labels were subsequently used to guide the student model in learning more resilient prediction strategies. This approach significantly improved the model’s defence capability under adversarial conditions.

Moreover, although the following methods focus primarily on unimodal tasks, their design principles in adversarial training offer valuable insights for multimodal adversarial defence. For example, Azizmalayeri et al. [34] introduced an adversarial training approach based on a Lagrangian objective function, which jointly optimizes a fixed norm regularization term and the classification loss to improve robustness against training attacks and significantly enhance generalization to unseen adversarial attacks. Xu et al. [35] proposed a dynamic divide-and-conquer adversarial training strategy (DDC-AT), which introduces a multibranch structure during training to differentiate the processing of pixels that are sensitive or insensitive to adversarial perturbations, thereby enhancing the robustness of semantic segmentation models on both clean and adversarial samples. Cui et al. [36] proposed an adversarial training method guided by the logits of a clean model, which constrains the predictions of a robust model on adversarial examples to align with the outputs of the clean model on corresponding natural inputs, thereby improving robustness while effectively preserving accuracy on clean data.

Building upon this, Cui et al. [37] further enhanced adversarial robustness by introducing the Improved Kullback–Leibler (IKL) loss, which incorporates symmetric gradient optimization and classwise global information, achieving state-of-the-art performance across multiple datasets. Zan et al. [38] proposed a gradual adversarial training method (GAT), which introduces multiple intermediate domains to guide the model to gradually adapt to adversarial features, thus significantly improving the robustness of remote sensing image segmentation models without modifying the network architecture or input data. Liu et al. [39] introduced an uncertainty-aware adversarial training method (AT-UR) that combines beta-weighted loss with an entropy minimization regularizer, aiming to improve adversarial robustness while effectively reducing the size of prediction sets in conformal prediction, thus enhancing model reliability in safety-critical scenarios. Despite their effectiveness, these adversarial training-based methods often suffer from high training costs, sensitivity to hyperparameter settings, and potential trade-offs between robustness and generalization, which may limit their applicability in real-time or resource-constrained scenarios.

Beyond adversarial training, enhancing the robustness of multimodal models can also be achieved by improving model architectures and optimizing training strategies. For instance, Sur et al. [40] proposed TIJO, a multimodal backdoor defence technique that jointly reverses triggers in both image and text modalities, effectively countering dual-key multimodal attacks. Hossain et al. [41] introduced SimCLIP+, a defence mechanism based on a pseudo-Siamese architecture, which performs adversarial fine-tuning on the CLIP visual encoder to maximize the cosine similarity between clean and perturbed examples, thereby improving robustness to adversarial perturbations. Zhang et al. [42] proposed Heterocentric Loss, which constrains the distance between class centres of different modalities to improve fusion consistency. They also introduced a guided complementary entropy loss to suppress the model’s confidence in incorrect labels, further enhancing adversarial robustness during heterogeneous feature fusion. He et al. [43], in the context of visual question answering, designed a regularization term called Contrastive Fusion Representation (CFR) to reduce the model’s sensitivity to perturbations in both visual and linguistic inputs. Additionally, Xu et al. [44] proposed Cross-modal Information Detector (CIDER), a method for detecting cross-modal jailbreak attacks. It identifies adversarial examples by measuring the semantic similarity between malicious queries and adversarial images. CIDER is plug-and-play and highly transferable. It defends against both white-box and black-box attacks without requiring access to the underlying multimodal large language model (MLLM).

3. Methods

3.1. CAGMC-Defence Framework

To increase the robustness of multimodal remote sensing image recognition models against adversarial attacks, this paper proposes a defence framework named CAGMC-Defence. The method integrates multimodal adversarial perturbation modelling with a robust feature fusion mechanism and consists of two core components: the Multimodal Feature Enhancement and Fusion (MFEF) module and the Multimodal Adversarial Training (MAT) module. The design of the MFEF module is inspired by the multimodal feature extraction and fusion strategies in the MCANet framework [15]. It consists of three key submodules. First, the Pseudo-Siamese Feature Extractor (PSFE) independently extracts features from optical and SAR modalities. Second, the Multimodal Cross-Attention (MCA) submodule models dynamic dependencies between modalities. Third, the Cross-Modal Feature Fusion (CMFF) submodule aligns the latent feature spaces and performs semanticlevel integration. Together, these components enable the model to learn more discriminative and robust fused feature representations.

Specifically, the preprocessed optical and SAR remote sensing images are first fed into the MFEF module, where the PSFE submodule performs modality-specific encoding to prevent cross-modal propagation of adversarial perturbations at shallow stages. Afterwards, the MCA submodule introduces a cross-attention mechanism and employs a Hadamard product to achieve second-order attention interaction, enabling the modelling of complex nonlinear intermodal dependencies. Based on differences in modality stability, the MCA submodule adaptively regulates feature responses—enhancing robust regional information while suppressing noise. Finally, the CMFF submodule semantically aligns and integrates the fused features, incorporating a consistency loss to constrain the shared representation between modalities and improve generalization under adversarial conditions. The resulting fused feature map has dimensions of

64 \times 64 \times 304

and is passed through two upsampling convolutional layers to produce the final classification predictions.

Building on this foundation, the MAT module is introduced to further enhance the model’s adversarial robustness. This module generates multimodal adversarial examples based on the current classifier and constructs a mixed training set to enable min–max adversarial training optimization. It integrates multiple improved perturbation generation strategies, using the current MFEF model as the target. Representative adversarial examples are crafted by optimizing perturbation directions and are then combined with clean inputs to form a robust training set. By training on the joint distribution of clean and adversarial examples, the model learns stable and discriminative features. This significantly enhances its robustness to known attacks and improves its generalization to unseen perturbations. The overall architecture of the CAGMC-Defence framework is illustrated in Figure 1.

3.2. Multimodal Feature Enhancement and Fusion Module (MFEF Module)

As illustrated in Figure 2, the MFEF module consists of three sequential submodules: PSFE, MCA, and CMFF. Together, they form a deep feature extraction and fusion architecture designed for multimodal remote sensing imagery. Optical and SAR images are first processed by a dual-branch pseudo-Siamese network based on ResNet101 [45], which extracts both low- and high-level features. The MCA module is introduced at multiple semantic levels to model the dependencies between modalities. The fused features are then unified and integrated in the CMFF module before being passed to the decoder for final classification prediction. The following sections provide a detailed explanation of the implementation of each submodule.

3.2.1. Pseudo-Siamese Feature Extraction Submodule (PSFE Submodule)

To address the issues of modality feature mismatch caused by the distinct imaging mechanisms of optical and SAR data, as well as the cross-modal propagation of adversarial perturbations during early feature extraction stages, this paper designs a dual-branch pseudo-Siamese network. The network adopts two parallel convolutional streams with independent parameters to process the optical image

x_{o}

and the SAR image

x_{s}

. Deep semantic features are gradually extracted through multilevel residual connections, resulting in discriminative intramodal representations. This structure improves multimodal feature alignment and classification accuracy by suppressing semantic discrepancies in object representation. It also mitigates the cross-modal spread of adversarial noise in shallow layers through decoupled modelling. As a result, the model’s robustness is significantly enhanced.

Taking the optical image as an example, its high-level feature representation can be formalized as follows:

F_{o}^{h} = F_{o}^{l} + \sum_{i = 1}^{L - 1} F_{o} (F_{o}^{i}, ω_{o}^{i})

(1)

where

F_{o}^{l}

denotes the shallow feature map at layer l,

F_{o} (\cdot)

represents the nonlinear feature transformation function at layer i, and

ω_{o}^{i}

is the set of convolutional parameters for the i-th layer. This formulation reflects the progressive aggregation of semantic features via residual stacking from layer l onwards.

Similarly, the feature extraction process for the SAR image

x_{s}

can be expressed as follows:

F_{s}^{h} = F_{s}^{l} + \sum_{i = 1}^{L - 1} F_{s} (F_{s}^{i}, ω_{s}^{i})

(2)

where

F_{s} (\cdot)

denotes the nonlinear transformation function in the SAR branch, which is used to extract the structural and scattering features of ground objects.

In summary, the PSFE submodule provides a structurally clear and semantically purified feature foundation for subsequent cross-modal collaborative modelling through modality decoupling and feature isolation strategies. This design not only facilitates more effective intermodal interactions but also provides a solid foundation for enhancing the overall model’s robustness against adversarial perturbations.

3.2.2. Multimodal Cross-Attention Submodule (MCA Submodule)

Because different modalities can represent the same geographic scene from complementary perspectives, their semantic and structural features are inherently synergistic. Effectively integrating the advantageous features of each modality can help mitigate performance degradation when one modality is compromised by adversarial perturbations. To enable robust multimodal perception and feature extraction under such conditions, we introduce a Multimodal Cross-Attention Submodule (Figure 3). This submodule establishes explicit interaction pathways between modalities, allowing the model to fully exploit the complementary nature of semantic and structural information from optical and SAR images. It guides adaptive fusion and weighted modulation at the feature level, thereby enhancing the discriminative power and robustness of the fused representation. Specifically, the mechanism comprises the following five key steps:

Feature Projection: To construct the query (Q), key (K), and value (V) vectors required for the multimodal cross-attention mechanism, the optical feature map

F_{o}

and SAR feature map

F_{s}

are independently projected through linear transformations. Specifically, each modality’s feature map is passed through three independent

1 \times 1

convolutional layers (with identical channel dimensions) to generate their respective projections:

\{\begin{matrix} Q_{o}, K_{o}, V_{o} = {Conv}_{Q, K, V}^{1 \times 1} (F_{o}) \\ Q_{s}, K_{s}, V_{s} = {Conv}_{Q, K, V}^{1 \times 1} (F_{s}) \end{matrix}

(3)

Here,

{Conv}_{Q, K, V}^{1 \times 1}

denotes three independent

1 \times 1

convolutional operations used to generate the query, key, and value feature maps.

Q_{o}, K_{o}, V_{o}

represent the query, key, and value features derived from the optical modality, while

Q_{s}, K_{s}, V_{s}

denote the corresponding features extracted from the SAR modality.

Intramodal Attention Modelling: To capture semantic dependencies within each modality, attention weight matrices are constructed independently for the optical and SAR branches. Specifically, for each modality, the correlation between spatial positions is computed via the matrix multiplication of the query matrix

Q

and the transpose of the key matrix

K^{⊤}

, followed by normalization using the softmax function. The attention weights are defined as follows:

\{\begin{matrix} S_{o} = softmax (Q_{o} K_{o}^{⊤}) \\ S_{s} = softmax (Q_{s} K_{s}^{⊤}) \end{matrix}

(4)

Here,

S_{o}

and

S_{s}

denote the attention weight matrices for the optical and SAR modalities, respectively. This mechanism captures long-range dependencies within each modality, enhancing the expressiveness and robustness of feature representations by integrating both local and global information.

Cross-Modal Attention Fusion: Optical and SAR images encode complementary information, such as semantic cues and structural details, while exhibiting significant heterogeneity. To enable collaborative learning and deep interaction between multimodal features, we introduce a cross-modal attention fusion mechanism. Specifically, the attention weight matrices of the optical modality

S_{o}

and the SAR modality

S_{s}

are fused using Hadamard (elementwise) multiplication to achieve nonlinear interaction and joint modelling. The fused attention matrix

S_{u}

is computed as follows:

S_{u} = S_{o} ⊙ S_{s}

(5)

This unified attention matrix explicitly encodes spatial correspondences between the two modalities, facilitating the discovery of complementary cross-modal features and laying a foundation for subsequent weighted fusion and robust feature construction.

Feature Reweighting: To further enhance the modality-specific focus on semantic regions, the joint attention matrix

S_{u}

is used to reweight the value vectors from both optical and SAR modalities. Specifically, the joint attention weights are multiplied elementwise by the value feature maps

V_{o}

and

V_{s}

. This operation adaptively recalibrates feature importance, strengthens the response of robust regions, and suppresses interference from adversarially perturbed areas. The reweighted features are computed as follows:

\{\begin{matrix} W_{o} = S_{u} ⊙ V_{o} \\ W_{s} = S_{u} ⊙ V_{s} \end{matrix}

(6)

where ⊙ denotes elementwise multiplication. This step enables dynamic adjustment of spatial feature responses across modalities based on cross-modal attention distribution, thereby improving the discriminative power and robustness of the learned feature representations.

Feature Interaction: After modality-specific reweighting, the attention-weighted feature maps of the optical and SAR modalities, denoted as

W_{o}

and

W_{s}

, are obtained. To fuse the discriminative information from both modalities and enhance the completeness and robustness of the multimodal feature representation, elementwise multiplication is applied to produce the unified attention-enhanced feature map

F_{u}

, which is defined as follows:

F_{u} = W_{o} ⊙ W_{s}

(7)

Here,

F_{u}

integrates the semantic content from the optical image and the structural characteristics from the SAR image, explicitly modelling the coupling relationships between modalities and providing a more robust and informative representation for classification tasks.

This cross-modal attention interaction mechanism can be embedded into various hidden layers of multimodal deep neural networks. It can either function as an independent auxiliary module or be integrated with the backbone network, supporting both standard supervised learning and adversarial robust training. This design enhances the network’s ability to collaboratively represent heterogeneous information from multiple sources.

3.2.3. Cross-Modal Feature Fusion Submodule (CMFF Submodule)

The Cross-Modal Feature Fusion (CMFF) submodule is designed to deeply integrate the multiscale semantic features extracted from optical and SAR images by the PSFE module, thereby enhancing the model’s representation capability and robustness in multimodal perception tasks. By combining multilevel feature processing with fusion mechanisms, this module performs cross-attention operations at both low- and high-dimensional scales to fully exploit the complementary characteristics across modalities (Figure 4).

At the low-dimensional feature processing stage, the optical and SAR modality features extracted by the pseudo-Siamese network are denoted as follows:

F_{o}^{l} \in R^{64 \times 64 \times 256}

and

F_{s}^{l} \in R^{64 \times 64 \times 256}

, respectively. These features are then individually fed into the MCA submodule to obtain the joint low-level attention feature map

F_{u}^{l} \in R^{64 \times 64 \times 256}

. Subsequently, the CMFF submodule concatenates

F_{o}^{l}

,

F_{s}^{l}

, and

F_{u}^{l}

along the channel dimension to produce a fused feature map

F_{o s u}^{l} \in R^{64 \times 64 \times 768}

. A

1 \times 1

convolution is then applied to compress the channel dimension, resulting in the low-dimensional fused feature map

F_{o s u}^{l - d} \in R^{64 \times 64 \times 48}

.

At the high-dimensional feature-processing stage, the optical and SAR feature maps obtained from the low-level branch, denoted as

F_{o}^{l}

and

F_{s}^{l}

, are fed into deeper convolutional units to extract richer semantic information, yielding high-level representations

F_{o}^{h} \in R^{32 \times 32 \times 2048}

and

F_{s}^{h} \in R^{32 \times 32 \times 2048}

. A

1 \times 1

convolution is then applied to compress the channel dimensions of both feature maps, producing reduced versions

F_{o}^{h - r}

and

F_{s}^{h - r} \in R^{32 \times 32 \times 256}

, which helps reduce computational complexity and improve fusion efficiency. Next, the original high-level features

F_{o}^{h}

and

F_{s}^{h}

are fed into the MCA submodule to generate a joint high-level attention feature map

F_{u}^{h} \in R^{32 \times 32 \times 2048}

. The CMFF submodule then concatenates

F_{o}^{h - r}

,

F_{s}^{h - r}

, and

F_{u}^{h}

along the channel dimension to obtain a fused feature map

F_{o s u}^{h} \in R^{32 \times 32 \times 2560}

. Finally, this fused feature map is passed into the Atrous Spatial Pyramid Pooling (ASPP) module to capture multiscale contextual information and enhance semantic representation. The ASPP outputs a feature map of size

R^{32 \times 32 \times 256}

, which is then upsampled to

R^{64 \times 64 \times 256}

, forming the final high-level fused representation, denoted as

F_{o s u}^{l - u}

.

Subsequently, the low-level fused feature

F_{osu}^{l - d}

and the high-level fused feature

F_{osu}^{l - u}

are concatenated along the channel dimension to obtain the comprehensive fused representation

F^{x} \in R^{64 \times 64 \times 304}

. This feature is then passed through two consecutive convolutional layers to further integrate multilevel information. Finally, a bilinear upsampling operation is applied to restore the feature map from

64 \times 64

to the original input resolution, resulting in the final classification prediction map.

3.3. Multimodal Adversarial Training Module (MAT Module)

Deep neural networks (DNNs) often exhibit poor robustness when exposed to adversarial examples, making them vulnerable to sophisticated perturbation-based attacks. To improve the overall stability of the defence mechanisms and ensure that the model yields consistent predictions for both clean inputs

x

and their adversarial counterparts

x^{adv}

, we propose a Multimodal Adversarial Training module (MAT). By incorporating adversarial examples into the training process, this module effectively enhances the model’s resistance to perturbed inputs. The strategy not only improves the classification performance on clean and adversarial examples but also significantly mitigates performance degradation under adversarial conditions, thereby increasing the overall recognition accuracy.

The classical adversarial training objective can be formulated as the following minimax optimization problem:

min_{θ} E_{(x, y) \sim D} [max_{x^{adv} \in H_{ϵ} (x)} L (f_{θ} (x^{adv}), y)]

(8)

where

θ

represents the model parameters, and

(x, y) \sim D

denotes a data–label pair sampled from the true distribution

D

.

E_{(x, y) \sim D}

indicates the expected operation over the input–label pairs

(x, y)

drawn from the distribution

D

. The adversarial example

x^{adv}

is generated within a perturbation neighbourhood

H_{ϵ} (x)

around the original input

x

. The classifier

f_{θ} (\cdot)

outputs the model’s prediction, and

L (\cdot, \cdot)

is the loss function measuring the prediction error with respect to the ground truth y.

This minimax training process can be broken down into two stages. First, during the inner maximization, the adversary constructs an adversarial input

x^{adv}

that maximizes the loss within the allowed perturbation range. Second, in the outer minimization, the model parameters

θ

are updated by minimizing the loss on these adversarial inputs, thereby improving the model’s adversarial robustness.

To address multimodal data, such as optical images

x_{o}

and SAR images

x_{s}

, we extend the adversarial training objective into a joint perturbation-based minimax optimization problem:

min_{θ} E_{(x_{o}, x_{s}, y) \sim D} [max_{(x_{o}^{adv}, x_{s}^{adv})} L (f_{θ} (x_{o}^{adv}, x_{s}^{adv}), y)] s . t . x_{o}^{adv} \in H_{ϵ} (x_{o}), x_{s}^{adv} \in H_{ϵ} (x_{s})

(9)

Here,

E_{(x_{o}, x s, y) \sim D}

denotes the expectation over input triplets sampled from the true data distribution

D

.

H_{ϵ} (x_{o})

and

H_{ϵ} (x_{s})

represent the perturbation neighbourhoods for the optical and SAR images, respectively, with

ϵ

as the maximum perturbation magnitude. The loss function

L

denotes the cross-entropy loss, which quantifies the discrepancy between the predicted and true labels.

To generate effective multimodal adversarial examples, we incorporate multiple attack strategies within adversarial training to maximize the model’s loss, thereby encouraging the learning of more generalizable and robust representations. Specifically, for optical images

x_{o}

and SAR images

x_{s}

, we construct joint adversarial examples

(x_{o}^{adv}, x_{s}^{adv})

under a unified loss objective using three methods: Fast Gradient Sign Method (FGSM) [46], Projected Gradient Descent (PGD) [47], and Momentum Iterative Method (MIM) [48].

FGSM generates adversarial examples via a single-step gradient-based update, as follows:

\{\begin{matrix} x_{o}^{adv} & = x_{o} + ϵ \cdot sign (\nabla_{x_{o}} L (f_{θ} (x_{o}, x_{s}), y)) \\ x_{s}^{adv} & = x_{s} + ϵ \cdot sign (\nabla_{x_{s}} L (f_{θ} (x_{o}, x_{s}), y)) \end{matrix}

(10)

Here,

L

denotes the loss function, and

f_{θ}

is the classification model. The operator

sign (\cdot)

extracts the gradient direction. The gradients

\nabla_{x_{o}} L (\cdot)

and

\nabla_{x_{s}} L (\cdot)

represent the loss gradients with respect to the optical and SAR images, respectively, which guide the direction of the adversarial perturbations. The parameter

ϵ

controls the perturbation magnitude.

To enhance the attack strength and diversity of adversarial examples, we further incorporate PGD and MIM into the adversarial training process. PGD follows an iterative optimization framework, in which the perturbations for both the optical and SAR modalities are simultaneously updated in each iteration as follows:

\{\begin{matrix} x_{o}^{t + 1} & = Π_{H_{ϵ} (x_{o})} (x_{o}^{t} + α \cdot sign (\nabla_{x_{o}} L (f_{θ} (x_{o}^{t}, x_{s}^{t}), y))) \\ x_{s}^{t + 1} & = Π_{H_{ϵ} (x_{s})} (x_{s}^{t} + α \cdot sign (\nabla_{x_{s}} L (f_{θ} (x_{o}^{t}, x_{s}^{t}), y))) \end{matrix}

(11)

Here,

α

denotes the step size, and

x_{o}^{t}

and

x_{s}^{t}

represent the optical and SAR images at the t-th iteration, respectively.

Π_{H_{ϵ} (x)} (\cdot)

represents the projection operator that constrains perturbed examples within an

ϵ

-ball centred at the original input.

Furthermore, the MIM extends PGD by incorporating a momentum term to enhance the transferability of perturbations. The MIM update rules are defined as follows:

\{\begin{matrix} g_{o}^{t + 1} & = μ \cdot g_{o}^{t} + \frac{\nabla_{x_{o}} L (f_{θ} (x_{o}^{t}, x_{s}^{t}), y)}{{∥\nabla_{x_{o}} L (f_{θ} (x_{o}^{t}, x_{s}^{t}), y)∥}_{1}}, x_{o}^{t + 1} = Π_{H_{ϵ} (x_{o})} (x_{o}^{t} + α \cdot sign (g_{o}^{t + 1})) \\ g_{s}^{t + 1} & = μ \cdot g_{s}^{t} + \frac{\nabla_{x_{s}} L (f_{θ} (x_{o}^{t}, x_{s}^{t}), y)}{{∥\nabla_{x_{s}} L (f_{θ} (x_{o}^{t}, x_{s}^{t}), y)∥}_{1}}, x_{s}^{t + 1} = Π_{H_{ϵ} (x_{s})} (x_{s}^{t} + α \cdot sign (g_{s}^{t + 1})) \end{matrix}

(12)

Here,

μ

denotes the momentum factor.

g_{o}^{t}

and

g_{s}^{t}

represent the accumulated gradient momentum for the optical and SAR images at the t-th iteration, respectively.

| | \cdot {| |}_{1}

denotes the

L_{1}

norm of a vector (i.e., the sum of absolute values of all the elements), which is used to normalize the gradient direction. The above methods are used to construct strong adversarial examples during training, thereby enhancing the model’s ability to handle inner maximization in the minimax optimization process. This provides a solid foundation for improving adversarial robustness in subsequent optimization.

To further improve the model’s robustness against the aforementioned adversarial attacks, we introduce a consistency training mechanism to mitigate the input distribution shift caused by adversarial perturbations. This mechanism enforces the model to produce consistent predictions for both the original inputs

(x_{o}, x_{s})

and their adversarial counterparts

(x_{o}^{adv}, x_{s}^{adv})

, thereby enhancing stability and generalization under adversarial conditions. Specifically, after the input data pass through the Multimodal Feature Enhancement and Fusion module (MFEF), the model’s prediction for the clean input is formulated as follows:

p (y ∣ (x_{o}, x_{s}); θ) = softmax (f_{θ} (F^{x}))

(13)

Here,

F^{x}

denotes the fused multimodal feature map, and

p (y ∣ (x_{o}, x_{s}); θ)

represents the pixelwise class probability distribution predicted by the model under parameters

θ

, given the joint input of the optical image

x_{o}

and the SAR image

x_{s}

. This probability distribution is obtained by applying the softmax function over the output features at each spatial location, yielding the per-pixel likelihoods across semantic categories, with the corresponding prediction on the adversarial inputs

(x_{o}^{adv}, x_{s}^{adv})

denoted as

p (y ∣ (x_{o}^{adv}, x_{s}^{adv}); θ)

.

To ensure prediction stability before and after perturbation, this work adopts the prediction consistency loss as the total loss function during training to supervise the pixelwise discrepancy between the ground-truth labels y and the predictions on adversarial samples. The loss function is defined as follows:

L (f_{θ} (x_{o}^{adv}, x_{s}^{adv}), y) = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} 1_{[y_{i} = c]} log (p (y_{i} = c ∣ x_{o}^{adv}, x_{s}^{adv}; θ))

(14)

where N denotes the total number of pixels in the image, C is the number of semantic categories, and

1_{[y_{i} = c]}

is the indicator function that equals 1 if pixel i belongs to class c and 0 otherwise. The term

p (y_{i} = c ∣ x_{o}^{adv}, x_{s}^{adv}; θ)

represents the conditional probability that pixel i is predicted as class c, which is computed via the softmax function. By minimizing this loss, the model is encouraged to maintain prediction consistency in the presence of adversarial perturbations, thereby improving its robustness in semantic classification tasks.

4. Experiments

4.1. Dataset and Evaluation Metrics

Dataset: The experimental data used in this study are derived from the WHU-OPT-SAR dataset [15], which is the first publicly available optical–SAR paired dataset for land-use classification. Released by Wuhan University, this dataset covers approximately 50,000 km² of Hubei Province (30°–33°N, 108°–117°E) and contains 100 coregistered image pairs, each with a spatial resolution of

5556 \times 3704

pixels. The optical images are acquired from the GF-1 satellite (with a resolution of 2 m), while the SAR images are obtained from the GF-3 satellite (with a resolution of 5 m). To ensure pixelwise alignment, the optical images are resampled to a 5 m resolution using bilinear interpolation and aligned with the SAR images at subpixel accuracy. Pixel-level land use labels are generated based on the 2017 National Land Use Change Survey in China. The dataset contains seven land cover categories: cropland, urban, rural, water, forestland, road, and others.

Data Processing and Experimental Details: During the preprocessing phase, the original remote sensing images were cropped into nonoverlapping patches of size

256 \times 256

. From these, a total of 8850 image slices were randomly selected to construct the experimental sample set. The dataset was subsequently split into training and testing subsets at a ratio of 8:2 to evaluate the performance of multimodal adversarial attacks and defence strategies.

All the experiments were implemented using the PyTorch 1.13.1 framework and were conducted on a workstation equipped with an NVIDIA RTX A6000 GPU (NVIDIA Corporation, Santa Clara, CA, USA). To simulate realistic adversarial scenarios, three classical attack methods—FGSM, PGD, and BIM—were adapted for multimodal inputs and injected into the dataset. These attacks introduce subtle perturbations to the input to mislead the model, with the perturbation magnitude set to 0.01. For multimodal PGD and MIM attacks, the step size was set to one-eighth of the maximum perturbation, and the number of iterations was fixed at 10. During adversarial training, we employed the Adam optimizer with an initial learning rate of

1 \times 10^{- 3}

, a weight decay of

1 \times 10^{- 4}

, a batch size of 64, and a total of 40 training epochs. These settings ensured thorough optimization of the proposed adversarial defence method.

Evaluation Metrics: To assess the classification performance of the model under adversarial attacks quantitatively, we adopt two widely used metrics: overall accuracy (OA) and the kappa coefficient. OA measures the proportion of correctly classified samples and is defined as follows:

OA = \frac{\sum_{i = 1}^{C} n_{i i}}{N}

(15)

where C denotes the total number of classes,

n_{i i}

represents the number of correctly classified samples for class i in the confusion matrix, and N is the total number of samples.

The kappa coefficient is used to evaluate the agreement between the classification results and those obtained by random chance and is defined as follows:

κ = \frac{P_{o} - P_{e}}{1 - P_{e}}

(16)

where

P_{o} = \frac{\sum_{i = 1}^{C} n_{i i}}{N}

is the observed agreement and

P_{e} = \frac{\sum_{i = 1}^{C} (n_{i \cdot} \cdot n_{\cdot i})}{N^{2}}

is the expected agreement by chance. Here,

n_{i \cdot}

and

n_{\cdot i}

denote the sum of the i-th row and i-th column of the confusion matrix, respectively. By computing OA and

κ

under different levels of adversarial perturbations (FGSM, PGD, and MIM), we comprehensively evaluate the effectiveness of the proposed adversarial defence method.

4.2. Adversarial Attack Experimental Results

We first evaluate the robustness of both unimodal and multimodal remote sensing image classification models under three classical white-box adversarial attack methods: FGSM, PGD, and MIM. The experimental results are summarized in Table 1. With respect to unimodal models, adding slight perturbations to the input images leads to a significant degradation in classification performance. Specifically, the unimodal model using only RGB images exhibits a dramatic decrease in mean overall accuracy (OA), which decreases from 0.8237 on clean examples to 0.4313 under adversarial attacks—an absolute reduction of 0.3924. This result indicates that the RGB-based model is highly vulnerable to adversarial perturbations and lacks robustness. In contrast, the unimodal model using only SAR images experiences a smaller decrease in mean OA, from 0.8456 to 0.5925. The relatively lower sensitivity of SAR-based models to adversarial noise suggests that the unique statistical characteristics of SAR data may limit the effectiveness of adversarial perturbation generation, resulting in less severe performance degradation compared with their RGB counterparts.

In the multimodal attack experiments, the remote sensing classification model that integrates both RGB and SAR imagery achieves an OA of 0.8779 under clean conditions. However, when adversarial perturbations are simultaneously applied to both modalities, the average OA decreases to 0.6517, resulting in a performance degradation of 0.3262. These results indicate that although the multimodal model offers superior classification performance under benign conditions, it still suffers from substantial accuracy loss when subjected to adversarial attacks, reflecting its limited robustness. Furthermore, the degree of performance degradation observed in the multimodal model lies between that of the RGB-only and SAR-only unimodal models. This suggests that although multimodal fusion enhances robustness to some extent, it remains insufficient to effectively resist highly targeted adversarial attacks.

We further evaluated the robustness of the multimodal classification model under unimodal adversarial attacks, with the experimental results presented in Table 2. As shown in the table, perturbations applied to the optical modality exhibit a significantly stronger attack effect than those targeting the SAR modality do. For example, under the FGSM attack, the model’s OA decreases from 0.8779 to 0.7178 when the optical modality is attacked, whereas attacking only the SAR modality results in a smaller decline to 0.7682. Iterative attacks such as PGD and MIM, which leverage multistep optimization, are generally more destructive than single-step FGSM. Nevertheless, the OA of the SAR modality remains relatively high even under these stronger attacks, demonstrating superior adversarial robustness. This performance gap may stem from the intrinsic differences in sensing mechanisms and feature representations between the two modalities. Optical imagery heavily relies on fine-grained visual features such as colour and texture, making it more sensitive to pixel-level perturbations. In contrast, SAR imagery, owing to its unique imaging process and inherent robustness to noise, preserves semantic consistency more effectively when exposed to adversarial interference, thereby exhibiting stronger resistance to attacks.

4.3. Adversarial Defence Experimental Results

To validate the effectiveness of the proposed CAGMC-Defence method, we conducted a systematic evaluation of its defence performance under three typical adversarial attacks (FGSM, PGD, and MIM) and compared it with seven mainstream defence methods, namely, HFS [28], Lagrangian-AT [34], DDC-AT [35], LBGAT [36], DKL [37], GAT [38], and AT-UR [39]. All methods were trained under the same configuration to ensure fair comparison, with 40 training epochs, a learning rate of

1 \times 10^{- 3}

, and the Adam optimizer. Table 3, Table 4 and Table 5 report the overall accuracy (OA) and kappa coefficient of each method under varying perturbation strengths, providing a quantitative assessment of their adversarial robustness.

Under adversarial attacks, all methods showed decreased classification performance as the perturbation strength increased, but the extent of degradation varied. HFS and DDC-AT maintained moderate accuracy under small perturbations, yet their robustness decreased sharply with stronger attacks. Under PGD and MIM attacks with

ϵ = 0.05

, the OA of HFS decreased to 0.4324 and 0.4474, respectively, whereas that of DDC-AT decreased to 0.5500 and 0.5441, respectively. These results indicate their limited effectiveness against high-intensity and complex attacks. In contrast, the Lagrangian-AT, LBGAT, and DKL performed well under FGSM attacks. The OA of the Lagrangian-AT was above 0.84 across all levels of perturbation strength and still reached 0.8433 at

ϵ = 0.05

. Both the LBGAT and DKL maintained OA levels consistently above 0.87, reflecting their good adaptability to single-step attacks. However, all three methods experienced significant performance decreases under PGD and MIM attacks. In particular, the OA of DKL decreased to 0.5630 under the MIM with

ϵ = 0.05

, which was notably lower than those of the other methods, revealing its limitations in handling complex adversarial scenarios.

Furthermore, AT-UR and GAT demonstrated stronger robustness under multistep attacks. For example, AT-UR remained stable under PGD attacks, achieving an OA of 0.8645 when

ϵ = 0.05

. However, its robustness decreased under MIM attacks. When

ϵ = 0.03

, the OA decreased to 0.7843, and the kappa coefficient decreased to 0.3149, indicating a weakness in adapting to higher-order attacks. In contrast, the GAT delivered consistently strong performance across all three types of attacks. In particular, under PGD and MIM attacks with

ϵ = 0.05

, OA values of 0.8584 and 0.8477, respectively, were achieved. These results were higher than those of all the other methods except CAGMC-Defence, demonstrating GAT’s superior generalization ability and stable defence performance.

Overall, CAGMC-Defence consistently demonstrates more robust defence performance across all three types of adversarial attacks. Not only does it achieve the highest classification accuracy under clean conditions, but it also exhibits significantly less performance degradation under strong perturbations. For example, under PGD attacks with a perturbation strength of 0.05, CAGMC-Defence maintains an OA of 0.8682—outperforming both AT-UR and the LBGAT. Similarly, in the presence of MIM attacks, CAGMC-Defence achieves the highest OA at the maximum perturbation level, surpassing the second-best method (LBGAT) and indicating a more balanced robustness and defence capability. Further statistical analysis reveals that the average decrease in the OA of CAGMC-Defence across all three attacks is only 0.0341, which is notably smaller than those of the other compared methods. Moreover, it results in the smallest reduction in the kappa coefficient, suggesting that CAGMC-Defence maintains high classification consistency even under adversarial conditions.

In summary, the above experimental results comprehensively demonstrate the robustness advantages of CAGMC-Defence in multimodal remote sensing image recognition tasks under various adversarial attacks. CAGMC-Defence not only significantly improves recognition accuracy under clean conditions but also maintains stable defence performance against a range of complex attack strategies. These results highlight its strong generalizability and practical applicability. Visual comparisons of different defence methods under various adversarial attacks are illustrated in Figure 5, Figure 6 and Figure 7.

To further evaluate the defence performance of the CAGMC-Defence model against single-modality adversarial attacks, we applied the three typical attack methods to either the optical or SAR modality under a fixed perturbation strength of 0.01. The overall classification accuracies (OAs) after defence are reported in Table 6. To highlight the effectiveness of the proposed defence strategy, we also compare these results with the corresponding attack outcomes without defence, as shown in Table 2.

Specifically, when the optical modality was attacked by the FGSM, PGD, and MIM, the OA of the undefended model decreased to 0.7178, 0.6815, and 0.6637, respectively. With CAGMC-Defence, the OA increased to 0.8841, 0.8813, and 0.8741, representing improvements of 16.63%, 19.98%, and 21.04%, respectively, effectively mitigating performance degradation caused by the perturbations. Similarly, when the SAR modality was attacked, the OA values of the original model were 0.7682, 0.7671, and 0.7658, whereas those of the defended model reached 0.8820, 0.8791, and 0.8649, corresponding to relative gains of 11.37%, 11.20%, and 9.91%, respectively. These results highlight the model’s strong crossmodal robustness. Notably, the proposed method outperformed even the clean-sample baseline in certain unimodal attack scenarios, demonstrating its effective use of redundant modalities and its ability to suppress corrupted inputs. Overall, the results confirm that CAGMC-Defence is not only effective against joint multimodal attacks but also provides stable and significant protection under challenging unimodal attack conditions, greatly reducing the impact of adversarial perturbations on model performance.

Furthermore, we evaluated the practicality of CAGMC-Defence by comparing the inference efficiency of various defence methods under different adversarial attack scenarios, as summarized in Table 7. All methods used their respective pretrained defence models, and during the testing phase, adversarial examples were regenerated for each model under FGSM, PGD, and MIM attacks. The total processing time, including both adversarial example generation and model inference, was recorded to fairly and comprehensively simulate the defence cost in real-world deployment. The values in the table represent the average processing time per adversarial example (in seconds), with lower values indicating higher efficiency.

The results demonstrate that CAGMC-Defence consistently achieves high inference efficiency across all attack types. Under PGD and MIM attacks, the average processing times are 2.87 s and 2.67 s, respectively, which are significantly lower than those of most multistep adversarial training methods, such as the LBGAT (10.64 s/4.03 s) and GAT (12.28 s/4.77 s), highlighting its computational advantage. In comparison, CAGMC-Defence achieves a favourable trade-off between robustness and computational cost, offering both strong defence performance and practical deployment efficiency.

4.4. Iterations

To evaluate the convergence speed and defence performance of the proposed multimodal adversarial defence method, we tested the CAGMC-Defence method both on clean examples and under various adversarial attack scenarios. As shown in Figure 8, under clean conditions, the model reached an overall accuracy above 0.87 within just 10 training epochs. It then quickly converged and maintained stable accuracy throughout the remaining training process, demonstrating strong classification performance. This rapid convergence significantly reduced the computational overhead, thereby enhancing the model’s real-time capability and robustness.

In addition, under FGSM attacks, CAGMC-Defence consistently maintained an OA of approximately 0.87, with kappa values ranging narrowly between 0.65 and 0.68, indicating the high robustness of the defence model against this type of attack. In contrast, under PGD attacks, the model experienced a notable performance decrease at epoch 20, with the OA decreasing to 0.7465 and the kappa coefficient decreasing to 0.4011, suggesting that the attack effectively compromised the model’s defence at this stage. However, as training progressed to epoch 40, the OA recovered to 0.8682, and the kappa value increased to 0.6518, demonstrating that the model gradually strengthens its resistance to stronger attacks through adversarial training. In the case of MIM attacks, the OA remained below 0.75 between epochs 10 and 30, indicating that the momentum-based gradient accumulation mechanism can more easily overcome static defence strategies. Nevertheless, by epoch 40, the OA increased to 0.8574, reflecting the robust defence capability of CAGMC-Defence even under high-order adversarial attacks.

4.5. Ablation Studies

To gain a deeper understanding of the mechanisms by which each module enhances model robustness, we first revisit the design rationale of the proposed components prior to conducting the ablation experiments. The MFEF module leverages a multimodal crossattention mechanism and feature fusion strategy to jointly model the representations of optical and SAR modalities. This design not only strengthens the collaborative perception between modalities but also helps mitigate abnormal activations induced by adversarial perturbations. In parallel, the MAT module introduces multimodal adversarial examples during training, guiding the model to adapt to perturbation distributions from an optimization perspective, thereby fundamentally improving adversarial robustness.

Building upon these design principles, we constructed four model variants and performed a series of ablation studies to systematically evaluate the individual contributions of each module to the overall robustness of the model: (1) NO_ACTION, the baseline model without the Multimodal Feature Enhancement and Fusion (MFEF) and Multimodal Adversarial Training (MAT) modules; (2) NO_MFEF, which retains MAT but removes MFEF; (3) NO_MAT, which retains MFEF but excludes MAT; and (4) ALL, the complete model integrating both MFEF and MAT. By comparing the performance of these variants under different adversarial attack scenarios, we assess the individual and joint effects of each module, thereby validating the robustness improvements introduced by our method. The experimental results are shown in Table 8, Table 9 and Table 10.

According to the experimental results presented in Table 8, Table 9 and Table 10, all three types of whitebox attacks significantly degraded the performance of the baseline model, with increasing levels of damage observed as the perturbation strength

ε

increased from 0 to 0.05. Among them, the MIM attack exhibited the highest destructiveness. For instance, at

ε = 0.05

, the OA of the baseline model without any defence modules (denoted as NO_ACTION) decreased to only 0.2578, which was substantially lower than the OAs under the FGSM (0.4688) and PGD (0.2825) attacks. Under clean conditions (

ε = 0

), the baseline model achieved an OA of 0.8779 and a kappa coefficient of 0.6964. Introducing only the Multimodal Feature Enhancement and Fusion module (NO_MAT) increased the OA to 0.9018 and the kappa coefficient to 0.7776, demonstrating that the MFEF module effectively improves the model’s discriminative capacity. However, in high-strength attack scenarios (

ε \geq 0.03

), the robustness of the NO_MAT configuration rapidly diminished. For example, under the FGSM attack at

ε = 0.05

, its OA decreased to merely 0.4886, indicating that the MFEF module alone is insufficient to reinforce the model’s vulnerable decision boundaries against strong adversarial perturbations.

In contrast, incorporating only the Multimodal Adversarial Training module (NO_MFEF) significantly enhanced robustness. For instance, under FGSM attacks with a perturbation strength of

ϵ = 0.05

, the model achieved an OA of 0.8437 and a kappa coefficient of 0.6653. Similarly, under PGD attacks with the same

ϵ

level, the OA and kappa coefficient remained at 0.8218 and 0.5109, respectively. Although the kappa value exhibited a more pronounced decline, both metrics still demonstrated substantial improvements over the baseline. These results indicate that the MAT module effectively reshapes the decision boundary through Multimodal Adversarial Training, thereby improving the model’s resistance to both gradient-based and iterative attacks.

The complete model (ALL), which integrates both the MAT and MFEF modules, demonstrated the most robust and stable defence performance. For example, under FGSM, PGD, and MIM attacks with a perturbation strength of

ϵ = 0.05

, the OA reaches 0.8788, 0.8682, and 0.8574, respectively—surpassing any single-module variant. The kappa coefficient also remained above 0.6284 across all three attacks. Compared with (NO_MFEF), the ALL model achieved a 0.0464 improvement in OA and a 0.1409 increase in kappa under PGD attacks with

ϵ = 0.05

, highlighting the complementary synergy between the MFEF and MAT modules. Overall, the MAT module serves as the primary driver of robustness, whereas the MFEF module enhances classification consistency by leveraging the complementary characteristics of optical and SAR modalities to smooth decision boundaries. Their synergistic integration significantly strengthens the multimodal model’s resilience against various adversarial perturbations without compromising its baseline performance.

4.6. Transferability Experiments

To assess the transferability of the proposed CAGMC-Defence method under different adversarial training strategies, we adopted three representative attack methods—FGSM, PGD, and MIM—to independently perform adversarial training on the model. The defence performance was then evaluated using all three attack types under a fixed perturbation strength (

ϵ = 0.05

). This setup enables a systematic evaluation of the generalization capability and stability of each training strategy (see Table 11).

The experimental results demonstrate that different adversarial training strategies could significantly increase model robustness against their respective attack types. For instance, under MIM adversarial training, the model achieved optimal defence performance when evaluated against MIM attacks, attaining an OA of 0.8574, which was notably higher than its performance under FGSM (0.8072) and PGD (0.7473) attacks. This observation suggests that the model achieves the highest robustness when the adversarial sample generation method used during training aligns with the type of attack encountered during testing. Further analysis reveals that PGD adversarial training exhibited superior transferability across different attack scenarios, achieving OAs of 0.8669 and 0.8587 under FGSM and MIM attacks, respectively, with the smallest performance degradation. This highlights the adaptability of PGD-trained models under diverse adversarial perturbations. This advantage may be attributed to the multistep iterative optimization process inherent in PGD adversarial sample generation, which produces more diverse and representative perturbations, thereby enhancing the model’s generalizability. In contrast, although MIM training achieved strong defence within its native attack domain, its generalization capability under heterogeneous attacks appears to be limited. Specifically, its performance decreased considerably when facing PGD attacks, with the OA decreasing to 0.7473 and the kappa coefficient decreasing to only 0.1499.

In summary, CAGMC-Defence substantially enhances model robustness across different adversarial training paradigms. Notably, PGD-based adversarial training achieves a well-balanced trade-off between defence against PGD attacks and generalization to other attack types (e.g., FGSM and MIM), highlighting its effectiveness in complex adversarial environments.

5. Conclusions

To address the challenge of insufficient adversarial robustness in multimodal remote sensing image classification, this paper proposes CAGMC-Defence, a cooperative defence framework that integrates Multimodal Adversarial Training with a cross-attention mechanism. CAGMC-Defence comprises two key modules, the Multimodal Feature Enhancement and Fusion (MFEF) module and the Multimodal Adversarial Training (MAT) module, which jointly increase robustness from the perspectives of feature modelling and training strategy. Specifically, the MFEF module adopts a pseudo-Siamese architecture combined with cross-attention to decouple and dynamically reweight modality-specific features, thereby mitigating perturbation propagation across modalities and improving the robustness of the fused representations. The MAT module jointly generates adversarial examples for both optical and SAR modalities and enforces a prediction consistency constraint to promote stable and accurate classification under perturbations. Experimental results on the WHU-OPT-SAR dataset demonstrate that CAGMC-Defence consistently outperforms state-of-the-art defence methods, exhibiting lower performance degradation and higher classification consistency under three representative adversarial attacks: FGSM, PGD, and MIM. Ablation studies further verify the standalone effectiveness and complementary benefits of the MFEF and MAT modules. Transfer experiments also confirm the strong generalization ability of CAGMC-Defence under diverse adversarial training strategies, with PGD-based adversarial training yielding particularly robust performance across different attack types.

6. Discussion

Although the proposed CAGMC-Defence framework demonstrates strong adversarial robustness on the WHU-OPT-SAR dataset, there are still some limitations and directions worth exploring in future work.

First, the experiments in this study are conducted solely on the WHU-OPT-SAR dataset. To the best of our knowledge, this dataset is the first publicly available multimodal remote sensing benchmark that provides coregistered optical–SAR image pairs with pixel-level annotations. It offers high-quality alignment and labelling accuracy. Owing to the current scarcity of such high-quality optical–SAR datasets, we selected it as the foundation for our study to ensure accurate modality alignment and label consistency. However, as only one dataset was used, the adaptability of our method to other modality combinations and more complex data distributions has not yet been fully verified. In the future, we plan to extend our method to other multimodal scenarios, such as optical–LiDAR and SAR–hyperspectral scenarios, to evaluate its robustness and generalization ability comprehensively across different modality structures.

Second, the proposed method introduces a dual-stream architecture and a crossmodal attention mechanism, which increases model complexity. As a result, the training convergence speed on large-scale remote sensing datasets still has room for improvement. This may limit its scalability in high-resolution scenes or large-sample applications. Future research could focus on improving training efficiency through lightweight network designs and more efficient attention mechanisms.

Finally, this study focuses on multimodal remote sensing image classification. However, other downstream tasks in remote sensing, such as change detection, object detection and causal representation learning [49,50], also face threats from adversarial attacks. Extending the CAGMC-Defence framework to these tasks and designing task-specific modules or hierarchical fusion mechanisms will be a promising direction for future research.

Author Contributions

Conceptualization, J.C. and H.C.; methodology, J.C.; software, J.C. and H.C.; validation, J.C., H.C. and L.M.; formal analysis, J.C.; investigation, J.C.; resources, C.C. and H.L.; data curation, Q.W. and K.Z.; writing—original draft preparation, J.C. and H.C.; writing—review and editing, W.G., C.C. and H.L.; visualization, J.C., H.C. and L.M.; supervision, H.L.; project administration, C.C. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant Number: 42271481) and the Natural Science Foundation of Hunan Province of China under Grant no. 2024JJ6496.

Data Availability Statement

The data associated with this research are available online. The WHUOPT-SAR dataset, which includes optical and SAR images with pixel-level annotations, is available at https://github.com/AmberHen/WHU-OPT-SAR-dataset.git (accessed on 10 July 2025).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q.; Zhang, B. More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4340–4354. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. Convolutional neural networks for multimodal remote sensing data classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5517010. [Google Scholar] [CrossRef]
Shao, R.; Yang, C.; Li, Q.; Xu, L.; Yang, X.; Li, X.; Li, M.; Zhu, Q.; Zhang, Y.; Li, Y.; et al. AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Ten Modalities via Language as a Reference Framework. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–20. [Google Scholar] [CrossRef]
Algiriyage, N.; Prasanna, R.; Stock, K.; Doyle, E.E.; Johnston, D. Multi-source multimodal data and deep learning for disaster response: A systematic review. SN Comput. Sci. 2022, 3, 92. [Google Scholar] [CrossRef]
Himeur, Y.; Rimal, B.; Tiwary, A.; Amira, A. Using artificial intelligence and data fusion for environmental monitoring: A review and future perspectives. Inf. Fusion 2022, 86, 44–75. [Google Scholar] [CrossRef]
Zhou, Y.; Jiang, W.; Jiang, X.; Chen, L.; Liu, X. Camonet: A target camouflage network for remote sensing images based on adversarial attack. Remote Sens. 2023, 15, 5131. [Google Scholar] [CrossRef]
Cui, J.; Guo, W.; Huang, H.; Lv, X.; Cao, H.; Li, H. Adversarial examples for vehicle detection with projection transformation. IEEE Trans. Geosci. Remote Sens. 2024, 63, 1–18. [Google Scholar] [CrossRef]
Peng, X.; Zhou, J.; Wu, X. Distillation-Based Cross-Model Transferable Adversarial Attack for Remote Sensing Image Classification. Remote Sens. 2025, 17, 1700. [Google Scholar] [CrossRef]
Karim, M.R.; Islam, T.; Lange, C.; Rebholz-Schuhmann, D.; Decker, S. Adversary-aware multimodal neural networks for cancer susceptibility prediction from multiomics data. IEEE Access 2022, 10, 54386–54409. [Google Scholar] [CrossRef]
Xue, W.; Chen, Z.; Tian, W.; Wu, Y.; Hua, B. A cascade defense method for multidomain adversarial attacks under remote sensing detection. Remote Sens. 2022, 14, 3559. [Google Scholar] [CrossRef]
Li, L.; Guan, H.; Qiu, J.; Spratling, M. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24408–24419. [Google Scholar]
Mao, J.; Weng, B.; Huang, T.; Ye, F.; Huang, L. Research on multimodality face antispoofing model based on adversarial attacks. Secur. Commun. Netw. 2021, 2021, 3670339. [Google Scholar] [CrossRef]
Nagarajan, S.M.; Devarajan, G.G.; Ramana, T.V.M.; Asha, J.M.; Bashir, A.K.; Al-Otaibi, Y.D. Adversarial deep learning based Dampster–Shafer data fusion model for intelligent transportation system. Inf. Fusion 2024, 102, 102050. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Li, X.; Zhang, G.; Cui, H.; Hou, S.; Wang, S.; Li, X.; Chen, Y.; Li, Z.; Zhang, L. MCANet: A joint semantic segmentation framework of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102638. [Google Scholar] [CrossRef]
Hall, D.L.; Llinas, J. An introduction to multisensor data fusion. Proc. IEEE 2002, 85, 6–23. [Google Scholar] [CrossRef]
Li, J.; Zhang, J.; Yang, C.; Liu, H.; Zhao, Y.; Ye, Y. Comparative analysis of pixel-level fusion algorithms and a new high-resolution dataset for SAR and optical image fusion. Remote Sens. 2023, 15, 5514. [Google Scholar] [CrossRef]
Chen, Y.; Bruzzone, L. Self-supervised SAR-optical data fusion and land-cover mapping using Sentinel-1/-2 images. arXiv 2021, arXiv:2103.05543. [Google Scholar]
Irfan, A.; Li, Y.; E, X.; Sun, G. Land Use and Land Cover classification with deep learning-based fusion of SAR and optical data. Remote Sens. 2025, 17, 1298. [Google Scholar] [CrossRef]
Yuan, L.; Zhu, G. Research on remote sensing image classification based on feature level fusion. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, 42, 2185–2189. [Google Scholar] [CrossRef]
Xu, D.; Li, Z.; Feng, H.; Wu, F.; Wang, Y. Multi-scale feature fusion network with symmetric attention for land cover classification using sar and optical images. Remote Sens. 2024, 16, 957. [Google Scholar] [CrossRef]
Liu, X.; Zou, H.; Wang, S.; Lin, Y.; Zuo, X. Joint network combining dual-attention fusion modality and two specific modalities for land cover classification using optical and SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 3236–3250. [Google Scholar] [CrossRef]
Gao, G.; Wang, M.; Zhang, X.; Li, G. DEN: A new method for SAR and optical image fusion and intelligent classification. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5201118. [Google Scholar] [CrossRef]
Lin, Y.; Wan, L.; Zhang, H.; Wei, S.; Ma, P.; Li, Y.; Zhao, Z. Leveraging optical and SAR data with a UU-Net for large-scale road extraction. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102498. [Google Scholar] [CrossRef]
Salehian, S.; Arefi, H.; Shah Hosseini, R. Change Detection in Urban Area Using Decision Level Fusion of Change Maps Extracted from Optic and SAR Images. J. Geomat. Sci. Technol. 2019, 8, 71–90. [Google Scholar]
Chen, J.; Xu, X.; Zhang, J.; Xu, G.; Zhu, Y.; Liang, B.; Yang, D. Ship target detection algorithm based on decision-level fusion of visible and SAR images. IEEE J. Miniaturiz. Air Space Syst. 2023, 4, 242–249. [Google Scholar] [CrossRef]
Yang, K.; Lin, W.Y.; Barman, M.; Condessa, F.; Kolter, Z. Defending multimodal fusion models against single-source adversaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3340–3349. [Google Scholar]
Zhang, Z.; Jung, C.; Liang, X. Adversarial defense by suppressing high-frequency components. arXiv 2019, arXiv:1908.06566. [Google Scholar] [CrossRef]
Bansal, H.; Singhi, N.; Yang, Y.; Yin, F.; Grover, A.; Chang, K.W. Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 112–123. [Google Scholar]
Waseda, F.; Tejero-de Pablos, A. Leveraging Many-To-Many Relationships for Defending Against Visual-Language Adversarial Attacks. arXiv 2024, arXiv:2405.18770. [Google Scholar]
Fares, S.; Ziu, K.; Aremu, T.; Durasov, N.; Takáč, M.; Fua, P.; Nandakumar, K.; Laptev, I. Mirrorcheck: Efficient adversarial defense for vision-language models. arXiv 2024, arXiv:2406.09250. [Google Scholar]
Ramesh, K. Multimodal Spoofing and Adversarial Examples Countermeasure for Speaker Verification. Master’s Thesis, University of Waterloo, Waterloo, ON, Canada, 2022. [Google Scholar]
Kuang, J.; Liang, S.; Liang, J.; Liu, K.; Cao, X. Adversarial backdoor defense in clip. arXiv 2024, arXiv:2409.15968. [Google Scholar] [CrossRef]
Azizmalayeri, M.; Rohban, M.H. Lagrangian objective function leads to improved unforeseen attack generalization in adversarial training. arXiv 2021, arXiv:2103.15385. [Google Scholar] [CrossRef]
Xu, X.; Zhao, H.; Jia, J. Dynamic divide-and-conquer adversarial training for robust semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7486–7495. [Google Scholar]
Cui, J.; Liu, S.; Wang, L.; Jia, J. Learnable boundary guided adversarial training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15721–15730. [Google Scholar]
Cui, J.; Tian, Z.; Zhong, Z.; Qi, X.; Yu, B.; Zhang, H. Decoupled kullback-leibler divergence loss. Adv. Neural Inf. Process. Syst. 2024, 37, 74461–74486. [Google Scholar]
Zan, Y.; Lu, P.; Meng, T. A Gradual Adversarial Training Method for Semantic Segmentation. Remote Sens. 2024, 16, 4277. [Google Scholar] [CrossRef]
Liu, Z.; Cui, Y.; Yan, Y.; Xu, Y.; Ji, X.; Liu, X.; Chan, A.B. The pitfalls and promise of conformal inference under adversarial attacks. arXiv 2024, arXiv:2405.08886. [Google Scholar] [CrossRef]
Sur, I.; Sikka, K.; Walmer, M.; Koneripalli, K.; Roy, A.; Lin, X.; Divakaran, A.; Jha, S. Tijo: Trigger inversion with joint optimization for defending multimodal backdoored models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 165–175. [Google Scholar]
Hossain, M.Z.; Imteaj, A. Securing vision-language models with a robust encoder against jailbreak and adversarial attacks. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 6250–6259. [Google Scholar]
Zhang, J.; Maeda, K.; Ogawa, T.; Haseyama, M. Defense Against Black-Box Adversarial Attacks Via Heterogeneous Fusion Features. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
He, J.; Qin, Z.; Liu, H.; Guo, S.; Chen, B.; Wang, N.; Xiang, T. Contrastive Fusion Representation: Mitigating Adversarial Attacks on VQA Models. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 354–359. [Google Scholar]
Xu, Y.; Qi, X.; Qin, Z.; Wang, W. Cross-modality information check for detecting jailbreaking in multimodal large language models. arXiv 2024, arXiv:2407.21659. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
Dong, Y.; Liao, F.; Pang, T.; Su, H.; Zhu, J.; Hu, X.; Li, J. Boosting adversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9185–9193. [Google Scholar]
He, S.; Shen, P.; Xu, P.; Luo, Q.; Li, H. STDCformer: A transformer-based model with a spatial-temporal causal de-confounding strategy for crowd flow prediction. Inf. Fusion 2025, 126, 103645. [Google Scholar] [CrossRef]
Wang, Y.; He, S.; Luo, Q.; Yuan, H.; Zhao, L.; Zhu, J.; Li, H. Causal invariant geographic network representations with feature and structural distribution shifts. Future Gener. Comput. Syst. 2025, 169, 107814. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed CAGMC-Defence framework.

Figure 2. Architecture of the Multimodal Feature Enhancement and Fusion module.

Figure 3. Architecture of the Multimodal Cross-Attention Submodule.

Figure 4. Architecture of the Cross-Modal Feature Fusion submodule.

Figure 5. Visualized classification results of different defence methods under FGSM attacks.

Figure 6. Visualized classification results of different defence methods under PGD attacks.

Figure 7. Visualized classification results of different defence methods under MIM attacks.

Figure 8. Classification performance of CAGMC-Defence across training epochs under different attack scenarios (clean, FGSM, PGD, and MIM). All the results are produced by the proposed method.

Table 1. Overall accuracy (OA) of unimodal and multimodal classification models under different white-box attacks.

Model	Clean	FGSM	PGD	MIM	Mean
RGB Unimodal	0.8237	0.4753	0.4217	0.3986	0.4313
SAR Unimodal	0.8456	0.6390	0.5824	0.5562	0.5925
Multimodal	0.8779	0.6969	0.6452	0.6130	0.6517

Table 2. OA of the multimodal classification model under unimodal attacks.

Attack Method	Optical Modality Attacked	SAR Modality Attacked
Clean	0.8779	0.8779
FGSM	0.7178	0.7682
PGD	0.6815	0.7671
MIM	0.6637	0.7658

Table 3. Comparison of defence performance under FGSM attacks with varying perturbation strengths.

Epsilon	Metric	FGSM	HFS	Lagrangian-AT	DDC-AT	LBGAT	DKL	GAT	AT-UR	CAGMC-Defence
0.01	OA	0.6969	0.7528	0.8720	0.8776	0.8819	0.8827	0.8898	0.8814	0.8911
0.01	Kappa	0.4630	0.4863	0.6485	0.6528	0.6757	0.6759	0.6955	0.6780	0.7266
0.02	OA	0.6104	0.6878	0.8711	0.8767	0.8809	0.8822	0.8868	0.8783	0.8904
0.02	Kappa	0.3503	0.3973	0.6412	0.6490	0.6755	0.6742	0.6855	0.6710	0.7138
0.03	OA	0.5546	0.6582	0.8670	0.8766	0.8790	0.8849	0.8841	0.8759	0.8878
0.03	Kappa	0.2886	0.3583	0.6234	0.6487	0.6726	0.6833	0.6768	0.6637	0.7037
0.04	OA	0.5163	0.6403	0.8566	0.8537	0.8761	0.8863	0.8811	0.8716	0.8833
0.04	Kappa	0.2559	0.3343	0.5775	0.6264	0.6678	0.6874	0.6669	0.6566	0.6910
0.05	OA	0.4688	0.6327	0.8433	0.8215	0.8730	0.8861	0.8691	0.8678	0.8788
0.05	Kappa	0.2213	0.3214	0.5236	0.6196	0.6628	0.6859	0.6608	0.6487	0.6783

Table 4. Comparison of defence performance under PGD attacks with varying perturbation strengths.

Epsilon	Metric	PGD	HFS	Lagrangian-AT	DDC-AT	LBGAT	DKL	GAT	AT-UR	CAGMC-Defence
0.01	OA	0.6452	0.7362	0.7923	0.8793	0.8744	0.8791	0.8895	0.8781	0.8904
0.01	Kappa	0.3946	0.4628	0.2941	0.6471	0.6508	0.6594	0.6946	0.6642	0.7246
0.02	OA	0.4733	0.6202	0.7815	0.8175	0.8707	0.8388	0.8857	0.8748	0.8892
0.02	Kappa	0.2137	0.3232	0.2542	0.5345	0.6422	0.5731	0.6820	0.6557	0.7075
0.03	OA	0.3798	0.5203	0.7723	0.6868	0.8680	0.7644	0.8818	0.8711	0.8830
0.03	Kappa	0.1355	0.2300	0.2170	0.3504	0.6378	0.4503	0.6689	0.6464	0.6901
0.04	OA	0.3197	0.4607	0.7645	0.5757	0.8637	0.7009	0.8692	0.8680	0.8758
0.04	Kappa	0.0911	0.1812	0.1893	0.2407	0.6296	0.3668	0.6607	0.6393	0.6698
0.05	OA	0.2825	0.4324	0.7602	0.5500	0.8591	0.6215	0.8584	0.8645	0.8682
0.05	Kappa	0.0704	0.1573	0.1794	0.2192	0.6211	0.2820	0.6483	0.6318	0.6518

Table 5. Comparison of defence performance under MIM Attacks with varying perturbation strengths.

Epsilon	Metric	MIM	HFS	Lagrangian-AT	DDC-AT	LBGAT	DKL	GAT	AT-UR	CAGMC-Defence
0.01	OA	0.6130	0.7176	0.7849	0.8759	0.8735	0.8788	0.8887	0.8577	0.8894
0.01	Kappa	0.3581	0.4375	0.2614	0.6471	0.6486	0.6588	0.6919	0.5951	0.7210
0.02	OA	0.4991	0.5994	0.7645	0.7577	0.8692	0.8056	0.8842	0.8321	0.8868
0.02	Kappa	0.1913	0.3018	0.1713	0.4407	0.6391	0.5146	0.6772	0.5056	0.7006
0.03	OA	0.3538	0.5131	0.7539	0.5998	0.8647	0.7224	0.8697	0.7843	0.8790
0.03	Kappa	0.1161	0.2227	0.1297	0.2618	0.6309	0.3933	0.6622	0.3149	0.6787
0.04	OA	0.2963	0.4711	0.7502	0.5494	0.8595	0.6300	0.8686	0.7478	0.8710
0.04	Kappa	0.0793	0.1863	0.1239	0.2181	0.6214	0.2902	0.6492	0.1526	0.6578
0.05	OA	0.2578	0.4474	0.7469	0.5441	0.8511	0.5630	0.8477	0.7337	0.8574
0.05	Kappa	0.0628	0.1654	0.1268	0.2104	0.6060	0.2286	0.6266	0.0958	0.6284

Table 6. OA of the CAGMC-Defence model under unimodal attacks.

Attack Method	Optical Modality Attacked	SAR Modality Attacked
Clean	0.8779	0.8779
FGSM	0.8841	0.8820
PGD	0.8813	0.8791
MIM	0.8741	0.8649

Table 7. Inference time comparison (unit: seconds) of different defence methods under three types of attacks.

Attack Method	HFS	Lagrangian-AT	DDC-AT	LBGAT	DKL	GAT	AT-UR	CAGMC-Defence
FGSM	1.04	1.06	1.05	1.06	1.09	1.19	1.12	1.00
PGD	5.17	3.36	3.51	10.64	4.61	12.28	3.58	2.87
MIM	3.71	3.19	2.88	4.03	2.85	4.77	2.86	2.67

Table 8. Impact of defence module configurations on CAGMC-Defence performance under FGSM attacks.

Epsilon	Metric	NO_ACTION	NO_MFEF	NO_MAT	ALL
0.00	OA	0.8779	0.8597	0.9018	0.8915
0.00	Kappa	0.6964	0.6876	0.7776	0.7415
0.01	OA	0.6969	0.8570	0.8006	0.8911
0.01	Kappa	0.4630	0.6812	0.5480	0.7266
0.02	OA	0.6104	0.8545	0.6615	0.8904
0.02	Kappa	0.3503	0.6851	0.3647	0.7138
0.03	OA	0.5546	0.8481	0.5814	0.8878
0.03	Kappa	0.2886	0.6811	0.2860	0.7037
0.04	OA	0.5163	0.8483	0.5147	0.8833
0.04	Kappa	0.2559	0.6780	0.2340	0.6910
0.05	OA	0.4688	0.8437	0.4886	0.8788
0.05	Kappa	0.2213	0.6653	0.2322	0.6783

Table 9. Impact of defence module configurations on CAGMC-Defence performance under PGD attacks.

Epsilon	Metric	NO_ACTION	NO_MFEF	NO_MAT	ALL
0.00	OA	0.8779	0.8653	0.9018	0.8914
0.00	Kappa	0.6964	0.6607	0.7776	0.7384
0.01	OA	0.6452	0.8696	0.7383	0.8904
0.01	Kappa	0.3946	0.6461	0.4626	0.7246
0.02	OA	0.4733	0.8620	0.4626	0.8892
0.02	Kappa	0.2137	0.6255	0.1899	0.7075
0.03	OA	0.3798	0.8519	0.3858	0.8830
0.03	Kappa	0.1355	0.5976	0.1395	0.6901
0.04	OA	0.3197	0.8380	0.3513	0.8758
0.04	Kappa	0.0911	0.5577	0.1170	0.6698
0.05	OA	0.2825	0.8218	0.3255	0.8682
0.05	Kappa	0.0704	0.5109	0.0994	0.6518

Table 10. Impact of defence module configurations on CAGMC-Defence performance under MIM attacks.

Epsilon	Metric	NO_ACTION	NO_MFEF	NO_MAT	ALL
0.00	OA	0.8779	0.8405	0.9018	0.8915
0.00	Kappa	0.6964	0.5191	0.7776	0.7415
0.01	OA	0.6130	0.8157	0.6779	0.8894
0.01	Kappa	0.3581	0.4235	0.3843	0.7210
0.02	OA	0.4991	0.7867	0.4403	0.8868
0.02	Kappa	0.1913	0.3038	0.1765	0.7006
0.03	OA	0.3538	0.7623	0.3802	0.8790
0.03	Kappa	0.1161	0.1998	0.1403	0.6787
0.04	OA	0.2963	0.7322	0.3504	0.8710
0.04	Kappa	0.0793	0.1125	0.1231	0.6578
0.05	OA	0.2578	0.7258	0.3252	0.8574
0.05	Kappa	0.0628	0.0950	0.1098	0.6284

Table 11. Defence performance of CAGMC-Defence under a fixed perturbation strength of

ϵ = 0.05

.

Table 11. Defence performance of CAGMC-Defence under a fixed perturbation strength of

ϵ = 0.05

.

Adversarial Training	FGSM		PGD		MIM
Adversarial Training	OA	Kappa	OA	Kappa	OA	Kappa
FGSM	0.8788	0.6783	0.8623	0.6324	0.8538	0.6152
PGD	0.8669	0.6388	0.8682	0.6518	0.8587	0.6191
MIM	0.8072	0.4272	0.7473	0.1499	0.8574	0.6284

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, J.; Cao, H.; Meng, L.; Guo, W.; Zhang, K.; Wang, Q.; Chang, C.; Li, H. CAGMC-Defence: A Cross-Attention-Guided Multimodal Collaborative Defence Method for Multimodal Remote Sensing Image Target Recognition. Remote Sens. 2025, 17, 3300. https://doi.org/10.3390/rs17193300

AMA Style

Cui J, Cao H, Meng L, Guo W, Zhang K, Wang Q, Chang C, Li H. CAGMC-Defence: A Cross-Attention-Guided Multimodal Collaborative Defence Method for Multimodal Remote Sensing Image Target Recognition. Remote Sensing. 2025; 17(19):3300. https://doi.org/10.3390/rs17193300

Chicago/Turabian Style

Cui, Jiahao, Hang Cao, Lingquan Meng, Wang Guo, Keyi Zhang, Qi Wang, Cheng Chang, and Haifeng Li. 2025. "CAGMC-Defence: A Cross-Attention-Guided Multimodal Collaborative Defence Method for Multimodal Remote Sensing Image Target Recognition" Remote Sensing 17, no. 19: 3300. https://doi.org/10.3390/rs17193300

APA Style

Cui, J., Cao, H., Meng, L., Guo, W., Zhang, K., Wang, Q., Chang, C., & Li, H. (2025). CAGMC-Defence: A Cross-Attention-Guided Multimodal Collaborative Defence Method for Multimodal Remote Sensing Image Target Recognition. Remote Sensing, 17(19), 3300. https://doi.org/10.3390/rs17193300

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CAGMC-Defence: A Cross-Attention-Guided Multimodal Collaborative Defence Method for Multimodal Remote Sensing Image Target Recognition

Abstract

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Image Fusion Based on Optical and SAR Data

2.2. Multimodal Adversarial Defence

3. Methods

3.1. CAGMC-Defence Framework

3.2. Multimodal Feature Enhancement and Fusion Module (MFEF Module)

3.2.1. Pseudo-Siamese Feature Extraction Submodule (PSFE Submodule)

3.2.2. Multimodal Cross-Attention Submodule (MCA Submodule)

3.2.3. Cross-Modal Feature Fusion Submodule (CMFF Submodule)

3.3. Multimodal Adversarial Training Module (MAT Module)

4. Experiments

4.1. Dataset and Evaluation Metrics

4.2. Adversarial Attack Experimental Results

4.3. Adversarial Defence Experimental Results

4.4. Iterations

4.5. Ablation Studies

4.6. Transferability Experiments

5. Conclusions

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI