Contrastive–Transfer-Synergized Dual-Stream Transformer for Hyperspectral Anomaly Detection

Deng, Lei; Ying, Jiaju; Wang, Qianghui; Cheng, Yue; Zhou, Bing

doi:10.3390/rs18030516

Open AccessArticle

Contrastive–Transfer-Synergized Dual-Stream Transformer for Hyperspectral Anomaly Detection

by

Lei Deng

¹,

Jiaju Ying

¹,

Qianghui Wang

²

,

Yue Cheng

¹ and

Bing Zhou

^1,*

¹

Shijiazhuang Campus, Army Engineering University of PLA, Shijiazhuang 050000, China

²

National Key Laboratory of Test & Evaluation of ElectroMagnetic Space Security, Luoyang 471003, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 516; https://doi.org/10.3390/rs18030516

Submission received: 28 November 2025 / Revised: 20 January 2026 / Accepted: 23 January 2026 / Published: 5 February 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

CTDST-HAD achieves an average AUC of 0.988 across nine real hyperspectral datasets, outperforming ten state-of-the-art methods; accuracy remains >0.95 even in complex near-ground jungle scenes.
Ablation shows that the contrastive–transfer two-stage pre-training, physics-based VAE augmentation, adaptive EWC, and focal loss each contribute 1.5–3.2 AUC points and are all indispensable.

What are the implications of the main findings?

Each hyperspectral image does not require retraining and can be directly used for fast inference, providing a scalable paradigm for real-time and low-cost applications of hyperspectral anomaly detection.
The strategy of combining physics guidance and transfer learning can be extended to other remote sensing tasks, providing a general idea for intelligent interpretation under scarce annotation conditions.

Abstract

Hyperspectral anomaly detection (HAD) aims to identify pixels that significantly differ from the background without prior knowledge. While deep learning-based reconstruction methods have shown promise, they often suffer from limited feature representation, inefficient training cycles, and sensitivity to imbalanced data distributions. To address these challenges, this paper proposes a novel contrastive–transfer-synergized dual-stream transformer for hyperspectral anomaly detection (CTDST-HAD). The framework integrates contrastive learning and transfer learning within a dual-stream architecture, comprising a spatial stream and a spectral stream, which are pre-trained separately and synergistically fine-tuned. Specifically, the spatial stream leverages general visual and hyperspectral-view datasets with adaptive elastic weight consolidation (EWC) to mitigate catastrophic forgetting. The spectral stream employs a variational autoencoder (VAE) enhanced with the RossThick–LiSparseR (R-L) physical-kernel-driven model for spectrally realistic data augmentation. During fine-tuning, spatial and spectral features are fused for pixel-level anomaly detection, with focal loss addressing class imbalance. Extensive experiments on nine real hyperspectral datasets demonstrate that CTDST-HAD outperforms state-of-the-art methods in detection accuracy and efficiency, particularly in complex backgrounds, while maintaining competitive inference speed.

Keywords:

hyperspectral anomaly detection; contrastive learning; transfer learning; transformer

1. Introduction

Hyperspectral images are special images of a target area obtained through imaging spectrometers, which have a three-dimensional data structure. Unlike the imaging mechanism of color cameras, imaging spectrometers have a wide spectral coverage range and numerous extremely narrow spectral response widths [1,2,3,4,5]. Each pixel in the spatial distribution represents a spectral vector, which can reflect the essential physical properties of the corresponding real terrain. Therefore, it is widely used in mineral mapping, precision agriculture, ecological protection, marine monitoring, food safety, urban and rural planning, and military reconnaissance [6,7,8,9]. Anomaly detection is one of the hot applications in the field of hyperspectral imaging, which aims to identify and locate anomalous targets without prior information. These targets usually occupy a small proportion of the image and have significant spectral differences from the background [10,11,12,13].

The initial hyperspectral anomaly detection methods relied on statistical assumptions, distance metrics, and reconstruction optimization. The classic statistical method Reed–Xiaoli (RX) algorithm and its variants assume that the background data follows a multivariate normal distribution and detect anomalies by calculating the degree of deviation between pixels and the background distribution [14]. The improved version of Local RX (LRX) introduces a dual-window mechanism, with an inner window isolating test pixels and an outer layer dynamically selecting background areas to reduce abnormal contamination. Although it enhances small-target detection capability, it is sensitive to window size. The weighting method assigns lower weights to abnormal samples to optimize background estimation [15]. Kernel RX (KRX) maps data to high-dimensional space through kernel functions, enhancing nonlinear separability but significantly increasing computational complexity and requiring optimization of kernel parameters [16]. Support vector data description (SVDD) is used to construct a minimum hypersphere enclosing the background sample. Outliers are identified outside the hypersphere, without assuming background distribution but relying on kernel function selection [17]. Peak density clustering (DPC) calculates the pixel density within a local window and directly filters extremely low-density points as anomalies, which is simple and efficient but susceptible to noise interference [18]. The reconstruction method emphasizes the representational power of the background dictionary, while sparse representation detection (SRD) utilizes an overcomplete dictionary to force the test pixels to be reconstructed using a sparse coefficient linear combination [19]. The residual size reflects the probability of anomalies and requires constraints on non-negativity or the introduction of weighting strategies to suppress dictionary pollution. Collaborative representation detection (CRD) allows all dictionary atoms to collaborate and quickly calculate residuals through closed-form solutions, with high efficiency but sensitivity to mixed pixels [20]. Low-rank and sparse matrix decomposition (LRaSMD) decomposes the data matrix into low-rank background and sparse anomalies, relying on convex optimization algorithms to solve. It is suitable for large-scale uniform backgrounds but struggles to handle sub-pixel-level anomalies [21]. Subspace methods such as orthogonal subspace projection (OSP) extract background feature basis vectors to construct an orthogonal projection space. Anomalies are captured due to prominent energy in the orthogonal direction, and local adaptive strategies need to be combined to deal with spectral variations [22,23]. The attribute- and edge-preserving filter detector (AED) performs a series of erosion and expansion operations using morphological attribute filtering on the feature images extracted after dimensionality reduction and uses Boolean graph fusion to obtain candidate anomalies. Finally, edge-preserving filtering considering local spatial correlation is adopted to reduce candidate false positives and obtain reliable results [24]. With the development of neural networks, data scales, and GPUs, deep learning-based hyperspectral anomaly detection technology is gradually gaining attention [25]. The stacked autoencoder (SAE) is a typical unsupervised deep learning network that reconstructs hyperspectral images through an encoder–decoder structure [26]. Its variants, adversarial AE (AAE) [27] and variational AE [28], serve as feature extractors, providing more discriminative features for subsequent anomaly detectors. The autonomous hyperspectral anomaly detection network (Auto-AD) method uses fully convolutional layers to construct autoencoders, which can suppress abnormal reconstruction [29]. Fan et al. [30] proposed the robust graph autoencoder (RGAE), which preserves the geometric structure between training samples and is robust to noise and anomalies during the training process. Liu et al. solved the limited generalization problem of the HAD method based on DNN by detecting anomalies in the frequency domain of images [31]. He et al. introduced a convolutional transformer-inspired autoencoder (CTA), which integrates convolutional layers with transformer modules to capture both local and global features, thereby enhancing anomaly detection performance [32]. Lian et al. proposed the gated transformer for hyperspectral anomaly detection (GT-HAD), a gated transformer architecture that leverages spatial–spectral similarity through dual-branch modeling and adaptive gating mechanisms [33]. Additionally, the transformer-based autoencoder framework (TAEF) was developed to model nonlinear mixing phenomena via a hybrid autoencoder–transformer structure [34]. Cheng et al. proposed a deep feature aggregation network (DFAN) for hyperspectral anomaly detection, which uses adaptive spectral–spatial aggregation and multi-separation loss to model diverse background patterns and improve anomaly separation [35]. Liu et al. introduced a self-supervised multiscale network (MSNet) for hyperspectral anomaly detection, which uses multiscale feature learning and a separation training strategy to reconstruct background and suppress anomalies without manual labels [36]. Wang et al. proposed PUNNet, a self-supervised blind-spot network that uses patch–shuffle downsampling and dilated convolution to improve background reconstruction and anomaly detection in hyperspectral imagery [37]. Fu et al. proposed a deep plug-and-play denoising CNN regularization method (DeCNN-AD) for HAD, which innovatively integrates a convolutional neural network denoiser as a prior within a low-rank representation framework via the plug-and-play scheme, bypassing the need for handcrafted regularizers [38]. Lin et al. proposed dynamic low-rank- and sparse-priors-constrained deep autoencoders (DLRSPs-DAEs), which integrate low-rank and sparse priors into autoencoder training via an end-to-end joint optimization strategy, thereby enhancing both background reconstruction and anomaly suppression [39]. Xiang et al. devised a pixel-associated autoencoder that employs superpixel-based dictionary construction and global similarity metrics as network input, coupled with a dual-hidden-layer feature similarity constraint to significantly improve reconstruction discriminativity [40]. Chen et al. introduced the dual-window spectral diffusion detector (DWSDiff), which utilizes a spectral diffusion process together with a dual-window neighborhood strategy for background estimation and synthesizes anomaly samples via principal component analysis and linear mixing to alleviate training data scarcity [41]. He et al. pioneered the adoption of state-space models in this task by proposing a cross-scale windowed integration state-space model (CWIMamba), which employs a cross-scale windowed design and a Haar wavelet convolution module to effectively capture long-range spatial–spectral dependencies while maintaining linear complexity and enhancing local feature representation [42]. The above-mentioned deep learning-based hyperspectral anomaly detection methods are all based on the idea of reconstruction, and each detection essentially includes two parts: training and inference [43,44]. Although they do not require a large number of hyperspectral images for training, their ability to extract and understand semantic features is much lower than supervised learning, and this is repetitive and completely forgotten training, resulting in low detection efficiency [45,46,47,48,49,50,51]. As a branch of unsupervised learning, self-supervised learning designs auxiliary tasks or generates labels using the internal structure of data to conduct training similar to supervised methods [52,53]. Self-supervised learning provides a solution for the insufficient learning ability and low detection efficiency of current hyperspectral anomaly detection models [54,55].

The advantage of self-supervised learning is that it can utilize a large amount of unlabeled data to design auxiliary tasks for training models, learn the inherent structure and characteristics of data, and more effectively utilize existing data resources, greatly improving the model’s generalization ability [56]. In addition, self-supervised learning can help alleviate the problem of scarce or expensive hyperspectral image annotation in supervised learning [57]. The key to self-supervised learning is to design auxiliary tasks and perform model transfer based on data features and use pre-trained weights in the final task to improve training efficiency and generalization ability [58]. At present, there are many studies on same-domain datasets, and their pre-training and fine-tuning are carried out on hyperspectral images [59]. In fact, training on large-scale hyperspectral images is extremely difficult, with a huge amount of training required [60]. Li et al. proposed a supervised anomaly detection framework based on transferred-depth CNN, which uses differential pixel pairs formed by different categories of ground objects in the reference image for training and takes the difference between the center pixel and surrounding pixels as the anomaly level of the center pixel [61]. Guan et al. proposed a hyperspectral image classification method that learns between different domains, transfers between different domains, and uses pre-trained weights from general color images to improve the model’s generalization ability [62]. Combining the cross-domain model and the transmission-depth CNN framework proposed by Li et al., we propose a contrastive–transfer-synergized dual-stream transformer for hyperspectral anomaly detection.

Current deep learning-based HAD methods have made significant progress, yet they remain constrained by several inherent limitations that impede their practical application. A critical examination reveals three core challenges:

(1) Limited Feature Representation with Poor Generalization: Prevailing methods, particularly those based on unsupervised reconstruction, are optimized to reconstruct the dominant background. This objective inherently limits their ability to learn discriminative features that effectively separate subtle anomalies from complex, heterogeneous backgrounds. Consequently, these models often exhibit performance degradation in challenging, non-homogeneous scenes.

(2) Inefficient, Non-Transferable Training Cycles: The dominant paradigm requires training a dedicated model from scratch for each new hyperspectral image (HSI) scene. This process fails to accumulate or transfer learned knowledge across different domains or sensing conditions. It results in repetitive, computationally expensive training cycles, which fundamentally contradicts the need for rapid, low-cost analysis in real-world scenarios.

(3) Severe Sensitivity to Extreme Class Imbalance: Anomaly detection is intrinsically a rare-event detection problem, where target pixels often constitute less than 0.1% of the image. Standard training objectives (e.g., Binary Cross-Entropy) are overwhelmed by the vast majority of background pixels, causing model convergence to trivial solutions that favor the background and yield high false-negative rates (missed detections).

To systematically address these interconnected challenges, we propose the contrastive–transfer-synergized dual-stream transformer for HAD (CTDST-HAD). Our framework is explicitly designed with modular components that target each limitation:

(1) We introduce a dual-stream transformer architecture pre-trained with contrastive learning. The spatial and spectral streams are separately pre-trained on large-scale general imagery and hyperspectral-derived data using SimSiam loss. This strategy forces the model to learn transferable and semantically rich representations that capture essential spatial contexts and spectral characteristics, moving beyond the simplistic background modeling of reconstruction-based approaches.

(2) We devise a Two-Stage Contrastive-Transfer Learning strategy, stabilized by adaptive elastic weight consolidation (EWC). The model first acquires general visual priors, then transfers to the hyperspectral domain. The novel adaptive EWC mechanism dynamically consolidates crucial parameters from previous tasks, mitigating catastrophic forgetting. This enables direct inference on novel HSI scenes without retraining, breaking the inefficient per-image training cycle and offering a scalable deployment paradigm.

(3) We incorporate a Physics-Guided Spectral Augmentation module and employ focal loss. The spectral stream utilizes a variational autoencoder (VAE) enhanced with the RossThick–LiSparseR (R-L) physical kernel model to generate spectrally realistic variations, effectively augmenting the latent feature space. During fine-tuning, focal loss is applied to down-weight the loss from easily classified background pixels, thereby refocusing the model’s learning capacity on the scarce and challenging anomaly samples.

The key gap we fill is the synergistic integration of cross-domain contrastive learning, physics-aware spectral modeling, and anti-forgetting transfer learning into a unified HAD framework, which has not been explored in prior work.

2. Related Work

2.1. Vision Transformer

The vision transformer (ViT) [63] is a groundbreaking model that adapts the transformer architecture, originally designed for natural language processing, to computer vision tasks. Unlike traditional convolutional neural networks, ViT segments an input image into a sequence of fixed-size patches, linearly embeds them, and adds position embeddings. This transformation allows the image to be processed as a sequence, effectively turning image classification into a “sequence-to-sequence” problem [64].

The core component of the ViT is the transformer encoder, which relies heavily on the Multi-Head Self-Attention (MSA) mechanism. This mechanism enables the model to dynamically focus on different parts of the image and capture global dependencies between all image patches. In the self-attention (SA) module, for each input sequence X, three learnable weight matrices, W_q, W_k, and W_v, are used to project the input into query (Q), key (K), and value (V) representations:

Q = XW_q, K = XW_k, V = XW_v

(1)

The output of the self-attention is computed as

SA = attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) V

(2)

The product QKᵀ represents the similarity between different elements in the sequence, while V carries the actual feature information. Multi-head attention employs multiple such self-attention mechanisms in parallel. The input X is split, and each head computes self-attention independently according to Equations (1) and (2). The outputs of all heads are then concatenated and linearly projected:

MSA(X) = Concat(SA₁, SA₂, …, SA_h)

(3)

where h is the number of attention heads. This architecture allows the ViT to jointly attend to information from different representation subspaces, making it highly effective for visual recognition tasks.

2.2. Autoencoder

The autoencoder is an unsupervised neural network model designed to learn deep features of data and reconstruct the input. Essentially, it is a neural network comprising an input layer, a hidden layer, and a reconstruction layer. The autoencoder is a symmetric multi-layer neural network. The output of the first layer in the autoencoder is

y^{(l)} = f [W^{(l)} y^{(l - 1)} + b^{(l)}]

(4)

where

y^{(l)}

denotes the output vector of neurons in the

l

-th layer;

W^{(l)}

and

b^{(l)}

represent the weight matrix and bias vector of the

l

-th layer, respectively; and

f [\cdot]

denotes a non-linear activation function (ReLU).

3. Proposed Method

In this section, we specifically introduce the proposed contrastive–transfer-synergized dual-stream transformer framework. Figure 1 shows the specific implementation process and model structure of the proposed framework. Comparative learning and transfer learning are used in the training framework for hyperspectral anomaly detection, and a dual-stream transformer network is used as the encoder to extract spatial–spectral features. In the first pre-training of the spatial stream, a single sample from the general visual dataset is enhanced with spatial data to form two views, which are then passed through a spatial encoder and a spatial projection head to generate two feature vectors. The parameters of the spatial encoder are trained by minimizing the contrastive loss between the two feature vectors. The second pre-training of the spatial stream follows the same process as the first pre-training, except that the second pre-training uses black and white images from the perspective of a hyperspectral sensor. In the pre-training of the spectral stream, a single spectral datum is enhanced through R-L-enhanced VAE to generate two views. Two spectral feature vectors are obtained through the spectral encoder and spectral projection head, respectively. The spectral encoder is trained by minimizing the contrast loss of the spectral feature vectors. In the fine-tuning stage, the complete hyperspectral image is decomposed into black and white images from a hyperspectral perspective and a spectral dataset. The spatial feature vectors of the sub-blocks and the spectral feature vectors of the pixels are obtained through spatial and spectral streams, respectively. The spatial feature vectors of the sub-blocks are the individual spatial feature vectors of all pixels in the sub-blocks. The spatial feature vectors of the pixels and the spectral feature vectors are concatenated and then binary classified to obtain the prediction results, minimizing the prediction results and labels and completing the fine-tuning of the binary classification head. After the fine-tuning is completed, the complete hyperspectral image is inferred once according to the fine-tuning process to obtain the final detection result.

The dual-stream transformer architecture is adopted primarily due to the inherent characteristics of hyperspectral data and the specific requirements of the anomaly detection task: (1) Hyperspectral images contain both rich spatial structural information and continuous spectral discriminative information; the dual-stream design allows the spatial encoder and the spectral encoder to focus on learning local/global spatial contexts and interspectral variation features separately, thereby avoiding unintended blending or mutual interference of spatial and spectral features in a single encoder. (2) The spatial and spectral streams can be pre-trained independently via contrastive learning, acquiring transferable and robust representations from large-scale general visual data and physics-augmented spectral data, respectively. This separate learning strategy is more conducive to capturing deep features of each modality than joint training. (3) During fine-tuning, the features from the two streams can be fused flexibly and effectively, simultaneously leveraging spatial structural cues and spectral anomaly responses, which enables more accurate pixel-level detection in complex backgrounds and in low-altitude scenes where the phenomena of “same object with different spectra, different objects with similar spectra” are prominent. Compared to single-stream or unimodal methods that rely solely on either spatial or spectral information, the dual-stream architecture provides a natural and efficient solution for modeling the dual attributes of hyperspectral imagery simultaneously.

3.1. Separate Pre-Training

In the pre-training stage, the spatial stream and spectral stream are trained completely separately. They have similar architectures and modules, both based on contrastive learning training, but their data input patterns and model parameters are different. The two-stage pre-training design is intended to systematically build robust representations that incorporate both general visual priors and sensor-specific characteristics. The first pre-training utilizes large-scale general visual datasets, aiming to enable the spatial encoder to learn universal spatial structures and contextual representations from natural images, knowledge that is highly transferable and provides a strong visual foundation for the model. The second pre-training shifts to a grayscale image dataset constructed from a hyperspectral sensor perspective. Its necessity lies in the following: (1) the imaging mechanism of hyperspectral sensors differs significantly from that of natural cameras, and directly using general visual features would lead to domain shift; (2) the second pre-training adapts the model to the specific imaging pattern and spatial distribution bias of hyperspectral sensors, thereby maintaining detection stability in complex near-ground scenes; and (3) through the adaptive elastic weight consolidation (EWC) strategy, the model effectively retains the general spatial priors learned in the first pre-training while acquiring sensor-specific features, thus avoiding catastrophic forgetting.

3.1.1. Spatial Stream

Considering the limited computing resources and limited training samples, SimSiam is chosen as the contrastive learning architecture. SimSiam consists of an encoder and a predictor, which directly maximizes the similarity between two enhanced views of the same image to learn feature expressions. It is simple and efficient and does not require negative samples or momentum encoders. The choice of SimSiam as the contrastive learning framework is based on the following considerations: (1) SimSiam does not rely on negative sample queues or momentum encoders, avoiding the computational overhead of maintaining large negative sample banks or momentum updates, and its training procedure is simple and efficient, particularly suitable for scenarios with limited computing resources; (2) SimSiam maximizes the similarity between two augmented views of the same image via a simple symmetric structure and can still learn robust representations even with few samples, which aligns well with the annotation-scarce nature of hyperspectral data; and (3) compared to contrastive methods requiring large negative samples or complex momentum mechanisms, SimSiam significantly reduces implementation complexity and GPU memory requirements while maintaining competitive performance, thereby facilitating stable training of both spatial and spectral encoders within our dual-stream framework. The general visual dataset is used for the first pre-training of spatial streams. In fact, the task of the spatial stream is to encode features of images, which is very similar to image classification tasks for general visual datasets. Firstly, spatial data augmentation is applied to a universal dataset to generate two enhanced views. The operations of spatial data augmentation include random cropping, random color jitter, random grayscale conversion, random Gaussian blur, and random horizontal flipping. Under the influence of the spatial encoder, the enhanced view becomes a feature vector. The spatial encoder is a standard vision transformer model but with its classification header removed. The spatial projection head consists of two fully connected layers that ultimately output a feature vector with the same dimension as the input, used to predict the representation of another view. The first pre-training loss function uses SimSiam’s standard loss function:

L_{S} = \frac{1}{2} sim [p_{1}, sg (z_{2})] + \frac{1}{2} sim [p_{2}, sg (z_{1})]

(5)

where

sim (a, b) = - \frac{a \cdot b}{{‖a‖}_{2} {‖b‖}_{2}}

represents the cosine similarity between feature vectors a and b;

p_{1}

and

p_{2}

are the projected feature vectors from the two augmented views.

z_{1}

and

z_{2}

are the encoded feature vectors;

s g (\cdot)

is the stop gradient.

The spatial encoder is implemented as a standard vision transformer (ViT). It processes input images with a fixed size of 224 × 224 pixels. Each image is divided into a grid of non-overlapping patches, where each patch has a size of 32 × 32 pixels, resulting in a sequence of (224/32)² = 49 patches. Each patch is linearly projected into an embedding space of dimension feature_dim = 128. The core of the encoder is a stack of depth = 6 identical transformer layers. Within each layer, the Multi-Head Self-Attention (MSA) mechanism employs heads = 16 parallel attention heads, with a dimension of head_dim = 64 per head. A dropout rate of 0.1 is applied after both the attention and the subsequent MLP block for regularization. An additional embedding dropout (emb_dropout = 0.1) is also applied to the patch embeddings. This configuration enables the model to capture robust hierarchical spatial representations. Table 1 shows the parameter settings of the spatial encoder and spectral encoder, reporting the parameters for normalizing the image size of the spatial encoder, as well as important parameters such as the input spectral dimension and window size of the spectral encoder.

The second pre-training of the spatial stream uses a grayscale image dataset from the perspective of a hyperspectral sensor. A complete hyperspectral image is segmented into several subimages for each band, with each subimage treated as a sample, to construct a spatial dataset from a hyperspectral perspective. The process of the second pre-training is the same as that of the first one. The universal spatial prior is covered by a specific imaging mode of the hyperspectral sensor, resulting in catastrophic forgetting of the model [65]. To consolidate the learned knowledge in model transfer and adapt to the spatial distribution bias from the perspective of hyperspectral sensors, a regularization strategy based on parameter importance perception is required. EWC quantifies the contribution of pre-trained parameters to general spatial modeling based on the Fisher information matrix, constrains their fine-tuning offset amplitude, and establishes a balance between preserving cross-domain spatial generalization knowledge and fusing sensor-specific spatial features. The loss function for the second pre-training is

L = L_{S} + \frac{λ_{t}}{2} {\sum_{i} F_{i} (θ_{i} - θ_{i}^{*})}^{2}

(6)

where

L_{S}

is the SimSiam standard loss function,

λ_{t}

represents the adaptive penalty coefficient,

F_{i}

is the importance level of the i-th parameter,

θ_{i}

is the i-th parameter during the second pre-training, and

θ_{i}^{*}

is the i-th parameter during the first pre-training.

The use of adaptive penalty coefficients can avoid manual parameter tuning, adapt to training differences across datasets, avoid underfitting or overforgetting, and satisfy the following equation:

λ_{t} = λ_{0} \exp (- τ u)

(7)

u = \frac{1}{|C|} \sum_{x \in C} \cos [f (x), f^{*} (x)]

(8)

where C is a sample from the batch data,

λ_{0}

is the basic penalty coefficient,

τ

is the exponential decay rate,

f

is the current model encoder, and

f^{*}

is the old model encoder.

F_{i}

can be estimated using the second derivative near the parameter extremum point:

F_{i} = \frac{1}{|D|} \sum_{d \in D} \frac{\partial L_{S} {(d, θ_{i})}^{2}}{\partial θ_{i}^{2}}

(9)

where D is the dataset, and d is the batch data in it.

3.1.2. Spectral Stream

The training of the spectral dimension also adopts contrastive learning, with the vision transformer as a spectral encoder for comparative learning. Figure 2 shows the acquisition of a single sample in the spectral dataset, which relies on the sliding of the dual window on the hyperspectral image. The specific parameters are as follows: an inner window of 9 × 9 pixels and an outer window of 11 × 11 pixels. The pixel at the center of the inner window is treated as the target, while the annular region between the inner and outer windows is defined as its local background. This dual-window structure slides across the entire hyperspectral image with a stride of 5 pixels to collect samples. The variational autoencoder serves as a data augmentation method for spectral datasets; as each sample has a corresponding relationship of “center pixel–surrounding pixels”, data augmentation may affect the independence of the samples. By continuously resetting the weights of the variational autoencoder, each sample is re-augmented to reduce the similarity of the augmented samples.

The bidirectional reflectance distribution function (BRDF) is often used to describe the influence of the relative positions of imaging spectrometers, target areas, and solar orientations on spectral curves. The expression of semiempirical models is concise, the physical meaning is clear, and the inversion range is wide, which plays an important guiding role in the generation of spectral data. The RossThin–LiSparseR model (R-L model) is a typical nuclear-driven model with strong inversion ability. It decomposes the two-dimensional reflection into three parts: isotropic reflection, volume scattering, and geometric reflection. The satisfied equation is as follows:

R = f_{i s o} + f_{v o l} K_{v o l} + f_{g e o} K_{g e o}

(10)

where f_iso, f_vol, and f_geo ∈ ℝ are constant coefficients, and the RossThin kernel satisfies the following system of equations:

K_{v o l} = \frac{(\frac{2}{π} - γ) (\cos γ + \sin γ)}{\cos α + \cos β} - \frac{4}{π}

(11)

\cos γ = \cos α \cos β + \sin α \sin β \cos φ

(12)

The LiSparseR kernel satisfies the following system of equations:

K_{g e o} = Ω - \sec α - \sec β + \frac{(1 + \cos γ) \sec α \sec β}{2}

(13)

Ω = \frac{1}{π} (η - \sin η \cos η) (\sec α + \sec β)

(14)

\cos η = \frac{2 \sqrt{χ^{2} + (\tan α \tan β \sin φ)}}{\sec α + \sec β}

(15)

χ = \sqrt{\tan^{2} α + \tan^{2} β - 2 \tan α \tan β \cos φ}

(16)

where α is the zenith angle of light irradiation, β is the observation zenith angle, and φ is the relative azimuth angle. γ is the phase angle.

Ω

is the overlap integral.

χ

and

η

are intermediate variables for kernel calculation.

The R-L model is embedded into the decoder of the variational encoder. Figure 3 shows the training process of a single sample. The parameters μ and σ of the latent variables are obtained from a single spectral curve through a standard encoder, and the variational posterior distribution

q_{ϕ} (z |x)

is determined. By using the reparameterization technique to sample from

q_{ϕ} (z |x)

, the latent variable

z = (α, β, φ)

is obtained, which essentially represents the angular information of the spectrum. In the R-L-enhanced decoder, the latent variable

z

is used to obtain K_vol and K_geo based on Equations (11) and (13). f_iso, f_vol, and f_geo are trainable parameter sets, and spectral curve

x^{'}

is generated based on Equation (10).

The detailed layer-wise architecture and data flow dimensions of this R-L-enhanced VAE are summarized in Table 2. Specifically, the encoder compresses the input spectrum through three fully connected layers (89→128→64→32). The resulting 32-dimensional feature is then mapped by two parallel linear layers to the mean and log-variance vectors, each of dimension 3, defining the latent distribution. During reparameterization, the 3D latent variable z is sampled. In the decoder, z is transformed into the two physical kernels K (dimension 2), which subsequently interact with the trainable spectral coefficients f to reconstruct the 89-dimensional spectrum via the physical Equation (10). This design explicitly integrates the deterministic R-L physical model into the stochastic generation process of the VAE.

The enhanced spectral data is transformed through the spectral encoder into feature vectors, which represent the differential features between the center and background pixels. The spectral projection head also consists of two fully connected layers, ultimately outputting a feature vector with the same dimension as the input, which is used to predict the representation of another spectral-enhanced view. The comparative loss of the spectral stream is calculated using Formula (5).

3.2. Synergistic Fine-Tuning

In the fine-tuning stage, the input is a complete hyperspectral image, and the spatial and spectral features extracted by the spatial stream and spectral stream are fused for pixel-level detection of hyperspectral images. Firstly, the input hyperspectral image is reduced to three bands through PCA and then spatially segmented into several patches to adapt to the input of the spatial stream. Spatial segmentation is performed on hyperspectral images using the same size, and sliding dual windows are applied to extract spectral data on each patch. In fact, the spatial–spectral features of a point in the hyperspectral image space are a combination of the features of its spatial patch and the differential features of the “center background” selected by the dual window. Algorithm 1 demonstrates the calculation process. Finally, a binary classifier is used to classify the deep features. This binary classifier is implemented as a lightweight fully connected neural network. It takes the fused spatial–spectral feature vector of dimension d₁ + d₂ = 256 as input. The network consists of two layers: a hidden layer with 128 neurons followed by a ReLU activation and dropout and an output layer with 2 neurons corresponding to the background and anomaly classes. The output logits are normalized via a softmax function to obtain the final anomaly probability.

The proportion of abnormal targets in hyperspectral images is relatively small, resulting in a significant imbalance between target and background samples when training using full-frame images. To eliminate the impact of imbalanced training samples, focal loss will be used for backpropagation.

F o c a l L o s s = - α \cdot {(1 - p t)}^{γ} \cdot \log (p t)

(17)

where

α

is the category balance weight, used to adjust the loss contribution ratio of positive and negative samples;

γ

is a focus parameter used to reduce the weight of easy-to-classify samples, making the model more focused on difficult-to-classify samples; and pt is the probability of the model predicting the correct category.

Algorithm 1: Concatenate spatial–spectral features

Input: HSI cube

H \in R^{M \times N \times B}

Output: Fused feature map

F \in R^{M \times N \times D}

where

D = d_{1} + d_{2}

1. Spatial stream

I_rgb←PCA(H,3)

P←split_into_patches(I_rgb,size = n)

F_spa←SpatialEncoder(P)

F_spa←upsample(F_spa,(M,N))

2. Spectral stream

for

i = 1

to

M

do

for

j = 1

to

N

do

W \leftarrow extract_window (H, i, j, s i z e = p)

F_spe(i,j)←SpectralEncoder(W)

end for

3. Fusion

F←concat(F_spa,F_spe,axis = −1)

4. return

F

4. Experiment and Analysis

4.1. Datasets and Parameter Settings

In this section, we will comprehensively evaluate the proposed framework. The datasets used in this article include the universal visual dataset and the hyperspectral dataset.

4.1.1. Universal Visual Dataset

The following datasets were used:

(1) Mini-Imagenet: Some images were extracted from the Imagenet dataset for small-sample research, with a total of 100 categories and 600 color images for each category. The image size is not fixed and includes various objects such as people, animals, tanks, cannons, tools, and ships.

(2) Military images: A high-quality image classification dataset for military scenarios, widely used in small-sample learning and military target recognition research, containing 1502 samples.

(3) MAR20: A military aircraft dataset for military applications such as reconnaissance and early warning, intelligence analysis, and battlefield situational awareness, containing 20 main models of the United States and Russia, with a total of 3842 images and 22,341 examples.

4.1.2. Hyperspectral Dataset

Nine hyperspectral images were used for the experiment, with four being self-test data and five being publicly available data. The self-test data comes from the HSI-300 hyperspectral imaging system, with the core imaging component being an acousto-optic tunable filter (AOTF). Table 3 shows the technical details of the sensor. The band book for self-test data collection was set to 89.

Figure 4 shows the Pseudo color images and the corresponding ground truth maps of nine hyperspectral data. Detailed information about the data is provided below:

(1) Cement Street: This was acquired on 6 April 2024 at the same coordinates, GSD 1.5 mm and 500 × 300 pixels; a cement road hosts two alloy aircraft models.

(2) Holly: This has a resolution of 600 × 496; synthetic turf-B is the anomaly; the backdrop includes holly flowerbeds and architectural shadows. Meadow and Holly were captured on 26 May 2022 at 38°27′W, 114°30′E, with a ground sampling distance (GSD) of ≈8.4 mm, representing typical urban near-ground views.

(3) Jungle-I and Jungle-II: This are also from 6 April 2024, same location, GSD 3.3 mm and 1000 × 1000 pixels; close-range jungle settings contain synthetic turf-C in Jungle-I and synthetic turf-D in Jungle-II.

(4) Gulfport: This was captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor, with a spatial resolution of 3.4 m. The size of the hyperspectral image is 100 × 100 × 191. Due to the water absorption region and lower signal-to-noise ratio, after removing the bad bands, the spectral wavelength ranges from 400 to 2500 nm, with a total of 191 bands. The anomaly types in this HSI dataset are the three planes at the bottom of the image, occupying 60 pixels.

(5) HYDICE: This was captured by the Hyperspectral Digital Image Collection Experiment (HYDICE) sensor in urban areas of California, USA. The size of the hyperspectral image is 80 × 100 × 175, with noise bands removed from the image. This dataset includes 21 outlier pixels, namely cars and roofs.

(6) Texas Coast: A series of scenes captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in the Texas coastal region. The size of the hyperspectral image is 100 × 100 × 207. There are 207 bands within the range of 0.45 to 1.35 μm. This dataset is subject to a series of stripe noises and is considered challenging.

(7) San Diego: This was collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) at San Diego Airport in California, USA. The size of the hyperspectral image is 400 × 400 × 224. The spectral resolution is 10 nm, and the wavelength range is 0.37~2.51 μm. After removing the low signal-to-noise ratio and absorption bands (1–6, 33–35, 97, 107–113, 153–166, and 221–224), 189 bands were retained in the experiment. Two 100 × 100-sized regions were cropped from the HSI and applied in the experiment. Aircraft were considered anomalous targets in each hyperspectral image. The two HSIs are named San Diego-I and San Diego-II.

Figure 4. Pseudocolor images (left) and the corresponding ground-truth maps (right) of nine real hyperspectral datasets. (a) Cement Street. (b) Holly. (c) Jungle-I. (d) Jungle-II. (e) Gulfport. (f) HYDICE. (g) Texas Coast. (h) San Diego-I. (i) San Diego-II.

4.1.3. Environment and Model Settings

All experiments were implemented on the Ubuntu system using PyTorch 1.8 and Python 3.8, with two GeForce GRX 2080 GPUs included. The detailed architecture parameters of the proposed model components (spatial/spectral encoder and R-L-enhanced VAE) are shown in Table 1 and Table 2.

4.2. Ablation Experiment

This section will conduct ablation experiments to quantitatively evaluate the impact of the proposed components and hyperparameters on the overall framework and determine the optimal parameters. Due to the excellent experimental results of the overall framework of CTDST-HAD on public data, its ablation analysis value is not high. In this section, four self-test data will be used for experiments.

4.2.1. Quality of Stage Training

We evaluated the quality of stage training in the proposed contrastive transfer learning framework. The component strategies involved in the evaluation are as follows: (1) first pre-training and synergistic fine-tuning (FSF), (2) second pre-training and synergistic fine-tuning (SSF), (3) synergistic fine-tuning (SF), and (4) first pre-training and second pre-training and synergistic fine-tuning (FSSF). To validate the effectiveness of our proposed framework in hyperspectral anomaly detection tasks, we removed training components at different stages. Figure 5 shows the detection performance of four strategies. The FSSF strategy significantly improves the detection performance, achieving optimal results in Cement Street, Jungle-I, and Jungle-II. The performance of the FSSF strategy in Holly is suboptimal, but the detection results of all four strategies are greater than 0.9975, and the magnitude of model performance improvement and degradation is very small. Based on experiments on four datasets, the effectiveness of the FSSF strategy can be demonstrated. In addition, the comparison between SF and FSF proves that using a general visual dataset for the first pre-training can improve the detection performance of the model. The improvement effect of the second pre-training is better than that of the first pre-training because it not only includes spatial datasets from hyperspectral perspectives but also undergoes spectral pre-training.

4.2.2. The Influence of Adaptive EWC Weights

This section discusses the impact of adaptive EWC weights on the model. In order to further enhance the knowledge-retention effect of EWC for cross-domain learning, an adaptive penalty coefficient is introduced. Figure 6 shows the effect of the basic penalty coefficient on adaptive EWC when the attenuation rate is 5, and the horizontal axis of the table is logarithmically processed with a base of 10. From the trend in the graph, it can be seen that the model performs best when λ₀ is 10⁻⁴, 1, 10², and 10⁶, with average AUC values of 0.9687, 0.979425, 0.97785, and 0.97095. The optimal effect is obtained when λ₀ is set to 1. Figure 7 shows the effect of the attenuation rate on adaptive EWC when the basic penalty coefficient λ₀ is set to 1. It can be seen from the graph that the model achieves optimal performance when the attenuation rate is 5.

Table 4 presents a comparison between our proposed adaptive EWC, standard EWC, and no constraints. The improvement in the model by adaptive EWC is obvious, and its effect is better than that of standard EWC. When there is no regular constraint, the new task gradient can arbitrarily cover the old parameter space, and old knowledge will inevitably be covered. EWC fixed penalty with uniform strength can improve the performance of the model. Although EWC suppresses the coverage of old knowledge, it is difficult to achieve an appropriate balance between memory retention and new knowledge absorption due to the neglect of parameter importance and task differences. Adaptive EWC first measures the criticality of each parameter to old tasks using Fisher information and then dynamically scales the penalty intensity using task difference indicators, so that key parameters are still strongly protected when the difference is small, while non-key parameters quickly relax constraints when the difference is large. Therefore, in theory, it meets the dual requirements of “important information not lost” and “new knowledge can be fully learned” and can automatically adapt to different distribution changes without manually adjusting global hyperparameters for each task.

4.2.3. The Impact of R-L-Enhanced VAE

The generation of spectral-enhanced views plays an important role in contrastive learning, as it can affect the training performance of the model. Table 5 shows the effects of some conventional spectral-enhancement operations and R-L-enhanced VAE on model performance. The random wavelength offset is two bands. Compared to R-L-enhanced VAE, which explicitly embeds the physical mechanism of RossThick–LiSparseR into the decoding path, pure data perturbation methods such as random channel masking, spectral Gaussian noise, random intensity scaling, and random wavelength shift essentially perform unstructured local deformations in high-dimensional observation space, without introducing any prior constraints on the deterministic mapping between solar-observation geometry and reflectivity. Therefore, these conventional spectral strategies not only expand sample diversity but also significantly amplify deviations from the real physical manifold, forcing representation learning to search in redundant and inconsistent hypothesis spaces, thereby exposing issues of insufficient robustness and blurred discriminative boundaries in cross-scene generalization, ultimately resulting in a systemic disadvantage of 1.5–2.1 percentage points lower macroscopic average accuracy than R-L-enhanced VAE.

4.2.4. The Impact of Focal Loss

The positive and negative samples in hyperspectral anomaly detection datasets are usually extremely imbalanced. The focal loss function can address the impact of imbalanced samples on the model. Figure 8 shows the effect of γ on the model when α in the focal loss function is 0.25. From the graph, it can be seen that the optimal value of γ is 2. Figure 9 shows the impact of α on the model when γ in the focal loss function is 2. The optimal parameter for obtaining α is 0.25. At the same time, we will compare the focal loss used with the BCE loss most commonly used in binary classification. Table 6 shows the performance of the model when using focal loss and BCE loss. The performance of the focal loss function is superior to that of the BCE loss function, which directly proves the effectiveness of focal loss in the model. The dataset for hyperspectral anomaly detection has a significant difference in the number of pixels between the target and background, and the training samples are extremely imbalanced, resulting in BCE loss being dominated by a large number of easily separable negative samples and gradients being “overwhelmed”. Focal loss dynamically reduces the weight of easily separable samples through modulation factors, allowing the model to focus on scarce abnormal positive samples and learn more discriminative feature boundaries. In addition, hyperspectral data has multiple bands and high redundancy, and abnormal spectra often exhibit weak and sparse responses. The focal loss focusing mechanism can effectively amplify these weak signals, avoiding the model performance degradation caused by sample imbalance in BCE loss.

4.3. Comparative Experiment

In this section, we compare the proposed method with numerous anomaly detection methods. These anomaly detection methods cover statistical, representation-based, and deep learning-based approaches, including the following:

(1) Reed–Xiaoli algorithm (RX) [13]: This is a statistical-distance approach that assumes a Gaussian background, estimates its pdf via the mean vector and covariance matrix, and scores anomalies by the Mahalanobis deviation of test pixels from this model.

(2) Collaborative representation detection (CRD) [19]: This is a sparse representation method that builds separate dictionaries for targets and background, expresses targets as sparse linear combinations of atoms, and separates the two components by coefficient optimization.

(3) Robust graph autoencoder (RGAE) [29]: This is a graph-regularized autoencoder framework that preserves sample geometry during training and retains robustness against noise and anomalies.

(4) Autonomous hyperspectral anomaly detection network (Auto-AD) [28]: This is a reconstruction suppression network employing a skip-connected fully convolutional autoencoder and an adaptively weighted loss to down-weight anomalous pixels during image recovery.

(5) Convolutional transformer-inspired autoencoder (CTA) [31]: This is a self-supervised clustering–reconstruction hybrid which uses K-means to generate pseudo-background/anomaly samples and a convolutional transformer autoencoder to capture local–global features for improved separability.

(6) Gated transformer for hyperspectral anomaly detection (GT-HAD) [32]: This is a dual-branch gating network that enforces spatial–spectral similarity during reconstruction, models background and anomaly features separately, and applies a Content Matching Method to activate each branch adaptively.

(7) Transformer-based autoencoder framework (TAEF) [33]: This is a high-order mixing network that embeds an Extended Multilinear Mixture Model in the decoder while using a transformer encoder to reconstruct background pixels and characterize nonlinear spectral interactions.

(8) Deep feature aggregation network for hyperspectral anomaly detection (DFAN-HAD) [34]: This involves deep learning multibackground disentanglement that fuses orthogonal spectral attention with background anomaly category statistics to model diverse background patterns while intrasimilarity interdifference multiple aggregation separation loss amplifies the background and suppresses anomaly representation for prior-free HAD.

(9) Multiscale network (MSNet) [35]: This involves self-supervised multiscale-separation reconstruction, employs an untrained multiscale convolutional encoder–decoder to recover background under few-shot conditions, and introduces a soft separator to suppress anomalies, boosting HAD robustness in unseen scenes.

(10) PUNNet [36]: This is a self-supervised blind-spot reconstruction network with patch–shuffle downsampling and dilated NAFNet blocks, enlarging the receptive field while masking the center pixel to amplify the anomaly reconstruction error for prior-free HAD.

Table 7 shows the parameter settings for each method. Figure 10 shows the detection results of each method on various data.

RX and CRD are representatives of classic non-deep learning methods, but their detection performance is poor. Regarding the self-test data, they gave high anomaly scores to a small number of pixels of the target but also gave a large number of high anomaly scores to the background, resulting in a high false-alarm rate in the detection results and difficulty in distinguishing the target from the background. RX and CRD have certain effects in public data, and RX has better detection performance in Gulfport, Texas Coast, and San Diego-I, which can fully recognize targets. The recognition level of CRD for targets is low, and in San Diego-II, it can roughly recognize targets but lacks sufficient suppression of the background.

Regarding the self-test data, deep learning-based methods can effectively suppress the background. MSNet can vaguely detect the edges of targets in the Cement Street dataset but has no detection effect on other datasets. The detection effect of CTA is similar to that of MSNet, which can detect the contour of the target but cannot fully highlight or detect the target at all. The detection performance of RGAE, Auto AD, GT-HAD, TAEF, DFAN-HAD, and PUNNet is basically the same, but the boundary of the target in the detection results of TAEF has a certain degree of blurring. When detecting the Jungle-I dataset, only our proposed CTDST-HAD fully detected the target, while the other methods only indicated the location of abnormal targets and did not detect the complete contour of the target. This is because the background of Jungle-I is complex and the hyperspectral images taken near the ground have strong spectral uncertainty, with the phenomenon of “the same substance but different spectra, and foreign objects in the same spectrum” being more pronounced. When detecting Jungle-II, our proposed method still performed the best, while the other methods had very low detection rates for targets. Regarding public datasets, TAEF has poor detection performance and generates a large number of false alarms. It only recognizes targets in the Texas Coast and cannot distinguish between targets and backgrounds in other data. RGAE, Auto AD, and CTA can detect targets to some extent in all publicly available data, but the detection rate needs to be improved. GT-HAD and DFAN-HAD have higher recognition rates, but the false alarms they generate cannot be ignored, which affects the overall detection performance. MSNet, PUNNet, and CTDST-HAD have good background suppression effects, but MSNet and PUNNet have low target recognition rates for Gulfport and San Diego-I data, while CTDST-HAD has good detection effects for all data.

Figure 11 shows the detection performance comparison of representative methods on various datasets, both on a double logarithmic scale and from a 3D perspective. This visualization simultaneously magnifies the low-false-alarm and low-detection-rate regions, allowing for a detailed examination of performance nuances. The 3D-ROC curves further provide a comprehensive view of the performance surface across detection rates, false-alarm rates, and thresholds, revealing the stability and overall behavior of each method. The results are consistent with the subjective observations from Figure 10. Table 8, Table 9 and Table 10 present the AUC values from different perspectives for the detection results of each method on various datasets. Specifically, RX and CRD exhibit poor detection performance across all self-test datasets, which is corroborated by their low average AUC values of 0.2174 and 0.2625, respectively, in Table 9. While CTA demonstrates precise background suppression, evidenced by its extremely low average AUC for false alarms of only 0.0065 in Table 10, it struggles with incomplete target detection, leading to a modest average detection AUC of 0.2489. Our proposed CTDST-HAD achieves the overall best performance across nearly all datasets in this comprehensive view. A notable observation is that its performance on the Texas Coast dataset, while still strong with a detection AUC of 0.4130, is marginally surpassed by that of MSNet, which achieves 0.5484, as evident in the magnified detail provided by this plot.

The backgrounds of Holly and Cement Street are relatively simple, and most methods can perform well, while Jungle-I and Jungle-II include close-range jungle backgrounds that are complex, which leads to a decrease in detection performance. For instance, in the complex Jungle-II scene, the best-performing traditional method CRD only reached a detection AUC of 0.2797, whereas CTDST-HAD achieved 0.3509. Our proposed method has the highest average overall AUC of 0.9884 in Table 8 and the highest average detection AUC of 0.4667 in Table 9. In the Holly dataset, the detection AUC value of PUN-Net is 0.3031, which is higher than that of many other methods but still lower than CTDST-HAD’s 0.4902, a result that may be due to PUN-Net’s strong background suppression ability reflected in its very low false-alarm AUC of 0.0083. TAEF and GT-HAD performed well in the Texas Coast with detection AUCs of 0.5359 and 0.2149, respectively, but had almost no detection effect in other data, with average detection AUCs as low as 0.3127 and 0.1526. Although CTDST-HAD may not perform the best on every single data point, its AUC values are consistently among the highest and show stable performance across different datasets, making it the overall best. The average overall AUC value of CTDST-HAD reached 0.9884, significantly outperforming the traditional method average of only 0.9244 and the highest other deep learning method average of 0.9619, demonstrating its superior and robust detection capability.

To provide a rigorous statistical evaluation of the comparative results, we conducted the Friedman test, a non-parametric statistical method suitable for comparing multiple algorithms across different datasets. The statistical analysis reveals that the eleven compared methods exhibit statistically significant differences in their detection performance (χ²(10) = 23.82, p = 0.0081). This confirms that the observed performance variations in AUC values are not due to random fluctuations but represent genuine differences in algorithmic efficacy. Further statistical analysis shows that the proposed CTDST-HAD method consistently outperforms the comparison methods across all nine hyperspectral datasets. The statistical significance (p < 0.01) indicates that this performance advantage is unlikely to occur by chance, confirming the robustness and effectiveness of our approach. Moreover, the minimal performance variance of CTDST-HAD across different datasets suggests that its superiority is not dependent on specific data.

4.4. Running Time and Model Complexity

This section discusses the real-time performance of the proposed model and the comparative method. Table 11 shows the running time of each method on each dataset. The running time of RGAE is very long, which may be due to its implementation on the Matlab platform. The other deep learning models were implemented on PyTorch. In the self-test data, our proposed CTDST-HAD is second only to RX and GT-HAD in real-time performance. Regarding public data, RX, GT-HAD, and CTA perform the best. But these deep learning-based comparative models are all based on reconstruction. The detection of each hyperspectral image includes two parts, including training inference, which implicitly increases the detection time. CTDST-HAD uses transfer learning based on spectral differences to achieve anomaly detection, and the detection of each hyperspectral image does not require retraining.

Table 12 shows the model complexity based on deep learning methods, using San Diego-I to calculate the Floating Point Operations per Second (FLOPs) of the model. The CTDST-HAD we propose is based on a framework of contrastive learning and transfer learning, with spatial and spectral encoders, both using vision transformers, and the model is relatively complex. But its FLOPs are moderate in the case of a large number of parameters, making the inference time faster.

5. Conclusions

This paper has introduced CTDST-HAD, a self-supervised framework that synergizes contrastive learning and transfer learning within a dual-stream transformer architecture for hyperspectral anomaly detection. The spatial stream acquires robust spatial priors from both general and hyperspectral-specific imagery, while the spectral stream learns discriminative spectral features through physics-informed data augmentation. The incorporation of adaptive EWC effectively preserves cross-domain knowledge during transfer, and the R-L-enhanced VAE ensures physically consistent spectral variations. The use of focal loss further enhances model robustness under severe class imbalance. Ablation studies validated the contribution of each component, showing that FSSF, adaptive EWC, R-L-enhanced VAE, and focal loss all significantly improve performance. Comparative experiments against a range of statistical, representation-based, and deep learning methods confirmed that CTDST-HAD achieves superior AUC scores across diverse scenes, especially in challenging near-ground jungle environments where other methods struggle. Additionally, the framework offers efficient inference by eliminating per-image retraining, making it suitable for practical applications. The limitations of CTDST-HAD lie in its complex framework and large number of parameters. Further work should focus on lightweighting the model and simplifying the overall framework.

Author Contributions

Conceptualization, L.D. and B.Z.; methodology, L.D. and J.Y.; software, J.Y.; validation, L.D., Y.C. and Q.W.; formal analysis, Y.C.; investigation, L.D. and Q.W.; resources, B.Z.; data curation, L.D. and Y.C.; writing—original draft preparation, L.D.; writing—review and editing, L.D.; visualization, Q.W.; supervision, B.Z.; project administration, B.Z. and J.Y.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The source code and self-test data can be found at https://github.com/aosilu/CTDST-HAD.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, P.; Bu, Y.; Zhao, Y.; Kong, S.G. Progressive self-supervised framework for anomaly detection in hyperspectral images. Eng. Appl. Artif. Intell. 2025, 156, 111151. [Google Scholar] [CrossRef]
Li, X.; Shang, W. Hyperspectral anomaly detection based on spectral similarity variability feature. Sensors 2024, 24, 5664. [Google Scholar] [CrossRef] [PubMed]
Zhao, D.; Zhang, H.; Huang, K.; Zhu, X.; Arun, P.V.; Jiang, W.; Li, S.; Pei, X.; Zhou, H. SASU-Net: Hyperspectral video tracker based on spectral adaptive aggregation weighting and scale updating. Expert Syst. Appl. 2025, 272. [Google Scholar] [CrossRef]
Jiang, W.; Zhao, D.; Wang, C.; Yu, X.; Arun, P.V.; Asano, Y.; Xiang, P.; Zhou, H. Hyperspectral video object tracking with cross-modal spectral complementary and memory prompt network. Knowl.-Based Syst. 2025, 330, 114595. [Google Scholar] [CrossRef]
Zhao, D.; Wang, M.; Huang, K.; Zhong, W.; Arun, P.V.; Li, Y.; Asano, Y.; Wu, L.; Zhou, H. OCSCNet-Tracker: Hyperspectral video tracker based on octave convolution and spatial-spectral capsule network. Remote. Sens. 2025, 17, 693. [Google Scholar] [CrossRef]
Zhao, X.; Liu, K.; Wang, X.; Zhao, S.; Gao, K.; Lin, H.; Zong, Y.; Li, W. Tensor adaptive reconstruction cascaded with global and local feature fusion for hyperspectral target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 607–620. [Google Scholar] [CrossRef]
Gan, Y.; Li, X.; Wu, S.; Wang, M. MACNet: A multiscale attention-guided contextual network for hyperspectral anomaly detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5508905. [Google Scholar] [CrossRef]
Xu, M.; Zhang, J.; Liu, S.; Sheng, H. Hyperspectral anomaly detection based on adaptive background dictionary construction and collaborative representation. Int. J. Remote. Sens. 2024, 45, 3349–3369. [Google Scholar] [CrossRef]
Zhao, D.; Yan, W.; You, M.; Zhang, J.; Arun, P.V.; Jiao, C.; Wang, Q.; Zhou, H. Hyperspectral anomaly detection based on empirical mode decomposition and local weighted contrast. IEEE Sens. J. 2024, 24, 33847–33861. [Google Scholar] [CrossRef]
He, K.; Jiang, Z.; Liu, B.; Zhang, X. Fast hyperspectral image anomaly detection based on orthogonal projection. Laser Optoelectron. Prog. 2024, 61, 233–240. [Google Scholar] [CrossRef]
Zhu, X.; Zhang, H.; Hu, B.; Huang, K.; Arun, P.V.; Jia, X.; Zhao, D.; Wang, Q.; Zhou, H.; Yang, S. DSP-Net: A dynamic spectral-spatial joint perception network for hyperspectral target tracking. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5510905. [Google Scholar] [CrossRef]
Zhao, D.; Zhong, W.; Ge, M.; Jiang, W.; Zhu, X.; Arun, P.V.; Zhou, H. SiamBSI: Hyperspectral video tracker based on band correlation grouping and spatial-spectral information interaction. Infrared Phys. Technol. 2025, 151, 106063. [Google Scholar] [CrossRef]
Fu, X.; Zhang, T.; Cheng, J.; Jia, S. MMR-HAD: Multiscale Mamba reconstruction network for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote. Sens. 2025, 63, 5516914. [Google Scholar] [CrossRef]
Reed, I.; Yu, X. Adaptive multiple-band CFAR detection of an optical pattern with unknown spectral distribution. IEEE Trans. Acoust. Speech Signal Process. 1990, 38, 1760–1770. [Google Scholar] [CrossRef]
Kwon, H.; Der, S.Z.; Nasrabadi, N.M. Adaptive anomaly detection using subspace separation for hyperspectral imagery. Opt. Eng. 2003, 1, 3342–3351. [Google Scholar] [CrossRef]
Kwon, H.; Nasrabadi, N. Kernel RX-algorithm: A nonlinear anomaly detector for hyperspectral imagery. IEEE Trans. Geosci. Remote. Sens. 2005, 43, 388–397. [Google Scholar] [CrossRef]
Banerjee, A.; Burlina, P.; Meth, R. Fast hyperspectral anomaly detection via SVDD. In Proceedings of the 2007 IEEE International Conference on Image Processing, San Antonio, TX, USA, 16 September–19 October 2007; IEEE: New York, NY, USA, 2007; pp. 101–104. [Google Scholar] [CrossRef]
Tu, B.; Yang, X.; Li, N.; Zhou, C.; He, D. Hyperspectral anomaly detection via density peak clustering. Pattern Recognit. Lett. 2020, 129, 144–149. [Google Scholar] [CrossRef]
Ling, Q.; Guo, Y.; Lin, Z.; An, W. A Constrained Sparse Representation Model for Hyperspectral Anomaly Detection. IEEE Trans. Geosci. Remote. Sens. 2019, 57, 2358–2371. [Google Scholar] [CrossRef]
Li, W.; Du, Q. Collaborative Representation for Hyperspectral Anomaly Detection. IEEE Trans. Geosci. Remote. Sens. 2015, 53, 1463–1474. [Google Scholar] [CrossRef]
Yang, Y.; Zhang, J.; Liu, D.; Wu, X. Low-rank and sparse matrix decomposition with background position estimation for hyperspectral anomaly detection. Infrared Phys. Technol. 2019, 96, 213–227. [Google Scholar] [CrossRef]
Chang, C.-I.; Cao, H.; Song, M. Orthogonal subspace projection target detector for hyperspectral anomaly detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 4915–4932. [Google Scholar] [CrossRef]
Yang, Y.; Zhang, J.; Song, S.; Zhang, C.; Liu, D. Low-rank and sparse matrix decomposition with orthogonal subspace projection-based background suppression for hyperspectral anomaly detection. IEEE Geosci. Remote. Sens. Lett. 2020, 17, 1378–1382. [Google Scholar] [CrossRef]
Kang, X.; Zhang, X.; Li, S.; Li, K.; Li, J.; Benediktsson, J.A. Hyperspectral Anomaly Detection With Attribute and Edge-Preserving Filters. IEEE Trans. Geosci. Remote. Sens. 2017, 55, 5600–5611. [Google Scholar] [CrossRef]
Hu, X.; Xie, C.; Fan, Z.; Duan, Q.; Zhang, D.; Jiang, L.; Wei, X.; Hong, D.; Li, G.; Zeng, X.; et al. Hyperspectral anomaly detection using deep learning: A review. Remote. Sens. 2022, 14, 1973. [Google Scholar] [CrossRef]
Bati, E.; Çalışkan, A.; Koz, A.; Alatan, A.A. Hyperspectral anomaly detection method based on auto-encoder. Proc. SPIE 2015, 9643, 96430S. [Google Scholar] [CrossRef]
Xie, W.; Lei, J.; Liu, B.; Li, Y.; Jia, X. Spectral constraint adversarial autoencoders approach to feature representation in hyperspectral anomaly detection. Neural Netw. 2019, 119, 222–234. [Google Scholar] [CrossRef]
Ojha, N.; Sinha, I.K.; Singh, K.P. VAE-AD: Unsupervised variational autoencoder for anomaly detection in hyperspectral images. In Neural Information Process Communications in Computer and Information Science; Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A., Eds.; Springer: Singapore, 2023; Volume 1794, pp. 127–139. [Google Scholar] [CrossRef]
Wang, S.; Wang, X.; Zhang, L.; Zhong, Y. Auto-AD: Autonomous hyperspectral anomaly detection network based on fully convolutional autoencoder. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 5503314. [Google Scholar] [CrossRef]
Fan, G.; Ma, Y.; Mei, X.; Fan, F.; Huang, J.; Ma, J. Hyperspectral Anomaly Detection With Robust Graph Autoencoders. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Liu, H.; Su, X.; Shen, X.; Zhou, C.; Chen, X.; Zhou, X. BiGSeT: Binary mask-guided separation training for DNN-based hyperspectral anomaly detection. arXiv 2023. [Google Scholar] [CrossRef]
He, Z.; He, D.; Xiao, M.; Lou, A.; Lai, G. Convolutional Transformer-Inspired Autoencoder for Hyperspectral Anomaly Detection. IEEE Geosci. Remote. Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Lian, J.; Wang, L.; Sun, H.; Huang, H. GT-HAD: Gated Transformer for Hyperspectral Anomaly Detection. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 3631–3645. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; Wang, B. Transformer-Based Autoencoder Framework for Nonlinear Hyperspectral Anomaly Detection. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Cheng, X.; Huo, Y.; Lin, S.; Dong, Y.; Zhao, S.; Zhang, M.; Wang, H. Deep feature aggregation network for hyperspectral anomaly detection. IEEE Trans. Instrum. Meas. 2024, 73, 5033016. [Google Scholar] [CrossRef]
Liu, H.; Su, X.; Shen, X.; Zhou, X. MSNet: Self-supervised multiscale network with enhanced separation training for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5520313. [Google Scholar] [CrossRef]
Wang, D.; Zhuang, L.; Gao, L.; Sun, X.; Zhao, X. Global feature-injected blind-spot network for hyperspectral anomaly detection. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 5509305. [Google Scholar] [CrossRef]
Fu, X.; Jia, S.; Zhuang, L.; Xu, M.; Zhou, J.; Li, Q. Hyperspectral anomaly detection via deep plug-and-play denoising CNN regularization. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 9553–9568. [Google Scholar] [CrossRef]
Lin, S.; Zhang, M.; Cheng, X.; Shi, L.; Gamba, P.; Wang, H. Dynamic low-rank and sparse priors constrained deep autoencoders for hyperspectral anomaly detection. IEEE Trans. Instrum. Meas. 2024, 73, 2500518. [Google Scholar] [CrossRef]
Xiang, P.; Ali, S.; Zhang, J.; Jung, S.K.; Zhou, H. Pixel-associated autoencoder for hyperspectral anomaly detection. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103816. [Google Scholar] [CrossRef]
Chen, W.; Zhi, X.; Jiang, S.; Huang, Y.; Han, Q.; Zhang, W. DWSDiff: Dual-window spectral diffusion for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote. Sens. 2025, 63, 5504617. [Google Scholar] [CrossRef]
He, X.; An, W.; Wang, Y.; Ling, Q.; Li, M.; Lin, Z.; Zhou, S. CWIMamba: Cross-scale windowed integration state space model for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote. Sens. 2025, 63, 5527820. [Google Scholar] [CrossRef]
Zhao, D.; Hu, B.; Jiang, W.; Zhong, W.; Arun, P.V.; Cheng, K.; Zhao, Z.; Zhou, H. Hyperspectral video tracker based on spectral difference matching reduction and deep spectral target perception features. Opt. Lasers Eng. 2025, 194, 109124. [Google Scholar] [CrossRef]
Zhao, D.; Zhang, H.; Arun, P.V.; Jiao, C.; Zhou, H.; Xiang, P.; Cheng, K. SiamSTU: Hyperspectral video tracker based on spectral spatial angle mapping enhancement and state aware template update. Infrared Phys. Technol. 2025, 150, 105919. [Google Scholar] [CrossRef]
Li, L.; Wu, Z.; Wang, B. Hyperspectral anomaly detection via merging total variation into low-rank representation. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 17, 14894–14907. [Google Scholar] [CrossRef]
Xiang, P.; Zhang, J.; Qi, S.; Jung, S.K.; Zhou, H.; Zhao, D. Hyperspectral anomaly detection using Taylor expansion and weighted irregular block filter. Infrared Phys. Technol. 2025, 150, 105942. [Google Scholar] [CrossRef]
Wu, Y.; Li, Z.; Zhao, B.; Song, Y.; Zhang, B. Transfer learning of spatial features from high-resolution RGB images for large-scale and robust hyperspectral remote sensing target detection. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5505732. [Google Scholar] [CrossRef]
Zou, Q.; Zhou, J.; Ma, Y.; Luo, M. Random projection-based sub-pixel target detection for hyperspectral image with t-distribution background. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5538218. [Google Scholar] [CrossRef]
Sun, X.; Zhuang, L.; Gao, L.; Gao, H.; Sun, X.; Zhang, B. A parameter-free topological disassembly-guided method for hyperspectral target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 17875–17888. [Google Scholar] [CrossRef]
Li, C.; Wang, R.; Chen, Z.; Gao, H.; Xu, S. Transformer-inspired stacked-GAN for hyperspectral target detection. Int. J. Remote. Sens. 2024, 45, 4961–4982. [Google Scholar] [CrossRef]
Jiang, W.; Zhong, W.; Arun, P.V.; Xiang, P.; Zhao, D. SRTE-Net: Spectral-spatial similarity reduction and reorganized texture encoding for hyperspectral video tracking. IEEE Signal Process. Lett. 2025, 32, 3390–3394. [Google Scholar] [CrossRef]
Liu, Y.; Jiang, K.; Xie, W.; Zhang, J.; Li, Y.; Fang, L. Hyperspectral anomaly detection with self-supervised anomaly prior. Neural. Netw. 2025, 187, 107294. [Google Scholar] [CrossRef]
Zhao, D.; Zhou, L.; Li, Y.; He, W.; Arun, P.V.; Zhu, X.; Hu, J. Visibility estimation via near-infrared bispectral real-time imaging in bad weather. Infrared Phys. Technol. 2024, 136, 105008. [Google Scholar] [CrossRef]
Wang, D.; Ren, L.; Sun, X.; Gao, L.; Chanussot, J. Nonlocal and local feature-coupled self-supervised network for hyperspectral anomaly detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 6981–6993. [Google Scholar] [CrossRef]
Zhao, D.; Zhu, X.; Zhang, Z.; Arun, P.V.; Cao, J.; Wang, Q.; Zhou, H.; Jiang, H.; Hu, J.; Qian, K. Hyperspectral video target tracking based on pixel-wise spectral matching reduction and deep spectral cascading texture features. Signal Process. 2023, 209, 109033. [Google Scholar] [CrossRef]
Sun, X.; Zhang, Y.; Dong, Y.; Du, B. Contrastive self-supervised learning-based background reconstruction for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote. Sens. 2025, 63, 5504312. [Google Scholar] [CrossRef]
Wang, Z.; Ma, D.; Yue, G.; Li, B.; Cong, R.; Wu, Z. Self-supervised hyperspectral anomaly detection based on finite spatialwise attention. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5502918. [Google Scholar] [CrossRef]
Li, M.; Fu, Y.; Zhang, T.; Wen, G. Supervise-assisted self-supervised deep-learning method for hyperspectral image restoration. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 7331–7344. [Google Scholar] [CrossRef]
Gao, L.; Wang, D.; Zhuang, L.; Sun, X.; Huang, M.; Plaza, A. BS3LNet: A new blind-spot self-supervised learning network for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 5504218. [Google Scholar] [CrossRef]
Hu, M.; Wu, C.; Zhang, L. HyperNet: Self-supervised hyperspectral spatial-spectral feature understanding network for hyperspectral change detection. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 5543017. [Google Scholar] [CrossRef]
Li, W.; Wu, G.; Du, Q. Transferred Deep Learning for Anomaly Detection in Hyperspectral Imagery. IEEE Geosci. Remote. Sens. Lett. 2017, 14, 597–601. [Google Scholar] [CrossRef]
Guan, P.; Lam, E.Y. Progressive self-supervised pretraining for hyperspectral image classification. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5517713. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.H.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020. [Google Scholar] [CrossRef]
Zhuang, H.; Yan, Y.; He, R.; Zeng, Z. Class incremental learning with analytic learning for hyperspectral image classification. J. Frankl. Inst. 2024, 361, 107285. [Google Scholar] [CrossRef]

Figure 1. Overall framework diagram.

Figure 2. Single sample acquisition from spectral dataset.

Figure 3. VAE structure embedded with physical-core-driven model.

Figure 5. Detection performance of four strategies on each dataset. (a) Cement Street. (b) Holly. (c) Jungle-I. (d) Jungle-II.

Figure 6. The impact of the basic penalty coefficient with a decay rate of 5 on adaptive EWC.

Figure 7. The impact of the attenuation rate on adaptive EWC when the basic penalty coefficient λ₀ is set.

Figure 8. The influence of γ on the model when α in the focal loss function is 0.25.

Figure 9. The influence of α on the model when γ in the focal loss function is 2.

Figure 10. The detection results of each method. (a) Cement Street. (b) Holly. (c) Jungle-I. (d) Jungle-II. (e) Gulfport. (f) HYDICE. (g) Texas Coast. (h) San Diego-I. (i) San Diego-II. RX [13], CRD [19], RGAE [29], Auto-AD [28], CTA [31], GT-HAD [32], TAEF [33], DFAN-HAD [34], MSNet [35], PUNNet [36], CTDST-HAD, Ground truth.

Figure 11. Performance comparison under double logarithmic scale (left) and 3D-ROC curve (right). (a) Cement Street. (b) Holly. (c) Jungle-I. (d) Jungle-II. (e) Gulfport. (f) HYDICE. (g) Texas Coast. (h) San Diego-I. (i) San Diego-II.

Table 1. Parameter settings for spatial encoder and spectral encoder.

Spatial Encoder		Spectral Encoder
image_size	224	spectral_band	89
patch_size	32	window_size	40
feature_dim	128	feature_dim	128
depth	6	depth	5
dropout	0.1	dropout	0.1
emb_dropout	0.1	emb_dropout	0.1
head_dim	64	head_dim	16
heads	16	heads	6

Table 2. Parameter settings for each layer of R-L-enhanced VAE.

	Layer	Dimensions
Encoder	FC + ReLU	[89,128]
	FC + ReLU	[128,64]
	FC + ReLU	[64,32]
	FC_mean, FC_var	[32,3], [32,3]
	Reparametrize	[6,3]
R-L-Enhanced Decoder	z → K	[3,2]
R-L-Enhanced Decoder	K⊕f	[2,89]

Note: ⊕ is the operation based on Equation (10).

Table 3. Details of sensor technology.

Parameter	Value
Dimension	20.3 cm (W) × 14.0 cm (H) × 41.7 cm (D)
Wavelength range	450 nm to 800 nm
Bandwidth (typical)	1.5 nm (at 450 nm), 3.5 nm (at 800 nm)
Accuracy	±1 nm (estimated temperature variance ± 5 °C
Repeatability	±0.5 nm
Out-of-band rejection	1:10⁻³
Numerical aperture	0.05
Imaging (entrance aperture to CCD image plane)	1:1 image
Switching speed	<100 μs
Operating temperature range	10 °C to 35 °C
Camera type	Electron-multiplying charge-coupled device
Camera cooling	Internal thermoelectric cooler
Minimum camera cooling temperature (air cooling at 25 °C ambient)	−60 °C
Camera controller card	PC plug-in card
Camera cooler power input	7.5 VDC
AC input (supplied cooler power module)	100 VAC to 240 VAC, 50–60 Hz
Typical case operating temperature rise (above ambient)	5 °C

Table 4. Comparison between adaptive EWC, standard EWC, and no constraints.

	Cement Street	Holly	Jungle-I	Jungle-II
Adaptive EWC	0.9832	0.999	0.9835	0.952
EWC	0.9612	0.9533	0.9701	0.9299
No constraints	0.9574	0.9424	0.9669	0.9208

Table 5. The impact of some conventional spectral-enhancement operations and R-L-enhanced VAE on model performance.

	Cement Street	Holly	Jungle-I	Jungle-II	Average
Random channel masking	0.9632	0.9797	0.9468	0.9296	0.9548
Spectral Gaussian noise	0.9744	0.9832	0.9587	0.9398	0.9640
Random intensity scaling	0.9623	0.9713	0.9479	0.9299	0.9529
Random wavelength shift	0.9583	0.9788	0.9435	0.9198	0.9501
R-L-enhanced VAE	0.9832	0.999	0.9835	0.952	0.9794

Table 6. Performance of the model when using focal loss and BCE loss.

	Cement Street	Holly	Jungle-I	Jungle-II	Average
Focal loss	0.9832	0.999	0.9835	0.952	0.9794
BCE loss	0.9711	0.9981	0.9633	0.9213	0.9635

Table 7. Parameter settings for each method.

Method	Parameter Settings	Source
RX [13]	-	-
CRD [19]	Outer window size: 11 Inner window size: 9 Regularization coefficient: 0.1	Original paper
RGAE [29]	Regularization parameter: 0.01 Number of superpixels: 150 Number of hidden layer nodes: 20	Author’s open code
Auto-AD [28]	Number of channels: 5 Loss change threshold: 1.5 × 10⁻⁵	Author’s open code
CTA [31]	Learning rate: 1 × 10⁻⁴ Total training epochs: 100	Author’s open code
GT-HAD [32]	Embedding dimension: 64 Patch size: 3	Author’s open code
TAEF [33]	Low-pass filter bandwidth: 7 Total training epochs: 20 Learning rate: 5 × 10⁻³	Original paper
DFAN-HAD [34]	Latent layer dimension: 64 Learning rate: 1 × 10⁻⁴	Author’s open code
MSNet [35]	Number of selected bands: 64 Regularization coefficient in loss: 1 × 10⁻³	Author’s open code
PUNNet [36]	PD stride factor: 2 Number of NAFNet blocks: 4 Dilation factor: 2 Loss function: L1 loss	Author’s open code
CTDST-HAD	Regularization: adaptive EWC ( $λ_{0} = 1, τ = 5$ )Loss function: focal loss ( $γ = 2, α = 0.25$ )	-

Table 8. AUC (PD,PF) of different methods for different data.

	RX [13]	CRD [19]	RGAE [29]	Auto-AD [28]	CTA [31]	GT-HAD [32]	TAEF [33]	DFAN-HAD [34]	MSNet [35]	PUNNet [36]	CTDST-HAD
Cement Street	0.7505	0.7693	0.9893	0.9887	0.9506	0.9895	0.9845	0.8574	0.5734	0.9889	0.9832
Holly	0.9978	0.9947	0.9993	0.9935	0.9454	0.9988	0.9980	0.9975	0.7289	0.9996	0.999
Jungle-I	0.8693	0.8622	0.9072	0.8785	0.6553	0.9104	0.9015	0.9640	0.7704	0.9194	0.9835
Jungle-II	0.9214	0.8753	0.9134	0.9006	0.7596	0.8824	0.9074	0.9141	0.7874	0.9103	0.952
Gulfport	0.9521	0.9593	0.7484	0.9429	0.9890	0.6162	0.5724	0.9943	0.9875	0.9209	0.9983
HYDICE	0.9855	0.8970	0.7633	0.8922	0.9946	0.6224	0.6855	0.9934	0.9809	0.8982	0.9921
Texas Coast	0.9906	0.9948	0.9822	0.9887	0.9891	0.9353	0.9850	0.9904	0.9977	0.9856	0.9921
San Diego-I	0.9118	0.9532	0.9891	0.9949	0.9657	0.6012	0.3234	0.9583	0.9623	0.9833	0.9963
San Diego-II	0.9404	0.9467	0.9918	0.9789	0.9893	0.7813	0.7120	0.9877	0.9811	0.9760	0.9993
Average	0.9244	0.9169	0.9204	0.9510	0.9154	0.8153	0.7855	0.9619	0.8633	0.9536	0.9884

Table 9. AUC (PD,τ) of different methods for different data.

	RX [13]	CRD [19]	RGAE [29]	Auto-AD [28]	CTA [31]	GT-HAD [32]	TAEF [33]	DFAN-HAD [34]	MSNet [35]	PUNNet [36]	CTDST-HAD
Cement Street	0.4759	0.5441	0.3643	0.3729	0.2272	0.3354	0.3446	0.2267	0.0685	0.3299	0.5279
Holly	0.3477	0.4156	0.3256	0.1801	0.0957	0.1995	0.4006	0.3619	0.0752	0.3031	0.4902
Jungle-I	0.1484	0.2773	0.2542	0.2419	0.0261	0.1097	0.3232	0.2099	0.0051	0.2506	0.3973
Jungle-II	0.1111	0.2797	0.0821	0.0836	0.0273	0.0497	0.1255	0.2358	0.0051	0.0313	0.3509
Gulfport	0.0727	0.0887	0.0988	0.2456	0.3333	0.0744	0.2242	0.5121	0.2328	0.1177	0.5374
HYDICE	0.2330	0.2975	0.1118	0.2686	0.2965	0.0426	0.3148	0.6304	0.1186	0.3033	0.6871
Texas Coast	0.3117	0.1293	0.3764	0.3372	0.4608	0.2149	0.5359	0.5484	0.3176	0.3682	0.4130
San Diego-I	0.0789	0.1153	0.2146	0.3821	0.0456	0.1125	0.1825	0.2261	0.0343	0.1367	0.4512
San Diego-II	0.1768	0.2151	0.1661	0.1281	0.2168	0.2350	0.3631	0.5167	0.0678	0.1184	0.3454
Average	0.2174	0.2625	0.2215	0.2489	0.1922	0.1526	0.3127	0.3853	0.1028	0.2177	0.4667

Table 10. AUC (PF,τ) of different methods for different data.

	RX [13]	CRD [19]	RGAE [29]	Auto-AD [28]	CTA [31]	GT-HAD [32]	TAEF [33]	DFAN-HAD [34]	MSNet [35]	PUNNet [36]	CTDST-HAD
Cement Street	0.3661	0.4532	0.0152	0.0166	0.0065	0.0169	0.0765	0.1217	0.0514	0.0371	0.1489
Holly	0.0472	0.1684	0.0101	0.0065	0.0051	0.0059	0.0280	0.0855	0.0051	0.0083	0.0622
Jungle-I	0.0531	0.1985	0.0113	0.0585	0.0051	0.0265	0.0464	0.0945	0.0051	0.0099	0.0712
Jungle-II	0.0429	0.2016	0.0075	0.0099	0.0051	0.0190	0.0381	0.1133	0.0051	0.0057	0.0794
Gulfport	0.0247	0.0401	0.0761	0.0409	0.0324	0.0683	0.2101	0.1036	0.0096	0.0383	0.0564
HYDICE	0.0351	0.0270	0.0611	0.0299	0.0055	0.0462	0.2197	0.1262	0.0069	0.0402	0.0692
Texas Coast	0.0554	0.0118	0.0197	0.0101	0.0416	0.0277	0.0719	0.1512	0.0052	0.0159	0.0393
San Diego-I	0.0405	0.0584	0.0971	0.0145	0.0144	0.0909	0.3145	0.0646	0.0069	0.0194	0.0409
San Diego-II	0.0589	0.0808	0.0676	0.0093	0.0120	0.1195	0.2053	0.0878	0.0070	0.0097	0.0141
Average	0.0247	0.0118	0.0075	0.0065	0.0051	0.0059	0.0280	0.0646	0.0051	0.0057	0.0141

Table 11. Running time of each method.

	RX [13]	CRD [19]	RGAE [29]	Auto-AD [28]	CTA [31]	GT-HAD [32]	TAEF [33]	DFAN-HAD [34]	MSNet [35]	PUNNet [36]	CTDST-HAD
Cement Street	0.28	100.34	484.3	166.22	12.71	3.37	71.37	31.53	79.94	181.82	6.67
Holly	0.56	192.13	1184.38	139.24	24.53	6.7	129.25	41.32	85.11	182.89	8.98
Jungle-I	1.85	628.59	7870.62	397.00	83.33	21.5	385.57	1851.31	357.04	586.82	24.36
Gulfport	0.06	17.45	63.13	68.15	1.79	0.30	7.30	36.12	20.05	89.40	2.20
HYDICE	0.04	13.81	48.03	43.74	1.20	0.20	6.39	29.29	27.06	73.00	2.91
Texas Coast	0.05	17.41	63.90	57.10	1.91	0.30	7.09	45.83	20.46	61.02	3.31
San Diego-I	0.07	23.23	74.33	47.26	2.66	0.37	7.56	71.63	21.12	98.75	3.80
San Diego-II	0.05	17.33	62.21	43.83	1.74	0.30	7.31	45.52	19.63	85.14	3.26

Table 12. Model complexity.

	RGAE [29]	Auto-AD [28]	CTA [31]	GT-HAD [32]	TAEF [33]	DFAN-HAD [34]	MSNet [35]	PUNNet [36]	CTDST-HAD
Params(M)	0.0381	3.25	1.097	0.26	0.24	0.08	0.04	0.1	53.94
FLOPs(G)	13.61	11.97	2.195 × 10⁻³	2.64	0.02	2 × 10⁻⁴	7.5 × 10⁻⁵	0.47	2.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Deng, L.; Ying, J.; Wang, Q.; Cheng, Y.; Zhou, B. Contrastive–Transfer-Synergized Dual-Stream Transformer for Hyperspectral Anomaly Detection. Remote Sens. 2026, 18, 516. https://doi.org/10.3390/rs18030516

AMA Style

Deng L, Ying J, Wang Q, Cheng Y, Zhou B. Contrastive–Transfer-Synergized Dual-Stream Transformer for Hyperspectral Anomaly Detection. Remote Sensing. 2026; 18(3):516. https://doi.org/10.3390/rs18030516

Chicago/Turabian Style

Deng, Lei, Jiaju Ying, Qianghui Wang, Yue Cheng, and Bing Zhou. 2026. "Contrastive–Transfer-Synergized Dual-Stream Transformer for Hyperspectral Anomaly Detection" Remote Sensing 18, no. 3: 516. https://doi.org/10.3390/rs18030516

APA Style

Deng, L., Ying, J., Wang, Q., Cheng, Y., & Zhou, B. (2026). Contrastive–Transfer-Synergized Dual-Stream Transformer for Hyperspectral Anomaly Detection. Remote Sensing, 18(3), 516. https://doi.org/10.3390/rs18030516

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Contrastive–Transfer-Synergized Dual-Stream Transformer for Hyperspectral Anomaly Detection

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Vision Transformer

2.2. Autoencoder

3. Proposed Method

3.1. Separate Pre-Training

3.1.1. Spatial Stream

3.1.2. Spectral Stream

3.2. Synergistic Fine-Tuning

4. Experiment and Analysis

4.1. Datasets and Parameter Settings

4.1.1. Universal Visual Dataset

4.1.2. Hyperspectral Dataset

4.1.3. Environment and Model Settings

4.2. Ablation Experiment

4.2.1. Quality of Stage Training

4.2.2. The Influence of Adaptive EWC Weights

4.2.3. The Impact of R-L-Enhanced VAE

4.2.4. The Impact of Focal Loss

4.3. Comparative Experiment

4.4. Running Time and Model Complexity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI