1. Introduction
Hyperspectral image is a three-dimensional data cube that contains both rich spatial and spectral information and is the basis for the success of many advanced vision tasks. These tasks cover the key areas of target detection [
1,
2,
3], anomaly detection [
4,
5,
6], land cover classification [
7,
8,
9], and change detection [
10,
11,
12]. In hyperspectral target detection, it effectively utilizes discriminative information from prior spectra, facilitating precise identification of potential targets, even in complex backgrounds, with pixel-level or sub-pixel accuracy. As a result, the development of hyperspectral target detection methods has attracted extensive and significant attention in a wide range of practical applications, including remote sensing monitoring, military reconnaissance, land and marine target observation, mineral exploration, and criminal investigation [
13]. However, the practical effectiveness of hyperspectral target detection is fundamentally constrained by two intertwined challenges: the severe scarcity of labeled training samples, as manually annotating pixel-level targets across vast hyperspectral scenes is prohibitively expensive and often infeasible; and the complex, heterogeneous, and dominant nature of background clutter, which makes isolating faint target signals extraordinarily difficult. To address these challenges, wide methods has been explored.
Traditional methods distinguish between target and background by mathematically exploring the differences between spectral vectors and a priori spectra. Constrained Energy Minimization (CEM) is one of the widely favored methods in this domain. It imposes constraints on both the target and background, constructing a finite impulse response filter to minimize the filter output energy [
14]. In this process, the output energy of the filter is minimized to effectively suppress background responses. Given its notable advantages in speed and accuracy, researchers have proposed numerous variants of CEM to more precisely capture the inherent characteristics of the target and background. Hierarchical CEM is one of the most representative improvements [
15], which utilizes a structure with CEM detectors at different layers. Through a layer-by-layer filtering process, the target is preserved and the background suppressed, gradually improving detection performance. Another insightful variant is the ensemble CEM [
16], which innovatively introduces cascaded detection and multi-scale scanning strategies, enhancing the generalization ability and non-linear discriminative capacity of the hyperspectral target detector, resulting in higher detection accuracy and stability. Chang [
17] recently found that iterative kernel CEM (IKCEM) has even better performance. In the multi-directional CEM method [
18], an adaptive neighborhood feature aggregation strategy is employed to comprehensively and precisely assess the importance of neighborhood information from different directions. Another commonly used target detection method is Orthogonal Subspace Projection (OSP) [
19], which reduces the dimensionality of hyperspectral image and suppresses insignificant spectral features. Specifically, it projects each pixel vector onto the background’s orthogonal subspace and then projects the residual onto the target vector of interest, thereby enhancing the signal-to-noise ratio. By taking advantage of OSP and CEM, Ren and Chang [
20] combined OSP and CEM to derive TCIMF, which used OSP to annihilate undesired targets to increase target detectability and in the meantime used CEM to improve Target background suppression simultaneously. To better cope with background, Chang [
21] develops a background-annihilated TCIMF (BA-TCIMF). Spectral Angle Mapper (SAM) [
22] mainly detects targets by evaluating the spectral angle between each pixel’s spectrum and the spectral signature of the target of interest. Additionally, some methods based on linear mixture models have successfully distinguished between the target and background by performing spectral decomposition on hyperspectral image [
23]. Other methods such as the Matched Subspace Detector (MSD) [
24], Adaptive Cosine Estimator (ACE) [
25,
26], and Matched Filter [
27] have also demonstrated strong applicability in hyperspectral target detection tasks. While these methods are computationally efficient and require no training data, they often fail to adequately model the complexity and heterogeneity of real hyperspectral backgrounds due to oversimplified statistical assumptions and limited capacity to capture nonlinear spatial–contextual information. Consequently, their performance degrades in scenes with highly variable backgrounds.
Deep learning–based hyperspectral target detection methods can be broadly divided into supervised and unsupervised categories. In the domain of supervised approaches, Li et al. [
28] pioneered a convolutional neural network (CNN) framework for hyperspectral target detection, laying a foundation for subsequent studies. Another representative method is HTD-net [
29], which employs an improved autoencoder to extract target features and then uses a linear prediction strategy to distinguish background spectra from target samples. However, they are inherently constrained by the scarcity of high-quality labeled samples, as acquiring sufficient pixel-level annotations is costly and time-consuming, particularly when targets are small or rare. Unsupervised models, in contrast, aim to enhance the network’s ability to learn latent target representations by applying specific constraints, thereby alleviating the difficulty of collecting large-scale labeled hyperspectral datasets [
30,
31]. The Variational Autoencoder (VAE) [
32] is one of the most representative and scalable unsupervised architectures. Xie et al. [
31] proposed an enhanced VAE to capture more complex nonlinear structures in hyperspectral data, while Shi et al. [
33] integrated residual learning to construct a macro–micro residual autoencoder tailored for hyperspectral target detection. In addition, several studies emphasize the role of background modeling. For instance, adversarial autoencoders in [
34] estimated background distributions through adversarial learning, and the method in [
35] used a VAE with orthogonal-subspace constraints to disentangle background components. Although these methods attempt to model background information, their modeling is typically indirect and remains limited in addressing the intrinsic complexity, heterogeneity, and noise characteristics of real-world hyperspectral backgrounds. Most approaches rely solely on spectral-domain statistics without explicitly characterizing background noise distributions or spatial–spectral dependencies. As a result, when the background exhibits high spectral variability, nonlinear mixing, or locally correlated clutter, the learned models often become insufficiently robust, leading to reduced target–background separability.
In contrast, the proposed method explicitly models background noise by estimating a multivariate Gaussian distribution from the scene and incorporating it into a forward diffusion process. This enables the diffusion model to learn both the generation and suppression mechanisms of background noise in a principled, physically grounded manner. Furthermore, we introduce a centre-weighted strategy that integrates spatial–neighborhood information to accurately capture local spectral variations. By combining noise-aware diffusion and spatial–spectral weighting, our method achieves stronger background suppression and more reliable target enhancement, particularly in highly complex, heterogeneous, or spatially correlated backgrounds. Additionally, the method naturally addresses the scarcity of labeled samples by employing background-only self-supervised training, requiring only a single hyperspectral image and a single prior target signature without pixel-level annotations. Note: In this study, “limited labeled samples” refers specifically to scenarios where the number of available positive samples is extremely scarce. Overall, the main contributions of this work are threefold.
(1) We introduce the diffusion model into the field of hyperspectral target detection and improve the noise addition process of the diffusion model. By incorporating multivariate Gaussian background noise, the denoising network can learn the distribution of background noise, thereby acquiring the capability to suppress the background. This method significantly mitigates background interference, contributing to improved detection rates and reduced false alarm rates.
(2) To obtain an accurate background noise distribution, we propose a spatial–spectral centre-weighted multivariate Gaussian background noise generation (SSBNG) strategy. By leveraging superpixel segmentation to consider local spatial neighborhood information, the spatial–spectral correlations within hyperspectral images are fully exploited. This provides high-quality and reliable noise samples for the subsequent noise addition stage, effectively enhancing the overall accuracy and reliability of the study.
(3) To validate the effectiveness of the proposed method, comprehensive experiments and evaluations were conducted. The experimental results demonstrate that our method outperforms existing state-of-the-art methods in hyperspectral target detection. Notably, the proposed model exhibits remarkable adaptability in real-world scenarios with complex background information, particularly in terms of background suppression, showcasing robust background suppression capabilities.
The remainder of this paper is organized as follows.
Section 2 details the proposed novel hyperspectral target detection method.
Section 3 reports and analyzes the experimental results obtained on four real-world hyperspectral datasets.
Section 4 discussion.
Section 5 concludes the paper.
2. Proposed Method
In this section, we introduce a novel framework for target detection. The core idea of the framework is to train the diffusion model by generating multivariate Gaussian background noise, so as to construct a denoising network that can effectively suppress the background interference. Considering that the performance of the denoising network is highly dependent on the accuracy of the background noise distribution, we design a concise and efficient spatial–spectral centre-weighted background noise generation (SSBNG) strategy. This strategy makes full use of the spatial–spectral properties of hyperspectral data to generate high-quality background noise samples. After the training is completed, the denoising network performs background suppression on the hyperspectral data, and the final detection results are subsequently obtained by Mahalanobis distance calculation. It should be noted that the proposed background modeling strategy uses a single multivariate Gaussian to represent the background. This simplification is reasonable for scenes where the background is relatively homogeneous or the dominant background type can be captured by a single Gaussian. Empirical checks on the tested datasets indicate that this assumption reasonably approximates the background distribution. However, in highly heterogeneous scenes containing multiple distinct background materials, the performance of the method may degrade. Future extensions could adopt mixture-of-Gaussians models to better capture multi-modal background distributions. Therefore, the proposed method is most suitable for scenarios with approximately homogeneous background structures.
Figure 1 and
Figure 2 clearly illustrates the overall architecture and workflow of the proposed framework.
2.1. Brief Review of Diffusion Model
Diffusion models, also known as diffusion probabilistic models [
36], are a class of latent variable models (LVMs) [
37]. Inspired by nonequilibrium thermodynamics, these models construct a Markov chain to progressively perturb the input sample into a standard Gaussian distribution. The denoising diffusion probabilistic model (DDPM) [
36,
38] employed in this work consists of two core processes: the forward diffusion process and the reverse denoising process. In the forward process, noise is added to the original sample
at each time step
, forming a Gaussian Markov chain with the following distribution
where
is a predefined noise scheduling parameter. This process can be simplified to directly sample the sample
at a given time step from
where
. The reverse process learns to progressively reconstruct the original image from noise using a neural network (typically a U-Net). The training objective is to minimize the noise prediction error
where
is the real noise added, and
is the network’s predicted value.
Due to the high dimensionality of hyperspectral image, the use of random noise in each iteration of traditional diffusion models may lead to poor convergence of the denoising network. To address this, we use the same Multivariate Gaussian noise and fixed time step across all training iterations, which enhances network convergence, stabilizes the training process, and reduces the overall training time.
Although diffusion models have demonstrated outstanding performance in tasks such as natural language processing [
39,
40], time series forecasting [
41,
42], molecular graph modeling [
43,
44] and hyperspectral image classification [
45], to the best of our knowledge, no literature has yet applied them to hyperspectral target detection. Unlike classification, target detection requires locating specific spatial targets within an image, so our diffusion model is adapted to suppress background noise while preserving target signals. We present hyperspectral target detection method based on DDPM, treating the background as a modelable noise distribution. Through the diffusion–denoising mechanism, the background interference is effectively suppressed, improving detection accuracy and reducing false alarm rates.
2.2. Spatial–Spectral Centre-Weighted Background Noise Generation
In this study, accurately obtaining background samples and generating background noise is crucial for the subsequent diffusion process. To address this, we propose the Spatial–Spectral Background Noise Generation (SSBNG) strategy, which aims to generate multivariate Gaussian background noise by integrating both spatial and spectral information. This strategy provides high-quality and reliable noise samples for the subsequent noise injection process, thereby effectively improving the accuracy and reliability of the entire methodology.
The specific implementation steps are as follows: First, three channels are selected from the hyperspectral image
, and transformed into a pseudo-color image. The SLIC method is then applied to segment the image into
superpixel regions. In these regions, adjacent pixels exhibit significant similarity in color, brightness, and texture features. For each superpixel region, the average spectral value of all pixels in the region is computed as the initial background sample spectrum, as shown in the following equation
where
represents the spectral vector of the
j-th pixel in the
i-th superpixel, and
represents the number of pixels in the
i-th superpixel. To measure the difference between each pixel and the average spectrum, the Euclidean distance between each pixel in the superpixel and the average spectrum is calculated, as follows
To further optimize the background sample spectrum and ensure it accurately reflects real background information, we use a weight adjustment function to weight the contribution of pixels within each superpixel. This process aims to emphasize pixels that are closer to the average spectrum, while diminishing the influence of anomalous pixels, thereby obtaining a more reliable background spectrum sample. The weight adjustment function [
46] is expressed as
where
is an adjustable weight parameter that controls the sensitivity of the weighting function. In this work,
is set to the maximum value of
within each superpixel to normalize the spatial–spectral deviation and ensure stable weighting. Based on the above weighting results, the background sample spectrum for each superpixel is recalculated using the following equation
However, it is important to note that some superpixel regions may contain target pixels, whose spectra significantly interfere with the purity of the background sample. To further purify the background sample, we introduce a CEM filter, which calculates the similarity between the background sample spectrum and the target prior spectrum, and removes samples that are highly similar to the target spectrum based on the similarity score. This ensures that the generated background sample reflects the pure background characteristics as much as possible. Given a spectral matrix consisting of all superpixel blocks
and a prior spectrum
, the output response of the CEM filter can be expressed as
where
is the correlation matrix, and
is the optimal finite impulse response filter obtained through a specific algorithm. By sorting the similarity scores of all samples in descending order, the top 20% of samples are removed from the background sample candidate set, following thresholding strategies commonly adopted in prior CEM-based target suppression studies as well as background-learning methods such as BLTSC and OS-VAE. Once the background samples (
) are successfully generated, the process of background noise generation begins. First, we perform statistical analysis on the obtained background samples, calculating the mean
and covariance matrix
. The mean reflects the average characteristics of the background samples, while the covariance matrix captures the correlations between different features. Based on these statistics, background noise samples are generated using a multivariate Gaussian distribution. The probability density function of the multivariate Gaussian distribution is given by
where
is the random variable vector corresponding to the generated background noise sample
,
is the dimensionality of the data, which in this study is related to the spectral vector dimensions,
is the determinant of the covariance matrix
. Using this distribution, we can generate noise that conforms to the background characteristics, with the distribution of the noise in the feature space matching the statistical properties of the original background sample, thereby completing the entire Spatial–Spectral Background Noise Generation (SSBNG) strategy. In this work, we model the background distribution using a single multivariate Gaussian, which provides a tractable and efficient approximation for background noise generation. This assumption is valid for scenes where the background is relatively homogeneous or dominated by a single material class. To examine the suitability of this assumption, we performed empirical distribution checks on the background samples of the tested datasets, which show that their principal spectral variations can be reasonably captured by a unimodal Gaussian distribution. However, we acknowledge that in highly heterogeneous scenes containing multiple distinct background materials, the background distribution may become multi-modal. In such cases, using a single Gaussian may limit modeling capacity and potentially degrade detection performance. We have included a discussion of this limitation and outline possible extensions, such as replacing the single Gaussian with a mixture-of-Gaussians model to better capture background complexity in multimodal environments.
2.3. Multivariate Background Noise Estimated
In the current mainstream diffusion model framework, standard Gaussian noise is the conventional choice for noise introduction. However, in this study, through an in-depth analysis of the characteristics of hyperspectral data and the requirements of target detection tasks, we break from tradition by focusing the diffusion process specifically on background noise. This innovation arises from a profound understanding of the complex relationship between background and target in hyperspectral data. From the training point of view, the denoising network is able to learn and capture these features more efficiently as the background noise distribution features and their statistical laws in hyperspectral data have been accurately modelled through the pre-generation process. Compared to extracting target noise, the task of extracting background noise is significantly less complex. This allows the denoising network to accurately model the patterns of background noise during training, thus providing a solid foundation for the subsequent inference process. During the inference phase, based on the effective learning of background noise during training, the denoising network is able to accurately identify and treat the background as noise. This precise background extraction mechanism greatly enhances the differentiation between the background and the target, making the target more prominent in the data and providing high-quality data representations for subsequent hyperspectral target detection tasks. It is important to note that since background noise does not follow a standard Gaussian distribution, the noise diffusion equation commonly relied upon in traditional DDPM cannot be directly applied to our method. To address this critical issue, we have carefully designed a specific noise diffusion method tailored to background noise, which accounts for the actual distribution characteristics of the background noise and the complex background structure of hyperspectral data. The details of this approach are described as follows.
2.3.1. Diffusion Process
Under the condition of multivariate Gaussian distribution noise generated from hyperspectral background samples, we define the entire diffusion process. The diffusion process at each step can be expressed as
where
represents the data state at the
t-th step of the diffusion process, which is the core variable that we track and analyze.
serves as a critical control parameter, playing a key role in precisely regulating the proportion of original data retained at each step, directly influencing the trend of the data during the diffusion process. In our implementation,
is generated using a linear schedule across the diffusion steps.
, meaning that the noise added at each step is not randomly generated, but strictly follows the distribution characteristics determined by the background samples, with mean
and covariance
. This noise addition method, based on a specific distribution, provides a solid theoretical foundation for subsequent analysis and model construction. Based on the above formula, we can further derive the transition probability at each step of the diffusion process. The transition probability
fully characterizes the probability distribution of the transition from state
to
, and it also follows a multivariate Gaussian distribution, expressed as
In this equation, the mean term depends not only on the previous state but also integrates the background mean . The coefficients and cleverly balance the influences of the previous data and the background information. The covariance term indicates that, as the diffusion steps progress, the noise fluctuation level is closely related to and the background covariance , further reflecting the intrinsic rules and characteristics of the entire diffusion process.
2.3.2. Denoising Process
After completing the diffusion process, we proceed to the denoising process. The core task of this stage is to learn a model
, which gradually restores the data from the noisy state
to a state close to the original data. The conditional distribution
in the denoising process is also assumed to follow a multivariate Gaussian distribution, expressed as
Here, the mean
and covariance
are the key parameters of the model, and they are parameterized through a neural network. The powerful learning capability of the neural network allows the model to dynamically adjust the mean and covariance based on the input
and the current time step
, adapting to different noise contamination levels and data characteristics. To solve and optimize this model more efficiently, we carefully design and represent the mean
as follows
In this equation, is the output of the neural network, which plays the key role of predicting the noise component. The cumulative decay factor is defined as the product of from to , i.e., . This reflects the cumulative decay level of the data from the initial state to the t-th step due to the influence of at each diffusion step. By using the factor , the predicted noise component is reasonably computed with the current state . This is then combined with and , which results in a mean expression that considers both the current noise state and the background mean. This design allows the model to fully leverage the available information during the denoising process, providing a more accurate estimate of the previous state .
The denoising network contains a pre-embedding layer with 256 channels, followed by three residual layers with 200, 100, and 50 channels. Each layer employs Tanh activation and LayerNorm normalization. A skip connection adds the input to the output. The network predicts the noise term at each diffusion time step.
2.3.3. Loss Function
To ensure that the denoising model we construct accurately learns to remove background noise, we need to carefully design a suitable loss function. The core goal of this loss function is to make the noise prediction
output by the neural network as close as possible to the true background noise
. We define the loss function in the form of a weighted norm, expressed as
In this expression,
represents the expectation over the initial data, noise, and time steps, ensuring that the loss function can comprehensively account for various scenarios. The term
represents the weighted norm, where
acts as the weighting matrix. This allows for appropriate weighting of the noise errors across different dimensions according to the covariance characteristics of the background noise, thereby making the loss function more accurate in reflecting the difference between the predicted and true noise.The above loss function is mathematically equivalent to
This matrix form expression is more concise and clear, while also facilitating matrix operations during the actual computation and optimization process, thus improving computational efficiency and the effectiveness of model training.
2.4. Target Detection
In the target detection process, the hyperspectral data is fed into the trained diffusion model. The diffusion network treats the complex background as noise and effectively removes it, producing a background-suppressed hyperspectral image . However, due to the spectral similarity between some targets and background materials in certain bands, the background suppression process may cause partial loss of target spectral information. To address this issue, we employ a Mahalanobis distance-based detection strategy that quantitatively measures the difference between each pixel and the background model.
The background model is constructed from the background-suppressed data
using the background regions whose spatial locations were determined before the diffusion process. Based on the spectral information of these background regions, the mean and covariance of the background distribution are re-estimated to describe the global spectral characteristics after background suppression. Each pixel is then compared to this background model using the Mahalanobis distance.
where
represents the Mahalanobis distance between the pixel spectrum
and the background distribution,
and
denote the mean vector and covariance matrix of the background, respectively. A larger
indicates that the pixel is more dissimilar to the background and thus more likely to belong to a target region. The overall procedure of the proposed target detection framework is summarized in Algorithm 1.
| Algorithm 1: Hyperspectral Target Detection |
| Input: Hyperspectral image tensor , prior target spectrum |
| 1: Perform superpixel segmentation and compute the average spectrum by (4) |
| 2: Calculate centre-weighted spectral data by (5), (6), and (7) |
| 3: Filter out target spectra to obtain pure background spectra by (8) |
| 4: Generate multivariate Gaussian background noise by (9) |
For each epochAdd background noise by (10) and (11) Train the denoising network by (12), (13), and (15)
|
| End for |
| 5: Use the trained denoising network for background suppression |
| 6: Compute Mahalanobis distance through background modeling to get detection results by (16) |
| Output: Final detection result |