Next Article in Journal
Comparing XGBoost and Double Machine Learning for Predicting the Nitrogen Requirement of Rice
Next Article in Special Issue
Underwater SLAM and Calibration with a 3D Profiling Sonar
Previous Article in Journal
Small Ship Detection Based on a Learning Model That Incorporates Spatial Attention Mechanism as a Loss Function in SU-ESRGAN
Previous Article in Special Issue
High-Resolution Imaging of Multi-Beam Uniform Linear Array Sonar Based on Two-Stage Sparse Deconvolution Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

NCSS-Net: A Negatively Constrained Network with Self-Supervised Band Selection for Hyperspectral Image Underwater Target Detection

PCA Laboratory, The Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, The Jiangsu Key Laboratory of Image and Video Understanding for Social Security, and the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(3), 418; https://doi.org/10.3390/rs18030418
Submission received: 27 December 2025 / Revised: 21 January 2026 / Accepted: 24 January 2026 / Published: 27 January 2026
(This article belongs to the Special Issue Underwater Remote Sensing: Status, New Challenges and Opportunities)

Highlights

What are the main findings?
  • This study proposes a novel unsupervised framework for hyperspectral underwater target detection, integrating self-supervised band selection with a physically-constrained autoencoder to enhance discriminability between subtle target signals and complex nearshore backgrounds.
  • The method achieves state-of-the-art detection performance across three challenging nearshore scenes, demonstrating superior robustness and accuracy compared to existing methods.
What are the implications of the main findings?
  • NCSS-Net provides a practical, annotation-free solution for real-world nearshore monitoring applications, eliminating the need for manual labeling or accurate estimation of environmental parameters.
  • The framework integrates spectral unmixing, band selection, and deep learning in a synergistic manner, offering a generalizable, adaptable framework for other hyperspectral detection tasks.

Abstract

Detecting nearshore underwater targets in hyperspectral imagery faces significant challenges due to complex background clutter, weak and distorted underwater target signals. Extracting discriminative features is a critical step. Current methods are often constrained by high spectral redundancy and reliance on manual annotations, leading to suboptimal detection performance. To address these problems, this paper proposes a novel underwater target detection framework that integrates self-supervised band selection with a physically-constrained detection, called the negatively constrained network with self-supervised band selection (NCSS-Net). Specifically, NCSS-Net first generates a target-prior abundance map via Normalized Difference Water Index and spectral unmixing. This abundance map is then converted into a binary target mask through adaptive thresholding. The binary target mask serves as pseudo labels and guides an Artificial Bee Colony algorithm to identify a maximally discriminative band subset. These bands are then fed into a negatively-constrained autoencoder. This network is trained with a specialized loss function to enforce negative correlation between the target and water endmembers, thereby enhancing their separability. Experimental results demonstrate that NCSS-Net outperforms existing state-of-the-art methods, offering an effective and practical solution for nearshore underwater monitoring applications. Our code will be available online upon acceptance.

1. Introduction

Hyperspectral image (HSI) is an advanced remote sensing technology due to its capacity to acquire spectral data across numerous, contiguous, and narrow wavelength intervals [1,2,3]. By integrating detailed spectral and spatial information, HSI offers superior capabilities for target detection in complex environments. Hyperspectral Target Detection (HTD) is defined as the process of identifying and localizing specific targets of interest within a hyperspectral image based on their known spectral signatures. HTD has been extensively studied in various fields, including mineral exploration, environmental monitoring, oceanography, and marine biology, in recent years [4,5].
However, the application of HTD technology in underwater environments faces significant challenges. Specifically, the unique nonlinear spectral mixing effects in underwater environments result in a received spectrum that is essentially a complex combination of the target, water, and seabed background [6]. This complexity, coupled with the prevalence of mixed pixels in complex nearshore environments, makes it extremely difficult to obtain key prior information and to define target boundaries in practice. Consequently, hyperspectral underwater target detection (HUTD) has emerged as a distinct and challenging research subfield. These challenges render traditional hyperspectral target detection methods that rely on shallow feature extraction and spectral matching particularly inadequate. Their inefficacy stems primarily from the intense absorption and scattering of light in water, which causes the observed target spectrum to deviate significantly from pre-obtained reference spectra and become highly similar to the water spectrum itself [7].
To address this issue, researchers have focused on water-optical models, which relate the water-leaving radiation to the inherent optical properties (IOPs) of water bodies. Jay et al. [8,9] introduced bathymetric models to address spectral distortion caused by water bodies and adopted a traditional land-based algorithm to detect underwater targets. Qi et al. [10] further integrated spectral-spatial information with iterative depth estimation for underwater detection. However, accurately estimating IOPs in unknown water environments remains challenging, limiting the practical utility of these methods.
In contrast, hyperspectral unmixing (HU) offers a promising path forward. Unlike model-driven approaches, HU adaptively learns the spectral signatures of constituent materials from the imagery, including the water background and potential targets. This data-driven approach facilitates separating the target signal from the mixed spectrum. Consequently, HU-based methods for HUTD have attracted attention. Qi et al. [11] first applied hyperspectral unmixing to underwater target detection and designed the UTD-Net network. Li et al. [12] advanced the unsupervised nonlinear unmixing problem by a dual-stream network incorporating the EMLM. Liu et al. [13] proposed a nonlinear unmixing autoencoder (NUN-UTD) trained on pseudo-mixed data. Despite these advances, most existing methods still rely on supervised paradigms or require accurate prior knowledge, restricting their applicability in scenarios where such information is scarce.
Although HUTD has shown application potential, research in this area still faces many challenges. First, the lack of publicly available standard datasets limits evaluation, casting doubt on practical robustness. Second, most existing methods highly rely on prior knowledge or parameter estimation, which is often unavailable. Furthermore, hyperspectral images typically contain hundreds of bands, many of which are redundant or noisy. Band selection thus becomes crucial to improve detection efficiency. Recent work has begun to explore unsupervised band selection. Liu et al. [14] proposed a robust positive and unlabeled learning framework with key band selection (PU-KBS). Zhang et al. [15] proposed an unsupervised feature-band-based detection method (FBUD) that selects discriminative bands via NDWI-guided unmixing.
Building upon these insights, we propose a novel unsupervised detection framework named the negatively constrained network with self-supervised band selection (NCSS-Net) for HUTD. Fundamentally distinguishing itself from NUN-UTD, which relies on prior target spectra and pseudo-mixed data generation, our proposed NCSS-Net adopts a fully self-supervised paradigm. It introduces a novel self-supervised band-selection strategy that dynamically identifies discriminative spectral bands without prior spectral information. Furthermore, it incorporates a negatively constrained autoencoder that enforces a physically meaningful negative correlation between target and water endmembers, a constraint absent in NUN-UTD. This enables NCSS-Net to operate without synthetic data or accurate prior knowledge, enhancing its adaptability in complex nearshore scenarios. The main contributions of this work are summarized as follows:
1.
We introduce a novel self-supervised band selection strategy that uses a target mask generated from NDWI-driven spectral unmixing to guide the ABC algorithm in selecting a maximally discriminative band subset.
2.
We design a negatively-constrained autoencoder network that incorporates a physically meaningful negative correlation mechanism derived from unmixing results, enhancing discriminability between targets and complex backgrounds.
3.
We construct the NCSS-Net framework, which synergistically integrates spectral unmixing, band selection, and pseudo-label augmentation. It achieves robust detection performance in nearshore scenes through iterative refinement of pseudo-labels and band subsets.
The remainder of this article is as follows: Section 2 introduces the knowledge of relevant research, and Section 3 describes the framework of the proposed network. Then Section 4 presents the relevant results of the experiment, and Section 5 discusses them. Finally, the conclusion of this article is given in Section 6.

2. Related Work

Traditional HTD methods primarily rely on spectral matching, such as the matched filter and adaptive cosine estimator. The main idea in these methods is to detect targets by calculating the similarity between the spectrum of each pixel in the image and a known target-prior spectrum. However, in underwater environments, the strong absorption and scattering of light by water cause severe distortion of the target spectrum, leading to significant differences between the measured target spectrum and the measured prior spectrum. This fundamental limitation has prompted the development of two different technical routes to address the challenge of hyperspectral underwater target detection.

2.1. Water Optical Model-Based Methods

To address water interference, researchers have developed optical models of water. These model-driven methods associate water-leaving radiation with IOPs of water bodies.
Seminal work by Jay et al. [8,9] introduced bathymetric models and adopted the traditional land-based algorithm to detect underwater targets. They further proposed a statistical method using neighborhood pixel information [16], improving the inversion accuracy of depth and water-quality parameters. Gillis [6] combined physical models with manifold learning to achieve geometric dimension reduction by mining target-space structures. Qi et al. [10] proposed optimizing the fusion of spectral and spatial information and applying iterative joint depth estimation for underwater target detection. Besides, Xia and Gu [17] proposed to define the inversion water depth and the seabed reference reflectance as the parameter characteristics of underwater targets. Parallel developments in algorithmic optimization included band-selection methods. Fu et al. [18] demonstrated real-time potential through speed-optimized detection, and Qi et al. [19] introduced an unsupervised approach that leverages bathymetric models to learn discriminative subspaces.
More recently, the focus has shifted towards integrating these physical models with data-driven techniques. To compensate for the lack of real labeled data, Li et al. [20] used the bathymetric model to generate synthetic data. While [21] introduced a 3D convolutional and depth-estimation network to improve feature extraction and object discrimination. Besides, Liu et al. [22] proposed an internal scanning system with a two-stage mirror, achieving wide-field view imaging without platform motion. More recently, Li et al. [23] proposed an end-to-end conditional diffusion model that effectively bypasses the need for precise environmental parameter estimation. A growing trend involves embedding physical priors directly into deep architectures. In this vein, [24] proposed IOPE-IPD, a network that integrates an IOP’s physical model with an autoencoder to estimate absorption and backscattering coefficients in a physically constrained manner.
Despite these advances, the performance of model-driven methods in complex, dynamic nearshore waters heavily depends on accurate estimation of environmental parameters, which is notoriously difficult to obtain. This limitation has motivated a shift toward data-driven techniques and end-to-end models that reduce or eliminate dependencies.

2.2. Unmixing-Based Methods

HU has become a promising method for underwater target detection. By decomposing mixed pixels into endmembers and their corresponding abundances, HU offers an alternative approach that avoids the dependency on IOPs and prior knowledge of target spectra or water optical parameters. Instead, it adaptively learns the spectral characteristics of constituent components directly from the data. This advantage has led to widespread study of HU-based methods for underwater target detection.
A significant branch of HU research leverages autoencoders within deep learning frameworks [25,26,27,28]. Early efforts often relied on the Linear Mixing Model (LMM), which assumes that the spectrum of each pixel is a linear combination of pure endmember spectra [29]. Classic endmember extraction algorithms, such as VCA [30] and N-FINDR [31], are used to extract endmembers representing different material categories from underwater hyperspectral images and estimate abundance maps for each pixel. In this framework, target detection is transformed into analyzing the target abundance map. Qi et al. [11] pioneered the application of HU and developed the UTD-Net, a network that combines a bathymetric model with an autoencoder to generate target abundance maps. Huang et al. [32] applied hyperspectral imaging for the detection of microplastics on the seabed, demonstrating its practical viability in underwater conditions. Besieds, Qi et al. [33] proposed a hybrid-level contrastive learning framework (HUCLNet) that integrates reliability-guided clustering and self-paced learning to address nearshore spectral distortions.
However, LMM fails to account for multiple scattering and nonlinear light interactions, leading to errors in abundance estimates in nearshore underwater environments. This has motivated the exploration of nonlinear unmixing autoencoders [34,35,36,37]. Li et al. [12] combined the EMLM physical model with a dual-stream deep learning architecture to solve unsupervised nonlinear unmixing problems. Zhu et al. [38,39] proposed the abundance learning, emphasizing endmembers’ collaboration to enhance abundance accuracy. Recent advances continue to enrich this field: Liu et al. [13] proposed a nonlinear unmixing autoencoder (NUN-UTD) that incorporates target-prior spectrum preservation and pseudomixed data training, while Zhang et al. [15] presented FBUD, which selects discriminative spectral bands via NDWI-guided unmixing and spectral difference analysis.
While nonlinear methods offer superior theoretical modeling, they often depend on precise assumptions about underwater optics and large volumes of annotated data, which are often scarce in underwater scenarios.
Furthermore, deep learning has significantly advanced underwater visual understanding. Comprehensive surveys have documented the evolution of AI-driven optical underwater target detection [40] and methods based on architectures like YOLO and CNN [41]. For semantic understanding of underwater scenes, researchers have proposed the AquaSketch-enhanced cross-scale information fusion method [42], which introduces sketch-structural priors and employs a top-down dual-branch pyramid to fuse multi-scale features, significantly improving underwater image captioning accuracy. Concurrently, Li et al. [43] proposed an attention-based model that fuses CLIP-based textual features with Faster R-CNN-extracted visual features to enhance semantic description capability. These methods demonstrate how structural priors and cross-modal fusion can advance the understanding of complex underwater environments.
In addition to improving unmixing models, band selection has been recognized as a key strategy for enhancing efficiency and robustness. FBUD [15] demonstrates the effectiveness of unsupervised band selection, which inspires our approach to jointly optimize band selection and unmixing within a unified self-supervised framework.

3. Materials and Methods

Figure 1 illustrates the architecture of the proposed NCSS-Net, an unsupervised framework designed for efficient nearshore underwater target detection. The core idea is to integrate self-supervised band selection with a physically-constrained deep network to significantly enhance the discriminability between subtle target signals and complex background clutter. The network primarily consists of three components: target-prior unmixing, artificial bee colony-based band selection, and a negatively constrained autoencoder network. The process begins by generating a target mask through NDWI and spectral unmixing. This mask serves as pseudo-labels to guide an optimization algorithm in identifying a maximally discriminative band subset. These optimal bands are then processed by a negatively constrained autoencoder, which is trained using the same target mask as the supervision. A specialized loss function ensures a negative correlation between target and water endmembers during training. The final output is the detection map for underwater targets.

3.1. Target-Prior Unmixing

In the HUTD task, land–water separation is the primary prerequisite for subsequent procedures. Given an input hyperspectral image I R M × N × B , the Normalized Difference Water Index (NDWI) [44] is first applied to extract the water region. The calculation and application of this index utilize the green band ( I G ) and the near-infrared band ( I NIR ) as follows:
NDWI = I G I NIR I G + I NIR .
To obtain the water mask M W and land mask M L from a nearshore hyperspectral image, an initial binary segmentation is typically achieved by applying an adaptive threshold to NDWI. We employ Otsu’s method [45] to automatically determine the optimal threshold for each scene. This method analyzes the NDWI histogram and selects the threshold τ that maximizes the separability between water and land pixel classes. This ensures robust water-land separation across varying scene conditions despite differences in nearshore turbidity and bottom reflectance. In this scheme, the initial land mask M L is defined as the complement of the initial water mask M W . Consequently, a series of morphological operations is applied to correct misclassifications and smooth the mask contours. Considering the different noise characteristics between water and land regions, for the water mask, we first apply an opening operation to remove small isolated noise, followed by a closing operation to fill small holes. Conversely, for the land mask, we first apply a closing operation to fill holes, and then an opening operation to remove isolated pixels. Specifically, the water mask M W and the land mask M L are obtained by the following equations:
M W = 1 , if NDWI > τ , 0 , otherwise . M W = ( M W W s e ) · W s e ,
M L = ( M L · W s e ) W s e ,
where M W and M L represent the initial water mask and initial land mask, respectively. Morphological opening and closing operations are denoted by ∘ and ·, respectively. The structuring element W s e is defined as a flat disk with a radius of 3 pixels.
Since the targets are submerged, we can use the obtained water mask M W to preliminarily determine their locations. We proceed M W with endmember extraction and unmixing to obtain the corresponding abundance maps. The N-FINDR algorithm [31] is a typical endmember extraction method based on the Linear Mixing Model (LMM). It operates on a key geometric assumption: the set of all pixels forms a simplex in the high-dimensional spectral feature space. The endmembers correspond to the vertices of the simplex that encloses the maximum volume of the pixel data. Thus, endmember extraction is to find the set of pixels that forms the simplex with the largest volume. However, this method is known to be sensitive to noisy pixels, shadows, and spectral outliers, which may be incorrectly identified as endmembers. To mitigate the sensitivity, we first employ PCA-guided endmember initialization and strategic pixel sampling in the N-FINDR procedure to reduce the influence of spectral anomalies. Furthermore, the extracted endmembers are validated using physical priors such as near-infrared reflectance characteristics and spatial distribution consistency checks. Endmembers exhibiting low NIR reflectance or high abundance values in land areas are excluded from the target endmember candidate set. Finally, the negatively constrained autoencoder structure enforces a meaningful negative correlation between the target and water endmembers, thereby correcting potential semantic misassignments from the initial unmixing.
HUTD scenes are fundamentally composed of three classes: land, underwater target, and pure water. Accordingly, the number of endmembers is set to three, and the abundance map a i for each endmember e i is derived through the unmixing process. The pixels in the hyperspectral image can be represented as follows:
x = i = 1 m e i a i + n ,
where e i denotes the i-th endmember, a i is its corresponding abundance, m is the number of endmembers, and n represents additive noise.
Although unsupervised unmixing algorithms can separate distinct spectrals, they are not inherently assigned physical semantics. To resolve this, semantic labels are assigned based on known physical and spatial properties. Specifically, the water endmember exhibits strong absorption in the NIR band (reflectance close to zero). The target endmember EM T is selected based on two criteria: it must exhibit a weak abundance response within the land mask M L , and its reflectance must not be the lowest in the near-infrared band. The corresponding abundance map is designated AM T , with values in the range [0, 1], representing the fractional abundance of the target endmember for each pixel. Then the target mask M T { 0 , 1 } M × N is generated by applying Otsu’s algorithm [45] for adaptive thresholding to the target abundance map AM T . Pixels with value 1 indicate potential target regions, while 0 indicates background. This target mask M T serves as a pseudo-label in the subsequent self-supervised band selection process.

3.2. Artificial Bee Colony-Based Band Selection

Following the generation of the target mask M T , band selection is conducted to reduce data dimensionality by selecting the most discriminative and least redundant key bands. The core of this step is a self-supervised strategy that leverages M T as pseudo-labels to guide the search. The band selection is accomplished using the Artificial Bee Colony (ABC) algorithm [46].
The ABC algorithm is an intelligent optimization technique designed to solve complex optimization problems in multi-dimensional spaces. In this approach, each potential solution is defined as a band combination x i = [ b 1 , b 2 , , b k ] containing k band indices, where b j { 1 , 2 , , B } represents the j-th selected band index. Since the ABC update mechanism operates in continuous space while band indices must be integers, we incorporate explicit feasibility constraints to bridge this gap. After a candidate solution is generated by the ABC update rule, we first round the continuous values to obtain integer band indices. Each index is then constrained to lie within the valid range [ 1 , B ] . Next, duplicate indices are removed through set operations; if this removal reduces the number of distinct bands below the required size k, missing bands are randomly selected from the remaining pool. Finally, the resulting indices are sorted in ascending order, treating the band combination as an unordered set to ensure consistency across all candidate solutions. Only solutions that satisfy this complete validation chain proceed to fitness evaluation.
The AUC values corresponding to the target mask are used as the fitness function to automatically and iteratively optimize these parameters within the search space. The optimization objective is to maximize the fitness function:
Fitness ( x i ) = AUC HyperspecAE ( x i ) , M T ,
where M T denotes the target mask, serving as pseudo labels; HyperspecAE ( x i ) is the predicted target abundance map obtained from an autoencoder using the band subset x i as input. The detector shares the architecture described in Section 3.3 with its input dimension adapted to | x i | . Its weights are initialized via VCA-based endmember extraction for the current subset, and only a single forward pass is performed on a fixed batch of 256 randomly sampled pixels. This design reduces the evaluation time per candidate, making the ABC loop computationally tractable. And AUC ( · , · ) computes the area under the ROC curve obtained by thresholding the output map produced by HyperspecAE ( x i ) and comparing it with the binary labels in M T .
The bee colony is divided into three roles:
1.
Employed Bees explore the vicinity of current solutions by randomly perturbing one dimension to generate new candidate solutions. For a solution x i , a new candidate v i is generated by the following:
v i j = x i j + ϕ i j ( x i j x k j ) ,
where j is a randomly chosen dimension, k is a randomly selected neighboring solution, and ϕ i j is a random number in [ 1 , 1 ] . The resulting values are then rounded, constrained, de-duplicated, and sorted as described above to yield a feasible integer band subset. A greedy selection strategy is then used to decide whether to retain the improved solution.
2.
Onlooker Bees evaluate the fitness values shared by the employed bees. To select promising solutions for further exploitation, we use a standard roulette-wheel selection based on linear normalization of AUC values. The selection probability for each solution is calculated as follows:
p i = Fitness ( x i ) m = 1 N employed Fitness ( x m ) + ϵ .
3.
Scout Bees are triggered when a solution becomes stagnant in a local optimum, i.e., when its trial counter exceeds a limit L. They abandon and replace it with a randomly generated band combination, which is produced using the following equation:
x n e w = x m i n + rand ( 0 , 1 ) · ( x m a x x m i n ) .
This mechanism enhances the ability of an algorithm to escape from local optimality.
The algorithm iterates until convergence or a maximum number of iterations is reached. Ultimately, the solution with the highest historical fitness is selected as the final, maximally discriminative set of key bands x b e s t .

3.3. Negatively-Constrained Autoencoder Network

To accurately characterize the spectral mixing in underwater hyperspectral images, our method employs an autoencoder architecture based on an additive post-nonlinear mixing model. This architecture is constructed upon the selected band subset x b e s t and explicitly decomposes the decoder into two distinct components, linear and nonlinear, which respectively capture different physical effects in the spectral mixing process.
The network is trained in a self-supervised manner, using the input hyperspectral image from the band subset x b e s t and pseudo-datasets. The pseudo-dataset is constructed by extracting spectral vectors of target pixels (where M T = 1 ) from the selected band subset x b e s t , with their corresponding abundance values from AM T serving as soft supervision labels. The encoder is guided to learn abundance maps that incorporate the optimal bands by minimizing the reconstruction loss. The feature compression block in the encoder consists of five fully connected layers and Leaky ReLU activation functions to progressively reduce the spectral dimensionality. The specific structural details are in Table 1, where B is the number of bands and m is the number of end members. Due to the network’s incorporation of endmember unmixing characteristics, an ASC-ANC layer is applied to enforce non-negativity and sum-to-one constraints on the abundances. Finally, the encoder outputs the abundance values, which are passed as input to the decoder.
The decoder consists of linear and nonlinear layers. The linear layer directly reflects the physical mixing of endmember spectra through linear weighted summation, while the nonlinear layer captures complex nonlinear interactions among spectra using a multilayer perceptron (MLP). Simultaneously, the decoder’s weight matrix is initialized by an endmember extraction algorithm. This can be expressed as follows:
X l i n e a r = EA + n ,
X n o n l i n e a r = f MLP ( X l i n e a r ) ,
X ^ = X l i n e a r + X n o n l i n e a r .
The final reconstructed output is the additive fusion of the linear and nonlinear layers, where E R B × m denotes the endmember matrix (B representing the number of bands and m the number of endmembers), A R m × N is the abundance matrix (N being the number of pixels), and f M L P ( · ) represents a nonlinear mapping composed of multiple fully connected layers and a sigmoid activation function.
To address endmember semantic ambiguity in unsupervised learning, we propose a physical-constraint mechanism based on negative correlations in abundance ratios. This constraint is derived from the physical exclusivity between the target and water endmembers. In underwater environments, the signal received by the sensor at a given pixel is a composite of light reflected from the target and backscattered from the water. In pixels containing a target, the target-reflected signal dominates; in pure water pixels, the water-backscattered signal dominates. Consequently, target pixels should exhibit a high target-to-water abundance ratio. By penalizing deviations from this expected ratio, the optical prior is embedded into the loss function. This forces the network to push the target and water endmembers in opposite directions within the abundance feature representation, thereby maximizing their separability in the feature space. This approach not only enhances the semantic separability of the unmixing results but also reduces reliance on annotated data, providing a physically interpretable learning framework for unsupervised underwater target detection. Let a target and a water be the abundances of the target and water endmembers in a pixel, respectively. We use the target mask M T as an independent and fixed reference to distinguish between target and background. Note that the used target mask M T is fixed and not updated during training. This prevents a circular dependency where the mask and the optimized ratio could influence each other iteratively. The specific expression is as follows:
r a t i o = a target + ϵ a water + ϵ ,
x i = x t a r g e t , if M T ( i ) = 1 , x w a t e r , otherwise .
To enforce this negative correlation throughout the learning process, we incorporate it into the loss function. Specifically, we define a negative correlation constraint loss L n e g to enhance the discriminative power between the target and background. This is achieved by penalizing deviations of the abundance ratio from ideal values, which are 2 for target pixels and 0 for background pixels:
L n e g = j T ( ratio j 2 ) 2 | T | + k B ( ratio k 0 ) 2 | B | ,
where T and B are the sets of target and background pixels identified by M T . Furthermore, a warm-up scheduling strategy is adopted to prevent the premature introduction of strong constraints from interfering with model convergence.
The model uses a multi-component composite loss function that integrates reconstruction error, abundance regularization, negative correlation constraint, and pseudo-label supervision. The mathematical formulation is defined as follows:
L = L r e c + μ L r e g + λ L n e g + L p s e u d o ,
where μ and λ are hyperparameters that balance the contribution of each loss term. And each component is specified as follows:
L r e c = 1 N i = 1 N x i x ^ i 2 2 ,
L r e g = i = 1 N log 1 Var ( a i ) + ϵ ,
L p s e u d o = f BCE ( a ^ i target , a i target ) ,
f BCE = 1 | T | i T a i target log ( a ^ i target ) + ( 1 a i target ) log ( 1 a ^ i target ) ,
where x i R B denotes the spectral vector of the i-th pixel; x ^ i is its reconstruction; a i R m denotes the abundance vector of all endmembers for the i-th pixel; and m is the number of endmembers. a i target [ 0 , 1 ] is the groundtruth target abundance for the i-th pixel; T denotes target pixel sets; and ϵ is a small constant for numerical stability. In this formulation, the reconstruction error L r e c employs the Mean Squared Error to quantify the discrepancy between the decoder’s output and the original input spectrum. The abundance regularization L r e g applies a logarithmic function to the variance of the endmember abundance matrix, imposing sparsity and diversity constraints to prevent the abundances of all pixels from converging to a uniform state. Pseudo-data are constructed using spectral vectors from pixels identified by M T . The network’s predicted target abundances for these pseudo-data are supervised against AM T values via BCE loss, formulated as L p s e u d o . The NCSS-Net workflow is presented in Algorithm 1. The framework follows a linear pipeline in which each component executes sequentially. The iterative mechanisms are confined to internal operations.
Algorithm 1 Proposed NCSS-Net framework
Input: Hyperspectral Image: I R M × N × B
Output: Final detection map
 1:
Generate M W and M L via Equation (3)
 2:
Identify target endmember and estimate AM T via linear unmixing
 3:
Generate pseudo target mask M T via Otsu thresholding on AM T
 4:
Initialize ABC population with M T as pseudo-labels
 5:
while max iterations not reached do
 6:
   For employed Bees, explore neighborhood via Equation (6)
 7:
   For Onlooker Bees, select via Equation (7) and explore
 8:
   For Scout Bees, replace stagnant solutions via Equation (8)
 9:
end while
10:
Select optimal band subset x best and corresponding data X selected
11:
Initialize autoencoder with additive post-nonlinear structure
12:
Construct pseudo-dataset with soft thresholding using M T and AM T
13:
for epoch = 1, 2, …, n do
14:
   Encode abundances A and decode reconstruction X ^
15:
   Compute negative constraint loss via Equation (14) and total loss via Equation (15)
16:
   Backpropagation
17:
end for
18:
Obtain final detection map

4. Result

In this section, the experimental effect is verified on an open nearshore hyperspectral underwater target detection dataset named River Scene.

4.1. Datasets

To validate the effectiveness of the proposed method, we conduct experiments on the publicly available River Scene dataset, which is specifically designed for HUTD. The dataset was originally introduced in [34] and has been widely adopted for evaluating hyperspectral UTD algorithms. The River Scene dataset was acquired using a Headwall Nano-Hyperspec imaging sensor mounted on a DJI Matrice 300 RTK unmanned aerial vehicle (UAV). The data were collected at the Qianlu Lake Reservoir in Liuyang, Hunan Province, China, on 31 July 2021. The sensor captures 270 spectral bands covering a wavelength range of 400–1000 nm, with high spatial resolution ranging from 5.55 cm to 9.25 cm per pixel depending on the flight altitude.
As shown in Figure 2, the River Scene datasets include three different real-world nearshore scenes: River Scene 1 measures 242 × 341 pixels with 9.25 cm spatial resolution, River Scene 2 contains 255 × 261 pixels at 5.55 cm resolution, and River Scene 3 comprises 137 × 178 pixels with 5.55 cm resolution.
The RGB images in Figure 2a–c demonstrate the complex nearshore environments with varying water conditions and background clutter. The underwater targets are black metal plates of size approximately 1 × 1 m, deployed at depths of 1–3 m. Due to the similar color between the targets and the riverbed, the targets are visually subtle in RGB imagery, posing significant challenges for detection algorithms. The target-prior spectra are collected on land under controlled conditions. All images were preprocessed with radiance calibration, reflectance conversion, and Savitzky–Golay smoothing to enhance spectral fidelity. Note that the target spectra are used only to support the manual annotation of the ground truth masks (as shown in Figure 2d–f) for quantitative evaluation of detection performance. They are neither employed as input to nor utilized in the NCSS-Net detection pipeline.

4.2. Experimental Settings

NCSS-Net was implemented in Python 3.6.9 using PyTorch 2.3.0 and trained on a computational server equipped with dual NVIDIA GeForce RTX 2080 Ti GPUs (11 GB VRAM each) and a 12-core CPU with 64 GB RAM. The operating environment was Ubuntu 18.04 LTS with CUDA 10.1 and CUDNN 7 for GPU acceleration.
The network architecture comprised a negatively constrained autoencoder. The encoder comprises five fully connected layers with Leaky ReLU activations, with a negative slope of 0.2. And the decoder adopted an additive post-nonlinear mixing model with separate linear and nonlinear pathways. This structure integrates batch normalization, Gaussian dropout with a variance of 1.0, and an ASC-ANC layer to enforce physical constraints on the abundance estimations. For band selection, the ABC algorithm was configured with a population size of 20, utilizing the area under the curve value from the target mask as the fitness function to identify an optimal subset of six spectral bands. The key hyperparameters of NCSS-Net are summarized in Table 2.
Due to weak target signals, color similarity between targets and underwater backgrounds, and spectral distortion in nearshore environments, we employed the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) as the primary evaluation metric. To elaborate, the AUC is computed by comparing the detection scores from each method’s output against the manually annotated groundtruth. For each detection map, pixel-wise scores are normalized and a ROC curve is constructed by systematically varying the decision threshold from 0 to 1, calculating the True Positive Rate (TPR) and False Positive Rate (FPR) at each threshold. The AUC is computed using the entire hyperspectral image, as the network operates in an unsupervised manner without explicit train-test splitting. The AUC is then defined mathematically as the integral of the ROC curve:
AUC = 0 1 TPR ( t ) d ( FPR ( t ) ) ,
which provides a single scalar value summarizing the overall detection performance across all possible thresholds. This threshold-independent metric is particularly suitable for underwater target detection tasks, where targets are sparse and class imbalance is prevalent, thereby ensuring robust assessment of the discriminability between subtle target signals and complex backgrounds. Under consistent experimental conditions, comparative analysis was conducted with state-of-the-art methods, including conventional detectors (ACE, CEM) and the recent nearshore underwater target detection network NUN-Net.

4.3. Experiment Results and Analysis

In our study, NCSS-Net was compared with four traditional signature detection algorithms: SAM [47], ACE [48], CEM [49], and OSP [50]. Furthermore, we compared it with two recently proposed representative deep learning UTD algorithms: UTD-Net [11] and TUTDF [20]. It should be noted that both of these deep learning methods are applicable only to scenes with a pure water background. Therefore, we first applied a water mask to the nearshore dataset before using the TUTDF and UTD-Net method for detection in our comparative experiments. Additionally, we compare with the latest HUTD method named NUN-UTD [13].
Table 3 summarizes the comparative detection performance between the proposed method and the competing methods on the River Scene datasets. NCSS-Net achieves the best performance across all datasets, as indicated by the bold AUC values. The detection maps of our method and all competitors are presented in Figure 3, Figure 4 and Figure 5, while the corresponding ROC curves are depicted in Figure 6.
The nearshore environment in the River Scene 1 dataset Figure 3 is complex, which in turn causes severe distortion of the target-prior spectra. This situation poses a formidable challenge for detection algorithms. In traditional algorithms, although the OSP detection map can produce a weak response at the target location, the background noise is obvious. The detection capabilities of ACE and CEM are essentially rendered ineffective, with target signals almost completely obscured by the background. Among deep learning approaches, UTD-Net fails in distinguishing due to its inability to effectively separate underwater targets from background interference. In contrast, NUN-UTD, which was specifically developed for nearshore underwater target detection, achieved an outstanding AUC. Nonetheless, NCSS-Net outperformed all competing methods, achieving an AUC of 0.9848. Its detection maps exhibit the sharpest target contours and minimal background noise, underscoring the exceptional adaptability and robustness in complex nearshore environments.
In the River Scene 2 dataset, as shown in Figure 4, the scene is relatively simple due to better lighting conditions and lower water turbidity. Traditional feature detectors ACE and CEM show significant performance improvements, with their detection maps demonstrating strong target responses. However, these methods generally struggle to distinguish nearshore land features from underwater targets, resulting in numerous false alarms in land areas. Deep learning-based methods demonstrate limited effectiveness in this scenario: UTD-Net and TUTDF exhibit constrained detection capabilities, though NUN-UTD maintains a high AUC score. Despite this, NCSS-Net achieves the highest performance with an AUC of 0.9775. Its detection map Figure 5j accurately identifies targets and effectively suppresses false alarms in land areas, confirming that NCSS-Net maintains leading detection accuracy under relatively ideal conditions.
Compared to the River Scene 1 and River Scene 2 datasets, the River Scene 3 dataset in Figure 5 has a smaller scene size, but the complex underwater material causes severe false alarm interference. This is particularly evident in the performance of the SAM algorithm, which detects a large number of underwater materials as targets. Although OSP and UTD-Net can generate responses at target locations, their detection confidence is weak. TUTDF and NUN-UTD perform well, but NUN-UTD remains affected by underwater interference, resulting in some false alarms. In this context, NCSS-Net achieves a near-perfect AUC value of 0.9993. Its detection map (Figure 5j) not only successfully detected all targets but also showed a strong ability to suppress complex underwater backgrounds, almost completely eliminating false alarm responses. Its false alarm control capability is significantly better than all other comparison algorithms.
Figure 6 shows the ROC curves of all compared methods on three River Scene datasets. In all scenarios, NCSS-Net consistently achieves the ROC curve closest to the top-left corner, reflecting its ability to maintain a high true positive rate while effectively suppressing false positives. On the most challenging River Scene 1 dataset, the ROC curve of NCSS-Net significantly outperforms those of other methods. For the simpler River Scene 2 dataset, the performance of NCSS-Net and NUN-UTD is similar, but NCSS-Net performs better in areas with low false-positive rates. On the River Scene 3 dataset, the ROC curve closely touches the upper left corner of the coordinate axis, further confirming the method’s strong generalization ability in complex underwater environments. In summary, NCSS-Net exhibits stable and excellent performance across all test scenes, and its detection map performs best in terms of target sharpness and background suppression.
As shown in Table 4, while NCSS-Net’s runtime is higher than traditional methods and some deep learning approaches such as UTD-Net and NUN-UTD, it achieves the highest detection accuracy. The runtime of NCSS-Net is comparable to that of TUTDF, making them the most computationally intensive category among the compared methods. This increased time cost is primarily due to the iterative optimization of the self-supervised band selection and the negatively constrained autoencoder, which are essential for achieving high discriminative power in complex nearshore environments.
The aforementioned leading accuracy is primarily due to the synergistic effect of its core components. The self-supervised band selection strategy leverages prior information derived solely from the aquatic zone, thereby excluding interference from terrestrial materials. By focusing on this purified data, the strategy effectively identifies and retains the most discriminative spectral bands, thereby reducing redundancy and enhancing the signal-clutter ratio at the outset. More critically, the negatively constrained autoencoder explicitly enforces a negative correlation between the target and water endmembers during the learning process. This physically driven constraint, embedded in the loss function, significantly amplifies the separability between the subtle target signals and the complex water background. But the above design also increases time costs. Nonetheless, NCSS-Net establishes itself as a highly effective solution for applications where maximizing detection performance is the paramount objective.

4.4. Ablation Study

To evaluate the effectiveness of different components in NCSS-Net, we conducted a series of ablation studies on three River Scene datasets. We first establish a baseline model that employs the same autoencoder backbone as NCSS-Net but uses all 270 spectral bands as input and incorporates none of the three core modules. The loss function of the baseline includes only the reconstruction loss and the abundance regularization term. It is trained with the Adam optimizer (learning rate 1 × 10 4 , weight decay 1 × 10 4 ), a batch size of 256 for 100 epochs. By systematically combining the three core modules of target-prior unmixing, band selection, and negative constraint autoencoder, we conducted an in-depth analysis of the contribution of each component to the detection performance. The quantitative results shown in Table 5 reveal the complex dependencies and synergistic effects among the modules.
  • Fundamental Role of Target-Prior Unmixing: The target-prior unmixing module generates a target-prior abundance map, providing crucial initial information for subsequent processing. However, the model with only the target-prior unmixing module performs identically to the baseline model. This occurs because the module merely produces an initial target mask via NDWI-driven spectral unmixing. The mask is not used directly for detection but serves as pseudo-labels to guide subsequent band selection and autoencoder training. Its role as an enabling component is revealed in combination with other modules. Therefore, adding it alone yields identical results to the baseline. This clearly indicates that the utility of this module is contingent on the presence of subsequent processing stages. While it cannot enhance performance in isolation, it provides the essential initial information the other modules need to function effectively.
  • Critical Dependence of Band Selection: The band selection module is designed to filter the most discriminative subset of bands and exhibits a critical dependence on the full network. When combined with the target-prior unmixing module but without the negative-constraint autoencoder, the results remain identical to those obtained with band selection alone. This is because the band selection process relies heavily on target masks to guide its search. Without the correcting mechanism provided by the negatively-constrained autoencoder, the band selection process may be misled by noise or errors in the pseudo-labels, especially in complex nearshore scenarios. The selected bands alone, without subsequent discriminative feature learning and physical constraint, cannot be effectively utilized, rendering the combination’s performance equivalent to that of band selection alone. When used without the negatively constrained autoencoder, it causes catastrophic failure on River Scene 3, suggesting that the selected bands alone are insufficient for discrimination. This failure persists even when band selection is combined with the autoencoder, but without the target-prior. This suggests the band selection process may discard crucial spectral target information and lead to model failure without strong discriminative network constraints.
  • Contributions of Negatively-Constrained Autoencoder: The negative correlation constraint autoencoder enhances the discriminative ability between targets and water endmembers through a physics-driven loss function. Its standalone performance is unstable, but it is the only module that, when added to other combinations, prevents the model from failing completely on River Scene 3. This shows that the autoencoder provides the necessary nonlinear representation and physical constraint to exploit the information from both the target-prior and the selected bands, transforming them into a highly discriminative feature set. This highlights its role as the final feature enhancer and constraint enforcer.
  • Synergistic Integration for Robust Performance: The ablation study results clearly show that the three components of NCSS-Net constitute an organic whole, and their synergistic effect is far greater than the independent contribution. This synergy stems from a cascaded dependency among the modules: the target-prior unmixing module does not directly improve detection performance; rather, it serves solely to provide pseudo-labels for subsequent processing. The band-selection module then filters discriminative spectral bands based on these pseudo-labels, yet its effectiveness heavily depends on both the quality of the pseudo-labels and the physical constraints imposed by the subsequent network. Finally, the negatively-constrained autoencoder leverages the selected bands and enforces a physics-driven constraint to learn discriminative features and produce the final detection map. In essence, the target-prior unmixing provides a physically-grounded starting point, the band selection reduces dimensionality and focuses on discriminative features, and the autoencoder delivers the final discriminative power. This multi-stage, physically-guided architecture achieves robust and accurate target detection in nearshore underwater environments of varying complexity.

4.5. Parameter Sensitivity

We conducted sensitivity analyses on three categories of hyperparameters in NCSS-Net: the loss function weighting coefficients μ and λ , the number of selected bands k, and the ideal target/background ratio values used in the negative correlation constraint.

4.5.1. Sensitivity to Loss Function Coefficients

The loss function parameters μ and λ are exploited to establish a tradeoff between abundance regularization and negative correlation constraint. μ controls the sparsity and diversity of the abundance estimates through the variance-based regularization term, while λ governs the strength of the physically-driven negative correlation constraint. To evaluate their impact, we tested 25 configurations across a grid. We set the parameter μ to range from 1 × 10 5 to 45 × 10 5 . This range is sampled at five evenly spaced values: 1 × 10 5 , 3 × 10 5 , 1 × 10 4 , 2 × 10 4 , and 4.5 × 10 4 . Similarly, the parameter λ is evaluated from 1 × 10 3 to 1 × 10 1 , with five logarithmically spaced values selected: 1 × 10 3 , 3 × 10 3 , 1 × 10 2 , 3 × 10 2 , and 1 × 10 1 .
As visualized in Figure 7, NCSS-Net maintains robust performance across a wide range of μ and λ values. AUC remains high and stable around μ = 1 × 10 4 and λ = 1 × 10 2 , indicating that the model is not overly sensitive to minor perturbations in these parameters. However, performance degradation is observed when either parameter deviates excessively from this optimal region. Excessively large μ values may over-constrain the abundance distributions, while overly strong λ can distort the physically meaningful negative correlation. This analysis confirms the stability of NCSS-Net and justifies the selected parameter values. To maintain consistent parameter settings that also yield favorable detection performance across the employed datasets, we ultimately set μ to 1 × 10 4 and λ to 1 × 10 2 .

4.5.2. Sensitivity to Band Subset Size

Beyond internal model parameters, the size of the selected band subset k critically influences input data dimensionality and feature representation. To determine the optimal k, we evaluated the complete NCSS-Net pipeline with k ranging from 4 to 10 across all three River Scene datasets.
As shown in Figure 8, the detection performance, measured by AUC, consistently peaks at k = 6 across all scenes. When k < 6 , performance degrades with fewer bands due to insufficient spectral discriminative information. Conversely, when k > 6 , using more bands introduces redundant or noisy channels, which may dilute salient features and increase the risk of overfitting. Six bands provide sufficient spectral resolution to capture subtle differences between target and water spectra. This empirical analysis justifies our choice to fix k at 6.

4.5.3. Sensitivity to Ideal Ratio Values

The choice of ideal ratio values for the negative correlation constraint, ratio target = 2 for target pixels and ratio background = 0 for background pixels, is empirically justified through a sensitivity analysis across three River Scene datasets. We systematically varied the ideal target ratio from 0.5 to 3.0 and the ideal background ratio from 0 to 0.3, and evaluated the resulting AUC performance.
As shown in Figure 9, the highest AUC values consistently cluster around ratio target = 2 and ratio background = 0 . In River Scene 1, the peak AUC is achieved with ( 2 , 0 ) . In River Scene 2, the best performance also corresponds to ( 2 , 0 ) . In River Scene 3, the perfect AUC is attained with the same setting. These results indicate that a target-to-water abundance ratio of approximately 2 optimally captures the spectral dominance of submerged targets relative to the surrounding water, while a ratio of 0 effectively represents the negligible target presence in water-dominated background regions. Therefore, the selection of ( 2 , 0 ) is physically interpretable, contributing to the stability and discriminative power of NCSS-Net.

5. Discussion

5.1. Interpretation of Superior Detection Performance

The experimental results demonstrate that NCSS-Net achieves optimal detection performance on all three nearshore River Scene datasets. This superiority stems fundamentally from the synergistic integration of its three core components, which operate not in isolation but as an interdependent system.
Ablation studies reveal a cascaded dependency among the modules. Specifically, in Table 5, the experimental results for Models 4 and 1 are identical, indicating that unmixing alone cannot improve detection performance. The target-prior unmixing module generates an initial target mask through NDWI-driven spectral unmixing. While it essentially serves as pseudo-labels to guide subsequent stages, this mask alone cannot enhance detection. The band selection module relies on these pseudo-labels to identify discriminative spectral subsets. Model 5 shows poorer results than Models 4 and 2. It demonstrates that without the correction mechanism of the autoencoder, the band selection module is highly susceptible to noise and inaccuracies in the pseudo-labels. The negatively-constrained autoencoder alone attempts to enforce physical constraints but lacks both the spectral focus provided by band selection and the initialization from target-prior unmixing, resulting in unstable feature learning.
Only when all three components are integrated does the framework achieve its full potential. The target-prior unmixing provides pseudo-labels to guide subsequent stages. Band selection leverages these labels to identify the most discriminative spectral bands. The autoencoder, guided by a physics-driven negative-correlation constraint, transforms these refined inputs into a highly separable representation. This negative correlation significantly amplifies the separability between subtle targets and complex water backgrounds.
Three modules constitute a multi-stage, physically guided architecture that delivers robust and accurate detection across varying river nearshore environments, a performance consistently validated across all three River Scene datasets.

5.2. Parameter Sensitivity and Model Stability

Our parameter sensitivity analyses demonstrate the robustness of NCSS-Net across two critical hyperparameter dimensions: the number of selected bands k, the loss function coefficients μ and λ , and the ideal target/background ratio values used in the negative correlation constraint.
First, loss parameter sensitivity analysis shows that NCSS-Net maintains robust performance across a wide range of hyperparameters μ (around 1 × 10 4 ) and λ (around 1 × 10 2 ), confirming the model’s stability. μ controls the sparsity and diversity of abundance estimates through variance-based regularization, while λ governs the strength of the physically-driven negative correlation constraint. The analysis indicates that performance degradation occurs only when either parameter deviates excessively from the optimal region: excessively large μ may over-constrain abundance distributions, and overly strong λ may distort the physically meaningful negative correlation.
Similarly, band-number sensitivity analysis (Figure 8) reveals that detection performance consistently peaks when k = 6 across all three nearshore scenes. Performance degrades with fewer bands ( k < 6 ) due to insufficient spectral discriminative information to separate subtle target signals from complex water backgrounds. Conversely, using more bands ( k > 6 ) introduces redundant or noisy spectral channels that may dilute salient features and increase the risk of overfitting without improving discriminability. This empirical finding indicates that six bands represent the optimal trade-off between spectral richness and minimal redundancy for the specific challenge of nearshore underwater target detection.
Furthermore, the sensitivity analysis of ideal ratio values (Figure 9) demonstrates that ratio target = 2 and ratio background = 0 yield optimal detection performance. These values reflect the physical relationship between the target and water endmembers, with target regions exhibiting significantly higher target-to-water abundance ratios than water-dominated backgrounds.
These sensitivity analyses provide clear guidance for parameter tuning in practical applications. And they collectively justify our selected hyperparameter values ( k = 6 , μ = 1 × 10 4 , λ = 1 × 10 2 , ratio target = 2 , ratio background = 0 ) for consistent performance across diverse nearshore datasets.

5.3. Fault Tolerance Mechanisms

A key challenge in self-supervised underwater target detection is that incorrect pseudo-labels, especially under extremely turbid water or high spectral similarity between target and background, can mislead subsequent band selection and network training. To mitigate this risk, NCSS-Net incorporates the following fault-tolerance mechanisms:
1.
A negative-correlation term is added to the loss function, enforcing a physically meaningful anti-correlation between the target and water endmembers. Target regions are expected to exhibit a high target-to-water abundance ratio, while background water regions should show a low ratio. During training, deviations from this ideal ratio are penalized. Even when the initial pseudo-labels are noisy, this constraint steers the model toward more discriminative feature representations, thereby reducing over-reliance on the initial pseudo-labels.
2.
The weight of the negative-correlation constraint is gradually increased via a warm-up strategy. This prevents the strong physical constraint from overwhelming the learning process in the early stages when pseudo-labels may be unreliable. The model is thus allowed to first capture the underlying spectral structures more freely, and the constraint is progressively strengthened as training stabilizes.
3.
Gradient clipping is applied to prevent gradient explosion caused by noisy labels or outlier samples, ensuring numerical stability and reliable convergence throughout the training process.

5.4. Limitations and Future Research Directions

The experimental validation of NCSS-Net is conducted on the River Scene datasets. While these datasets represent a valuable and publicly available benchmark for nearshore HUTD, containing three scenes with variations in turbidity and target depth, they are geographically and hydrologically similar. Due to the scarcity of large-scale annotated HUTD data, this focused evaluation scope is common in the field, inherently limiting the strength of claims regarding the generalization to vastly different underwater scenarios. The self-supervised and prior-free design in NCSS-Net is a strategic response to this data-scarcity challenge, aiming to enhance its potential adaptability to unlabeled or novel environments. Nevertheless, rigorous validation on more diverse datasets is required to fully establish its broad applicability.
Beyond this fundamental challenge related to data, the current framework has several other technical limitations. First, although fault tolerance mechanisms are incorporated, their performance largely depends on the quality of the initial target-prior mask, which is generated through NDWI and linear unmixing. Under conditions of extremely turbid water or highly similar spectra between targets and the background, significant errors may occur during the initial mask generation, subsequently affecting band selection and network training. Second, the current method primarily handles static underwater targets; its detection capability for moving targets or partially occluded targets has not yet been validated. Furthermore, the endmember extraction step relies on the N-FINDER algorithm, which may be susceptible to noise and outliers in complex nearshore scenes. This could potentially compromise the physical representativeness of the extracted spectral signatures. Future work should focus on integrating more robust endmember extraction techniques. This includes integrating multimodal sensors for robustness in extreme conditions, extending the model to handle dynamic targets and environments, incorporating more sophisticated water optical models for broader generalization, and validating performance on larger, more diverse public datasets. Additionally, exploring more robust endmember extraction methods could further improve the physical consistency and reliability of the unmixing foundation.

6. Conclusions

In this paper, we propose a novel unsupervised framework, NCSS-Net, for HUTD. The framework effectively integrates self-supervised band selection with a physically constrained autoencoder to address the challenges of spectral redundancy, weak target signals, and complex background interference in nearshore environments. Specifically, NCSS-Net introduces three key components: target-prior unmixing for generating an initial target abundance map, an artificial bee colony-based band selection mechanism to identify the most discriminative spectral subset, and a negatively-constrained autoencoder that enhances target-background discriminability through a physics-driven loss function. Experiments conducted on River Scene datasets demonstrate that NCSS-Net achieves state-of-the-art detection performance.

Author Contributions

Conceptualization, S.Z.; methodology, M.L.; software, M.L.; validation, M.L.; formal analysis, M.L.; investigation, M.L.; resources, M.L.; data curation, M.L.; writing—original draft preparation, M.L.; writing—review and editing, S.Z.; visualization, M.L.; supervision, S.Z.; project administration, S.Z.; funding acquisition, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation (NSF) of China, grant number 62571246.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets are derived from the River Scene dataset, with the links as follows: River Scene dataset: https://drive.google.com/file/d/1eDJZW20TebuEE9Sa4yFB7Sze-N_Chxh3/view?usp=sharing, accessed on 25 December 2025.

Acknowledgments

We thank the open-source code of the project NUN-Net for easy reproduction.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

  1. Cheng, G.; Yang, C.; Yao, X.; Guo, L.; Han, J. When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs. IEEE Trans. Geosci. Remote Sens. 2018, 56, 1421–1431. [Google Scholar] [CrossRef]
  2. Xu, X.; Li, J.; Li, S.; Plaza, A. Curvelet transform domain-based sparse nonnegative matrix factorization for hyperspectral unmixing. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2020, 13, 4908–4924. [Google Scholar] [CrossRef]
  3. Mei, S.; Zhang, G.; Li, J.; Zhang, Y.; Du, Q. Improving spectral-based endmember finding by exploring spatial context for hyperspectral unmixing. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2020, 13, 3336–3349. [Google Scholar] [CrossRef]
  4. Chen, L.; Liu, J.; Sun, S.; Chen, W.; Du, B.; Liu, R. An iterative GLRT for hyperspectral target detection based on spectral similarity and spatial connectivity characteristics. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5505811. [Google Scholar] [CrossRef]
  5. Chang, C.-I. Hyperspectral target detection: Hypothesis testing, signal-to-noise ratio, and spectral angle theories. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5505223. [Google Scholar] [CrossRef]
  6. Gillis, D.B. An underwater target detection framework for hyperspectral imagery. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2020, 13, 1798–1810. [Google Scholar] [CrossRef]
  7. Clevers, J.G.P.W.; Kooistra, L. Using hyperspectral remote sensing data for retrieving canopy chlorophyll and nitrogen content. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2012, 5, 574–583. [Google Scholar] [CrossRef]
  8. Jay, S.; Guillaume, M. Underwater target detection with hyperspectral remote-sensing imagery. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Honolulu, HI, USA, 25–30 July 2010; pp. 2820–2823. [Google Scholar]
  9. Jay, S.; Guillaume, M.; Blanc-Talon, J. Underwater target detection with hyperspectral data: Solutions for both known and unknown water quality. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2012, 5, 1213–1221. [Google Scholar] [CrossRef]
  10. Qi, J.; Gong, Z.; Xue, W.; Liu, X.; Yao, A.; Zhong, P. A self-improving framework for joint depth estimation and underwater target detection from hyperspectral imagery. Remote Sens. 2021, 13, 1721. [Google Scholar] [CrossRef]
  11. Qi, J.; Gong, Z.; Xue, W.; Liu, X.; Yao, A.; Zhong, P. An unmixing based network for underwater target detection from hyperspectral imagery. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2021, 14, 5470–5487. [Google Scholar] [CrossRef]
  12. Li, M.; Yang, B.; Wang, B. EMLM-net: An extended multilinear mixing model-inspired dual-stream network for unsupervised nonlinear hyperspectral unmixing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5509116. [Google Scholar] [CrossRef]
  13. Liu, J.; Qi, J.; Zhu, D.; Wen, H.; Jiang, H.; Zhong, P. Detecting nearshore underwater targets with hyperspectral nonlinear unmixing autoencoder. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5529615. [Google Scholar] [CrossRef]
  14. Liu, Z.; Zhao, H.; Wang, X.; Wang, S.; Li, J.; Zhong, Y. PU-KBS: A robust positive and unlabeled learning framework with key band selection for one-class hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5517915. [Google Scholar] [CrossRef]
  15. Zhang, S.; Duan, P.; Kang, X.; Mo, Y.; Li, S. Feature-band-based unsupervised hyperspectral underwater target detection near the coastline. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5510410. [Google Scholar] [CrossRef]
  16. Jay, S.; Guillaume, M. A novel maximum likelihood based method for mapping depth and water quality from hyperspectral remote-sensing data. Remote Sens. Environ. 2014, 147, 121–132. [Google Scholar] [CrossRef]
  17. Xia, Z.; Gu, Y. Parameter feature extraction for hyperspectral detection of the shallow underwater target. Sci. China Technol. Sci. 2021, 64, 1092–1100. [Google Scholar] [CrossRef]
  18. Fu, X.; Shang, X.; Sun, X.; Yu, H.; Song, M.; Chang, C.-I. Underwater hyperspectral target detection with band selection. Remote Sens. 2020, 12, 1056. [Google Scholar] [CrossRef]
  19. Qi, J.; Gong, Z.; Yao, A.; Liu, X.; Li, Y.; Zhang, Y.; Zhong, P. Bathymetric-based band selection method for hyperspectral underwater target detection. Remote Sens. 2021, 13, 3798. [Google Scholar] [CrossRef]
  20. Li, Z.; Chen, Y.; Wang, H.; Zhang, L.; Liu, J. A transfer-based framework for underwater target detection from hyperspectral imagery. Remote Sens. 2023, 15, 1023. [Google Scholar] [CrossRef]
  21. Li, Q.; Li, J.; Li, T.; Li, Z.; Zhang, P. Spectral–spatial depth-based framework for hyperspectral underwater target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4204615. [Google Scholar] [CrossRef]
  22. Liu, B.; Men, S.; Yu, Q.; Li, D.; Ding, Z.; Liu, Z. Internal scanning hyperspectral imaging system for deep sea target detection. Opt. Laser Eng. 2025, 185, 108722. [Google Scholar] [CrossRef]
  23. Li, Q.; Li, J.; Li, T.; Feng, Y. A joint framework for underwater hyperspectral image restoration and target detection with conditional diffusion model. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2024, 17, 17263–17277. [Google Scholar] [CrossRef]
  24. Li, Q.; Gao, M.; Zhang, M.; Wang, J.; Chen, J.; Li, J. IOPE-IPD: Water Properties Estimation Network Integrating Physical Model and Deep Learning for Hyperspectral Imagery. Remote Sens. 2025, 17, 3546. [Google Scholar] [CrossRef]
  25. Wang, M.; Zhao, M.; Chen, J.; Rahardja, S. Nonlinear unmixing of hyperspectral data via deep autoencoder networks. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1467–1471. [Google Scholar] [CrossRef]
  26. Zhao, M.; Wang, M.; Chen, J.; Rahardja, S. Hyperspectral unmixing for additive nonlinear models with a 3-D-CNN autoencoder network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5509415. [Google Scholar] [CrossRef]
  27. Shen, D.; Ma, X.; Kong, W.; Liu, J.; Wang, J.; Wang, H. Hyperspectral target detection based on interpretable representation network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5519416. [Google Scholar] [CrossRef]
  28. Liu, R.; Lei, C.; Xie, L.; Qin, X. A novel endmember bundle extraction framework for capturing endmember variability by dynamic optimization. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5505217. [Google Scholar] [CrossRef]
  29. Guo, T.; He, L.; Luo, F.; Gong, X.; Zhang, L.; Gao, X. Learnable background endmember with subspace representation for hyperspectral anomaly detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5501513. [Google Scholar] [CrossRef]
  30. Nascimento, J.M.P.; Bioucas-Dias, J.M. Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE Trans. Geosci. Remote Sens. 2005, 43, 898–910. [Google Scholar] [CrossRef]
  31. Winter, M.E. N-FINDR: An algorithm for fast autonomous spectral end-member determination in hyperspectral data. Imaging Spectrom. V 1999, 3753, 266–275. [Google Scholar]
  32. Huang, H.; Sun, Z.; Liu, S.; Di, Y.; Xu, J.; Liu, C.; Wu, J. Underwater hyperspectral imaging for in situ underwater microplastic detection. Sci. Total Environ. 2021, 776, 145960. [Google Scholar] [CrossRef]
  33. Qi, J.; Zhou, C.; Liu, X.; Li, Y.; Zhang, M.; Zhong, P. Nearshore Underwater Target Detection Meets UAV-borne Hyperspectral Remote Sensing: A Novel Hybrid-level Contrastive Learning Framework and Benchmark Dataset. arXiv 2025, arXiv:2502.14495. [Google Scholar]
  34. Hensman, P.; Masko, D. The Impact of Imbalanced Training Data for Convolutional Neural Networks. Ph.D. Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2015. [Google Scholar]
  35. Su, Y.; Xu, X.; Li, J.; Qi, H.; Gamba, P.; Plaza, A. Deep autoencoders with multitask learning for bilinear hyperspectral unmixing. IEEE Trans. Geosci. Remote Sens. 2021, 59, 8615–8629. [Google Scholar] [CrossRef]
  36. Li, M.; Yang, B.; Wang, B. A coarse-to-fine scheme for unsupervised nonlinear hyperspectral unmixing based on an extended multilinear mixing model. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5521415. [Google Scholar] [CrossRef]
  37. Fang, T.; Zhu, F.; Chen, J. Hyperspectral unmixing based on multilinear mixing model using convolutional autoencoders. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5507316. [Google Scholar] [CrossRef]
  38. Zhu, D.; Du, B.; Hu, M.; Dong, Y.; Zhang, L. Collaborative-guided spectral abundance learning with bilinear mixing model for hyperspectral subpixel target detection. Neural Netw. 2023, 163, 205–218. [Google Scholar] [CrossRef]
  39. Zhu, D.; Du, B.; Zhang, L. Learning single spectral abundance for hyperspectral subpixel target detection. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 10134–10144. [Google Scholar] [CrossRef] [PubMed]
  40. Chen, L.; Huang, Y.; Dong, J.; Xu, Q.; Kwong, S.; Lu, H.; Li, C. Underwater Object Detection in the Era of Artificial Intelligence: Current, Challenge, and Future. arXiv 2024, arXiv:2410.05577. [Google Scholar] [CrossRef]
  41. Khan, A.; Fouda, M.M.; Do, D.-T.; Almaleh, A.; Aloahtan, A.M.; Rahman, A.U. Underwater Target Detection Using Deep Learning: Methodologies, Challenges, Applications, and Future Evolution. IEEE Access 2024, 12, 12618–12634. [Google Scholar] [CrossRef]
  42. Li, H.; Li, L.; Wang, H.; Zhang, W.; Ren, P. Underwater image captioning with AquaSketch-enhanced cross-scale information fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4208718. [Google Scholar] [CrossRef]
  43. Li, L.; Li, H.; Ren, P. Underwater image caption via attention mechanism based fusion of visual and textual information. Inf. Fusion 2024, 123, 103269. [Google Scholar] [CrossRef]
  44. McFeeters, S.K. The use of the normalized difference water index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
  45. Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 1979, SMC-9, 62–66. [Google Scholar] [CrossRef]
  46. Karaboga, D.; Akay, B. A comparative study of artificial bee colony algorithm. Appl. Math. Comput. 2009, 214, 108–132. [Google Scholar] [CrossRef]
  47. Kruse, F.A.; Lefkoff, A.B.; Boardman, J.W.; Heidebrecht, K.B.; Shapiro, A.T.; Barloon, P.J.; Goetz, A.F.H. The spectral image processing system (SIPS)-interactive visualization and analysis of imaging spectrometer data. Remote Sens. Environ. 1993, 44, 145–163. [Google Scholar] [CrossRef]
  48. Manolakis, D.; Shaw, G. Detection algorithms for hyperspectral imaging applications. IEEE Signal Process. Mag. 2002, 19, 29–43. [Google Scholar] [CrossRef]
  49. Harsanyi, J.C. Detection and Classification of Subpixel Spectral Signatures in Hyperspectral Image Sequences; University of Maryland, Baltimore County: Baltimore, MD, USA, 1993. [Google Scholar]
  50. Harsanyi, J.C.; Chang, C.-I. Hyperspectral image classification and dimensionality reduction: An orthogonal subspace projection approach. IEEE Trans. Geosci. Remote Sens. 1994, 32, 779–785. [Google Scholar] [CrossRef]
Figure 1. The NCSS-Net framework, comprising: (1) Target-Prior Unmixing for pseudo-label generation, (2) ABC-Based Band Selection for feature reduction, and (3) Negatively-Constrained Autoencoder for the final detection map production.
Figure 1. The NCSS-Net framework, comprising: (1) Target-Prior Unmixing for pseudo-label generation, (2) ABC-Based Band Selection for feature reduction, and (3) Negatively-Constrained Autoencoder for the final detection map production.
Remotesensing 18 00418 g001
Figure 2. Visualization of the River Scene datasets: RGB images (ac) and their corresponding ground truth masks (df) for three River Scene datasets.
Figure 2. Visualization of the River Scene datasets: RGB images (ac) and their corresponding ground truth masks (df) for three River Scene datasets.
Remotesensing 18 00418 g002
Figure 3. Performance of the proposed and compared algorithms on River Scene 1: (a) RGB image; (b) ground truth; (c) ACE; (d) CEM; (e) SAM; (f) OSP; (g) UTD-NET; (h) TUTDF; (i) NUN-UTD; (j) proposed method.
Figure 3. Performance of the proposed and compared algorithms on River Scene 1: (a) RGB image; (b) ground truth; (c) ACE; (d) CEM; (e) SAM; (f) OSP; (g) UTD-NET; (h) TUTDF; (i) NUN-UTD; (j) proposed method.
Remotesensing 18 00418 g003
Figure 4. Performance of the proposed and compared algorithms on River Scene 2: (a) RGB image; (b) ground truth; (c) ACE; (d) CEM; (e) SAM; (f) OSP; (g) UTD-UTD; (h) TUTDF; (i) NUN-NET; (j) proposed method.
Figure 4. Performance of the proposed and compared algorithms on River Scene 2: (a) RGB image; (b) ground truth; (c) ACE; (d) CEM; (e) SAM; (f) OSP; (g) UTD-UTD; (h) TUTDF; (i) NUN-NET; (j) proposed method.
Remotesensing 18 00418 g004
Figure 5. Performance of the proposed and compared algorithms on River Scene 3: (a) RGB image; (b) ground truth; (c) ACE; (d) CEM; (e) SAM; (f) OSP; (g) UTD-NET; (h) TUTDF; (i) NUN-UTD; (j) proposed method.
Figure 5. Performance of the proposed and compared algorithms on River Scene 3: (a) RGB image; (b) ground truth; (c) ACE; (d) CEM; (e) SAM; (f) OSP; (g) UTD-NET; (h) TUTDF; (i) NUN-UTD; (j) proposed method.
Remotesensing 18 00418 g005
Figure 6. ROC curves comparison of different algorithms on River Scenes: (a) River Scene 1; (b) River Scene 2; (c) River Scene 3.
Figure 6. ROC curves comparison of different algorithms on River Scenes: (a) River Scene 1; (b) River Scene 2; (c) River Scene 3.
Remotesensing 18 00418 g006
Figure 7. AUC values for different weighting coefficients in the loss function: (a) River Scene 1; (b) River Scene 2; (c) River Scene 3.
Figure 7. AUC values for different weighting coefficients in the loss function: (a) River Scene 1; (b) River Scene 2; (c) River Scene 3.
Remotesensing 18 00418 g007
Figure 8. AUC values for different number of selected bands k: (a) River Scene 1, (b) River Scene 2, and (c) River Scene 3.
Figure 8. AUC values for different number of selected bands k: (a) River Scene 1, (b) River Scene 2, and (c) River Scene 3.
Remotesensing 18 00418 g008
Figure 9. AUC values for different ideal ratio values: (a) River Scene 1, (b) River Scene 2, and (c) River Scene 3.
Figure 9. AUC values for different ideal ratio values: (a) River Scene 1, (b) River Scene 2, and (c) River Scene 3.
Remotesensing 18 00418 g009
Table 1. Structure of the feature compression block.
Table 1. Structure of the feature compression block.
Layer TypeInputOutputActivation Function
Fully-connected Layer(batch size, B)(batch size, 18 × m)LeakyReLU
Fully-connected Layer(batch size, 18 × m)(batch size, 9 × m)LeakyReLU
Fully-connected Layer(batch size, 9 × m)(batch size, 6 × m)LeakyReLU
Fully-connected Layer(batch size, 6 × m)(batch size, 3 × m)LeakyReLU
Fully-connected Layer(batch size, 3 × m)(batch size, 18 × m)LeakyReLU
Table 2. Key hyperparameter settings of NCSS-Net.
Table 2. Key hyperparameter settings of NCSS-Net.
DatasetLearning
Rate
Weight
Decay
Batch
Size
Epochs
Band Subset
Size
μ λ
River Scene 1 1 × 10 4 1 × 10 4 2561006 1 × 10 4 1 × 10 2
River Scene 2 1 × 10 4 1 × 10 4 2561006 1 × 10 4 1 × 10 2
River Scene 3 1 × 10 4 1 × 10 4 2561006 1 × 10 4 1 × 10 2
Table 3. AUC values’ comparison for River Scene datasets.
Table 3. AUC values’ comparison for River Scene datasets.
DatasetACE
[48]
CEM
[49]
OSP
[50]
SAM
[47]
UTD-Net
[11]
TUTDF
[20]
NUN-UTD
[13]
NCSS-Net
River Scene 10.47390.58320.79360.54690.43310.74920.97100.9848
River Scene 20.87840.89780.86610.31090.53000.79340.96610.9775
River Scene 30.42470.57540.78000.54700.75280.81260.98830.9993
Table 4. Runtime comparison for River Scene datasets.
Table 4. Runtime comparison for River Scene datasets.
DatasetACE
[48]
CEM
[49]
OSP
[50]
SAM
[47]
UTD-Net
[11]
TUTDF
[20]
NUN-UTD
[13]
NCSS-Net
River Scene 16.074.043.482.16556.432999.571481.882857.11
River Scene 25.162.893.582.10430.061869.241142.482252.63
River Scene 32.951.902.781.94232.401108.80489.251036.53
Table 5. AUC performance comparison of several variants of the proposed model on different datasets.
Table 5. AUC performance comparison of several variants of the proposed model on different datasets.
ModelTarget-Prior
Unmixing
Band
Selection
Negatively-Constrained
Autoencoder
River Scene 1River Scene 2River Scene 3
1×××0.85920.87530.5320
2××0.71890.68710.0079
3××0.55780.86330.4321
4××0.85920.87530.5320
5×0.71890.68710.0079
6×0.64930.52280.5498
7×0.71330.86400.0128
80.98480.97750.9993
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, M.; Zhong, S. NCSS-Net: A Negatively Constrained Network with Self-Supervised Band Selection for Hyperspectral Image Underwater Target Detection. Remote Sens. 2026, 18, 418. https://doi.org/10.3390/rs18030418

AMA Style

Liu M, Zhong S. NCSS-Net: A Negatively Constrained Network with Self-Supervised Band Selection for Hyperspectral Image Underwater Target Detection. Remote Sensing. 2026; 18(3):418. https://doi.org/10.3390/rs18030418

Chicago/Turabian Style

Liu, Mengxin, and Shengwei Zhong. 2026. "NCSS-Net: A Negatively Constrained Network with Self-Supervised Band Selection for Hyperspectral Image Underwater Target Detection" Remote Sensing 18, no. 3: 418. https://doi.org/10.3390/rs18030418

APA Style

Liu, M., & Zhong, S. (2026). NCSS-Net: A Negatively Constrained Network with Self-Supervised Band Selection for Hyperspectral Image Underwater Target Detection. Remote Sensing, 18(3), 418. https://doi.org/10.3390/rs18030418

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop