SIG-Net: A Spectral-Index-Guided Network for Red Tide Extraction from Sentinel-2 Multispectral Imagery

Zhou, Lei; Li, Hongping; Chen, Xiaojun; Li, Zhanqiang

doi:10.3390/rs18121928

Open AccessArticle

SIG-Net: A Spectral-Index-Guided Network for Red Tide Extraction from Sentinel-2 Multispectral Imagery

College of Marine Technology, Ocean University of China, Qingdao 266100, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(12), 1928; https://doi.org/10.3390/rs18121928

Submission received: 16 April 2026 / Revised: 3 June 2026 / Accepted: 8 June 2026 / Published: 11 June 2026

(This article belongs to the Special Issue Remote Sensing Technologies and Methods for Ocean Monitoring and Surveillance)

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

SIG-Net integrates Sentinel-2 bands and spectral indices through a dual-branch architecture.
SIGF uses cross-attention and adaptive gating for spectral-index-guided feature fusion.

What are the implications of the main findings?

NDNI provides useful red-edge and blue-band spectral contrast for Noctiluca-related red tide extraction.
SIG-Net improves red tide extraction over U-Net, DeepLabV3+, and SegFormer.

Abstract

Red tide events pose substantial threats to marine ecosystems, aquaculture, and coastal public health. Timely and accurate delineation of red tide extent from satellite imagery is therefore essential for operational monitoring and early warning. However, existing deep learning-based semantic segmentation methods generally treat multispectral bands as homogeneous inputs and do not fully exploit the domain knowledge embodied in spectral indices commonly used in traditional remote sensing analysis. To address this limitation, this study proposes a spectral-index-guided network (SIG-Net) that explicitly incorporates spectral-index priors into deep feature extraction through a dual-branch architecture. SIG-Net comprises three components: a spectral encoder based on a Mix Vision Transformer (MiT-B2) that learns spatial-spectral representations from the original Sentinel-2 bands; a lightweight CNN-based index encoder that extracts discriminative features from four spectral indices, namely the red-green index (RGI), blue-green index (BGI), normalized difference vegetation index (NDVI), and the normalized difference Noctiluca index (NDNI) proposed in this study; and a spectral-index-guided fusion (SIGF) module that adaptively integrates multi-scale features from the two branches using spatial-reduction cross-attention and a gated fusion mechanism. Experiments on a Sentinel-2 red tide dataset show that SIG-Net outperforms single-branch baselines, including U-Net, DeepLabV3+, and SegFormer, as well as naive multi-source fusion strategies. Ablation studies further confirm the contributions of the SIGF module, the gating mechanism, and the proposed NDNI to performance improvements. The proposed method provides an effective framework for integrating domain knowledge with deep learning for red tide remote sensing monitoring.

Keywords:

red tide detection; spectral indices; dual-branch network; cross-attention fusion; Sentinel-2; remote sensing; deep learning

1. Introduction

Red tide, also referred to as a harmful algal bloom (HAB), is a natural phenomenon characterized by rapid phytoplankton proliferation in marine and coastal waters [1]. Such events can cause large-scale fish mortality, shellfish biotoxin contamination, water-quality deterioration, and substantial economic losses to the aquaculture and tourism sectors [2,3]. Driven by intensified coastal eutrophication associated with human activities and climate change, the frequency, intensity, and geographic extent of red tide events have increased markedly over recent decades [4,5]. Accordingly, the development of rapid, large-area, and accurate red tide monitoring capabilities has become an urgent need in marine environmental management [6,7].

Satellite remote sensing offers a unique synoptic perspective for monitoring large-scale marine red tide dynamics [8]. Medium-resolution multispectral sensors, such as the Multispectral Instrument (MSI) aboard Sentinel-2, provide 10–20 m spatial resolution, a 5-day revisit cycle, and free data access, making them particularly suitable for operational red tide monitoring [9,10]. The visible and near-infrared (VNIR) bands of Sentinel-2 capture spectral signals associated with red tide characteristics, including elevated chlorophyll-a concentrations, changes in water-leaving radiance, and the formation of surface scums [11,12].

Recent studies using medium- and high-spatial-resolution satellite imagery have further advanced red tide and algal-bloom monitoring in optically complex coastal waters. Sentinel-2 MSI has been used to map small dinoflagellate blooms in complex coastal waters, characterize optical types of highly concentrated red tides, monitor bloom status in riverine and coastal environments, and develop new spectral or color-space indices for mixed harmful algal blooms [10,11,12,13,14]. In Chinese coastal waters, high-spatial-resolution broad-band optical satellite data have been applied to fine-scale red tide detection, while recent GOCI-II studies have improved the detection and species-related discrimination of algal blooms in the East China Sea [15,16,17]. Recent Sentinel-2 index-based and machine-learning studies also show that combining spectral indices with data-driven classifiers can improve floating algal-bloom mapping [18]. These studies demonstrate the value of higher spatial resolution and richer spectral information for delineating narrow, patchy, and nearshore bloom features, but they also show that spectral variability, turbidity, sunglint, clouds, and mixed pixels remain important sources of uncertainty.

Traditional remote sensing methods for red tide detection rely heavily on expert-designed spectral indices and threshold-based classification [19]. The normalized difference vegetation index (NDVI), red-green index (RGI), blue-green index (BGI), and various chlorophyll-absorption indices have been widely used to distinguish bloom pixels from background water [20,21]. Although these indices effectively encode domain-specific spectral knowledge, such as the red-edge reflectance peak of chlorophyll-containing organisms, their performance is often constrained by threshold selection, atmospheric conditions, water turbidity, and spectral variability among algal species.

In recent years, deep learning-based semantic segmentation has emerged as a powerful alternative for pixel-level classification of remote sensing imagery [22]. Architectures such as U-Net, DeepLabV3+, and, more recently, SegFormer have achieved strong performance across a range of remote sensing tasks, including land-cover mapping, building extraction, and crop classification [23,24,25]. These methods learn hierarchical feature representations directly from raw pixel data and can capture complex spatial-spectral patterns that are difficult to express using handcrafted indices alone [26].

From a methodological perspective, existing red tide detection approaches can be broadly divided into spectral-feature-based methods and learning-based methods. Spectral-feature-based methods use characteristic band responses, spectral indices, and threshold rules to exploit the optical signatures of bloom waters, and they are physically interpretable and computationally efficient. However, their transferability can be limited by regional water optical conditions, dominant algal species, atmospheric correction uncertainty, and manually selected thresholds. Learning-based methods, especially CNN- and Transformer-based semantic segmentation networks, can learn nonlinear spatial-spectral representations from training samples, but many existing models treat all multispectral bands or index channels as homogeneous inputs and do not explicitly model the prior knowledge embedded in spectral indices. This gap motivates the spectral-index-guided dual-branch design proposed in this study.

However, applying existing deep learning methods to red tide extraction from multispectral imagery still presents several challenges [27]. First, the limited spectral dimensionality of multispectral data constrains the discriminative power of purely data-driven feature learning, especially when training samples are scarce [28]. Second, standard architectures treat all input channels equally and do not exploit the physical meaning carried by specific band combinations, namely spectral indices [29]. Third, red tide pixels usually occupy only a small proportion of an image, resulting in severe class imbalance and further increasing the difficulty of learning [30].

A natural strategy for addressing these limitations is to combine spectral indices with the original bands as auxiliary inputs. The simplest approach, namely early fusion, concatenates the indices with the original bands to form an extended input tensor. Although straightforward, early fusion forces the network to learn representations from heterogeneous data sources (reflectance values and index ratios) within a single encoder, which may not be optimal. An alternative is a dual-branch architecture, in which separate encoders process the two data sources independently before feature fusion. The key question is how to fuse the two branches effectively so that spectral-index information genuinely guides spectral feature learning rather than merely increasing channel dimensionality.

This study proposes SIG-Net (Spectral-Index-Guided Network), a novel dual-branch semantic segmentation framework that explicitly leverages spectral-index priors to guide multispectral feature extraction for red tide mapping. The main contributions of this work are as follows:

(1) A dual-branch backbone architecture is designed, in which a MiT-B2 spectral encoder processes the original Sentinel-2 bands and a lightweight four-stage CNN index encoder processes the spectral indices. The two branches share the same spatial downsampling scheme (1/4, 1/8, 1/16, and 1/32), enabling multi-scale feature alignment and fusion.

(2) A spectral-index-guided fusion (SIGF) module is proposed. It employs a spatially reduced cross-attention mechanism in which index-branch features serve as keys and values to guide query features from the spectral branch. A learnable gating mechanism adaptively controls the fusion ratio between the original spectral features and the attention-enhanced features at each scale.

(3) A normalized difference Noctiluca index (NDNI), defined as (B5 − B2)/(B5 + B2), is introduced. It exploits the red-edge scattering and blue-band absorption characteristics of Noctiluca-type red tide organisms and complements the traditional RGI, BGI, and NDVI indices.

(4) Comprehensive experiments are conducted on a Sentinel-2 red tide dataset covering multiple coastal regions, demonstrating that SIG-Net significantly outperforms single-branch baselines (U-Net, DeepLabV3+, and SegFormer) and alternative fusion strategies (early fusion, concatenation, and element-wise addition). Detailed ablation studies further verify the effectiveness of each component.

The remainder of this paper is organized as follows. Section 2 describes the study area, Sentinel-2 data, the proposed method, and experimental settings. Section 3 presents the experimental results, including baseline comparisons and ablation studies. Section 4 discusses the findings and their implications. Section 5 concludes the paper.

2. Materials and Methods

2.1. Study Area

The study area comprises the coastal waters of Jiangsu, Guangdong, and Guangxi, which are among the major red-tide-prone regions of China (Figure 1). Under the combined influence of coastal eutrophication, complex hydrodynamic processes, and intensive human activities, red tide events occur frequently during the warm season in these regions [31]. Specifically, the Jiangsu coast is strongly affected by Yangtze River diluted water and high nearshore suspended-sediment concentrations; the Guangdong coast is influenced by Pearl River runoff and the dispersion of the Pearl River estuarine plume; and the Guangxi coast is mainly characterized by the semi-enclosed environment of the Beibu Gulf and regional alongshore transport processes. Together, these environmental controls create favorable conditions for rapid phytoplankton growth and aggregation from May to September [32,33]. Common dominant red tide organisms in the region include Noctiluca scintillans, Prorocentrum donghaiense, and Karenia mikimotoi, and different algal species exhibit distinct spectral response characteristics from the visible to near-infrared bands [34].

The study area is approximately located between 20° and 35°N and between 108° and 122°E and encompasses a range of representative marine environments, including the nearshore shallow waters of Jiangsu, the waters adjacent to the Pearl River Estuary in Guangdong, and the coastal waters of the Beibu Gulf in Guangxi [35]. These regions include estuarine mixed waters, highly turbid nearshore waters, semi-enclosed bays, and outer shelf seas and therefore exhibit pronounced regional heterogeneity and optical complexity. Using these areas as study sites facilitates a systematic evaluation of the robustness and applicability of the proposed method under different nutrient regimes, suspended-sediment concentrations, water-depth conditions, and hydrodynamic settings, thereby improving the model’s generalization ability for typical coastal red tide monitoring scenarios in China.

2.2. Sentinel-2 Data and Preprocessing

Sentinel-2 is an optical Earth-observation satellite mission implemented by the European Space Agency (ESA) under the European Union’s Copernicus Programme and consists of two satellites, Sentinel-2A and Sentinel-2B [36]. Although primarily designed for land monitoring, it has also been widely used for remote sensing of coastal and inland waters, providing high-resolution multispectral imagery for agricultural monitoring, forest management, land-use change detection, water-resource assessment, and water-environment monitoring [37]. Sentinel-2 carries the Multispectral Instrument (MSI), which acquires 13 spectral bands spanning the visible, near-infrared, and shortwave infrared regions at spatial resolutions of 10, 20, and 60 m [38]. The coordinated operation of the twin satellites provides an approximately 5-day revisit interval at the equator, enabling continuous and stable remote sensing observations of global land surfaces and coastal regions [39,40,41,42,43,44].

In this study, Sentinel-2 Multispectral Instrument (MSI) Level-2A (L2A) surface reflectance products were used as the primary remote sensing data source. These products are atmospherically corrected using the Sen2Cor processor and provide bottom-of-atmosphere (BOA) reflectance. In view of the spectral response characteristics of red tide targets in the visible to near-infrared range, seven bands closely related to aquatic remote sensing and algal bloom identification were selected for subsequent analysis, namely B2, B3, B4, B5, B6, B7, and B8 (Table 1).

The original Sentinel-2 L2A imagery was acquired from 2020 to 2022. Image samples were selected according to red tide event information reported in the China Marine Disaster Bulletin and cover several representative red tide occurrence areas along the coasts of Jiangsu, Guangdong, and Guangxi (Table 2). Because the selected Sentinel-2 MSI bands have different native spatial resolutions, B5, B6, and B7 were resampled from 20 m to the 10 m reference grid of B2/B3/B4/B8 before model input construction. This resampling was conducted for band co-registration and pixel-wise tensor alignment, and it does not imply that new spatial information beyond the native 20 m resolution of the red-edge bands was generated. The resampled red-edge bands were used mainly as spectral cues for red tide discrimination rather than as independently enhanced 10 m spatial features. This common-grid preprocessing is consistent with common Sentinel-2 multispectral analysis workflows, including previous Sentinel-2 water-quality studies in which 20 m bands were resampled to a 10 m grid before analysis [45], and was applied consistently to all compared models, ensuring that the relative comparison among methods remains fair. No-data pixels were masked during subsequent training, inference, and statistical analysis.

To improve sample usability and training consistency, the imagery was further cropped, registered, and normalized after band resampling and reflectance conversion, and a red tide semantic segmentation dataset was constructed in combination with manual annotations. The preprocessed multispectral bands and corresponding labels jointly served as the basis for training and testing the SIG-Net model.

The dataset used in this study was derived from nine Sentinel-2 L2A scenes acquired at three time points. To meet the training requirements of deep learning models, each scene was cropped into 512 × 512-pixel patches using a sliding-window strategy with 128-pixel overlap between adjacent patches, generating a total of 5148 image patches (Figure 2). The ground-truth labels were constructed through a combined event-record, spectral-threshold, and expert-interpretation procedure. First, red tide event records from the China Marine Disaster Bulletin and related regional reports were used to determine the approximate occurrence dates and sea areas. Second, Sentinel-2 true-color and false-color composites, together with red-tide-related spectral indices, were visually inspected to identify candidate bloom regions. Third, semi-automatic threshold segmentation was used to generate preliminary red tide masks based on the spectral contrast between bloom water and background water. These masks were then manually checked and corrected by interpreters with remote-sensing and marine-environment knowledge according to spectral response, spatial continuity, coastline context, and cloud masks or invalid-pixel masks. Finally, the corrected masks were converted into binary semantic labels, where red tide pixels were assigned class 1 and background water, and non-red-tide areas were assigned class 0, with invalid pixels excluded where appropriate.

In constructing the labels, we conservatively labeled red tide pixels with clear visual, spectral, and spatial continuity characteristics. Ambiguous transition areas, weak-response pixels, and regions strongly affected by clouds, sunlight, suspended sediments, or mixed water signals were treated cautiously to reduce label noise. This strategy improves the reliability of positive samples but may lead to under-representation of weak or early-stage blooms and may make the model more conservative near red tide boundaries.

To avoid information leakage among the training, validation, and test sets caused by spatial autocorrelation between adjacent patches from the same scene, which could further bias model evaluation, a scene-level split strategy was adopted. Specifically, four scenes (2272 patches) were used for training, three scenes (1309 patches) for validation, and two scenes (1567 patches) for testing. This partitioning strategy enables a more rigorous evaluation of the model’s generalization ability to unseen scenes.

2.3. Spectral Indices

A key innovation of this study is the explicit incorporation of spectral indices as an independent input branch. Given the band-dependent response of red tide targets and the prior information encoded in spectral indices, four indices closely related to red tide identification were selected as auxiliary features (Table 3).

The red-green index (RGI) is defined as B4/B3. This index characterizes the relative difference in reflectance between the red and green bands and is sensitive to changes in the visible-band reflectance characteristics of red tide waters [46]. Under high-concentration bloom conditions, RGI often increases and can therefore provide auxiliary discriminative information for red tide identification.

The blue-green index (BGI) is defined as B2/B3. This index reflects the relative difference in reflectance between the blue and green bands and can be used to characterize blue-light attenuation caused by the absorption of pigments such as chlorophyll and carotenoids [47]. In some red tide scenarios, lower BGI values usually correspond to stronger algal-pigment absorption signals and can therefore serve as a supplementary indicator for identifying anomalous bloom waters.

The normalized difference vegetation index (NDVI) is defined as (B8 − B4)/(B8 + B4). Although NDVI was originally developed for terrestrial vegetation monitoring, previous studies have extended its application to certain water-environment identification tasks. In red tide scenarios, when high-density near-surface algal aggregations or floating surface coverage are present, near-infrared reflectance may be relatively enhanced, allowing NDVI to respond to anomalous water targets [48].

The normalized difference Noctiluca index (NDNI) is defined as (B5 − B2)/(B5 +B2). This index is proposed in this study to enhance the spectral response of Noctiluca scintillans-dominated red tides commonly found in China’s coastal waters [49]. Carotenoids in Noctiluca cells exhibit strong absorption in the blue band (B2, 490 nm), whereas their relatively large cell size may lead to more pronounced scattering in the red-edge band (B5, 705 nm) [50]. Therefore, constructing a normalized difference index using B5 and B2 helps highlight the spectral contrast between blue-band absorption and the red-edge response in Noctiluca red tides.

To fully exploit the complementarity of these spectral indices, they were not simply concatenated with the original bands as additional channels. Instead, features were extracted through an independent index branch, and the spectral indices were then used to guide learning of the original multispectral features during the fusion stage, thereby improving the model’s ability to identify red tide targets.

During data preprocessing, four spectral indices were calculated from the seven reflectance bands and prepared together with the original bands as an 11-channel tensor for each image patch (7 original bands + 4 spectral indices). This unified tensor was used only to maintain consistent preprocessing and augmentation before the network input was split by modality. To avoid numerical instability caused by denominators close to zero, a small constant epsilon = 10⁻⁶ was added to the denominator during index calculation, and the resulting index values were clipped to the range [−10, 10]. Any residual NaN or Inf values generated during the calculation were uniformly replaced with 0.

In the SIG-Net architecture, once the 11-channel input enters the network, it is split by modality: the first seven channels are fed into the spectral encoder, and the remaining four channels are fed into the index encoder. Therefore, although the original bands and spectral indices are prepared in a unified tensor before input, the model does not adopt simple early-concatenation fusion. Instead, a dual-branch structure is used to extract the two types of features separately, followed by guided fusion at later stages. This design preserves the independent representational capacity of different input modalities while allowing spectral indices to guide multispectral feature learning.

2.4. Proposed Method: SIG-Net Architecture

This section describes the architecture of SIG-Net in detail. As shown in Figure 3, SIG-Net adopts a dual-branch encoder–decoder structure comprising three core components: (1) a spectral encoder based on a Mix Vision Transformer (MiT-B2), which extracts hierarchical spatial-spectral representations from the original Sentinel-2 bands; (2) a lightweight CNN-based index encoder, which extracts multi-scale features from the spectral indices; and (3) a set of spectral-index-guided fusion (SIGF) modules, which adaptively integrate features from the two branches at multiple scales through spatial-reduction cross-attention and a gated fusion mechanism. The fused multi-scale features are then fed into a SegFormer-style MLP decoder head for pixel-wise red tide classification.

2.4.1. Spectral Encoder

The spectral encoder adopts the Mix Vision Transformer (MiT-B2) architecture, which has shown strong performance in dense prediction tasks. MiT-B2 is a hierarchical Vision Transformer that generates multi-scale feature maps through four successive stages, each composed of Patch Embedding, efficient self-attention layers, and Mix-FFN (a feed-forward network incorporating depthwise convolution).

In this implementation, the MiT-B2 encoder processes the 7-channel spectral input with the following configuration: the initial Patch Embedding uses a 7 × 7 convolution with a stride of 4 (stage 1), followed by 3 × 3 convolutions with a stride of 2 in the next three stages. The number of Transformer layers in the four stages is (3, 4, 6, 3), with corresponding numbers of attention heads (1, 2, 5, 8). The base embedding dimension is 64, and the channel dimensions of the four stages are (64, 128, 320, 512). To reduce the computational cost of self-attention on high-resolution feature maps, efficient self-attention uses spatial reduction ratios of (8, 4, 2, 1) at the four stages, respectively. The MLP expansion ratio is 4, and the maximum drop-path rate is set to 0.1 for regularization during training.

Compared with the standard Vision Transformer (ViT), a key advantage of MiT is that it natively produces multi-scale features without requiring a feature pyramid network (FPN), making it naturally compatible with the proposed dual-branch multi-scale fusion strategy. In addition, Mix-FFN replaces traditional positional encoding with 3 × 3 depthwise convolutions to provide local positional information without explicit positional encoding, which is particularly beneficial for dense prediction tasks in remote sensing scenarios.

2.4.2. Index Encoder

The index encoder is implemented as a lightweight four-stage convolutional neural network for extracting multi-scale features from the four spectral-index channels. Unlike the spectral encoder, which uses MiT-B2 to model long-range spatial-spectral dependencies, the index encoder adopts a relatively simple CNN structure. This asymmetric design is motivated by two considerations. First, spectral indices already encode strong prior information, and the index branch therefore does not require a feature extractor as complex as that of the spectral branch. Second, keeping the index branch lightweight helps control the additional computational overhead introduced by the dual-branch structure, thereby improving overall network efficiency.

Each stage of the index encoder consists of two sequential convolutional modules. The first module performs spatial downsampling using a convolution with a stride s, where the stride is 4 in stage 1 and 2 in stages 2–4, followed by batch normalization and a ReLU activation function. The second module uses a 3 × 3 convolution with a stride of 1 to further refine local features. The initial convolution in stage 1 uses a 7 × 7 kernel to capture broader spatial context, whereas all subsequent stages use 3 × 3 kernels. The output channel dimensions of the four stages are set to (64, 128, 320, 512) to match the feature dimensions of the spectral encoder at the corresponding scales, thereby facilitating subsequent cross-branch feature fusion.

Compared with the spectral encoder, the index encoder introduces only a limited additional computational cost. Because each stage contains only two convolutional layers and no self-attention modules, it increases the parameter count by approximately 3.2 M in the current implementation, corresponding to about a 13% increase relative to the MiT-B2 backbone (approximately 24.7 M parameters). This lightweight design allows the dual-branch architecture to maintain strong representational capacity while retaining good deployment feasibility, making it suitable for real-time or near-real-time red tide monitoring scenarios.

2.4.3. Spectral-Index-Guided Fusion (SIGF) Module

The core innovation of SIG-Net lies in the spectral-index-guided fusion (SIGF) module, which is instantiated at four spatial scales. Unlike conventional symmetric fusion strategies (e.g., feature concatenation or element-wise addition), SIGF explicitly models the asymmetric relationship between spectral features and index features, treating spectral-index features as guidance information that modulates spectral-branch features through a cross-attention mechanism. The underlying rationale is that spectral indices encode prior knowledge relevant to red tide identification and can therefore serve as auxiliary signals that guide the learning of more discriminative representations from the original multispectral features.

Let the spectral feature and the index feature at a given scale be denoted as

F_{s} \in R^{C \times H \times W}

and

F_{i} \in R^{C \times H \times W}

, respectively. First, the SIGF module generates the query, key, and value through 1 × 1 convolutions:

Q = W_{Q} F_{s}, K = W_{K} SR (F_{i}), V = W_{V} SR (F_{i})

(1)

where WQ, WK, and WV denote 1 × 1 linear projections; SR(·) denotes the spatial reduction operation, which is implemented by average pooling in this study. For the four scales, the spatial reduction ratios are set to (8, 4, 2, 1), respectively. Let the original number of spatial tokens be N = H × W; then the number of reduced tokens is

N_{r} = N / r^{2}

, and the computational complexity of the attention operation is reduced from

O (N^{2})

to

O (N N_{r})

.

Before attention computation, Q, K, and V are rearranged into a two-dimensional token form, where

Q \in R^{N \times C}, K, V \in R^{N_{r} \times C}

. The cross-attention output is then defined as:

F_{att} = BN (softmax (\frac{Q K^{T}}{\sqrt{C}}) V)

(2)

where

\sqrt{C}

is the scaling factor used to avoid excessively large dot-product values;

BN

denotes the batch normalization operation applied after restoring the feature map to two dimensions. In this study, BatchNorm2d is used to normalize the attention output so that it remains consistent with the feature representation of the convolution branch and yields more stable training performance.

After obtaining the attention-enhanced feature, SIGF further adaptively controls the fusion ratio between the original spectral feature and the attention-enhanced feature through a learnable gating mechanism:

G = σ (BN ({Conv}_{3 \times 3} ([F_{s}; F_{att}]))

(3)

F_{fused} = G ⊙ F_{att} + (1 - G) ⊙ F_{s}

(4)

where

σ

denotes the Sigmoid activation, [

;

] denotes channel-wise concatenation, and

⊙

denotes element-wise multiplication. The gating map is

G \in [0, 1]^{C \times H \times W}

obtained through end-to-end learning, enabling the network to adaptively regulate the fusion strength of index-guided information in both the spatial and channel dimensions. In general, in regions where the spectral indices are more discriminative, the model tends to assign higher weights to the attention-enhanced features; whereas in regions with ambiguous boundaries or weak index responses, more original spectral features can be retained. The ablation experiments in Section 3.2 further verify the effectiveness of this gating mechanism in improving model performance.

2.4.4. Decoder Head

After multi-scale feature fusion, the four fused feature maps are fed into the SegFormer-style decoder head. The decoder first maps features at each scale into a unified 256-dimensional channel space through projection layers. In this implementation, each projection layer consists of a 1 × 1 convolution, batch normalization, and ReLU activation. Subsequently, features from all scales are upsampled to 1/4 resolution (H/4 × W/4) by bilinear interpolation and concatenated along the channel dimension to form 1024-dimensional multi-scale fused features.

The concatenated features are compressed to 256 dimensions through a 1 × 1 convolutional fusion layer with batch normalization, followed by dropout with a ratio of 0.1 for regularization. Finally, a 1 × 1 convolutional classification layer outputs pixel-level predictions for two classes; during both training and inference, the predictions are further upsampled to the original input resolution to obtain the final segmentation map.

2.4.5. Loss Function

The model is trained using a pixel-wise cross-entropy loss function with class weights:

L_{CE} = - \frac{1}{N} \sum_{n} \sum_{c} w_{c} y_{n, c} {\log \hat{p}}_{n, c}

(5)

Here,

N

denotes the total number of valid pixels involved in loss computation,

c

denotes the class index (background class and red tide class),

w_{c}

denotes the class weight; in this study, the background class is set to

w_{0} = 1.0

, and the red tide class to

w_{1} = 50.0

;

y_{n, c}

denotes the one-hot encoded ground-truth label, and

{\hat{p}}_{n, c}

denotes the predicted class probability. The larger weight assigned to the red tide class is used to mitigate the pronounced class imbalance problem and thereby enhance the model’s focus on minority-class targets.

To further enhance the model’s ability to identify the red tide class, this study combines cross-entropy loss and Dice loss in a weighted manner to obtain the total loss function:

L_{total} = L_{CE} + 3.0 L_{Dice}

(6)

Here, Dice loss [51] directly measures the regional overlap between the predicted and ground-truth segmentations and is generally more robust to class imbalance than cross-entropy loss alone. By combining the two, the model can simultaneously optimize pixel-wise classification accuracy and the overall overlap quality of the target region, thereby improving the completeness and boundary consistency of red tide segmentation results.

2.5. Experimental Settings

2.5.1. Implementation Details

All models were implemented in MMSegmentation v1.2.2, with PyTorch 2.1.2 and CUDA 11.8 as the runtime environment. Experiments were conducted on a single NVIDIA RTX 4090 GPU with 24 GB of memory. AdamW [52] was used as the optimizer, with an initial learning rate of 1 × 10⁻³ and a weight decay of 0.01. The learning-rate schedule included 500 iterations of linear warm-up (from 1 × 10⁻⁶ to 1 × 10⁻³), followed by polynomial decay (power = 0.9) to a minimum learning rate of 1 × 10⁻⁵. The models were trained for 20,000 iterations with a batch size of 16. To improve training efficiency and reduce GPU memory usage, automatic mixed precision (AMP) and dynamic loss scaling were enabled during training.

Data augmentation included random cropping (512 × 512, with a maximum class-ratio threshold of 0.75), random horizontal flipping (p = 0.5), random vertical flipping (p = 0.5), and random rotation (angle range [−90°, +90°], p = 0.5). The loss function combined weighted cross-entropy loss and Dice loss, with class weights set to [1.0, 50.0] to alleviate class imbalance and the Dice-loss weight set to 3.0; the specific definition is given in Section 2.4.5. All models were compared under the same data split and training settings. Mean intersection over union (mIoU), mean precision (mPrecision), mean recall (mRecall), and mean F1-score (mF1) were used as the primary evaluation metrics. These metrics are widely used for semantic segmentation and classification performance evaluation because they characterize region overlap, false positives, false negatives, and the balance between precision and recall [53]. The final results are reported on the independent test set.

2.5.2. Baseline Methods

To assess the effectiveness of SIG-Net, it was compared with the following baseline methods.

U-Net: a classical encoder–decoder segmentation architecture with symmetric downsampling and upsampling paths and skip connections. In this implementation, a five-stage structure was used, with the base number of channels set to 64. To ensure the same amount of input information as in the proposed method, U-Net adopts an 11-channel input, in which the seven original spectral bands and four spectral indices are directly concatenated into a multi-channel input tensor.

DeepLabV3+ [54]: an encoder–decoder architecture combined with atrous spatial pyramid pooling (ASPP) to enhance multi-scale context modeling. In this study, ResNet-50 was used as the backbone, and the input was likewise an 11-channel tensor formed by concatenating the seven original spectral bands and four spectral indices.

SegFormer (MiT-B2) [55]: a representative Transformer-based semantic segmentation architecture whose backbone is identical to the spectral encoder of SIG-Net. For a fair comparison, two configurations were evaluated in this study:

(a) SegFormer-7ch: only the seven original spectral bands are used as input;

(b) SegFormer-11ch: the seven original spectral bands and four spectral indices are directly concatenated into an 11-channel input and fed into a single encoder in order to evaluate the difference between a simple early-fusion strategy and the dual-branch guided-fusion strategy of SIG-Net.

All baseline methods used the same training settings as SIG-Net, including the optimizer (AdamW), initial learning rate (1 × 10⁻³), data-augmentation strategy, and loss function (a combination of cross-entropy loss and Dice loss), to ensure a fair comparison. For models whose default input is 3-channel RGB (U-Net, DeepLabV3+, and SegFormer), the input layers were modified to accommodate multispectral input: U-Net, DeepLabV3+, and SegFormer-11ch received 11-channel input, whereas SegFormer-7ch received 7-channel input. To avoid additional influences introduced by different pretraining strategies, all models were randomly initialized and trained from scratch.

3. Results

3.1. Comparison with Baseline Methods

To verify the effectiveness of the proposed method, SIG-Net was compared with representative semantic segmentation models, including U-Net, DeepLabV3+ (R50), and SegFormer (MiT-B2), and the results are summarized in Table 4. Overall, SIG-Net delivers the best overall performance, achieving mIoU, mPrecision, mRecall, and mF1 values of 75.31%, 81.60%, 86.03%, and 83.67%, respectively. Among these metrics, mIoU, mPrecision, and mF1 are the highest across all methods, indicating that the proposed dual-branch feature modeling and fusion strategy effectively improves red tide segmentation performance.

In terms of individual metrics, different methods exhibit different trade-offs between precision and recall. SegFormer-7ch achieves the highest mRecall (86.23%), indicating a strong capability for detecting target regions; however, its mPrecision is only 75.88%, suggesting a relatively high false-positive rate, and its mIoU and mF1 are therefore only 71.51% and 80.18%, respectively. By contrast, SIG-Net attains an mRecall of 86.03%, close to the highest value, while improving mPrecision to 81.60%, which is significantly higher than that of all comparison methods. As a result, it suppresses false detections while maintaining high recall, ultimately achieving the best mIoU and mF1. This indicates that the performance gain of the proposed method does not arise merely from increased recall, but from a better balance between precision and recall.

In addition, when the input of SegFormer is expanded from 7 channels to 11 channels, mIoU decreases from 71.51% to 70.42%, and mF1 decreases from 80.18% to 79.11%. This result does not indicate that spectral indices are ineffective; rather, it suggests that simply concatenating the original spectral bands and spectral indices into a single encoder does not guarantee performance improvement. A possible reason is that spectral indices are nonlinear ratio features derived from the original bands and have value distributions and noise characteristics that differ from those of reflectance bands. A single patch-embedding layer treats all 11 channels as homogeneous input and may therefore mix redundant or unstable index responses with original spectral information at the earliest stage of feature extraction. Under limited training samples and strong class imbalance, this early-fusion strategy can increase optimization difficulty and lead to more false responses. Meanwhile, under the 11-channel early-fusion setting, U-Net and DeepLabV3+ achieve mIoU values of 73.27% and 72.94%, respectively, which are still lower than that of SIG-Net, further indicating that direct concatenation alone cannot fully exploit the auxiliary role of spectral indices.

By contrast, SIG-Net adopts a 7+4-channel dual-branch input scheme and performs collaborative modeling and guided fusion of original spectral features and index features through the specially designed SIGF module, thereby more fully exploiting the complementary information between different modalities. In terms of model complexity, SIG-Net achieves the best performance with only a moderate increase in parameter count. Compared with the 24.7 M parameters of SegFormer (MiT-B2), the total parameter count of SIG-Net is approximately 27.9 M, representing an increase of only about 3.2 M parameters, while mIoU improves by 3.80 percentage points and mF1 by 3.49 percentage points. This indicates that the lightweight index encoder and SIGF module can substantially enhance segmentation capability while maintaining relatively low additional computational overhead, demonstrating a favorable balance between performance and complexity. The detailed results are presented in Figure 4 and Figure 5.

3.2. Ablation Studies

Systematic ablation experiments were conducted to evaluate the contributions of the individual components of SIG-Net. All ablation studies used the same training configuration as that described in Section 2.5.1.

3.2.1. Effect of Fusion Strategy

Table 5 compares three fusion strategies within the dual-branch framework of SIG-Net, including element-wise addition, channel concatenation followed by a 1 × 1 convolution for dimensionality reduction, and the SIGF module proposed in this study. All variants use the same spectral encoder (MiT-B2) and index encoder (lightweight CNN), ensuring a fair comparison.

The results show that the gated SIGF module achieves the best performance across all evaluation metrics, indicating that the proposed guided fusion mechanism is superior to simpler symmetric fusion methods. Although element-wise addition introduces no extra parameters, its fusion ratio is fixed and lacks adaptive capability in both the spatial and channel dimensions. Although concatenation followed by a 1 × 1 convolution can learn channel-mapping relationships, it still treats the two branches symmetrically and fails to explicitly characterize the guiding role of spectral-index features in spectral-feature learning.

A further comparison between the gated and ungated SIGF variants verifies the importance of the gating mechanism. Without gating, the fusion process lacks adaptive control over the degree to which original spectral features are preserved, and when spectral-index responses are unstable or insufficiently discriminative, the effective use of spectral information may be impaired. After the introduction of the gating mechanism, the network can adaptively control the fusion ratio between attention-enhanced features and original spectral features according to spatial location and channel response, thereby achieving consistent improvements across all metrics. In particular, gated SIGF reaches 81.60% in mPrecision, which is significantly higher than that of the other three strategies (77.93–78.25%), indicating that the mechanism can effectively suppress false detections. By contrast, the differences in mRecall among the methods are relatively small (85.68–87.13%), suggesting that the performance gain mainly stems from a better balance between precision and recall rather than from simply increasing recall. The qualitative results are shown in Figure 6.

3.2.2. Effect of Spectral Indices

Table 6 examines the contribution of each spectral index by systematically removing one index at a time from the full index set (RGI, BGI, NDVI, and NDNI). Except for the “No indices (spectral only)” configuration, all models adopt the complete SIG-Net architecture (including SIGF fusion and gating). “Remove” indicates that the specified index is removed from the four-index set. mPrecision, mRecall, and mF1 are all calculated separately for each class and then averaged.

To evaluate the influence of different spectral indices on segmentation performance, an ablation experiment was designed in which one index was removed at a time, and the results were compared with those of the full index configuration and a control model using only the original spectral features. The results show that the full index configuration (RGI + BGI + NDVI + NDNI) achieves the best overall performance, with mIoU and mF1 reaching 75.31% and 83.67%, respectively, both higher than those of the other configurations. This indicates that the joint use of multiple indices can effectively improve the overall segmentation capability of the model.

From the perspective of individual metrics, the full index configuration does not yield the highest values for both mPrecision and mRecall, which indicates that the advantage of multi-index fusion does not lie in maximizing a single metric, but in achieving a better overall balance between precision and recall. For example, after removing NDVI, the model’s mPrecision increases to 83.74%, the highest among all configurations, but mRecall decreases to 82.30%, resulting in lower mIoU and mF1 than those of the full configuration. After removing NDNI, the model’s mRecall increases to 86.96%, the highest value, but mPrecision drops to 78.48%, indicating more false detections and thus a marked decrease in overall performance. By contrast, the full index configuration achieves a better balance between precision and recall and therefore performs best on the comprehensive metrics mIoU and mF1.

Further analysis of the contribution of each spectral index shows that removing NDNI causes mIoU to decrease from 75.31% to 73.62%, representing the largest decline and indicating that NDNI has strong discriminative ability in the current task. After removing BGI and NDVI, mIoU decreases to 74.77% and 74.56%, respectively, indicating that both indices also make stable positive contributions to model performance. After removing RGI, mIoU is 75.03%, only 0.28 percentage points lower than that of the full configuration, indicating that RGI contributes relatively less to overall performance but still provides complementary information. Overall, clear complementarity exists among the indices, and multi-index fusion helps improve the model’s representation of target regions, thereby yielding better segmentation results. The qualitative results are shown in Figure 7.

4. Discussion

4.1. Effectiveness of Spectral Index Guidance

The experimental results indicate that explicitly integrating spectral indices as an independent guidance branch, rather than treating them as additional input channels in an early-fusion scheme, yields better overall performance for red tide semantic segmentation. This suggests that, in remote sensing image analysis, incorporating domain knowledge into deep feature learning in a structured manner is often more effective than simple channel-level concatenation. This design strategy may also provide useful guidance for other remote sensing tasks that rely on empirical indices, such as vegetation mapping, water-quality assessment, and urban thermal-environment analysis [56].

An important reason why the SIGF fusion mechanism outperforms simpler alternatives (Table 5) lies in its asymmetric cross-attention design. By using features from the spectral encoder as queries and features from the index encoder as keys and values, SIGF explicitly models how spectral indices guide spectral-feature learning, rather than treating the two branches as fully equivalent information sources. This asymmetry is consistent with the functional role of spectral indices: they are derived from the original bands and encode prior knowledge relevant to red tide identification. They are therefore better suited to serve as auxiliary guidance signals for modulating deep representation learning than to be indiscriminately fused with the original band features.

4.2. Role of the Gating Mechanism

The gating mechanism in SIGF plays an important role in improving the adaptivity of the fusion process. The ablation experiments (Table 5) show that model performance declines when the gating mechanism is removed, indicating that relying solely on attention-enhanced features in a fixed manner is not optimal. This is particularly important in complex coastal scenes. For example, in shallow turbid waters, suspended sediments may introduce spectral anomalies similar to those of red tides; in low-signal-to-noise regions, the stability of spectral indices may also decrease. In such cases, if the original spectral features are not effectively preserved, the fusion results may be affected by unreliable guidance information.

After the introduction of gating, the model can adaptively regulate the contribution of index-guided information in both the spatial and channel dimensions, thereby fully exploiting prior information in regions with strong index responses while preserving more original spectral features in regions with ambiguous boundaries or weak index discriminability. This adaptive weighting mechanism helps alleviate the effects of spectral-index instability under varying environmental conditions and is one of the key reasons why it outperforms fixed-weight fusion strategies.

4.3. Value of the Proposed NDNI

The ablation study of spectral-index selection (Table 6) shows that among the four indices, NDNI makes the most prominent contribution to overall segmentation performance. This result supports our basic hypothesis that the spectral contrast between the red-edge band (B5, 705 nm) and the blue band (B2, 490 nm) can provide strongly discriminative supplementary information for Noctiluca scintillans-dominated red tides. Its potential physical basis may be related to the relatively large cell size of Noctiluca, the more pronounced red-edge response, and the absorption characteristics of carotenoids in the blue band [57].

It should be noted that the effectiveness of NDNI may exhibit a certain degree of species dependence. For red tides dominated by other species, the optimal index formulation may not be the same. Therefore, the more important conclusion of this study does not lie in NDNI itself, but in the overall framework embodied by SIG-Net, namely the use of task-relevant domain indices as prior information to guide deep feature learning. Within this framework, the index set can be flexibly replaced with other, more suitable ones according to target algal species, regional environments, and sensor characteristics.

4.4. Limitations and Future Work

Although the proposed method achieves favorable experimental results, several limitations remain. First, although the dataset used in this study covers multiple coastal regions and time periods, it is still constructed only from a single sensor platform (Sentinel-2 MSI). The applicability of SIG-Net to other multispectral or hyperspectral sensors, such as Landsat-8/9 OLI, MODIS, and GOCI-II, requires further verification, especially with regard to the transferability of spectral indices under different band settings.

Second, although B5, B6, and B7 were resampled to the 10 m reference grid for band co-registration and pixel-wise tensor construction, their effective spatial information remains constrained by their native 20 m resolution. The resampled red-edge bands therefore provide spectral cues rather than newly generated 10 m spatial details. Future work may further examine super-resolution or multi-resolution fusion strategies for Sentinel-2 red-edge bands in red tide extraction.

Third, the current ground-truth labels mainly rely on expert visual interpretation and semi-automatic threshold segmentation, which may introduce a certain degree of subjectivity and boundary uncertainty. Because only red tide pixels with relatively clear visual and spectral responses were labeled as high-confidence positive samples, weak-response or early-stage bloom pixels may be under-represented in the current dataset. This conservative labeling strategy helps reduce false-positive labels but may also make the model less sensitive to ambiguous bloom boundaries. In the future, incorporating field-validation data, such as ship-based sampling, buoy observations, and fluorescence measurements, would help establish higher-quality ground-truth labels and improve the rigor of the evaluation.

Fourth, this study does not explicitly model the temporal evolution of red tide events, including stages such as occurrence, development, and decline. Extending SIG-Net to a multi-temporal remote sensing analysis framework is expected to further improve its ability to characterize dynamic red tide changes and enhance the detection of early red tide signals.

Finally, although SIG-Net achieves favorable performance while maintaining relatively low model complexity, further efforts are still needed before deployment in operational red tide monitoring scenarios, including near-real-time processing, automated data-stream ingestion, inference-efficiency optimization, and integration with existing early-warning systems. In addition, the current experimental results are still mainly based on limited regions and sample sizes. Future work should further validate the stability and generalization capability of the model across more sea areas, more red tide types, and cross-regional scenarios.

5. Conclusions

This study proposes SIG-Net (Spectral-Index-Guided Network), a dual-branch semantic segmentation architecture for red tide extraction from Sentinel-2 multispectral imagery. The core idea is to explicitly construct domain-relevant spectral indices, namely RGI, BGI, NDVI, and the NDNI proposed in this study, as an independent input branch and to incorporate index prior information into the deep feature learning process through the spectral-index-guided fusion (SIGF) module.

Through spatial-reduction cross-attention and a learnable gating mechanism, the SIGF module achieves adaptive fusion of spectral features and index features at multiple scales. Compared with simple early concatenation or symmetric fusion strategies, this design makes more effective use of the prior knowledge carried by spectral indices, thereby enhancing the model’s discriminative capability for red tide targets.

Experimental results on the Sentinel-2 red tide dataset demonstrate that SIG-Net outperforms baseline models such as U-Net, DeepLabV3+, and SegFormer in overall performance, as well as simpler dual-branch fusion strategies such as element-wise addition and concatenation. Ablation studies further verify the effectiveness of the individual components of the proposed method, among which both the SIGF module and the gating mechanism deliver stable performance gains, while NDNI shows the strongest contribution among the four spectral indices.

Overall, SIG-Net provides an effective framework for integrating remote sensing domain knowledge with deep learning for fine-grained red tide extraction. Although this study focuses on red tide detection, the dual-branch guided-fusion strategy also has potential for other remote sensing tasks that require the incorporation of domain prior information. Future work will further focus on multi-temporal analysis, multi-sensor fusion, and deployment for operational marine environmental monitoring.

Author Contributions

Conceptualization, H.L. and L.Z.; methodology, L.Z.; validation, L.Z., X.C. and Z.L.; formal analysis, L.Z.; writing—original draft preparation, L.Z.; writing—review and editing, H.L., X.C. and Z.L.; supervision, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The Sentinel-2 Level-2A data used in this study are freely available from the Copernicus Open Access platform (https://browser.dataspace.copernicus.eu/, accessed on 7 June 2026).

Acknowledgments

The authors sincerely thank the European Space Agency (ESA) for providing free Sentinel-2 satellite data and the Ministry of Natural Resources for providing the red tide event records.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zohdi, E.; Abbaspour, M. Harmful algal blooms (red tide): A review of causes, impacts, and approaches to monitoring and prediction. Int. J. Environ. Sci. Technol. 2019, 16, 1789–1806. [Google Scholar] [CrossRef]
Hallegraeff, G.; Enevoldsen, H.; Zingone, A. Global harmful algal bloom status reporting. Harmful Algae 2021, 102, 101992. [Google Scholar] [CrossRef]
Kershaw, J.; Jensen, S.; McConnell, B.; Fraser, S.; Cummings, C.; Lacaze, J.; Hermann, G.; Bresnan, E.; Dean, K.J.; Turner, A.; et al. Toxins from harmful algae in fish from Scottish coastal waters. Harmful Algae 2021, 105, 102068. [Google Scholar] [CrossRef]
Zhang, S.; Arhonditsis, G.; Ji, Y.; Bryan, B.A.; Peng, J.; Zhang, Y.; Gao, J.; Zhang, J.; Cho, K.H.; Huang, J. Climate change promotes harmful algal blooms in China’s lakes and reservoirs despite significant nutrient control efforts. Water Res. 2025, 277, 123307. [Google Scholar] [CrossRef]
Wiley, D.; McPherson, R.A. The role of climate change in the proliferation of freshwater harmful algal blooms in inland waterbodies of the United States. Earth Interact. 2024, 28, e230008. [Google Scholar] [CrossRef]
Lan, J.; Liu, P.; Hu, X.; Zhu, S. Harmful algal blooms in eutrophic marine environments: Causes, monitoring, and treatment. Water 2024, 16, 2525. [Google Scholar] [CrossRef]
Stauffer, B.; Bowers, H.; Buckley, E.; Davis, T.; Johengen, T.; Kudela, R.; McManus, M.; Purcell, H.; Smith, G.; Woude, A.V.; et al. Considerations in harmful algal bloom research and monitoring: Perspectives from a consensus-building workshop and technology testing. Front. Mar. Sci. 2019, 6, 399. [Google Scholar] [CrossRef]
Wang, S.; Qin, B. Application of optical remote sensing in harmful algal blooms in lakes: A review. Remote Sens. 2025, 17, 1381. [Google Scholar] [CrossRef]
Ahmad, H. High-resolution spatiotemporal monitoring of water quality and trophic status in Bay St. Louis using Sentinel-2 NDCI time series on Google Earth Engine. Trans. GIS 2025, 29, e70166. [Google Scholar] [CrossRef]
Gernez, P.; Zoffoli, M.L.; Lacour, T.; Hernández Fariñas, T.; Navarro, G.; Caballero, I.; Harmel, T. The many shades of red tides: Sentinel-2 optical types of highly-concentrated harmful algal blooms. Remote Sens. Environ. 2023, 287, 113486. [Google Scholar] [CrossRef]
Yue, L.; Luo, H.; Zhou, Z.; Xu, J.; Shen, H. Detection of the status of diatom blooms in the tributaries of the Yangtze River based on Sentinel-2 images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Natarajan, L.; Chandrasekaran, M.; Vajravelu, M.; Shah, C.; Sivadas, S.K.; Ramu, K.; Ramana Murthy, M.V. High resolution Sentinel-2 and Sentinel-3 satellite imagery in monitoring Green Noctiluca scintillans blooms in complex coastal waters: A case study in Gulf of Mannar. J. Indian Soc. Remote Sens. 2025, 53, 791–802. [Google Scholar] [CrossRef]
Caballero, I.; Fernández, R.; Escalante, O.M.; Mamán, L.; Navarro, G. New capabilities of Sentinel-2A/B satellites combined with in situ data for monitoring small harmful algal blooms in complex coastal waters. Sci. Rep. 2020, 10, 8743. [Google Scholar] [CrossRef]
Ou, Z.; Li, X.; Jin, F.; Peng, S.; Liu, W.; Li, E.; Zhang, L. MABI: A novel Mixed Algal Blooms Index based on color space transformation. Mar. Pollut. Bull. 2025, 210, 117321. [Google Scholar] [CrossRef]
Liu, R.; Xiao, Y.; Ma, Y.; Cui, T.; An, J. Red tide detection based on high spatial resolution broad band optical satellite data. ISPRS J. Photogramm. Remote Sens. 2022, 184, 131–147. [Google Scholar] [CrossRef]
Jing, Y.; Feng, C.; Chen, T.; Zhu, Y.; Li, C.; Tao, B.; Song, Q. Use of GOCI-II images for detection of harmful algal blooms in the East China Sea. Geosci. Lett. 2024, 11, 2. [Google Scholar] [CrossRef]
Zhang, C.; Tao, B.; Li, Y.; Ai, L.; Zhu, Y.; Liang, L.; Huang, H.; Li, C. Evaluation of Rayleigh-corrected reflectance on remote detection of algal blooms in optically complex coasts of East China Sea. Remote Sens. 2024, 16, 2304. [Google Scholar] [CrossRef]
Colkesen, I.; Ozturk, M.Y.; Altuntas, O.Y. Comparative evaluation of performances of algae indices, pixel- and object-based machine learning algorithms in mapping floating algal blooms using Sentinel-2 imagery. Stoch. Environ. Res. Risk Assess. 2024, 38, 1613–1634. [Google Scholar] [CrossRef]
Wang, N.; Tan, Z.; Yang, C.; Ma, J.; Duan, H. A deep learning approach for extracting cyanobacterial blooms in eutrophic lakes from satellite imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 13719–13732. [Google Scholar] [CrossRef]
Sujatha, M.G.; Ranganathan, P.; Marsh, R.; Reza, H.; Korom, S. CMEAP: Cluster modeling with explainability for threshold identification of spectral indices in algal bloom prediction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 24066–24087. [Google Scholar] [CrossRef]
Bartelt, G.; You, J.; Hondzo, M. Remote cyanobacteria detection by multispectral drone imagery. Lake Reserv. Manag. 2024, 40, 236–247. [Google Scholar] [CrossRef]
Li, J.; Cai, Y.; Li, Q.; Kou, M.; Zhang, T. A review of remote sensing image segmentation by deep learning methods. Int. J. Digit. Earth 2024, 17, 2328827. [Google Scholar] [CrossRef]
Huang, L.; Jiang, B.; Lv, S.; Liu, Y.; Fu, Y. Deep-learning-based semantic segmentation of remote sensing images: A survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8370–8396. [Google Scholar] [CrossRef]
Jonnala, N.S.; Bheemana, R.C.; Prakash, K.; Bansal, S.; Jain, A.; Pandey, V.; Faruque, M.R.I.; Al-Mugren, K.S. DSIA U-Net: Deep shallow interaction with attention mechanism UNet for remote sensing satellite images. Sci. Rep. 2025, 15, 549. [Google Scholar] [CrossRef]
Zhu, E.; Chen, Z.; Wang, D.; Shi, H.; Liu, X.; Wang, L. UNetMamba: An efficient UNet-like Mamba for semantic segmentation of high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2024, 22, 1–5. [Google Scholar] [CrossRef]
Shen, Y.; Shi, L.; Zhao, J.; Dong, Y.; Wang, L. Fully convolutional spectral-spatial fusion network integrating supervised contrastive learning for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 9077–9088. [Google Scholar] [CrossRef]
Parelius, E.J. A review of deep-learning methods for change detection in multispectral remote sensing images. Remote Sens. 2023, 15, 2092. [Google Scholar] [CrossRef]
Wang, J.; Miao, J.; Li, G.; Tan, Y.; Yu, S.; Liu, X.; Zeng, L.; Li, G. Pan-sharpening network of multi-spectral remote sensing images using two-stream attention feature extractor and multi-detail injection (TAMINet). Remote Sens. 2024, 16, 75. [Google Scholar] [CrossRef]
Lei, D.; Luo, X.; Chen, H.; Zhang, L.; Liu, Q.; Li, W. A deep unrolling pansharpening method based on spectral consistency and double spatial priors. Int. J. Remote Sens. 2023, 44, 1842–1871. [Google Scholar] [CrossRef]
Sharma, S.; Gosain, A. Addressing class imbalance in remote sensing using deep learning approaches: A systematic literature review. Evol. Intell. 2025, 18, 23. [Google Scholar] [CrossRef]
Zhang, C. Interannual and decadal changes in harmful algal blooms in the coastal waters of Fujian, China. Toxins 2022, 14, 578. [Google Scholar] [CrossRef] [PubMed]
Song, Y.; Li, M.; Liu, X.; Li, W.; Yao, H.; Liu, Y.; Chen, J. Effects of riverine nutrient enrichment and sediment reduction on high primary productivity zone in the Yangtze River estuary: Historical reconstruction and future perspective. Front. Mar. Sci. 2025, 12, 1529744. [Google Scholar] [CrossRef]
Gao, L.; Li, D.; Ishizaka, J.; Zhang, Y.; Zong, H.; Guo, L. Nutrient dynamics across the river-sea interface in the Changjiang (Yangtze River) estuary-East China Sea region. Limnol. Oceanogr. 2015, 60, 1855–2235. [Google Scholar]
Hou, W.; Chen, X.; Ba, M.; Yu, J.; Chen, T.; Zhu, Y.; Bai, J. Characteristics of harmful algal species in the coastal waters of China from 1990 to 2017. Toxins 2022, 14, 160. [Google Scholar] [CrossRef]
Li, X.; Duan, X.; He, X.; Xie, Y.; Yang, L.; Yin, P.; Cao, K.; Chen, B.; Gao, F.; Li, F. The relationships between vertical variations of shallow gas and pore water geochemical characteristics in boreholes from the inner shelf of the East China Sea. Front. Mar. Sci. 2024, 11, 1343701. [Google Scholar] [CrossRef]
Li, J.; Roy, D. A global analysis of Sentinel-2A, Sentinel-2B and Landsat-8 data revisit intervals and implications for terrestrial monitoring. Remote Sens. 2017, 9, 902. [Google Scholar] [CrossRef]
Kumari, A.; Karthikeyan, S. Sentinel-2 data for land use/land cover mapping: A meta-analysis and review. SN Comput. Sci. 2023, 4, 815. [Google Scholar]
Shoko, C.; Mutanga, O. Examining the strength of the newly launched Sentinel-2 MSI sensor in detecting and discriminating subtle differences between C3 and C4 grass species. ISPRS J. Photogramm. Remote Sens. 2017, 129, 32–40. [Google Scholar]
Wang, Q.; Liu, H.; Wang, D.; Li, D.; Liu, W.; Si, Y.; Liu, Y.; Li, J.; Duan, H.; Shen, M. Assessment of atmospheric correction algorithms for correcting sunglint effects in Sentinel-2 MSI imagery: A case study in clean lakes. Remote Sens. 2024, 16, 3060. [Google Scholar] [CrossRef]
Lacroix, P.; Bièvre, G.; Pathier, E.; Kniess, U.; Jongmans, D. Use of Sentinel-2 images for the detection of precursory motions before landslide failures. Remote Sens. Environ. 2018, 215, 507–516. [Google Scholar] [CrossRef]
Baetens, L.; Desjardins, C.; Hagolle, O. Validation of Copernicus Sentinel-2 cloud masks obtained from MAJA, Sen2Cor, and FMask processors using reference cloud masks generated with a supervised active learning procedure. Remote Sens. 2019, 11, 433. [Google Scholar] [CrossRef]
Smith, M.E.; Lemley, D.; Whitfield, E.; Adams, J. Evaluation of Sentinel-2 for water quality monitoring in a eutrophic estuary in South Africa. Water SA 2025, 51, 181–190. [Google Scholar] [CrossRef]
Kabir, S.; Saranathan, A.M.; Barnes, B.B.; Ashapure, A.; O’Shea, R.E.; Stengel, V. Feasibility of PlanetScope SuperDove constellation for water quality monitoring of inland and coastal waters. Front. Remote Sens. 2025, 6, 1624783. [Google Scholar] [CrossRef]
Paulino, C.; Sánchez, S.; Alburqueque, E.; Lorenzo, A.; Grados, D. Detection of harmful algal blooms from satellite-based inherent optical properties of the ocean in Paracas Bay, Peru. Mar. Pollut. Bull. 2024, 201, 116173. [Google Scholar]
Ogashawara, I.; Kiel, C.; Jechow, A.; Kohnert, K.; Ruhtz, T.; Grossart, H.-P.; Hölker, F.; Nejstgaard, J.C.; Berger, S.A.; Wollrab, S. The use of Sentinel-2 for chlorophyll-a spatial dynamics assessment: A comparative study on different lakes in Northern Germany. Remote Sens. 2021, 13, 1542. [Google Scholar]
Moradi, M.; Kabiri, K. Spatio-temporal variability of red-green chlorophyll-a index from MODIS data: Case study of Chabahar Bay, SE of Iran. Cont. Shelf Res. 2019, 184, 1–9. [Google Scholar]
Narayanan, A.; Reynolds, R.A.; Stramski, D. Variations in phytoplankton assemblages in the western Arctic seas as evidenced by changes in pigment composition and associated spectra of light absorption. J. Geophys. Res. Oceans 2025, 130, e2024JC021769. [Google Scholar] [CrossRef]
Oyama, Y.; Matsushita, B.; Fukushima, T. Distinguishing surface cyanobacterial blooms and aquatic macrophytes using Landsat/TM and ETM+ shortwave infrared bands. Remote Sens. Environ. 2015, 157, 35–47. [Google Scholar] [CrossRef]
Qi, L.; Tsai, S.; Chen, Y.; Le, C.; Hu, C. In search of red Noctiluca scintillans blooms in the East China Sea. Geophys. Res. Lett. 2019, 46, 5997–6004. [Google Scholar] [CrossRef]
Shaju, S.S.; Akula, R.; Jabir, T. Characterization of light absorption coefficient of red Noctiluca scintillans bloom in the southeastern Arabian Sea. Oceanologia 2018, 60, 419–425. [Google Scholar] [CrossRef]
Yeung, M.; Sala, E.; Schönlieb, C.; Rundo, L. Unified focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput. Med. Imaging Graph. 2022, 95, 102026. [Google Scholar] [CrossRef]
Abdian, A.Z.; Javidi, M.; Mansouri, N. Nova: A novel optimizer integrating Nesterov momentum, AMSGrad, and decoupled weight decay for deep learning. Signal Image Video Process. 2025, 19, 722. [Google Scholar] [CrossRef]
Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Wang, J.; Zhang, X.; Yan, T.; Tan, A. DPNet: Dual-pyramid semantic segmentation network based on improved Deeplabv3 Plus. Electronics 2023, 12, 3161. [Google Scholar] [CrossRef]
Elmessery, W.M.; Maklakov, D.; El-Messery, T.M.; Baranenko, D.A.; Gutiérrez, J.; Shams, M.Y.; El-Hafeez, T.A.; Elsayed, S.; Alhag, S.K.; Moghanm, F.; et al. Semantic segmentation of microbial alterations based on SegFormer. Front. Plant Sci. 2024, 15, 1352935. [Google Scholar] [CrossRef] [PubMed]
Hikmah, N.I.; Manurung, P. Application of spectral indices and deep learning (convolutional neural network model) on land cover change analysis. Appl. Environ. Sci. 2025, 3, 39–60. [Google Scholar] [CrossRef]
Martinez-Vicente, V.; Kurekin, A.; Sá, C.; Brotas, V.; Amorim, A.; Veloso, V.; Lin, J.; Miller, P.I. Sensitivity of a satellite algorithm for harmful algal bloom discrimination to the use of laboratory bio-optical data for training. Front. Mar. Sci. 2020, 7, 582960. [Google Scholar] [CrossRef]

Figure 1. Study area and Sentinel-2 scenes used in this study. The light-yellow squares denote Sentinel-2A MSI L2A scenes, and the green squares denote Sentinel-2B MSI L2A scenes. The Yangtze River, Pearl River Estuary, and Beibu Gulf are marked to provide geographic context for the study areas.

Figure 2. Only a small example is shown here; full-scene data were cropped for actual training. Locations of the training samples (a) and the corresponding ground truth (b). The Sentinel-2 L2A image is a true-color composite of bands 4 (R), 3 (G), and 2 (B). The red pixels in (b) represent red tide, whereas the black pixels represent seawater and clouds.

Figure 3. Overall architecture of SIG-Net.

Figure 4. Qualitative comparison of different methods on the Sentinel-2 red tide test set. (a) Original RGB; (b) Ground truth; (c) U-Net; (d) DeepLabV3+ (R50); (e) SegFormer (MiT-B2), 7 bands; (f) SegFormer (MiT-B2), 11 bands; (g) SIG-Net (this study).

Figure 5. Enlarged local comparison of different methods on the Sentinel-2 red tide test set. (a) Original RGB; (b) Ground truth; (c) U-Net; (d) DeepLabV3+ (R50); (e) SegFormer (MiT-B2), 7 bands; (f) SegFormer (MiT-B2), 11 bands; (g) SIG-Net (this study).

Figure 6. Comparison of segmentation performance under different feature-fusion strategies. (a) Original RGB; (b) Element-wise addition; (c) Concatenation followed by 1 × 1 convolution; (d) SIGF (cross-attention) without gating; (e) SIGF (cross-attention) with gating.

Figure 7. Comparison of segmentation performance under different spectral-index configurations. (a) RGB image; (b) Full index set (RGI+BGI+NDVI+NDNI); (c) Remove BGI; (d) Remove NDVI; (e) Remove NDNI; (f) Remove RGI; (g) No indices (spectral only).

Table 1. Sentinel-2A/2B band specifications.

Band No.	Band Name	Spectral Range (nm)	Central Wavelength (nm)	Resolution (m)
1	Coastal aerosol	433–453	443	60
2	Blue	458–523	490	10
3	Green	543–578	560	10
4	Red	650–680	665	10
5	Red Edge 1	698–713	705	20
6	Red Edge 2	733–748	740	20
7	Red Edge 3	773–793	783	20
8	NIR	785–900	842	10
8A	Narrow NIR	855–875	865	20
9	Water vapour	935–955	945	60
10	SWIR-Cirrus	1360–1390	1375	60
11	SWIR 1	1565–1655	1610	20
12	SWIR 2	2100–2280	2190	20

Table 2. Sentinel-2 scenes used in this study.

Satellite	Date	Data	Region	Purpose
Sentinel-2A MSI	18 August 2020	S2A_MSIL2A_20200818T022601_N0500_R046_T51SWS_20230319T041945	Yellow Sea	Applicability test
	18 August 2020	S2A_MSIL2A_20200818T022601_N0500_R046_T51SXR_20230319T041945	Yellow Sea	Design/validation
	18 August 2020	S2A_MSIL2A_20200818T022601_N0500_R046_T51SXS_20230319T041945	Yellow Sea	Design/validation
Sentinel-2B MSI	14 February 2021	S2B_MSIL2A_20210214T031829_N0500_R118_T48QZH_20230519T010352	South China Sea	Applicability test
	14 February 2021	S2B_MSIL2A_20210214T031829_N0500_R118_T48QZJ_20230519T010352	South China Sea	Design/validation
	14 February 2021	S2B_MSIL2A_20210214T031829_N0500_R118_T49QBC_20230519T010352	South China Sea	Design/validation
	14 February 2021	S2B_MSIL2A_20210214T031829_N0500_R118_T49QBD_20230519T010352	South China Sea	Design/validation
	12 March 2022	S2B_MSIL2A_20220312T024549_N0510_R132_T49QHF_20240529T184709	South China Sea	Design/validation
	12 March 2022	S2B_MSIL2A_20220312T024549_N0510_R132_T50QKL_20240529T184709	South China Sea	Design/validation

Table 3. Spectral indices used in SIG-Net.

Index	Formula	Description
RGI	B4/B3	Ratio of red-band to green-band reflectance, used to characterize differences in visible-band reflectance
BGI	B2/B3	Ratio of blue-band to green-band reflectance, used to characterize blue–green differences caused by pigment absorption
NDVI	(B8 − B4)/(B8 + B4)	Normalized difference between the near-infrared and red bands, used to characterize anomalous near-surface targets
NDNI	(B5 − B2)/(B5 + B2)	Normalized difference between the red-edge and blue bands, used to enhance the spectral contrast of Noctiluca red tides

Table 4. Quantitative comparison of SIG-Net and baseline methods on the Sentinel-2 red tide test set. The best value in each metric column is shown in bold.

Method	Input	mIoU (%)	mPrecision (%)	mRecall (%)	mF1 (%)	Parameters (M)
U-Net	11ch	73.27	76.69	84.91	80.02	14.8
DeepLabV3+ (R50)	11ch	72.94	75.56	84.37	81.22	43.6
SegFormer (MiT-B2)	7ch	71.51	75.88	86.23	80.18	24.7
SegFormer (MiT-B2)	11ch	70.42	75.32	84.28	79.11	24.7
SIG-Net (this study)	7+4ch	75.31	81.60	86.03	83.67	~27.9

Table 5. Ablation study of fusion strategies.

Fusion Strategy	Gating	mIoU (%)	mPrecision (%)	mRecall (%)	mF1 (%)
Element-wise addition	—	73.39	78.07	87.13	81.96
Concatenation + 1 × 1 convolution	—	73.18	78.25	86.27	81.75
SIGF (cross-attention)	✘	72.72	77.93	85.68	81.33
SIGF (cross-attention)	✔	75.31	81.60	86.03	83.67

Table 6. Ablation study of spectral index selection.

Index Configuration	No. of Additional Index Channels	mIoU (%)	mPrecision (%)	mRecall (%)	mF1 (%)
Full set (RGI+ BGI+ NDVI+ NDNI)	4	75.31	81.60	86.03	83.67
Remove BGI	3	74.77	81.81	84.71	83.20
Remove NDVI	3	74.56	83.74	82.30	83.00
Remove NDNI	3	73.62	78.48	86.96	82.17
Remove RGI	3	75.03	81.13	86.07	83.42
No indices (spectral only)	0	71.51	75.88	86.23	80.18

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, L.; Li, H.; Chen, X.; Li, Z. SIG-Net: A Spectral-Index-Guided Network for Red Tide Extraction from Sentinel-2 Multispectral Imagery. Remote Sens. 2026, 18, 1928. https://doi.org/10.3390/rs18121928

AMA Style

Zhou L, Li H, Chen X, Li Z. SIG-Net: A Spectral-Index-Guided Network for Red Tide Extraction from Sentinel-2 Multispectral Imagery. Remote Sensing. 2026; 18(12):1928. https://doi.org/10.3390/rs18121928

Chicago/Turabian Style

Zhou, Lei, Hongping Li, Xiaojun Chen, and Zhanqiang Li. 2026. "SIG-Net: A Spectral-Index-Guided Network for Red Tide Extraction from Sentinel-2 Multispectral Imagery" Remote Sensing 18, no. 12: 1928. https://doi.org/10.3390/rs18121928

APA Style

Zhou, L., Li, H., Chen, X., & Li, Z. (2026). SIG-Net: A Spectral-Index-Guided Network for Red Tide Extraction from Sentinel-2 Multispectral Imagery. Remote Sensing, 18(12), 1928. https://doi.org/10.3390/rs18121928

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SIG-Net: A Spectral-Index-Guided Network for Red Tide Extraction from Sentinel-2 Multispectral Imagery

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Sentinel-2 Data and Preprocessing

2.3. Spectral Indices

2.4. Proposed Method: SIG-Net Architecture

2.4.1. Spectral Encoder

2.4.2. Index Encoder

2.4.3. Spectral-Index-Guided Fusion (SIGF) Module

2.4.4. Decoder Head

2.4.5. Loss Function

2.5. Experimental Settings

2.5.1. Implementation Details

2.5.2. Baseline Methods

3. Results

3.1. Comparison with Baseline Methods

3.2. Ablation Studies

3.2.1. Effect of Fusion Strategy

3.2.2. Effect of Spectral Indices

4. Discussion

4.1. Effectiveness of Spectral Index Guidance

4.2. Role of the Gating Mechanism

4.3. Value of the Proposed NDNI

4.4. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI