1. Introduction
Red tide, also referred to as a harmful algal bloom (HAB), is a natural phenomenon characterized by rapid phytoplankton proliferation in marine and coastal waters [
1]. Such events can cause large-scale fish mortality, shellfish biotoxin contamination, water-quality deterioration, and substantial economic losses to the aquaculture and tourism sectors [
2,
3]. Driven by intensified coastal eutrophication associated with human activities and climate change, the frequency, intensity, and geographic extent of red tide events have increased markedly over recent decades [
4,
5]. Accordingly, the development of rapid, large-area, and accurate red tide monitoring capabilities has become an urgent need in marine environmental management [
6,
7].
Satellite remote sensing offers a unique synoptic perspective for monitoring large-scale marine red tide dynamics [
8]. Medium-resolution multispectral sensors, such as the Multispectral Instrument (MSI) aboard Sentinel-2, provide 10–20 m spatial resolution, a 5-day revisit cycle, and free data access, making them particularly suitable for operational red tide monitoring [
9,
10]. The visible and near-infrared (VNIR) bands of Sentinel-2 capture spectral signals associated with red tide characteristics, including elevated chlorophyll-a concentrations, changes in water-leaving radiance, and the formation of surface scums [
11,
12].
Recent studies using medium- and high-spatial-resolution satellite imagery have further advanced red tide and algal-bloom monitoring in optically complex coastal waters. Sentinel-2 MSI has been used to map small dinoflagellate blooms in complex coastal waters, characterize optical types of highly concentrated red tides, monitor bloom status in riverine and coastal environments, and develop new spectral or color-space indices for mixed harmful algal blooms [
10,
11,
12,
13,
14]. In Chinese coastal waters, high-spatial-resolution broad-band optical satellite data have been applied to fine-scale red tide detection, while recent GOCI-II studies have improved the detection and species-related discrimination of algal blooms in the East China Sea [
15,
16,
17]. Recent Sentinel-2 index-based and machine-learning studies also show that combining spectral indices with data-driven classifiers can improve floating algal-bloom mapping [
18]. These studies demonstrate the value of higher spatial resolution and richer spectral information for delineating narrow, patchy, and nearshore bloom features, but they also show that spectral variability, turbidity, sunglint, clouds, and mixed pixels remain important sources of uncertainty.
Traditional remote sensing methods for red tide detection rely heavily on expert-designed spectral indices and threshold-based classification [
19]. The normalized difference vegetation index (NDVI), red-green index (RGI), blue-green index (BGI), and various chlorophyll-absorption indices have been widely used to distinguish bloom pixels from background water [
20,
21]. Although these indices effectively encode domain-specific spectral knowledge, such as the red-edge reflectance peak of chlorophyll-containing organisms, their performance is often constrained by threshold selection, atmospheric conditions, water turbidity, and spectral variability among algal species.
In recent years, deep learning-based semantic segmentation has emerged as a powerful alternative for pixel-level classification of remote sensing imagery [
22]. Architectures such as U-Net, DeepLabV3+, and, more recently, SegFormer have achieved strong performance across a range of remote sensing tasks, including land-cover mapping, building extraction, and crop classification [
23,
24,
25]. These methods learn hierarchical feature representations directly from raw pixel data and can capture complex spatial-spectral patterns that are difficult to express using handcrafted indices alone [
26].
From a methodological perspective, existing red tide detection approaches can be broadly divided into spectral-feature-based methods and learning-based methods. Spectral-feature-based methods use characteristic band responses, spectral indices, and threshold rules to exploit the optical signatures of bloom waters, and they are physically interpretable and computationally efficient. However, their transferability can be limited by regional water optical conditions, dominant algal species, atmospheric correction uncertainty, and manually selected thresholds. Learning-based methods, especially CNN- and Transformer-based semantic segmentation networks, can learn nonlinear spatial-spectral representations from training samples, but many existing models treat all multispectral bands or index channels as homogeneous inputs and do not explicitly model the prior knowledge embedded in spectral indices. This gap motivates the spectral-index-guided dual-branch design proposed in this study.
However, applying existing deep learning methods to red tide extraction from multispectral imagery still presents several challenges [
27]. First, the limited spectral dimensionality of multispectral data constrains the discriminative power of purely data-driven feature learning, especially when training samples are scarce [
28]. Second, standard architectures treat all input channels equally and do not exploit the physical meaning carried by specific band combinations, namely spectral indices [
29]. Third, red tide pixels usually occupy only a small proportion of an image, resulting in severe class imbalance and further increasing the difficulty of learning [
30].
A natural strategy for addressing these limitations is to combine spectral indices with the original bands as auxiliary inputs. The simplest approach, namely early fusion, concatenates the indices with the original bands to form an extended input tensor. Although straightforward, early fusion forces the network to learn representations from heterogeneous data sources (reflectance values and index ratios) within a single encoder, which may not be optimal. An alternative is a dual-branch architecture, in which separate encoders process the two data sources independently before feature fusion. The key question is how to fuse the two branches effectively so that spectral-index information genuinely guides spectral feature learning rather than merely increasing channel dimensionality.
This study proposes SIG-Net (Spectral-Index-Guided Network), a novel dual-branch semantic segmentation framework that explicitly leverages spectral-index priors to guide multispectral feature extraction for red tide mapping. The main contributions of this work are as follows:
(1) A dual-branch backbone architecture is designed, in which a MiT-B2 spectral encoder processes the original Sentinel-2 bands and a lightweight four-stage CNN index encoder processes the spectral indices. The two branches share the same spatial downsampling scheme (1/4, 1/8, 1/16, and 1/32), enabling multi-scale feature alignment and fusion.
(2) A spectral-index-guided fusion (SIGF) module is proposed. It employs a spatially reduced cross-attention mechanism in which index-branch features serve as keys and values to guide query features from the spectral branch. A learnable gating mechanism adaptively controls the fusion ratio between the original spectral features and the attention-enhanced features at each scale.
(3) A normalized difference Noctiluca index (NDNI), defined as (B5 − B2)/(B5 + B2), is introduced. It exploits the red-edge scattering and blue-band absorption characteristics of Noctiluca-type red tide organisms and complements the traditional RGI, BGI, and NDVI indices.
(4) Comprehensive experiments are conducted on a Sentinel-2 red tide dataset covering multiple coastal regions, demonstrating that SIG-Net significantly outperforms single-branch baselines (U-Net, DeepLabV3+, and SegFormer) and alternative fusion strategies (early fusion, concatenation, and element-wise addition). Detailed ablation studies further verify the effectiveness of each component.
The remainder of this paper is organized as follows.
Section 2 describes the study area, Sentinel-2 data, the proposed method, and experimental settings.
Section 3 presents the experimental results, including baseline comparisons and ablation studies.
Section 4 discusses the findings and their implications.
Section 5 concludes the paper.
2. Materials and Methods
2.1. Study Area
The study area comprises the coastal waters of Jiangsu, Guangdong, and Guangxi, which are among the major red-tide-prone regions of China (
Figure 1). Under the combined influence of coastal eutrophication, complex hydrodynamic processes, and intensive human activities, red tide events occur frequently during the warm season in these regions [
31]. Specifically, the Jiangsu coast is strongly affected by Yangtze River diluted water and high nearshore suspended-sediment concentrations; the Guangdong coast is influenced by Pearl River runoff and the dispersion of the Pearl River estuarine plume; and the Guangxi coast is mainly characterized by the semi-enclosed environment of the Beibu Gulf and regional alongshore transport processes. Together, these environmental controls create favorable conditions for rapid phytoplankton growth and aggregation from May to September [
32,
33]. Common dominant red tide organisms in the region include
Noctiluca scintillans,
Prorocentrum donghaiense, and
Karenia mikimotoi, and different algal species exhibit distinct spectral response characteristics from the visible to near-infrared bands [
34].
The study area is approximately located between 20° and 35°N and between 108° and 122°E and encompasses a range of representative marine environments, including the nearshore shallow waters of Jiangsu, the waters adjacent to the Pearl River Estuary in Guangdong, and the coastal waters of the Beibu Gulf in Guangxi [
35]. These regions include estuarine mixed waters, highly turbid nearshore waters, semi-enclosed bays, and outer shelf seas and therefore exhibit pronounced regional heterogeneity and optical complexity. Using these areas as study sites facilitates a systematic evaluation of the robustness and applicability of the proposed method under different nutrient regimes, suspended-sediment concentrations, water-depth conditions, and hydrodynamic settings, thereby improving the model’s generalization ability for typical coastal red tide monitoring scenarios in China.
2.2. Sentinel-2 Data and Preprocessing
Sentinel-2 is an optical Earth-observation satellite mission implemented by the European Space Agency (ESA) under the European Union’s Copernicus Programme and consists of two satellites, Sentinel-2A and Sentinel-2B [
36]. Although primarily designed for land monitoring, it has also been widely used for remote sensing of coastal and inland waters, providing high-resolution multispectral imagery for agricultural monitoring, forest management, land-use change detection, water-resource assessment, and water-environment monitoring [
37]. Sentinel-2 carries the Multispectral Instrument (MSI), which acquires 13 spectral bands spanning the visible, near-infrared, and shortwave infrared regions at spatial resolutions of 10, 20, and 60 m [
38]. The coordinated operation of the twin satellites provides an approximately 5-day revisit interval at the equator, enabling continuous and stable remote sensing observations of global land surfaces and coastal regions [
39,
40,
41,
42,
43,
44].
In this study, Sentinel-2 Multispectral Instrument (MSI) Level-2A (L2A) surface reflectance products were used as the primary remote sensing data source. These products are atmospherically corrected using the Sen2Cor processor and provide bottom-of-atmosphere (BOA) reflectance. In view of the spectral response characteristics of red tide targets in the visible to near-infrared range, seven bands closely related to aquatic remote sensing and algal bloom identification were selected for subsequent analysis, namely B2, B3, B4, B5, B6, B7, and B8 (
Table 1).
The original Sentinel-2 L2A imagery was acquired from 2020 to 2022. Image samples were selected according to red tide event information reported in the China Marine Disaster Bulletin and cover several representative red tide occurrence areas along the coasts of Jiangsu, Guangdong, and Guangxi (
Table 2). Because the selected Sentinel-2 MSI bands have different native spatial resolutions, B5, B6, and B7 were resampled from 20 m to the 10 m reference grid of B2/B3/B4/B8 before model input construction. This resampling was conducted for band co-registration and pixel-wise tensor alignment, and it does not imply that new spatial information beyond the native 20 m resolution of the red-edge bands was generated. The resampled red-edge bands were used mainly as spectral cues for red tide discrimination rather than as independently enhanced 10 m spatial features. This common-grid preprocessing is consistent with common Sentinel-2 multispectral analysis workflows, including previous Sentinel-2 water-quality studies in which 20 m bands were resampled to a 10 m grid before analysis [
45], and was applied consistently to all compared models, ensuring that the relative comparison among methods remains fair. No-data pixels were masked during subsequent training, inference, and statistical analysis.
To improve sample usability and training consistency, the imagery was further cropped, registered, and normalized after band resampling and reflectance conversion, and a red tide semantic segmentation dataset was constructed in combination with manual annotations. The preprocessed multispectral bands and corresponding labels jointly served as the basis for training and testing the SIG-Net model.
The dataset used in this study was derived from nine Sentinel-2 L2A scenes acquired at three time points. To meet the training requirements of deep learning models, each scene was cropped into 512 × 512-pixel patches using a sliding-window strategy with 128-pixel overlap between adjacent patches, generating a total of 5148 image patches (
Figure 2). The ground-truth labels were constructed through a combined event-record, spectral-threshold, and expert-interpretation procedure. First, red tide event records from the China Marine Disaster Bulletin and related regional reports were used to determine the approximate occurrence dates and sea areas. Second, Sentinel-2 true-color and false-color composites, together with red-tide-related spectral indices, were visually inspected to identify candidate bloom regions. Third, semi-automatic threshold segmentation was used to generate preliminary red tide masks based on the spectral contrast between bloom water and background water. These masks were then manually checked and corrected by interpreters with remote-sensing and marine-environment knowledge according to spectral response, spatial continuity, coastline context, and cloud masks or invalid-pixel masks. Finally, the corrected masks were converted into binary semantic labels, where red tide pixels were assigned class 1 and background water, and non-red-tide areas were assigned class 0, with invalid pixels excluded where appropriate.
In constructing the labels, we conservatively labeled red tide pixels with clear visual, spectral, and spatial continuity characteristics. Ambiguous transition areas, weak-response pixels, and regions strongly affected by clouds, sunlight, suspended sediments, or mixed water signals were treated cautiously to reduce label noise. This strategy improves the reliability of positive samples but may lead to under-representation of weak or early-stage blooms and may make the model more conservative near red tide boundaries.
To avoid information leakage among the training, validation, and test sets caused by spatial autocorrelation between adjacent patches from the same scene, which could further bias model evaluation, a scene-level split strategy was adopted. Specifically, four scenes (2272 patches) were used for training, three scenes (1309 patches) for validation, and two scenes (1567 patches) for testing. This partitioning strategy enables a more rigorous evaluation of the model’s generalization ability to unseen scenes.
2.3. Spectral Indices
A key innovation of this study is the explicit incorporation of spectral indices as an independent input branch. Given the band-dependent response of red tide targets and the prior information encoded in spectral indices, four indices closely related to red tide identification were selected as auxiliary features (
Table 3).
The red-green index (RGI) is defined as B4/B3. This index characterizes the relative difference in reflectance between the red and green bands and is sensitive to changes in the visible-band reflectance characteristics of red tide waters [
46]. Under high-concentration bloom conditions, RGI often increases and can therefore provide auxiliary discriminative information for red tide identification.
The blue-green index (BGI) is defined as B2/B3. This index reflects the relative difference in reflectance between the blue and green bands and can be used to characterize blue-light attenuation caused by the absorption of pigments such as chlorophyll and carotenoids [
47]. In some red tide scenarios, lower BGI values usually correspond to stronger algal-pigment absorption signals and can therefore serve as a supplementary indicator for identifying anomalous bloom waters.
The normalized difference vegetation index (NDVI) is defined as (B8 − B4)/(B8 + B4). Although NDVI was originally developed for terrestrial vegetation monitoring, previous studies have extended its application to certain water-environment identification tasks. In red tide scenarios, when high-density near-surface algal aggregations or floating surface coverage are present, near-infrared reflectance may be relatively enhanced, allowing NDVI to respond to anomalous water targets [
48].
The normalized difference Noctiluca index (NDNI) is defined as (B5 − B2)/(B5 +B2). This index is proposed in this study to enhance the spectral response of
Noctiluca scintillans-dominated red tides commonly found in China’s coastal waters [
49]. Carotenoids in Noctiluca cells exhibit strong absorption in the blue band (B2, 490 nm), whereas their relatively large cell size may lead to more pronounced scattering in the red-edge band (B5, 705 nm) [
50]. Therefore, constructing a normalized difference index using B5 and B2 helps highlight the spectral contrast between blue-band absorption and the red-edge response in Noctiluca red tides.
To fully exploit the complementarity of these spectral indices, they were not simply concatenated with the original bands as additional channels. Instead, features were extracted through an independent index branch, and the spectral indices were then used to guide learning of the original multispectral features during the fusion stage, thereby improving the model’s ability to identify red tide targets.
During data preprocessing, four spectral indices were calculated from the seven reflectance bands and prepared together with the original bands as an 11-channel tensor for each image patch (7 original bands + 4 spectral indices). This unified tensor was used only to maintain consistent preprocessing and augmentation before the network input was split by modality. To avoid numerical instability caused by denominators close to zero, a small constant epsilon = 10−6 was added to the denominator during index calculation, and the resulting index values were clipped to the range [−10, 10]. Any residual NaN or Inf values generated during the calculation were uniformly replaced with 0.
In the SIG-Net architecture, once the 11-channel input enters the network, it is split by modality: the first seven channels are fed into the spectral encoder, and the remaining four channels are fed into the index encoder. Therefore, although the original bands and spectral indices are prepared in a unified tensor before input, the model does not adopt simple early-concatenation fusion. Instead, a dual-branch structure is used to extract the two types of features separately, followed by guided fusion at later stages. This design preserves the independent representational capacity of different input modalities while allowing spectral indices to guide multispectral feature learning.
2.4. Proposed Method: SIG-Net Architecture
This section describes the architecture of SIG-Net in detail. As shown in
Figure 3, SIG-Net adopts a dual-branch encoder–decoder structure comprising three core components: (1) a spectral encoder based on a Mix Vision Transformer (MiT-B2), which extracts hierarchical spatial-spectral representations from the original Sentinel-2 bands; (2) a lightweight CNN-based index encoder, which extracts multi-scale features from the spectral indices; and (3) a set of spectral-index-guided fusion (SIGF) modules, which adaptively integrate features from the two branches at multiple scales through spatial-reduction cross-attention and a gated fusion mechanism. The fused multi-scale features are then fed into a SegFormer-style MLP decoder head for pixel-wise red tide classification.
2.4.1. Spectral Encoder
The spectral encoder adopts the Mix Vision Transformer (MiT-B2) architecture, which has shown strong performance in dense prediction tasks. MiT-B2 is a hierarchical Vision Transformer that generates multi-scale feature maps through four successive stages, each composed of Patch Embedding, efficient self-attention layers, and Mix-FFN (a feed-forward network incorporating depthwise convolution).
In this implementation, the MiT-B2 encoder processes the 7-channel spectral input with the following configuration: the initial Patch Embedding uses a 7 × 7 convolution with a stride of 4 (stage 1), followed by 3 × 3 convolutions with a stride of 2 in the next three stages. The number of Transformer layers in the four stages is (3, 4, 6, 3), with corresponding numbers of attention heads (1, 2, 5, 8). The base embedding dimension is 64, and the channel dimensions of the four stages are (64, 128, 320, 512). To reduce the computational cost of self-attention on high-resolution feature maps, efficient self-attention uses spatial reduction ratios of (8, 4, 2, 1) at the four stages, respectively. The MLP expansion ratio is 4, and the maximum drop-path rate is set to 0.1 for regularization during training.
Compared with the standard Vision Transformer (ViT), a key advantage of MiT is that it natively produces multi-scale features without requiring a feature pyramid network (FPN), making it naturally compatible with the proposed dual-branch multi-scale fusion strategy. In addition, Mix-FFN replaces traditional positional encoding with 3 × 3 depthwise convolutions to provide local positional information without explicit positional encoding, which is particularly beneficial for dense prediction tasks in remote sensing scenarios.
2.4.2. Index Encoder
The index encoder is implemented as a lightweight four-stage convolutional neural network for extracting multi-scale features from the four spectral-index channels. Unlike the spectral encoder, which uses MiT-B2 to model long-range spatial-spectral dependencies, the index encoder adopts a relatively simple CNN structure. This asymmetric design is motivated by two considerations. First, spectral indices already encode strong prior information, and the index branch therefore does not require a feature extractor as complex as that of the spectral branch. Second, keeping the index branch lightweight helps control the additional computational overhead introduced by the dual-branch structure, thereby improving overall network efficiency.
Each stage of the index encoder consists of two sequential convolutional modules. The first module performs spatial downsampling using a convolution with a stride s, where the stride is 4 in stage 1 and 2 in stages 2–4, followed by batch normalization and a ReLU activation function. The second module uses a 3 × 3 convolution with a stride of 1 to further refine local features. The initial convolution in stage 1 uses a 7 × 7 kernel to capture broader spatial context, whereas all subsequent stages use 3 × 3 kernels. The output channel dimensions of the four stages are set to (64, 128, 320, 512) to match the feature dimensions of the spectral encoder at the corresponding scales, thereby facilitating subsequent cross-branch feature fusion.
Compared with the spectral encoder, the index encoder introduces only a limited additional computational cost. Because each stage contains only two convolutional layers and no self-attention modules, it increases the parameter count by approximately 3.2 M in the current implementation, corresponding to about a 13% increase relative to the MiT-B2 backbone (approximately 24.7 M parameters). This lightweight design allows the dual-branch architecture to maintain strong representational capacity while retaining good deployment feasibility, making it suitable for real-time or near-real-time red tide monitoring scenarios.
2.4.3. Spectral-Index-Guided Fusion (SIGF) Module
The core innovation of SIG-Net lies in the spectral-index-guided fusion (SIGF) module, which is instantiated at four spatial scales. Unlike conventional symmetric fusion strategies (e.g., feature concatenation or element-wise addition), SIGF explicitly models the asymmetric relationship between spectral features and index features, treating spectral-index features as guidance information that modulates spectral-branch features through a cross-attention mechanism. The underlying rationale is that spectral indices encode prior knowledge relevant to red tide identification and can therefore serve as auxiliary signals that guide the learning of more discriminative representations from the original multispectral features.
Let the spectral feature and the index feature at a given scale be denoted as
and
, respectively. First, the SIGF module generates the query, key, and value through 1 × 1 convolutions:
where WQ, WK, and WV denote 1 × 1 linear projections; SR(·) denotes the spatial reduction operation, which is implemented by average pooling in this study. For the four scales, the spatial reduction ratios are set to (8, 4, 2, 1), respectively. Let the original number of spatial tokens be N = H × W; then the number of reduced tokens is
, and the computational complexity of the attention operation is reduced from
to
.
Before attention computation, Q, K, and V are rearranged into a two-dimensional token form, where
. The cross-attention output is then defined as:
where
is the scaling factor used to avoid excessively large dot-product values;
denotes the batch normalization operation applied after restoring the feature map to two dimensions. In this study, BatchNorm2d is used to normalize the attention output so that it remains consistent with the feature representation of the convolution branch and yields more stable training performance.
After obtaining the attention-enhanced feature, SIGF further adaptively controls the fusion ratio between the original spectral feature and the attention-enhanced feature through a learnable gating mechanism:
where
denotes the Sigmoid activation, [
] denotes channel-wise concatenation, and
denotes element-wise multiplication. The gating map is
obtained through end-to-end learning, enabling the network to adaptively regulate the fusion strength of index-guided information in both the spatial and channel dimensions. In general, in regions where the spectral indices are more discriminative, the model tends to assign higher weights to the attention-enhanced features; whereas in regions with ambiguous boundaries or weak index responses, more original spectral features can be retained. The ablation experiments in
Section 3.2 further verify the effectiveness of this gating mechanism in improving model performance.
2.4.4. Decoder Head
After multi-scale feature fusion, the four fused feature maps are fed into the SegFormer-style decoder head. The decoder first maps features at each scale into a unified 256-dimensional channel space through projection layers. In this implementation, each projection layer consists of a 1 × 1 convolution, batch normalization, and ReLU activation. Subsequently, features from all scales are upsampled to 1/4 resolution (H/4 × W/4) by bilinear interpolation and concatenated along the channel dimension to form 1024-dimensional multi-scale fused features.
The concatenated features are compressed to 256 dimensions through a 1 × 1 convolutional fusion layer with batch normalization, followed by dropout with a ratio of 0.1 for regularization. Finally, a 1 × 1 convolutional classification layer outputs pixel-level predictions for two classes; during both training and inference, the predictions are further upsampled to the original input resolution to obtain the final segmentation map.
2.4.5. Loss Function
The model is trained using a pixel-wise cross-entropy loss function with class weights:
Here, denotes the total number of valid pixels involved in loss computation, denotes the class index (background class and red tide class), denotes the class weight; in this study, the background class is set to , and the red tide class to ; denotes the one-hot encoded ground-truth label, and denotes the predicted class probability. The larger weight assigned to the red tide class is used to mitigate the pronounced class imbalance problem and thereby enhance the model’s focus on minority-class targets.
To further enhance the model’s ability to identify the red tide class, this study combines cross-entropy loss and Dice loss in a weighted manner to obtain the total loss function:
Here, Dice loss [
51] directly measures the regional overlap between the predicted and ground-truth segmentations and is generally more robust to class imbalance than cross-entropy loss alone. By combining the two, the model can simultaneously optimize pixel-wise classification accuracy and the overall overlap quality of the target region, thereby improving the completeness and boundary consistency of red tide segmentation results.
2.5. Experimental Settings
2.5.1. Implementation Details
All models were implemented in MMSegmentation v1.2.2, with PyTorch 2.1.2 and CUDA 11.8 as the runtime environment. Experiments were conducted on a single NVIDIA RTX 4090 GPU with 24 GB of memory. AdamW [
52] was used as the optimizer, with an initial learning rate of 1 × 10
−3 and a weight decay of 0.01. The learning-rate schedule included 500 iterations of linear warm-up (from 1 × 10
−6 to 1 × 10
−3), followed by polynomial decay (power = 0.9) to a minimum learning rate of 1 × 10
−5. The models were trained for 20,000 iterations with a batch size of 16. To improve training efficiency and reduce GPU memory usage, automatic mixed precision (AMP) and dynamic loss scaling were enabled during training.
Data augmentation included random cropping (512 × 512, with a maximum class-ratio threshold of 0.75), random horizontal flipping (
p = 0.5), random vertical flipping (
p = 0.5), and random rotation (angle range [−90°, +90°],
p = 0.5). The loss function combined weighted cross-entropy loss and Dice loss, with class weights set to [1.0, 50.0] to alleviate class imbalance and the Dice-loss weight set to 3.0; the specific definition is given in
Section 2.4.5. All models were compared under the same data split and training settings. Mean intersection over union (mIoU), mean precision (mPrecision), mean recall (mRecall), and mean F1-score (mF1) were used as the primary evaluation metrics. These metrics are widely used for semantic segmentation and classification performance evaluation because they characterize region overlap, false positives, false negatives, and the balance between precision and recall [
53]. The final results are reported on the independent test set.
2.5.2. Baseline Methods
To assess the effectiveness of SIG-Net, it was compared with the following baseline methods.
U-Net: a classical encoder–decoder segmentation architecture with symmetric downsampling and upsampling paths and skip connections. In this implementation, a five-stage structure was used, with the base number of channels set to 64. To ensure the same amount of input information as in the proposed method, U-Net adopts an 11-channel input, in which the seven original spectral bands and four spectral indices are directly concatenated into a multi-channel input tensor.
DeepLabV3+ [
54]: an encoder–decoder architecture combined with atrous spatial pyramid pooling (ASPP) to enhance multi-scale context modeling. In this study, ResNet-50 was used as the backbone, and the input was likewise an 11-channel tensor formed by concatenating the seven original spectral bands and four spectral indices.
SegFormer (MiT-B2) [
55]: a representative Transformer-based semantic segmentation architecture whose backbone is identical to the spectral encoder of SIG-Net. For a fair comparison, two configurations were evaluated in this study:
(a) SegFormer-7ch: only the seven original spectral bands are used as input;
(b) SegFormer-11ch: the seven original spectral bands and four spectral indices are directly concatenated into an 11-channel input and fed into a single encoder in order to evaluate the difference between a simple early-fusion strategy and the dual-branch guided-fusion strategy of SIG-Net.
All baseline methods used the same training settings as SIG-Net, including the optimizer (AdamW), initial learning rate (1 × 10−3), data-augmentation strategy, and loss function (a combination of cross-entropy loss and Dice loss), to ensure a fair comparison. For models whose default input is 3-channel RGB (U-Net, DeepLabV3+, and SegFormer), the input layers were modified to accommodate multispectral input: U-Net, DeepLabV3+, and SegFormer-11ch received 11-channel input, whereas SegFormer-7ch received 7-channel input. To avoid additional influences introduced by different pretraining strategies, all models were randomly initialized and trained from scratch.
4. Discussion
4.1. Effectiveness of Spectral Index Guidance
The experimental results indicate that explicitly integrating spectral indices as an independent guidance branch, rather than treating them as additional input channels in an early-fusion scheme, yields better overall performance for red tide semantic segmentation. This suggests that, in remote sensing image analysis, incorporating domain knowledge into deep feature learning in a structured manner is often more effective than simple channel-level concatenation. This design strategy may also provide useful guidance for other remote sensing tasks that rely on empirical indices, such as vegetation mapping, water-quality assessment, and urban thermal-environment analysis [
56].
An important reason why the SIGF fusion mechanism outperforms simpler alternatives (
Table 5) lies in its asymmetric cross-attention design. By using features from the spectral encoder as queries and features from the index encoder as keys and values, SIGF explicitly models how spectral indices guide spectral-feature learning, rather than treating the two branches as fully equivalent information sources. This asymmetry is consistent with the functional role of spectral indices: they are derived from the original bands and encode prior knowledge relevant to red tide identification. They are therefore better suited to serve as auxiliary guidance signals for modulating deep representation learning than to be indiscriminately fused with the original band features.
4.2. Role of the Gating Mechanism
The gating mechanism in SIGF plays an important role in improving the adaptivity of the fusion process. The ablation experiments (
Table 5) show that model performance declines when the gating mechanism is removed, indicating that relying solely on attention-enhanced features in a fixed manner is not optimal. This is particularly important in complex coastal scenes. For example, in shallow turbid waters, suspended sediments may introduce spectral anomalies similar to those of red tides; in low-signal-to-noise regions, the stability of spectral indices may also decrease. In such cases, if the original spectral features are not effectively preserved, the fusion results may be affected by unreliable guidance information.
After the introduction of gating, the model can adaptively regulate the contribution of index-guided information in both the spatial and channel dimensions, thereby fully exploiting prior information in regions with strong index responses while preserving more original spectral features in regions with ambiguous boundaries or weak index discriminability. This adaptive weighting mechanism helps alleviate the effects of spectral-index instability under varying environmental conditions and is one of the key reasons why it outperforms fixed-weight fusion strategies.
4.3. Value of the Proposed NDNI
The ablation study of spectral-index selection (
Table 6) shows that among the four indices, NDNI makes the most prominent contribution to overall segmentation performance. This result supports our basic hypothesis that the spectral contrast between the red-edge band (B5, 705 nm) and the blue band (B2, 490 nm) can provide strongly discriminative supplementary information for
Noctiluca scintillans-dominated red tides. Its potential physical basis may be related to the relatively large cell size of Noctiluca, the more pronounced red-edge response, and the absorption characteristics of carotenoids in the blue band [
57].
It should be noted that the effectiveness of NDNI may exhibit a certain degree of species dependence. For red tides dominated by other species, the optimal index formulation may not be the same. Therefore, the more important conclusion of this study does not lie in NDNI itself, but in the overall framework embodied by SIG-Net, namely the use of task-relevant domain indices as prior information to guide deep feature learning. Within this framework, the index set can be flexibly replaced with other, more suitable ones according to target algal species, regional environments, and sensor characteristics.
4.4. Limitations and Future Work
Although the proposed method achieves favorable experimental results, several limitations remain. First, although the dataset used in this study covers multiple coastal regions and time periods, it is still constructed only from a single sensor platform (Sentinel-2 MSI). The applicability of SIG-Net to other multispectral or hyperspectral sensors, such as Landsat-8/9 OLI, MODIS, and GOCI-II, requires further verification, especially with regard to the transferability of spectral indices under different band settings.
Second, although B5, B6, and B7 were resampled to the 10 m reference grid for band co-registration and pixel-wise tensor construction, their effective spatial information remains constrained by their native 20 m resolution. The resampled red-edge bands therefore provide spectral cues rather than newly generated 10 m spatial details. Future work may further examine super-resolution or multi-resolution fusion strategies for Sentinel-2 red-edge bands in red tide extraction.
Third, the current ground-truth labels mainly rely on expert visual interpretation and semi-automatic threshold segmentation, which may introduce a certain degree of subjectivity and boundary uncertainty. Because only red tide pixels with relatively clear visual and spectral responses were labeled as high-confidence positive samples, weak-response or early-stage bloom pixels may be under-represented in the current dataset. This conservative labeling strategy helps reduce false-positive labels but may also make the model less sensitive to ambiguous bloom boundaries. In the future, incorporating field-validation data, such as ship-based sampling, buoy observations, and fluorescence measurements, would help establish higher-quality ground-truth labels and improve the rigor of the evaluation.
Fourth, this study does not explicitly model the temporal evolution of red tide events, including stages such as occurrence, development, and decline. Extending SIG-Net to a multi-temporal remote sensing analysis framework is expected to further improve its ability to characterize dynamic red tide changes and enhance the detection of early red tide signals.
Finally, although SIG-Net achieves favorable performance while maintaining relatively low model complexity, further efforts are still needed before deployment in operational red tide monitoring scenarios, including near-real-time processing, automated data-stream ingestion, inference-efficiency optimization, and integration with existing early-warning systems. In addition, the current experimental results are still mainly based on limited regions and sample sizes. Future work should further validate the stability and generalization capability of the model across more sea areas, more red tide types, and cross-regional scenarios.
5. Conclusions
This study proposes SIG-Net (Spectral-Index-Guided Network), a dual-branch semantic segmentation architecture for red tide extraction from Sentinel-2 multispectral imagery. The core idea is to explicitly construct domain-relevant spectral indices, namely RGI, BGI, NDVI, and the NDNI proposed in this study, as an independent input branch and to incorporate index prior information into the deep feature learning process through the spectral-index-guided fusion (SIGF) module.
Through spatial-reduction cross-attention and a learnable gating mechanism, the SIGF module achieves adaptive fusion of spectral features and index features at multiple scales. Compared with simple early concatenation or symmetric fusion strategies, this design makes more effective use of the prior knowledge carried by spectral indices, thereby enhancing the model’s discriminative capability for red tide targets.
Experimental results on the Sentinel-2 red tide dataset demonstrate that SIG-Net outperforms baseline models such as U-Net, DeepLabV3+, and SegFormer in overall performance, as well as simpler dual-branch fusion strategies such as element-wise addition and concatenation. Ablation studies further verify the effectiveness of the individual components of the proposed method, among which both the SIGF module and the gating mechanism deliver stable performance gains, while NDNI shows the strongest contribution among the four spectral indices.
Overall, SIG-Net provides an effective framework for integrating remote sensing domain knowledge with deep learning for fine-grained red tide extraction. Although this study focuses on red tide detection, the dual-branch guided-fusion strategy also has potential for other remote sensing tasks that require the incorporation of domain prior information. Future work will further focus on multi-temporal analysis, multi-sensor fusion, and deployment for operational marine environmental monitoring.