State-Space Model Meets Linear Attention: A Hybrid Architecture for Internal Wave Segmentation

An, Zhijie; Li, Zhao; Barintag, Saheya; Zhao, Hongyu; Yao, Yanqing; Jiao, Licheng; Gong, Maoguo

doi:10.3390/rs17172969

Open AccessArticle

State-Space Model Meets Linear Attention: A Hybrid Architecture for Internal Wave Segmentation

by

Zhijie An

¹

,

Zhao Li

¹

,

Saheya Barintag

¹

,

Hongyu Zhao

¹

,

Yanqing Yao

²

,

Licheng Jiao

¹

and

Maoguo Gong

^1,*

¹

College of Mathematics Science, Inner Mongolia Normal University, Hohhot 010028, China

²

School of Automation, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 2969; https://doi.org/10.3390/rs17172969

Submission received: 14 July 2025 / Revised: 18 August 2025 / Accepted: 24 August 2025 / Published: 27 August 2025

(This article belongs to the Special Issue Advancements of Vision-Language Models (VLMs) in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Internal waves (IWs) play a crucial role in the transport of energy and matter within the ocean while also posing significant risks to marine engineering, navigation, and underwater communication systems. Consequently, effective segmentation methods are essential for mitigating their adverse impacts and minimizing associated hazards. A promising strategy involves applying remote sensing image segmentation techniques to accurately identify IWs, thereby enabling predictions of their propagation velocity and direction. However, current IWs segmentation models struggle to balance computational efficiency and segmentation accuracy, often resulting in either excessive computational costs or inadequate performance. Motivated by recent developments in the Mamba2 architecture, this paper introduces the state-space model meets linear attention (SMLA), a novel segmentation framework specifically designed for IWs. The proposed hybrid architecture effectively integrates three key components: a feature-aware serialization (FAS) block to efficiently convert spatial features into sequences; a state-space model with linear attention (SSM-LA) block that synergizes a state-space model with linear attention for comprehensive feature extraction; and a decoder driven by hierarchical fusion and upsampling, which performs channel alignment and scale unification across multi-level features to ensure high-fidelity spatial detail recovery. Experiments conducted on a dataset of 484 synthetic-aperture radar (SAR) images containing IWs from the South China Sea achieved a mean Intersection over Union (MIoU) of 74.3%, surpassing competing methods evaluated on the same dataset. These results demonstrate the superior effectiveness of SMLA in extracting features of IWs from SAR imagery.

Keywords:

internal waves; segmentation; linear attention; state-space model; Mamba2

1. Introduction

Internal waves (IWs), as a widespread oceanic phenomenon, are distributed across global marine regions. Large-scale IWs carry significant amounts of energy as they propagate toward coastlines, posing substantial impacts on marine ecosystems and maritime navigation [1,2,3]. Consequently, the observation of IWs has become a significant research topic in oceanography. This study aims to develop a novel deep learning network for IWs image segmentation, enabling efficient and accurate observation of IWs characteristics.

Currently, with the rapid advancement of remote sensing technology, satellite observation data has become a crucial resource for studying IWs [4,5], with the two primary remote sensing methods being satellite optical remote sensing and synthetic-aperture radar (SAR) technology [6,7]. Optical Remote Sensing operates in a passive imaging mode by capturing solar radiation reflected from the sea surface to analyze IW surface expressions [8]. In contrast, synthetic-aperture radar (SAR) functions as an active microwave sensor capable of penetrating clouds and rain, enabling all-weather, day–night imaging with sustained high spatial resolution over long distances. SAR achieves this by detecting IWs-induced modulations of sea surface current fields, which manifest as alternating bright–dark stripe patterns in radar backscatter intensity [9,10]. Remote sensing images quality limitations (e.g., resolution constraints, noise interference, atmospheric disturbances) challenge precise IWs structure observation. Thus, an increasing number of studies have utilized remote sensing images to extract and segment IWs [11]. Before performing segmentation on remote sensing images, a series of optimization processes, such as normalization, image augmentation, and denoising [12], are typically applied to improve segmentation performance and enhance the training efficiency of the model. The goal of IWs image segmentation is to employ deep learning algorithms to extract the structures and characteristics of IWs from remote sensing images, thereby supporting the analysis of IWs behavior and enabling the prediction of their speed and direction [13]. As a result, current research on segmentation models continues to focus on improving segmentation accuracy and efficiency while minimizing dependence on large-scale labeled datasets and computational resources.

In the past decade, large models for image processing have continued to evolve. Convolutional neural networks (CNNs) have long held a dominant position in the field and have provided the foundation for the development of numerous derivative architectures. The generative adversarial network (GAN) framework [14], based on game theory principles, achieves image segmentation through an adversarial training mechanism between the generator and discriminator. Meanwhile, the attention-based deep convolutional segmentation approach extracts hierarchical features through multi-level convolutional layers and adaptively enhances target feature representation using attention modules [15], effectively integrating local details with global contextual information. Although CNNs play an indispensable role in deep learning models, their limited local receptive fields hinder the modeling of long-range dependencies. Transformers address this issue by capturing global information [16,17], but their high computational complexity limits their applicability to high-resolution remote sensing images.

Recently, Mamba, an efficient sequence modeling method, has shown significant advantages in processing long sequences [18,19]. Various models based on the Mamba architecture, such as U-Mamba, which integrates CNN residual blocks with Mamba blocks, LocalMamba, which performs selective scanning in local windows, and VM-UNET-V2, which introduces VSS blocks and spatial dependency injection (SDI), have demonstrated strong segmentation performance. These models collectively highlight the potential of combining Mamba with other architectures to improve both the efficiency and effectiveness of image segmentation. However, the selective state-space models (SSMs) in Mamba face challenges when scaling to larger models due to the computational complexity of state-space discretization. To address this, Mamba2 introduces structured state-space decomposition and a parameter-shared projection mechanism, leveraging matrix representations and block-based computations to maximize GPU parallelism, reduce complexity, and accelerate training. Although Mamba2 demonstrates outstanding performance in feature modeling, it may still suffer from overfitting and incur high computational costs in resource-constrained scenarios. To overcome these limitations, this paper proposes an optimized image segmentation framework for IWs, aiming to enhance segmentation accuracy and computational efficiency under limited training data conditions. To improve global context modeling and local feature representation, we design a novel state-space model that integrates the visual Mamba2 module, combining Mamba2’s state-space modeling with a linear attention mechanism, thereby facilitating efficient global feature interactions. Moreover, the introduction of structured state matrix decomposition, block-wise computation, a non-causal linear attention (NCLA) block, and a compression gating design further enhances the overall computational efficiency through architectural simplification. The main contributions of this paper are as follows:

(1): A decoder structure driven by hierarchical fusion and upsampling is constructed, which performs channel alignment and scale unification on multi-level features to achieve high precision spatial detail recovery. This significantly improves boundary segmentation accuracy of IWs. In addition, a feature-aware serialization (FAS) block is designed to compress the spatial dimension while effectively preserving and enhancing multi-scale salient feature representations, providing high quality inputs for subsequent sequence modeling.
(2): A state-space model block integrated with a linear attention mechanism (SSM-LA) is introduced. This module innovatively combines the linear properties of Mamba2 with visual feature modeling, achieving a unified representation of fine-grained local details and global contextual information, balancing performance and computational efficiency.
(3): The proposed architecture fully integrates the advantages of structured state-space modeling and linear attention mechanisms, effectively enhancing the representation capability for complex ripple patterns while significantly reducing the computational redundancy typically associated with conventional Transformer architectures in remote sensing image processing. The resulting lightweight and efficient segmentation framework achieves a dual optimization of accuracy and computational efficiency, demonstrating strong adaptability for practical remote sensing applications in resource-constrained environments.

The remainder of this paper is organized as follows: Section 2 reviews related work on image segmentation and IWs detection in remote sensing imagery. Section 3 presents the overall architecture and methodology of the proposed model. Section 4 details the experimental procedures and results. Section 5 provides a comprehensive conclusion and discusses future research directions.

2. Related Work

Accurate image feature segmentation remains a fundamental challenge in computer vision. During the early development of deep learning, convolutional neural networks (CNNs) [20] emerged as the dominant architectural paradigm due to their strong adaptability and scalability. The fully convolutional network (FCN) [21] pioneered this field by introducing transposed convolutional layers and skip connections, enabling CNNs to handle inputs of arbitrary size and perform end-to-end pixel level classification, thereby establishing the cornerstone of modern segmentation networks. U-Net [22] further advanced this framework with its encoder–decoder architecture, enhancing segmentation performance in small sample scenarios through multi-scale feature fusion via skip connections. However, U-Net’s reliance on traditional convolutional operations inherently limited its capacity to model long-range dependencies due to its localized receptive fields. Subsequent improvements, including attention gating mechanisms for dynamically suppressing background interference [23], multi-scale context aggregation modules at the encoder stage [24], and atrous spatial pyramid pooling (ASPP) in decoder design [25], still proved to be less effective than self-attention mechanisms and their variants in capturing global contextual information.

Although CNNs play an indispensable role in deep learning models, their limited local receptive fields hinder the modeling of long-range dependencies. In parallel, research in oriented object detection has addressed similar challenges of capturing complex spatial layouts [26]. For instance, advanced architectures utilize hierarchical mask prompting and robust integrated regression to accurately delineate objects with irregular orientations [27], demonstrating the importance of multi-scale feature integration and precise boundary definition, which are also critical for IWs segmentation.

The advent of vision Transformers (ViTs) [28] reinvigorated the field by leveraging global attention weights to model pixel-wise relationships across entire images, demonstrating superior capabilities in capturing long-range dependencies, as well as offering architectural flexibility and scalability compared to U-Net. Nevertheless, the quadratic computational complexity of self attention mechanisms has hindered their application in high-resolution image processing. To address this limitation, researchers have proposed localized attention constraints [29], hierarchical downsampling architectures [30], and linearized attention approximations [31], although these approaches still underperform relative to standard quadratic attention variants.

Recent advancements in selective state-space models (SSMs) [32], characterized by linear complexity, global modeling capacity, and hardware efficiency, have begun to challenge the dominance of transformers in long sequence tasks. Mamba [33], a prominent SSM variant, achieves performance comparable to or surpassing that of ViTs in image processing through its S6 module, which enables input-dependent parameter adaptation. Subsequent adaptations for vision tasks include bidirectional scanning mechanisms for 2D structural compatibility [34], vision mamba (Vim) [35], which incorporates positional encodings into pure SSM backbones for high-resolution processing, and extensions to 3D spatiotemporal data and multimodal scientific computing [36]. Building upon these developments, Mamba2 [37], proposed by Albert Gu’s team in 2024, introduced structured state matrix decomposition, significantly reducing computational complexity and accelerating training while preserving the core advantages of the original architecture, marking a breakthrough in efficiency performance co-optimization.

The successful application of the aforementioned models in related fields highlights the vast potential of deep learning in various domains. In the field of oceanic IWs detection, several deep learning-based methods have been proposed. In 2021, S. Vasavi et al. introduced a U-Net model to segment IWs regions from SAR images and generate binary masks [38]. The segmentation results were subsequently fed into a kdv solver to invert IWs parameters such as amplitude, wavelength, and propagation velocity, with physical constraints applied to correct segmentation errors. Also in 2021, Zhang et al. proposed a backpropagation neural network (BPNN)-based model [13] to predict IWs propagation in the Andaman Sea, departing from traditional approaches that heavily relied on physical models. In 2023, G. Yuan et al. developed an automatic IWs recognition algorithm based on a one-dimensional convolutional neural network (1D-CNN) [39], incorporating feature extraction and classification modules, which is suitable for marine buoy systems in automated IWs detection. In the same year, Jiang et al. proposed a modified deep convolutional generative adversarial network (DCGAN) [40] to enhance local contrast in MODIS images, thereby highlighting IWs signals and improving detection performance in low-resolution MODIS images. Figure 1 provides a visual summary of the key outcomes and model architectures from the aforementioned related works.

While existing segmentation models predominantly focus on medical imaging, industrial applications remain underexplored. To bridge this gap, we propose an encoder–decoder architecture specifically designed for the precise segmentation of IWs patterns in marine imagery. Inspired by Mamba2, our model incorporates stacked Vision Mamba2 blocks to construct a multi-level feature dependency capture mechanism, enabling comprehensive modeling of complex spatial relationships in visual data. This innovative solution delivers enhanced segmentation precision for oceanographic observation scenarios.

3. Method

3.1. Architecture Overview

The proposed segmentation framework follows a hybrid network structure that efficiently captures both local features and long-range contextual information. Figure 2 presents an overview of the SMLA architecture. The encoder begins with a feature-aware serialization (FAS) block, which is optimized to reduce spatial dimensions while efficiently extracting salient visual features. This is followed by a series of stacked state-space model with Linear attention (SSM-LA) blocks designed to comprehensively model intricate feature dependencies, enabling detailed representations of complex spatial relationships within visual data. Collectively, these encoder components transform the input image into robust sequential feature representations, effectively preserving rich multi-scale spatial and contextual information. The decoder then reconstructs high-resolution segmentation maps through a hierarchical fusion process that integrates multi-scale features generated by the encoder and progressively upsamples the feature maps. This hierarchical strategy ensures the accurate restoration of detailed spatial information, significantly enhancing the delineation precision of IWs boundaries in the final segmentation output. Detailed descriptions of the FAS block, SSM-LA block, and the decoder’s patch-expanding mechanisms are provided in the subsequent subsections.

3.2. Feature-Aware Serialization Block

The FAS block is illustrated in the upper left part of Figure 2. This block compresses the spatial dimensions of the feature map while performing comprehensive visual feature extraction. Specifically, given an input image

x \in C \times H \times W

, the process begins with a single convolutional operation

C_{1}

, which consists of dropout, a 3 × 3 convolution, normalization, and activation functions, resulting in

f_{1} \in 2 C \times \frac{H}{2} \times \frac{W}{2}

. Then, two convolutional operations

C_{2}

are applied, and their output is residual connection with

f_{1}

to produce

f_{2} \in 2 C \times \frac{H}{2} \times \frac{W}{2}

. Subsequently,

C_{2}

is applied again to obtain

f_{3} \in 4 C \times \frac{H}{4} \times \frac{W}{4}

, followed by a flattening operation that yields the sequence

S \in L \times C

. Through this block, the spatial resolution of the feature map is progressively reduced while the channel depth is enriched to preserve detailed information.

3.3. State-Space Models with Linear Attention Block

The SSM-LA block, illustrated in the lower left part of Figure 2, incorporates a 3 × 3 convolution layer to enhance deep feature representations. This operation is fused with the original input to preserve richer semantic details. Following batch normalization, a linear projection is applied to generate several key intermediate representations.

[Z, X, B, C] = {Linear}_{in} (u),

(1)

where, among these, Z represents the gating branch, while the non-causal linear attention layer (NCLA) is conceptualized as a mapping from

A, X, B, C \to Y

. Accordingly, it is meaningful to generate

A, X, B,

and C in parallel at the beginning of the block. Here,

X, B,

and C correspond to the

Q, K,

and V projections, respectively; only the creation of

Q, K,

and V is depicted in the figure. The state transition matrix A in the underlying state-space model is initialized with strictly negative eigenvalues as follows:

A = - \exp (A_{\log}), A_{\log} \sim U [\log (a_{\min}), \log (a_{\max})] .

(2)

In the NCLA layer, we update the corresponding tensor contraction algorithm or notation in the linear form, following the Mamba2 framework [37]:

\begin{matrix} Z & = contract (LD, LN \to LND) (V, K), \end{matrix}

(3)

\begin{matrix} H & = contract (LL, LDN \to ND) (M, Z), \end{matrix}

(4)

\begin{matrix} Y & = contract (LN, ND \to LD) (Q, H) . \end{matrix}

(5)

where L is sequence length, N denotes the state dimension, and D is the head dimension. This algorithm involves three steps: the first step (3) performs an “expansion” into more features by a factor of the feature dimension. The third step (5) contracts the expanded feature dimension away. The second step (4) is particularly critical, as it captures the linear characteristic of linear attention. It unrolls scalar SSM recurrences to create a global hidden state H. For clarity, NCLA is shown in the bottom right part of Figure 2. In the NCLA, the projection structure originally employed in the SSD module is seamlessly integrated into the linear attention mechanism. Leveraging the kernel functions inherent to linear attention, the input features are further modeled and enhanced within a state-space dynamic framework. Ultimately, by aggregating the outputs across all channels, the overall output can be expressed as

Y = Q (M (K^{T} V)),

(6)

and this modeling approach enables NCLA to retain fine-grained local representation capabilities while simultaneously benefiting from the efficiency of linear attention in global context modeling, thereby achieving both improved performance and computational efficiency. Additionally, a feed-forward network (FFN) is integrated subsequent to the NCLA block to facilitate enhanced information exchange across channels and to maintain alignment with the established practices of classical vision Transformers.

3.4. Decoder

This section first introduces the sample-to-example (STE) block. After passing through the SSM-LA block, the feature maps produce outputs of varying sizes

P_{i}

, which need channel alignment before feature decoding. The following equation gives the main computational method of the STE module:

\begin{matrix} {\tilde{P}}_{1} & = U^{4} (Conv (P_{1})), \\ {\tilde{P}}_{2} & = U^{2} (Conv (P_{2})), \\ {\tilde{P}}_{3} & = Conv (P_{3}), \\ {\tilde{P}}_{4} & = P_{avg} (Conv (P_{4})) . \end{matrix}

(7)

Here,

P_{avg}

is 2 × 2 average pooling, Conv is 3 × 3 convolution,

U^{2}

and

U^{4}

are upsampling with a scaling factor of 2 and 4.

Firstly, the deep layer features

P_{1}

and

P_{2}

, characterized by lower resolutions and rich semantic information, restore spatial details through upsampling to facilitate their integration with shallower features. Secondly, the intermediate layer feature

P_{3}

contains both moderate spatial details and initial semantic information, and its channels are adjusted via convolution to align with other features. Finally, the shallow layer feature

P_{4}

reduces spatial resolution through downsampling, thereby decreasing the subsequent computational load while preserving primary features and enhancing the semantic expression of local details.

These aligned features are subsequently concatenated, creating a comprehensive feature representation P encompassing detailed and contextual information across multiple scales. The concatenated feature tensor is progressively refined through multiple decoding stages, each incorporating transposed convolutions (ConvTrans), followed by standard convolutions, batch normalization (BN), activation functions (

σ

), and channel shuffle operations (

ψ

):

\begin{matrix} P^{(k)} & = T^{(k)} (P^{(k - 1)}) \\ = BN (σ ({Conv}^{(k)} (ψ ({ConvTrans}^{(k)} (P^{(k - 1)}))))) . \end{matrix}

(8)

The hierarchical decoding strategy gradually restores intricate spatial details from compressed latent representations, progressively refining feature resolution until the original image size is reached. Ultimately, the final segmentation probability map is generated by applying a 1 × 1 convolution followed by an activation function, producing precise, pixel-wise segmentation predictions that accurately delineate IWs.

4. Experiment and Results

4.1. Data

In this paper, we collected a total of 484 IWs images captured by the Environmental Satellite Advanced Synthetic-Aperture Radar (ENVISAT ASAR) over the South China Sea region between 1 May 2003 and 31 March 2012. Among them, 444 IWs images were used for training and 40 for testing. These data are available for download from the European Space Agency (https://esar-ds.eo.esa.int/oads/access/collection). ENVISAT ASAR operated in the C-band at a wavelength of 5.6 cm and employed the wide swath mode (WS), offering a spatial resolution of 150 m with VV/HH polarization. The swath width was approximately 400 km.

4.2. Implementation Details

The experiments were conducted on a Windows 10 operating system. We utilized the PyTorch 2.6.0+cu126 framework and leveraged an NVIDIA RTX 4090 GPU from NVIDIA Corporation, Santa Clara, CA, USA to accelerate computations. Python version 3.9 was used for scripting and execution. The training process was configured to run for 100 epochs with a batch size of 32. The Adam optimizer was adopted to perform parameter optimization during training.

4.3. Metrics

To comprehensively evaluate the performance of the proposed framework, we adopt a set of widely used semantic segmentation metrics, including Mean Intersection over Union (MIoU), Frequency Weighted Intersection over Union (FWIoU), overall accuracy, precision, and F1-score. MIoU is a primary indicator, calculating the average IoU for all classes to fairly assess performance, especially on minority classes in imbalanced datasets. In contrast, FWIoU weights each class’s IoU by its frequency, making it more sensitive to the performance on dominant regions like the background, but potentially obscuring poor results on rarer classes. While accuracy—the percentage of correctly classified pixels—offers a high-level overview, it is less reliable in imbalanced scenarios and serves mainly as a supplementary reference. Finally, precision measures the rate of true positives to minimize false positives, while the F1-score, as the harmonic mean of precision and recall, provides a more comprehensive evaluation by balancing both false positives and false negatives.

The formulas for computing these metrics are presented below.

MIoU = \frac{1}{2} (\frac{TP}{TP + FP + FN} + \frac{TN}{TN + FN + FP}),

(9)

n_{0} = TN + FP, n_{1} = TP + FN,

(10)

FWIoU = \frac{n_{1}}{n_{0} + n_{1}} \times \frac{TP}{TP + FP + FN} + \frac{n_{0}}{n_{0} + n_{1}} \times \frac{TN}{TN + FN + FP},

(11)

Accuracy = \frac{TP + TN}{n_{0} + n_{1}},

(12)

Precision = \frac{TP}{TP + FP},

(13)

Recall = \frac{TP}{TP + FN},

(14)

F_{1} - score = 2 \times \frac{Precision \times Recall}{Precision + Recall} .

(15)

True Positive (TP) signifies the count of pixels accurately identified as “edge”; False Negative (FN) indicates the number of pixels mistakenly not identified as “edge”; False Positive (FP) represents the quantity of “non-edge” pixels in the ground truth dataset erroneously labeled as “edge” by the model; True Negative (TN) reflects the number of pixels correctly identified as “non-edge”.

4.4. Comparative Experiments

To comprehensively evaluate the performance of the proposed framework, we conducted comparative experiments against several state-of-the-art segmentation networks, including DeepLabV3 [41], FPN [42], LinkNet [43], MAnet [44], PSPNet [45], Unet [22], Unet++ [46], UPerNet [47], and MTU2-Net [15]. These models were implemented with six commonly used backbones: densenet161, dpn131, resnet101, resnet152, se_resnext101_32x4d, and vgg19_bn. The performance results are summarized in Table 1, which reports multiple evaluation metrics, including MIoU, FWIoU, accuracy, precision, and F1-score.

As shown in Table 1, other semantic segmentation models exhibit a discernible performance plateau when applied to the IWs dataset, achieving an average MIoU of approximately 68%. This performance ceiling is primarily attributable to the intrinsic complexities of the dataset, such as a severe class imbalance and the prevalence of fine-scale, low-contrast IWs features. A key observation is the consistent underperformance of models utilizing ResNet backbones. This suggests that the feature extraction mechanism of standard ResNet architectures may be suboptimal for capturing the subtle, high-frequency textural information characteristic of IWs in remote sensing imagery. While established strategies like increasing network depth exemplified by MTU2-Net, which achieved a 4% MIoU gain—can offer marginal improvements, they do so at the expense of a substantial increase in parametric complexity and computational cost, limiting their practicality for operational applications.

In order to further elaborate on the experimental results, the authors designed a heat map and a box plot. This comprehensive quantitative analysis revealed a clear performance hierarchy and highlighted critical architectural limitations among existing models. As shown in Figure 3, we identified three distinct tiers of backbone efficacy, with densenet161, dpn131, and se_resnext101_32x4d emerging as top-tier performers (MIoU: 0.66–0.68+), while resnet152 proved consistently unsuitable for this task (MIoU = 0.49+). Among segmentation models, MAnet, Unet++, and UPerNet demonstrated the most robust performance. While this figure provides a clear quantitative ranking, a qualitative inspection of the segmentation maps is essential to understand the practical implications behind these scores. A striking divergence in output quality becomes apparent, particularly when comparing our proposed model to even the top-performing baseline configurations. Our framework generates segmentation maps with superior spatial coherence, excelling in the accurate delineation of the fine, contiguous crest characteristic of IWs.

In stark contrast, even high-performing baselines, such as an MAnet model paired with a densenet161 backbone, often produce fragmented or incomplete predictions. These models exhibit a tendency to fail on low-contrast wave fronts and produce a higher rate of both false positives in areas of background clutter and false negatives along the most subtle wave features. This qualitative superiority demonstrates that our model’s architecture is more effective at learning the relevant spatial context and long-range dependencies inherent in IWs sstructures. It suggests that while SOTA models can achieve high quantitative scores, their generic feature extraction mechanisms may be insufficient to fully resolve the unique morphological challenges presented by IWs. The ability of our framework to translate a quantitative edge into a visually obvious and significant improvement in segmentation integrity confirms its practical advantages for reliable oceanographic analysis.

On the other hand, from the analysis in Figure 4, our proposed model demonstrates a marked advantage over leading SOTA models. While the MIoU scores for robust architectures like DeepLabV3, FPN, and Linknet are concentrated in the 0.6–0.8 range, our model not only achieves a significantly higher median MIoU but also displays a more compact interquartile range (IQR) and fewer outliers. This signifies superior predictive accuracy and exceptional stability across the validation set. Quantitatively, our framework achieves an average MIoU improvement of approximately 6% over the majority of baseline configurations.

4.5. Ablation Studies

This section presents ablation studies designed to evaluate the contributions of individual components within the proposed segmentation framework. Figure 5 shows the visualization of the ablation experiment. It can be seen that the proposed model outperforms other models in terms of continuity and wave packet segmentation capabilities. The detailed quantitative analysis results are summarized in Table 2. Specifically, the model architecture was divided into two key blocks: the FAS block, and the SSM-LA block. In the baseline configuration, the original FAS block was replaced with direct pixel-wise input, while the state-space model employed in the SSM-LA block was substituted with a standard attention mechanism. The Table 2 presents three metrics: MIoU, GFLOPs, and the number of model parameters.

Experimental findings reveal that although removing the FAS block in favor of direct pixel-wise inputs theoretically preserves finer spatial details, it leads to an exponential increase in sequence length, thereby significantly intensifying computational demands. As a result, substantial reductions in either batch size or input resolution were required to control the quadratic growth in per-batch training cost. Although this measure enhances computational efficiency, the reduction in input resolution inevitably leads to coarser feature details, degrades model accuracy. This observation aligns with the experimental results presented in the first and second rows of Table 2. In summary, the inclusion of the FAS block results in a marginal increase in computational cost, adding only 0.1M to the parameter count and 0.69 to the GFLOPs. In return, it achieves a substantial 4.8% percentage point improvement in MIoU over the baseline.

Substituting the SSM-LA with standard attention mechanisms produced comparable performance. This observation is consistent with recent studies showing that specific state-space models, such as VSSD [48] and MLLA [49], can match or even exceed the performance of similarly sized Transformer architectures under moderate-scale conditions. As shown in Table 2, the integration of the SSM-LA block, while increasing the parameter count by 5.51 M compared to the baseline, yielded substantial performance gains. Specifically, it boosted the model’s MIoU by 6.1% percentage points and simultaneously reduced the GFLOPs by 4.59, indicating a significant improvement in computational efficiency.

Finally, in contrast to the aforementioned experiments, our proposed segmentation model achieves a favorable balance. It secures a high model accuracy of 74.3% and a computational efficiency of 8.77 GFLOPs, without unduly increasing the model’s parameter count, which stands at 32.10 M.

4.6. Generalization Analysis

To evaluate the generalization capability of the proposed framework, we conducted inference experiments using SAR images acquired from the Andaman Sea and the Sulu Sea. It is important to note that these datasets were not involved in the training process, thereby enabling an unbiased assessment of the model’s performance on unseen scenes.

The prediction results for the Andaman Sea are illustrated in Figure 6. Taking the first row as an example, Figure 6a presents the original test image, Figure 6b shows the segmentation output predicted by the model, and Figure 6c displays the final overlay visualization, where the predicted IWs fronts are superimposed on the original image using green contours. The SAR image in Figure 6a was captured by Sentinel-1 over the Andaman Sea from 11:45:17 to 11:45:31 on 23 March 2019, corresponding to absolute orbit number 026465. Figure 6d was also acquired by Sentinel-1 over the Andaman Sea from 11:45:01 to 11:45:20 on 28 September 2015, with an absolute orbit number of 007915.

The prediction results for the Sulu Sea are shown in Figure 7. The SAR image in Figure 7a was acquired by Sentinel-1 over the Sulu Sea from 21:41:11 to 21:41:40 on 20 February 2019, with an absolute orbit number of 015035. Figure 7d was captured from 21:42:10 to 21:42:35 on 17 December 2020, corresponding to absolute orbit number 035731. The structure of the visualization is consistent with that of the Andaman Sea, comprising the original image, the prediction output, and the final overlay.

In addition, Table 3 presents the quantitative analysis results, including the study area, the quantity of images, and the metrics. These results demonstrate that the proposed segmentation framework maintains robust performance across different geographical regions and acquisition times, confirming its generalization ability and effectiveness in practical applications involving diverse SAR scenes.

5. Conclusions

This paper introduced SMLA, a novel segmentation framework for IWs that successfully integrates a feature-aware serialization (FAS) block, a core state-space model with linear attention (SSM-LA) encoder, and a hierarchical feature decoder. Our hybrid architecture demonstrates a superior ability to model the complex, multi-scale structures of IWs by capturing both long-range spatial dependencies and fine-grained textural features with linear time complexity. Experimental results on a challenging SAR dataset of the South China Sea confirm the state-of-the-art performance of our approach, achieving a 74.3% MIoU and showcasing significant improvements in segmentation continuity and boundary accuracy compared to existing models. The success of SMLA extends beyond a simple performance metric; it signifies a promising direction for oceanic remote sensing analysis. By effectively synergizing state-space modeling with linear attention, our work provides a powerful and computationally efficient paradigm for segmenting complex, quasi-linear phenomena in large-scale geophysical imagery.

Despite its strong performance, we acknowledge certain limitations. The current design, which stacks four SSM-LA blocks to ensure high precision, results in a model with 32.10M parameters, leading to considerable computational overhead during training. This may limit its accessibility for researchers with constrained computational resources. Furthermore, while the model demonstrated robust generalization on unseen data from the Andaman and Sulu Seas, its performance might still be sensitive to SAR imaging conditions not represented in the training set, such as extreme sea states or specific sensor noise patterns.

Future research will proceed along several promising avenues. To address the computational cost, we will explore model compression techniques, such as knowledge distillation and quantization, to develop a lightweight yet powerful version of SMLA suitable for edge computing or rapid deployment. Another critical direction is the integration of physical constraints into the learning process; Finally, we plan to extend the SMLA framework to a multi-modal setting, fusing SAR data with optical or altimetry data to build a more comprehensive and robust IWs segmentation and analysis system.

Author Contributions

Conceptualization, Z.A., S.B., L.J. and M.G.; methodology, Z.A. and Z.L.; software, Z.A. and Z.L.; validation, Z.A., S.B., H.Z. and Y.Y.; formal analysis, Z.A. and L.J.; investigation, Z.A., S.B. and H.Z.; resources, Z.A. and M.G.; data curation, Z.A., Z.L. and S.B.; writing—original draft preparation, Z.A., Z.L. and S.B.; writing—review and editing, Z.A., Z.L., S.B., H.Z., Y.Y. and M.G.; visualization, Z.A. and Z.L.; supervision, L.J., M.G., Y.Y. and H.Z.; funding acquisition, S.B. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62161044, 62161045), Natural Science Foundation of Inner Mongolia (2023MS06003, 2025QN06034), Key Laboratory of Infinite-dimensional Hamiltonian System and Its Algorithm Application (IMNU), Ministry of Education (2023KFYB06, 2023KFGJ01, 2023KFGJ02), and Student Science and Technology Innovation Project for the Inner Mongolia Normal University (2025XSKC12).

Data Availability Statement

ENVISAT ASAR data is accessed on 1 May 2003 and 31 March 2012 from the European Space Agency (https://esar-ds.eo.esa.int/oads/access/collection). Generalization analysis data is available for download from EarthData (https://search.asf.alaska.edu/).

Acknowledgments

The authors would like to thank the developers in the pytorch community for their open-source deep learning projects. Special thanks are due to anonymous reviewers and editors for their valuable comments for the improvement of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

ENVISAT ASAR	Environmental Satellite Advanced Synthetic-Aperture Radar
IWs	Internal Wave
SMLA	State Space Model Meets Linear Attention
MIoU	Mean Intersection over Union
FAS	Feature-Aware Serialization
SSM-LA	State Space Model Block with Linear Attention
NCLA	Non-Causal Linear Attention
SAR	Synthetic-Aperture Radar
STE	Sample to Example
FWIoU	Frequency-Weighted Intersection over Union
TP	True Positive
FN	False Negative
FP	False Positive
TN	True Negative

References

Kozlov, I.; Romanenkov, D.; Zimin, A.; Chapron, B. SAR observing large-scale nonlinear internal waves in the White Sea. Remote Sens. Environ. 2014, 147, 99–107. [Google Scholar] [CrossRef]
Pan, J.; Jay, D.A.; Orton, P.M. Analyses of internal solitary waves generated at the Columbia River plume front using SAR imagery. J. Geophys. Res. Oceans 2007, 112. [Google Scholar] [CrossRef]
Alford, M.H.; Peacock, T.; MacKinnon, J.A.; Nash, J.D.; Buijsman, M.C.; Centurioni, L.R.; Chao, S.Y.; Chang, M.H.; Farmer, D.M.; Fringer, O.B.; et al. The formation and fate of internal waves in the South China Sea. Nature 2015, 521, 65–69. [Google Scholar] [CrossRef] [PubMed]
Jackson, C. Internal wave detection using the moderate resolution imaging spectroradiometer (MODIS). J. Geophys. Res. Ocean. 2007, 112. [Google Scholar] [CrossRef]
Zheng, Q.; Susanto, R.D.; Ho, C.R.; Song, Y.T.; Xu, Q. Statistical and dynamical analyses of generation mechanisms of solitary internal waves in the northern South China Sea. J. Geophys. Res. Ocean. 2007, 112. [Google Scholar] [CrossRef]
Crisp, D.J. The State-of-the-Art in Ship Detection in Synthetic Aperture Radar Imagery; Defense Technical Information Center: Fort Belvoir, VA, USA, 2004. [Google Scholar]
Mandal, A.K.; Seemanth, M.; Ratheesh, R. Characterization of internal solitary waves in the Andaman Sea and Arabian Sea using EOS-04 and sentinel observations. Int. J. Remote Sens. 2024, 45, 1201–1219. [Google Scholar] [CrossRef]
Surampudi, S.; Sasanka, S. Internal wave detection and characterization with SAR data. In Proceedings of the 2019 IEEE Recent Advances in Geoscience and Remote Sensing: Technologies, Standards and Applications (TENGARSS), Kochi, India, 17–20 October 2019; pp. 104–108. [Google Scholar]
Sun, L.; Liu, Y.; Meng, J.; Fang, Y.; Su, Q.; Li, C.; Zhang, H. Internal solitary waves in the central Andaman sea observed by combining mooring data and satellite remote sensing. Cont. Shelf Res. 2024, 277, 105249. [Google Scholar] [CrossRef]
Santos-Ferreira, A.M.; Da Silva, J.C.; Magalhaes, J.M. SAR mode altimetry observations of internal solitary waves in the tropical ocean Part 1: Case studies. Remote Sens. 2018, 10, 644. [Google Scholar] [CrossRef]
Zhang, S.; Li, X.; Zhang, X. Internal wave signature extraction from SAR and optical satellite imagery based on deep learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Saheya, B.; Cai, R.; Zhao, H.; Gong, M.; Li, X. MCS Filter: A Multichannel Structure-Aware Speckle Filter for SAR Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–14. [Google Scholar] [CrossRef]
Zhang, X.; Li, X.; Zheng, Q. A machine-learning model for forecasting internal wave propagation in the Andaman Sea. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3095–3106. [Google Scholar] [CrossRef]
Saheya, B.; Ren, X.; Gong, M.; Li, X. IW Extraction From SAR Images Based on Generative Networks with Small Datasets. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 16473–16487. [Google Scholar] [CrossRef]
Barintag, S.; An, Z.; Jin, Q.; Chen, X.; Gong, M.; Zeng, T. MTU2-Net: Extracting Internal Solitary Waves from SAR Images. Remote Sens. 2023, 15, 5441. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.S.; Khan, F.S. Transformers in remote sensing: A survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. Changemamba: Remote sensing change detection with spatio-temporal state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic segmentation of remotely sensed images with state space model. Heliyon 2024, 10, e38495. [Google Scholar] [CrossRef]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: New York, NY, USA, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. Ce-net: Context encoder network for 2d medical image segmentation. IEEE Trans. Med. Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the ICCV, Montreal, QC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Yao, Y.; Cheng, G.; Lang, C.; Xie, X.; Han, J. Centric Probability-Based Sample Selection for Oriented Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 3290–3302. [Google Scholar] [CrossRef]
Yao, Y.; Cheng, G.; Lang, C.; Yuan, X.; Xie, X.; Han, J. Hierarchical Mask Prompting and Robust Integrated Regression for Oriented Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 13071–13084. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the ICCV, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, X.; Liu, W. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3123–3136. [Google Scholar] [CrossRef]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
He, R.; Zheng, W.; Zhao, L.; Wang, Y.; Zhu, D.; Wu, D.; Hu, B. Surface Vision Mamba: Leveraging Bidirectional State Space Model for Efficient Spherical Manifold Representation. arXiv 2025, arXiv:2501.14679. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. NeurIPS 2024, 37, 103031–103063. [Google Scholar]
Lin, J.; Hu, H. Audio mamba: Pretrained audio state space model for audio tagging. arXiv 2024, arXiv:2405.13636. [Google Scholar] [CrossRef]
Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv 2024, arXiv:2405.21060. [Google Scholar] [CrossRef]
Vasavi, S.; Divya, C.; Sarma, A.S. Detection of solitary ocean internal waves from SAR images by using U-Net and KDV solver technique. Glob. Transitions Proc. 2021, 2, 145–151. [Google Scholar] [CrossRef]
Yuan, G.; Ning, C.; Liu, L.; Li, C.; Liu, Y.; Sangmanee, C.; Cui, X.; Zhao, J.; Wang, J.; Yu, W. An automatic internal wave recognition algorithm based on CNN applicable to an ocean data buoy system. J. Mar. Sci. Eng. 2023, 11, 2110. [Google Scholar] [CrossRef]
Jiang, Z.; Gao, X.; Shi, L.; Li, N.; Zou, L. Detection of Ocean Internal Waves Based on Modified Deep Convolutional Generative Adversarial Network and WaveNet in Moderate Resolution Imaging Spectroradiometer Images. Appl. Sci. 2023, 13, 11235. [Google Scholar] [CrossRef]
Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. CoRR 2017, arXiv:1706.05587. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 VCIR, St. Petersburg, FL, USA, 10–13 December 2017; pp. 1–4. [Google Scholar] [CrossRef]
Liang, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution. In Proceedings of the ICCV, Montreal, QC, Canada, 10–17 October 2021; pp. 4096–4105. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer: New York, NY, USA, 2018; pp. 3–11. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified Perceptual Parsing for Scene Understanding. CoRR 2018, arXiv:1807.10221. [Google Scholar]
Shi, Y.; Dong, M.; Li, M.; Xu, C. VSSD: Vision Mamba with Non-Causal State Space Duality. arXiv 2024, arXiv:2407.18559. [Google Scholar]
Han, D.; Wang, Z.; Xia, Z.; Han, Y.; Pu, Y.; Ge, C.; Song, J.; Song, S.; Zheng, B.; Huang, G. Demystify Mamba in Vision: A Linear Attention Perspective. arXiv 2024, arXiv:2405.16605. [Google Scholar] [CrossRef]

Figure 1. With the continuous evolution of fundamental network architectures for image segmentation—such as CNNs, U-Net, Transformers, and SSMs—segmentation algorithms have become increasingly integrated and cross-disciplinary. These foundational frameworks provide robust algorithmic support for subsequent derivative techniques, including FCN, PVT, Swin Transformer, and Mamba. These “next-generation” models achieve lightweight design while significantly improving the efficiency and accuracy of both remote sensing and medical image segmentation. Based on this, introducing the concept of SSMs into internal waves (IWs) segmentation in remote sensing imagery is expected to further enhance segmentation accuracy and processing speed. All structural diagrams shown in the figure are excerpted from relevant literature.

Figure 2. The overall architecture of the proposed SMLA network. The framework consists of four main components: feature-aware serialization, state-space models with linear attention (SSM-LAs), non-causal linear attention (NCLA), and a feature decoder. The input image is first processed by the serialization module to extract and sequence multi-scale spatial features. These features are then encoded by multiple SSM-LA blocks, followed by NCLA modules to enhance global contextual dependencies. The resulting features are unified in dimension and passed to the decoder for spatial reconstruction and final prediction.

Figure 3. The heatmap visualizes the MIoU (Mean Intersection over Union) score distributions across six backbone architectures (densenet161, dpn131, resnet101, resnet152, se_resnext101_32x4d, and vgg19_bn) paired with various semantic segmentation models. The color gradient provides an intuitive representation of performance variations among different architecture combinations, enabling comparative evaluation of backbone-network effectiveness for segmentation tasks.

Figure 4. The boxplot presents the distribution of MIoU (Mean Intersection over Union) scores for ten mainstream semantic segmentation models (DeepLabV3, FPN, Linknet, MAnet, PSPNet, Unet, Unet++, UPerNet, MTU2, and ours). Through statistical features including the box (interquartile range), median line, and outliers, this visualization systematically compares the performance stability and generalization capability of each model on the test dataset.

Figure 5. Visualization results of ablation experiment. Image is the origin image. GT is the ground truth. Base is the result of basic model. FAS is the result of model which add the feature-aware serialization. SSM-LA is the result of model which add state-space models with linear attention. Ours is the result of the proposed model.

Figure 6. Generalized experimental results on the Andaman Sea. (a,d) are original SAR images; (b,e) are prediction results; (c,f) are, finally, overlay results.

Figure 7. Generalized experimental results, Sulu Sea and Celebes Sea in the figure has never appeared in the training. (a,d) are original images; (b,e) are prediction results; (c,f) are, finally, overlay results.

Table 1. Comparative experiment results. The optimal results are highlighted in red.

Model	Backbone	MIoU	FWIoU	Accuracy	Precision	F1-Score
DeepLabV3 [41]	densennet161	0.664	0.979	0.988	0.994	0.994
	dpn131	0.684	0.981	0.988	0.994	0.994
	resnet101	0.668	0.979	0.988	0.994	0.994
	resnet152	0.668	0.978	0.987	0.994	0.993
	se_resnext101_32x4d	0.676	0.980	0.988	0.994	0.994
	vgg19_bn	0.649	0.979	0.988	0.993	0.994
FPN [42]	densenet161	0.674	0.980	0.988	0.994	0.994
	dpn131	0.663	0.980	0.989	0.993	0.994
	resnet101	0.683	0.981	0.988	0.994	0.994
	resnet152	0.493	0.974	0.987	0.987	0.993
	se_resnext101_32x4d	0.672	0.979	0.988	0.994	0.994
	vgg19_bn	0.674	0.979	0.987	0.994	0.994
Linknet [43]	densenet161	0.682	0.980	0.988	0.994	0.994
	dpn131	0.671	0.979	0.987	0.994	0.994
	resnet101	0.661	0.979	0.988	0.993	0.994
	resnet152	0.493	0.974	0.987	0.987	0.993
	se_resnext101_32x4d	0.683	0.979	0.987	0.995	0.993
	vgg19_bn	0.677	0.980	0.988	0.994	0.994
MAnet [44]	densenet161	0.684	0.980	0.988	0.994	0.994
	dpn131	0.683	0.979	0.987	0.995	0.994
	resnet101	0.680	0.979	0.987	0.994	0.994
	resnet152	0.494	0.974	0.987	0.987	0.993
	se_resnext101_32x4d	0.673	0.980	0.988	0.994	0.994
	vgg19_bn	0.667	0.978	0.987	0.994	0.993
PSPNet [45]	densenet161	0.676	0.980	0.988	0.994	0.994
	dpn131	0.649	0.979	0.988	0.993	0.994
	resnet101	0.654	0.979	0.988	0.993	0.994
	resnet152	0.493	0.974	0.987	0.987	0.993
	se_resnext101_32x4d	0.656	0.978	0.987	0.993	0.993
	vgg19_bn	0.664	0.979	0.988	0.994	0.994
Unet [22]	densenet161	0.684	0.981	0.988	0.994	0.994
	dpn131	0.676	0.980	0.988	0.994	0.994
	resnet101	0.659	0.979	0.988	0.993	0.994
	resnet152	0.493	0.974	0.987	0.987	0.993
	se_resnext101_32x4d	0.668	0.980	0.988	0.993	0.994
	vgg19_bn	0.657	0.979	0.987	0.994	0.993
Unet++ [46]	densenet161	0.684	0.980	0.988	0.994	0.994
	dpn131	0.674	0.980	0.988	0.994	0.994
	resnet101	0.676	0.979	0.987	0.994	0.993
	resnet152	0.493	0.974	0.987	0.987	0.993
	se_resnext101_32x4d	0.664	0.979	0.987	0.994	0.993
	vgg19_bn	0.666	0.980	0.988	0.993	0.994
UPerNet [47]	densenet161	0.683	0.981	0.989	0.994	0.994
	dpn131	0.680	0.980	0.988	0.994	0.994
	resnet101	0.664	0.979	0.987	0.993	0.994
	resnet152	0.494	0.974	0.987	0.987	0.993
	se_resnext101_32x4d	0.675	0.980	0.988	0.994	0.994
	vgg19_bn	0.681	0.980	0.988	0.994	0.994
MTU2-Net [15]	-	0.721	0.982	0.989	0.994	0.996
Ours	-	0.743	0.981	0.990	0.995	0.996

Table 2. Ablation study of proposed framework. The optimal results highlighted in red.

Baseline	FAS	SSM-LA	MIoU	GFLOPs	Parameter
✓			65.4	12.67	26.49
✓	✓		70.2	13.36	26.59
✓		✓	71.5	8.08	32.00
✓	✓	✓	74.3	8.77	32.10

Table 3. Generalization analysis of proposed framework.

Area	Quantity	MIoU	FWIoU	Accuracy	Precision	F1-Score
Andaman Sea	234	69.6	98.1	98.7	99.3	99.3
Sulu Sea	138	70.1	98.2	98.7	99.3	99.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

An, Z.; Li, Z.; Barintag, S.; Zhao, H.; Yao, Y.; Jiao, L.; Gong, M. State-Space Model Meets Linear Attention: A Hybrid Architecture for Internal Wave Segmentation. Remote Sens. 2025, 17, 2969. https://doi.org/10.3390/rs17172969

AMA Style

An Z, Li Z, Barintag S, Zhao H, Yao Y, Jiao L, Gong M. State-Space Model Meets Linear Attention: A Hybrid Architecture for Internal Wave Segmentation. Remote Sensing. 2025; 17(17):2969. https://doi.org/10.3390/rs17172969

Chicago/Turabian Style

An, Zhijie, Zhao Li, Saheya Barintag, Hongyu Zhao, Yanqing Yao, Licheng Jiao, and Maoguo Gong. 2025. "State-Space Model Meets Linear Attention: A Hybrid Architecture for Internal Wave Segmentation" Remote Sensing 17, no. 17: 2969. https://doi.org/10.3390/rs17172969

APA Style

An, Z., Li, Z., Barintag, S., Zhao, H., Yao, Y., Jiao, L., & Gong, M. (2025). State-Space Model Meets Linear Attention: A Hybrid Architecture for Internal Wave Segmentation. Remote Sensing, 17(17), 2969. https://doi.org/10.3390/rs17172969

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

State-Space Model Meets Linear Attention: A Hybrid Architecture for Internal Wave Segmentation

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Architecture Overview

3.2. Feature-Aware Serialization Block

3.3. State-Space Models with Linear Attention Block

3.4. Decoder

4. Experiment and Results

4.1. Data

4.2. Implementation Details

4.3. Metrics

4.4. Comparative Experiments

4.5. Ablation Studies

4.6. Generalization Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI