A TransUNet-Based Intelligent Method for Identifying Internal Solitary Waves in the South China Sea

Wan, Zubiao; Zhu, Yuhang; Peng, Shiqiu; Xie, Jieshuo; Li, Shaotian; Song, Tao

doi:10.3390/jmse13061154

Open AccessArticle

A TransUNet-Based Intelligent Method for Identifying Internal Solitary Waves in the South China Sea

by

Zubiao Wan

^1,2

,

Yuhang Zhu

^2,3,4,

Shiqiu Peng

^2,3,4,

Jieshuo Xie

²,

Shaotian Li

^2,3,4,*

and

Tao Song

^1,*

¹

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China

²

Laboratory of Tropical Oceanography, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Guangzhou 510301, China

³

Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), Guangzhou 511458, China

⁴

Guangdong Key Laboratory of Ocean Remote Sensing and Big Data, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Guangzhou 510301, China

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(6), 1154; https://doi.org/10.3390/jmse13061154

Submission received: 16 May 2025 / Revised: 9 June 2025 / Accepted: 9 June 2025 / Published: 11 June 2025

(This article belongs to the Section Physical Oceanography)

Download

Browse Figures

Versions Notes

Abstract

Internal Solitary Waves (ISWs) play a crucial role in energy transfer among multi-scale oceanic motions. They also have a significant impact on marine transportation and underwater communication. To date, the identification of ISWs has been primarily developed based on Synthetic Aperture Radar (SAR) imagery. However, under severe sea conditions, the characteristics of ISWs at the ocean surface are generally disrupted, complicating their detection through satellite imagery. To mitigate the disturbances caused by severe weather, it is essential to account for ocean thermocline variability. In this study, we propose an automatic identification method for ISWs, utilizing the LLC4320 dataset from the South China Sea region for model training. The main innovations include: (1) The use of model data that incorporates both sea surface and underwater features, enabling accurate identification under rough sea conditions; (2) By incorporating the underwater features of ISWs, a TransUNet-based automatic identification method with some modifications, such as Dynamic Snake Convolution, is developed. The experimental results demonstrate that the model accurately identifies ISWs, achieving a Dice coefficient of 66.32%, Hausdorff_95 (HD95) of 5.27, Mean Pixel Accuracy (MPA) of 85.42%, and Mean Intersection over Union (MIoU) of 73.74% on our dataset, outperforming the other methods.

Keywords:

internal solitary waves; deep learning; semantic segmentation; LLC4320; TransUNet

1. Introduction

Internal Solitary Waves (ISWs) are fluctuations caused by density differences within water bodies, typically occurring in the deep ocean. These waves result from a balance between nonlinear wave-steepening processes and linear wave dispersion. ISWs characterized by significant amplitudes and robust current velocities are extensively observed in regions surrounding the edge of the continental shelf, abrupt subsurface inclines of islands and seamounts, and sills within straits. Since they predominantly occur in areas where robust barotropic tidal currents traverse steep topographical features or interact with thermoclines [1], ISWs not only alter the distribution of temperature and salinity, which impacts biological habitats, but may also threaten the stability of ecosystems [2,3]. Moreover, the abrupt flow changes induced by ISWs can heighten navigation risks for ships and jeopardize the stability of marine engineering structures [4,5,6]. Consequently, a comprehensive understanding and precise identification of oceanic ISWs are crucial for protecting marine ecosystems and ensuring the safety of marine economic activities.

Early research on oceanic ISWs primarily relied on physical models and traditional signal processing techniques, such as the Fourier transform and wavelet transform [7,8]. While effective under certain conditions, these methods often fail to capture the diversity and dynamic characteristics of ISWs, due to their distinct nonlinear characteristics and the complexity of the marine environment. With advances in remote sensing technology and data acquisition methods, researchers have increasingly analyzed ISWs using satellite imagery and ocean observation data [9,10]. The rise of deep learning in the 21st century has brought new opportunities for this field. Researchers have found that deep learning can automatically extract features, eliminating the subjectivity of manually assigned features, thus significantly enhancing the accuracy and efficiency of ISW identification.

In recent years, Convolutional Neural Networks (CNNs) have achieved remarkable success in remote sensing image processing tasks [11,12]. Utilizing multi-layer convolutional and pooling operations, CNNs effectively capture local features within images, making them highly suitable for feature extraction from marine remote sensing data. Some CNN-based models analyze satellite images of the ocean surface, constructing multi-level feature representations that enable high performance when handling complex ocean data [13]. The identification and extraction of ISWs from marine remote sensing images using CNNs primarily employ two methods: object detection and semantic segmentation. Object detection uses bounding boxes to locate ISWs. For example, Bao et al. [14] applied the Faster R-CNN network to an automatic ISW detection in the South China Sea (SCS). Semantic segmentation involves delineating the crest lines of ISWs. For instance, Zheng et al. [15] utilized the Mask R-CNN network to segment ISW stripes. However, current CNN-based models still face challenges in learning global features and require further development.

In addition, the U-Net architecture and its variants [16,17,18], such as TransUNet [19], have been applied to ISW identification and have become a popular choice for ISW identification due to their superior feature extraction and image segmentation capabilities. Through its encoder–decoder structure and skip connections, U-Net effectively integrates multi-level feature information, enabling the extraction of key ISW features while preserving image details [20].

With advancements in computational power and the expansion of datasets, researchers are exploring methods that combine various deep learning techniques. For instance, combining U-Net with Long Short-Term Memory Networks (LSTMs) [21] improves the handling of time series data, capturing the time-varying characteristics of ISWs. Additionally, some studies have introduced emerging technologies, such as Generative Adversarial Networks (GANs), to enhance model robustness by generating synthetic data [22,23].

Beyond model innovations, advances in data acquisition and processing technologies have also supported the research on ISW identification research. The development of drone and satellite remote sensing technologies enables researchers to obtain high-resolution ocean surface images and other relevant data [24,25,26], providing a solid foundation for training deep learning models. By integrating remote sensing with deep learning models, it is possible to monitor ISWs in the ocean [27,28,29], improving spatial coverage and temporal resolution for ISW identification.

Currently, most of the research work on ISW identification relies on Synthetic Aperture Radar (SAR) images; however, this method often yields suboptimal results under complex sea conditions. It is necessary to introduce ocean thermocline variabilities to resist the disturbance due to severe weather. This study uses the ocean dataset produced by a high-resolution numerical model, including but not limited to the sea surface height and three-dimensional sea temperature, while considering the slender, linear characteristics of ISWs, to achieve a precise identification in complex marine environments. The use of model data not only reduces the dependence of the identification method on SAR data but also accounts for the temperature gradient within the ocean, mitigating the effects of high sea states.

This paper is organized as follows: Section 1 offers a comprehensive introduction to the ISWs identification. Section 2 describes the processing of model data and the proposed ISW identification network. Section 3 presents the experimental results and performance of the network. Section 4 discusses the distribution characteristics of ISWs and the identification effectiveness of the model in the study area, and Section 5 concludes the paper.

2. Materials and Methods

2.1. Study Area and Dataset Creation

The study area is located in the SCS, specifically between 115.31° E and 123.64° E longitude, and 9.89° N and 17.14° N latitude, focusing on the identification of ISWs within the Luzon Strait. The Luzon Strait region is characterized by unique topography and complex oceanic background, which make it conducive to the generation and propagation of ISWs [30,31]. This region provides a wealth of samples suitable for training the ISW identification network.

The data source used in this study is the LLC4320 dataset, which contains various oceanic variables, including sea surface height (SSH), flow (zonal and meridional velocities), temperature, and salinity. The LLC4320 dataset is a 1/48° global ocean model dataset produced by MITgcm, developed to support oceanographic research by providing high-resolution global ocean simulations for the upcoming Surface Water and Ocean Topography (SWOT) mission. The dataset has a spatial resolution of 2.3 km. From this source, 360 data entries, each containing different oceanic variables, were selected. We further assigned the region around Luzon Strait, extracted data and manually annotated ISWs to compile the training dataset.

Due to the subtle appearance of ISWs in the sea surface height data, labeling ISWs manually is difficult. To cope with it, we apply a central difference filter [32] to the sea surface height to enhance the high-frequency features of ISWs so that ISWs are more distinguishable from the disturbed background. Central difference filtering is a widely used numerical method primarily employed to remove low-frequency signals while emphasizing high-frequency variations. This is achieved by calculating the difference between a data point and its neighboring points, thereby enhancing high-frequency features in the signals or data. For ISWs in sea surface height data, the wave peaks are typically characterized by prominent high-frequency fluctuations that may be obscured or difficult to detect in the raw data. By applying central difference filtering, the contrast of these wave peaks is enhanced, allowing for clearer identification of the morphology of the ISWs. Additionally, central difference filtering effectively removes low-frequency information, such as background noise or long-period fluctuations, from the smooth sections of the waveforms. This process preserves the fast-varying components of the data, which is essential for manual labeling and improving the performance of identification models. The formula for the central difference is shown in Equation (1).

G (x, y) = \frac{1}{2} \sqrt{{(I (x + 1, y) - I (x - 1, y))}^{2}} + \frac{1}{2} \sqrt{{(I (x, y + 1) - I (x, y - 1))}^{2}}

(1)

Figure 1a displays the sea surface height data in the study area before filtering, while Figure 1b illustrates the data after filtering. After filtering, the distribution of the crest line of ISW becomes more distinct, facilitating the labeling of most ISWs. LabelMe can then be used to manually label these waves. Although manual labeling of ISWs is widely used, it has several limitations, particularly concerning labeling quality. First, in regions with low contrast and high noise, filtering may generate pseudo-wave peaks, leading to inaccurate labeling. Additionally, the characteristics and complexity of ISWs require the labeler to have a certain level of expertise to distinguish between genuine wave peaks and background noise or other fluctuations. In areas with sparse ISWs and subtle fluctuations, the model is highly sensitive to labeling quality. If the labeler fails to correctly identify these fluctuations, it may result in mislabeling or omission, which can degrade the performance of models trained with these labels. Therefore, the labeling method plays a critical role in the final identification accuracy. Our labeling process begins by locating the high-frequency line, followed by extracting the profile of the three-dimensional ocean temperature data at the same position along a direction perpendicular to the ripples. If concavity is observed in the isothermal line within the profile, it is identified as an ISW, and lines are drawn manually to label it. Using this method, ripples that are difficult to distinguish from sea surface height data alone can be accurately identified. It is a widely used technique in oceanography for ISWs, enhancing the accuracy of manual labeling and improving labeling quality. The results of the labeling are shown in Figure 2.

2.2. Deep Learning Framework

The proposed model is developed based on the TransUNet architecture, which integrates the strengths of Transformers in capturing long-range dependencies with the advantages of U-Net in preserving fine details and high-resolution features [33]. As shown in Figure 3, TransUNet consists of two main components: the encoder and decoder. The encoder utilizes convolutional layers and a Transformer encoder to progressively extract image features. In this process, convolutional layers capture local features, while the Transformer encoder acquires global context through the self-attention mechanism. This combination not only preserves U-Net’s sensitivity to local features but also improves its ability to interpret global information, thereby enhancing the model’s capacity to process complex image data. In the decoder, TransUNet adopts a structure similar to U-Net, integrating upsampling and convolutional operations to restore feature maps to the original resolution. Introducing skip connections during the decoding process allows the model to fully utilize the features extracted by the encoder, improving segmentation accuracy. Compared to the traditional U-Net, TransUNet can fuse local and global information more effectively, making it better suited for complex image segmentation tasks. In our study, several modifications have been made to the original TransUNet framework to address the specific characteristics of ISWs. The innovation to the model is shown in Figure 4. This section presents these modifications and the rationale behind them.

2.3. Dynamic Snake Convolution

ISWs in model data manifest as elongated, curved tubular structures, and the local features of these structures are crucial for accurate ISW identification. Traditional convolution operations, with their fixed-shape kernels, struggle to capture the precise features of these complex morphologies. In contrast, Dynamic Snake Convolution (DSConv), with its adaptive kernel shape and weight adjustment mechanism, is better suited to focus on the local features of ISWs, thereby improving the accuracy and robustness of feature extraction.

The DSConv operation is inspired by serpentine curves, enabling it to focus on curved structures within images adaptively. Unlike traditional convolution kernels, which are typically rectangular or square, the shape of the DSConv kernel resembles a serpentine path, allowing for more flexible capture of complex edges and textures [34,35,36]. The core principle of the DSConv is to maintain the continuity of the region of interest through an iterative strategy, improving both segmentation accuracy and efficiency. Specifically, the DSConv introduces deformable offsets on top of standard convolution kernels, allowing the kernel to flexibly adapt to the complex geometric features of the target. Moreover, this approach enhances the ability to extract local features of tubular structures by adaptively adjusting the weights of the convolution kernel. Figure 5 demonstrates the working principle of the DSConv.

In the proposed TransUNet model, we integrate the DSConv into the encoding path to replace traditional convolutional layers. Specifically, the DSConv replaces the 3 × 3 convolutions in each encoding block. As shown, the ISW features primarily extend longitudinally, and the receptive field of the DSConv effectively captures longitudinal offsets, which enables the model to capture the local structural information of ISWs better while extracting image features.

2.4. Frequency Perception Module (FPM)

The Frequency Perception Module (FPM) is a network architecture designed to process information across different frequency ranges simultaneously. It typically utilizes octave convolution to automatically detect both high and low-frequency information in an end-to-end manner, extracting these features from images. Cong et al. [37] demonstrated that the FPM enhanced the model’s ability to capture detailed information and overall structure in images, thereby improving identification performance. The detailed structure of the FPM is shown in Figure 6.

The ripples of ISWs often share similar visual characteristics, such as color and texture, with the background, making it difficult for traditional neural networks to effectively differentiate between the two. By incorporating the FPM, the model can decompose the ISW features into high and low-frequency components. High-frequency features represent textural details or rapidly changing areas, particularly the edges of ISWs, which significantly enhance the accuracy of edge identification, while low-frequency features help outline the overall contour of the object, aiding the model in identifying the fundamental structure of ISW. In the identification model, the FPM is integrated into the feature extraction process. By introducing this module, the model can independently learn features across various frequency scales, thereby improving the efficiency and accuracy of distinguishing the wave trace from the background. Ultimately, this approach enhances the identification accuracy of ISWs and strengthens the support for subsequent image processing and analysis.

2.5. Convolutional Block Attention Module (CBAM)

ISWs may vary in scale and morphology in model data and be obscured by complex background information. Cai et al. [38] showed that introducing the CBAM module enabled the model to focus on the relevant feature regions of ISWs more effectively during feature extraction and suppressed irrelevant background information. This improvement allows the model to extract more accurate features from ISWs of varying scales and morphologies. Additionally, the lightweight nature of the CBAM module makes it well-suited for real-time applications or resource-constrained environments, without significantly increasing computational overhead. Thus, the CBAM module is embedded in the model’s encoding path to improve its sensitivity to ISW features.

The Convolutional Block Attention Module (CBAM) is a lightweight attention mechanism that separately learns channel and spatial weights through two parallel sub-modules: channel attention module and spatial attention module [39]. As shown in Figure 7, the channel attention module aggregates spatial information from feature maps using global average pooling and max pooling and then learns inter-channel dependencies through a shared MLP network, which in turn generates a channel weight map. These channel weights are then applied to the individual channels of the input feature maps, thereby reinforcing the model’s focus on key channels. Similarly, the spatial attention module learns spatial dependencies. Ultimately, the outputs of the channel and spatial attention modules are multiplied to generate the final attention map, which is subsequently applied to the input feature map to enhance the focus on important features.

2.6. Matthews Correlation Coefficient Loss Function

The ISW identification is a typical example of an imbalanced classification problem, where the ISW samples are relatively scarce compared to background ones. Traditional loss functions, such as cross-entropy, may perform poorly in such conditions, as they prioritize overall accuracy while neglecting the class imbalance, which results in predictions that favor the background category. By incorporating the Matthews Correlation Coefficient (MCC) loss function, the model can better address the class imbalance, thereby improving the accuracy of ISW identification [40].

The MCC [41] is a metric commonly used to evaluate the performance of binary classification models. It considers all four categories in the confusion matrix, true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), and provides a comprehensive assessment of model performance. The MCC value ranges from −1 to 1, where 1 indicates perfect prediction, 0 represents performance equivalent to random guessing, and −1 reflects an opposite prediction. MCC is particularly well-suited for binary classification tasks with imbalanced datasets, as it simultaneously accounts for the performance across all categories.

The formula for calculating MCC is formulated in Equation (2).

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(2)

In this formula, TP represents the number of samples correctly classified as positive by the model; TN represents the number of samples correctly classified as negative; FP represents the number of samples incorrectly classified as positive; and FN represents the number of samples incorrectly classified as negative. This formula allows the MCC to reflect the model’s predictive ability for both positive and negative categories, thereby mitigating any bias arising from class imbalance and providing a more stable and reliable evaluation.

The computational process begins by calculating the confusion matrix, which compares the model’s predictions with the true labels to obtain the values of TP, TN, FP, and FN. The MCC value is then computed by substituting the confusion matrix values into the formula. If the model identifies more accurately for both ISW and background, the MCC value will approach 1. Conversely, if the model performs poorly, the MCC value will be close to 0 or negative.

To use MCC as a loss function for model optimization, the MCC values must be converted to loss values. Since most optimization algorithms aim to minimize the loss function, this conversion is typically achieved by subtracting the MCC from 1. In implementing the MCC loss function, further improvements in the model’s focus on minority class samples can be achieved through class weighting. This strategy helps the model increase the accuracy of identifying the minority class without introducing bias toward the majority class.

3. Results

3.1. Evaluation Metrics

For the evaluation of the networks’ performance, this study utilizes four commonly used metrics in semantic segmentation: Dice Coefficient, Hausdorff_95 (HD95), Mean Pixel Accuracy (MPA), and Mean Intersection over Union (MIoU). Dice Coefficient quantifies the similarity between predicted results and the ground truth by calculating the ratio of intersection to union (Equation (3)). Hausdorff_95 (HD95) measures the maximum distance discrepancy between the predicted and ground truth boundaries, making it especially useful for assessing the accuracy of segmentation boundaries and resilience to outliers. MPA computes the proportion of correctly classified pixels across all categories, reflecting the model’s overall pixel-level performance (Equation (4)). MIoU calculates the intersection-to-union ratio for each category and averages these values across all categories, serving as a widely adopted metric in semantic segmentation (Equation (5)).

D i c e = \frac{2 \times T P}{2 \times T P + F P + F N}

(3)

M P A = \frac{1}{N} \sum_{i = 1}^{N} \frac{{T P}_{i}}{{T P}_{i} + {F P}_{i}}

(4)

M I o U = \frac{1}{N} \sum_{i = 1}^{N} \frac{{T P}_{i}}{{T P}_{i} + {F P}_{i} + {F N}_{i}}

(5)

in which True Positive (TP) refers to the number of pixels correctly identified as positive, False Negative (FN) presents the number of pixels incorrectly classified as negative, and False Positive (FP) presents the number of pixels incorrectly predicted as positive by the model.

3.2. Data and Training Setup

The ISW dataset, as described in Section 2, is used to train the network. This dataset comprises 360 data samples, split 9:1 between the training and test sets. Furthermore, labels are manually annotated. The model input includes center-differenced sea surface height and ocean temperature about 88 to 171 m beneath the surface as supplementary from the LLC4320 dataset. Training and testing are conducted in the PyTorch 2.0.1 framework on a server with an NVIDIA GeForce RTX 4080 Graphics Processing Unit (GPU). Adam is employed as the optimizer with an initial learning rate of 0.001. The batch size is set to be 4, and the experiments are conducted over 100 epochs. Dosovitskiy et al. [42] suggested that Transformer models performed optimally only when the dataset was sufficiently large. To enable TransUNet to effectively identify lightweight ISW data and address overfitting due to the model’s high complexity, we analyze the key parameter of the Transformer layer count, which influences model complexity, and apply the Dropout technique [43]. Dropout is able to randomly deactivate some neurons during training, significantly reducing overfitting. Table 1 and Table 2 present the effects of the number of Transformer layers and the dropout rate on model performance, respectively. The experimental results show that the model achieves optimal performance when the number of Transformer layers is 2 with a dropout rate of 0.1.

3.3. Comparison with Other Deep Learning Models

To evaluate the performance of the proposed model, we trained two deep learning networks, MTU2-Net and U-Net + CBAM, using the same dataset constructed in this study and performed a comparative analysis. MTU2-Net, a model proposed by Saheya et al. [44] specifically for ISW identification, significantly enhances the model accuracy using a novel loss function. Similarly, Cai et al. [38] integrated a CBAM with the U-Net model, which can also obtain precise ISW identification. The experimental results for the proposed model, along with those for the other two models, are presented in Table 3. As shown in Table 3, the proposed model outperforms the others in terms of Dice coefficient, MPA, and MIoU, which demonstrates a significant improvement in HD95. All these confirm that the proposed model surpasses the compared models in both the number of identified ISWs and the accuracy of ISW location predictions.

We further selected four representative ISWs from the test set, including cases with compact wave patterns, inconspicuous waves, and transverse wave patterns, and we manually identified the ISWs in the visualized images to identify the ground truth. The visualizations of sea surface height data, ground truth data, and the identification results of various deep learning methods are shown in Figure 8. It is revealed that MTU2-Net tends to miss inconspicuous ISW stripes in complex scenarios and struggles to distinguish the boundaries of ISWs when the stripes are dense. For the U-Net combined with the CBAM module, the main issue is a tendency to produce false positive identifications. As expected, the proposed model maintains excellent ISW identification accuracy across various complex scenarios and effectively distinguishes the boundaries of ISWs, demonstrating its superiority.

3.4. Ablation Experiments

During the modification of the model, we conducted ablation experiments to evaluate various techniques. The experiments involved four models: the original TransUNet model, the TransUNet model using DSConv for feature extraction, the TransUNet model incorporating both DSConv and FPM and a model that further integrates a CBAM based on the previous components. The Dice coefficient was used as the evaluation metric to provide a comprehensive performance assessment of the model’s identification quality. The experimental results are shown in Table 4.

The results presented in the table demonstrate that the DSConv, which automatically adjusts the receptive field and is highly suitable for extracting elongated features of ISWs, and the FPM, which effectively distinguishes wave patterns from the background, thereby facilitating the extraction of ISW edge features, and the CBAM, which combines channel and spatial attention to enhance feature focusing, all contribute to improving the model’s ability to identify ISWs accurately. These findings strongly support the effectiveness of our proposed method.

4. Discussion

4.1. Distribution of ISWs in the South China Sea and Model Generalization Ability

The Northern South China Sea (NSCS) is one of the regions with the highest frequency of ISW activity globally, and the distribution of ISWs in this area follows distinct patterns. According to the model identification results shown in Figure 9a, ISW activity is particularly intense in the NSCS, especially to the west of the Luzon Strait and in the northern continental shelf region. The seabed topography in these areas is complex and varied, while the pronounced water stratification creates favorable conditions for the formation and propagation of ISWs. The identification results indicate that ISWs exhibit a high density and large amplitudes in these regions, significantly affecting the marine environment. Additionally, ISWs occurring to the west of the Luzon Strait display larger amplitudes and longer wavelengths near the strait, although they are most densely distributed further to the west. Observations also reveal that ISWs in the northern SCS experience considerable deformation and attenuation during propagation, primarily due to the complex dynamic processes and topographic effects in the ocean.

To enhance the generalization ability, the proposed model employs key techniques to improve its adaptability across different regions. When processing data with distinctive waveform features, such as ISWs, the DSConv is particularly effective in extracting slender features. The waveform characteristics of ISWs remain consistent across regions, and the adaptive nature of the DSConv allows it to adjust the convolution path according to the input signal’s morphology, ensuring efficient feature extraction across diverse regions and datasets. The CBAM further enhances the model’s ability to identify significant features by incorporating spatial and channel attention mechanisms. Spatial attention enables the model to focus on critical regions of the input image, while channel attention increases sensitivity to various features by weighting the importance of different channels. By enhancing the model’s capacity to identify locally significant features, the CBAM effectively improves the model’s adaptability to variations in data across different regions. Additionally, the model’s generalization is further strengthened by dropout regularization and normalization techniques during training. Regularization controls model complexity and mitigates overfitting, while normalization stabilizes the training process by adjusting the input distribution at each layer, preventing instability caused by significant data differences between regions. By combining these techniques, the model maintains strong identification performance across diverse datasets, enhancing its robustness and overall performance.

4.2. Comparison with Traditional Identification Methods

In this study, the proposed model employs the deep learning method to identify ISWs in the NSCS, producing accurate identification results. Compared to traditional wavelet transform-based methods, the deep learning model demonstrates superior expressiveness and accuracy. Traditional methods rely on manual feature extraction and mathematical transformations to identify ISWs. These methods typically require waveform decomposition and reconstruction. While the wavelet transform can capture local features to some extent, it is often influenced by waveform complexity and noise interference, which limits its effectiveness in edge recognition. Under severe sea conditions, traditional methods struggle to distinguish waveform features against complex backgrounds and are more prone to misidentification. In contrast, the deep learning model can automatically extract deep features from the data and integrate spatial and temporal information through an end-to-end learning process, enabling efficient identification of ISWs. The network structure adapts to various waveforms and environmental changes during training, enhancing the model’s robustness and accuracy.

The experimental results shown in Figure 10 indicate that the proposed model significantly outperforms traditional wavelet transform methods in identification accuracy. During the experiments, we observed that the traditional method not only suffered from blurring and errors in edge recognition but also frequently failed to accurately detect ISWs or misclassified waves that were not ISWs. This issue is especially evident in environments with fluctuating waveforms and high interference, leading to substantial identification errors. In contrast, the deep learning model consistently maintained high identification accuracy in complex marine environments, effectively detecting wave edges, and overcoming common limitations of traditional methods. These results demonstrate that the deep learning model has superior adaptability when handling marine data, showcasing its enhanced performance in ISW identification tasks.

4.3. ISW Identification Under Extreme Sea States

In the context of ISW identification in the SCS under extreme sea states, we focused on the impact of input data selection on the accuracy of identification results. Traditional methods for ISW identification often rely on sea surface data, such as SAR imagery. However, wind and current fields in extreme sea states can significantly interfere with the sea surface state, obscuring the characteristics of ISWs and compromising identification accuracy. To address this challenge, we utilized three-dimensional ocean temperature data from the high-resolution LLC4320 dataset as the sole input for our identification model. Compared to the models using only sea surface height data, the proposed model is less susceptible to surface-related disturbances. As a result, we anticipate that this approach will enhance the robustness of ISW identification.

The results presented in Figure 9b indicate that when three-dimensional ocean temperature is used as the input, the proposed model consistently maintains high performance, successfully identifying the overall distribution of ISWs. It suggests that three-dimensional ocean temperature provides reliable information for ISW identification and is minimally impacted by surface conditions. This finding is significant for practical ocean monitoring and early warning, offering a more robust method for ISW identification. However, the current approach, which does not account for sea surface conditions, still lacks detail. As shown in Figure 9b,c, the identified ISW stripes exhibit poor performance at the boundaries, and several small targets are misidentified. Moving forward, we will explore other ISW identification methods based on three-dimensional ocean data to improve accuracy and robustness under severe sea conditions.

5. Conclusions

ISWs are a significant phenomenon in ocean dynamics and are critical to marine engineering, marine ecological protection, resource development, and navigation. Their unique propagation characteristics and complex generation mechanisms have been central topics in marine scientific research for an extended period.

To accurately identify ISWs in the ocean, we propose a semantic segmentation model based on an enhanced TransUNet. The proposed model preserves the original advantages of TransUNet and further incorporates innovative modifications tailored to the characteristics of ISWs: the DSConv to improve the model’s adaptability to waveform features, the FPM for effective differentiation of ISWs from the background, and the CBAM to enhance the model’s focus on key regions. The experimental results demonstrate that the proposed model performs exceptionally well in identifying ISWs in the SCS, achieving a Dice coefficient of 66.32%, HD95 of 5.27, MPA of 85.42%, and MIoU of 73.74%, all superior to other methods. Using both sea surface height and three-dimensional ocean temperature data as inputs, the model accurately identifies the location and morphology of ISWs. Under complex sea conditions, the model maintains high identification accuracy, even when only the three-dimensional ocean temperature data are used as input, highlighting its robustness and effectiveness.

In the future, we will explore more advanced semantic segmentation technologies and algorithms to further enhance the model’s identification and generalization capabilities. In particular, additional efforts will be dedicated to improving ISW identification under severe sea conditions, with a focus on refining the recognition of finer details. Moreover, it is also necessary to explore the influence of additional oceanic factors on ISW identification to achieve better results.

Author Contributions

Conceptualization, Z.W. and S.P.; methodology, Z.W.; validation, Z.W., T.S. and S.L.; formal analysis, Z.W., Y.Z. and J.X.; investigation, Z.W. and T.S.; Data curation, Z.W. and Y.Z.; writing—original draft preparation, Z.W.; writing—review and editing, S.L. and T.S.; supervision, T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Key Research and Development Program of China (2022YFC3105005), National Natural Science Foundation of China (U21A6001, 42206019, 42376019), Guangdong Key Project (2019BT2H594), the Guangdong Basic and Applied Basic Research Foun-dation (2022A1515240081) and the special fund of South China Sea Institute of Oceanology of the Chinese Academy of Sciences (SCSIO2023QY01).

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from the South China Sea Institute of Oceanography, Chinese Academy of Sciences and are available from the S.P. with the permission of the South China Sea Institute of Oceanography, Chinese Academy of Sciences.

Acknowledgments

The authors gratefully acknowledge the use of the HPCC at the South China Sea Institute of Oceanology, Chinese Academy of Sciences.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gong, Q.; Chen, L.; Diao, Y.; Xiong, X.; Sun, J.; Lv, X. On the identification of internal solitary waves from moored observations in the northern South China Sea. Sci. Rep. 2023, 13, 3133. [Google Scholar] [CrossRef] [PubMed]
Alford, M.H.; Peacock, T.; MacKinnon, J.A.; Nash, J.D.; Buijsman, M.C.; Centurioni, L.R.; Chao, S.Y.; Chang, M.H.; Farmer, D.M.; Fringer, O.B.; et al. The formation and fate of internal waves in the South China Sea. Nature 2015, 521, 65–69. [Google Scholar] [CrossRef]
Moum, J.N.; Smyth, W.D. The pressure disturbance of a nonlinear internal wave train. J. Fluid Mech. 2006, 558, 153–177. [Google Scholar] [CrossRef]
Troy, C.; Koseff, J. The instability and breaking of long internal waves. J. Fluid Mech. 2005, 543, 107–136. [Google Scholar] [CrossRef]
Grue, J.; Jensen, A.; Rusås, P.; Sveen, J.K. Properties of large-amplitude internal waves. J. Fluid Mech. 1999, 380, 257–278. [Google Scholar] [CrossRef]
Colosi, J.A.; Flatté, S.M.; Bracher, C. Internal-wave effects on 1000-km oceanic acoustic pulse propagation: Simulation and comparison with experiment. J. Acoust. Soc. Am. 1994, 96, 452–468. [Google Scholar] [CrossRef]
Rodenas, J.A.; Garello, R. Internal wave detection and location in SAR images using wavelet transform. IEEE Trans. Geosci. Remote Sens. 2002, 36, 1494–1507. [Google Scholar] [CrossRef]
Kurekin, A.A.; Land, P.E.; Miller, P.I. Internal waves at the U.K. Continental shelf: Automatic mapping using the ENVISAT ASAR sensor. Remote Sens. 2020, 12, 2476. [Google Scholar] [CrossRef]
Hsu, M.K.; Liu, A.K.; Liu, C. A study of internal waves in the China Seas and Yellow Sea using SAR. Cont. Shelf Res. 2000, 20, 389–410. [Google Scholar] [CrossRef]
Li, X.; Liu, B.; Zheng, G.; Ren, Y.; Zhang, S.; Liu, Y.; Gao, L.; Liu, Y.; Zhang, B.; Wang, F. Deep-learning-based information mining from ocean remote-sensing imagery. Nat. Sci. Rev. 2020, 7, 1584–1605. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Yan, Z.; Chong, J.; Zhao, Y.; Sun, K.; Wang, Y.; Li, Y. Multifeature Fusion Neural Network for Oceanic Phenomena Detection in SAR Images. Sensors 2020, 20, 210. [Google Scholar] [CrossRef]
Bao, S.; Meng, J.; Sun, L.; Liu, Y. Detection of ocean internal waves based on Faster R-CNN in SAR images. J. Oceanol. Limnol. 2020, 38, 55–63. [Google Scholar] [CrossRef]
Zheng, Y.; Qi, K.; Zhang, H. Stripe segmentation of oceanic internal waves in synthetic aperture radar images based on Mask R-CNN. Geocarto Int. 2022, 37, 14480–14494. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Interv. 2015, 9351, 234–241. [Google Scholar] [CrossRef]
Zhang, S.; Li, S.; Zhang, X. Internal wave signature extraction from SAR and optical satellite imagery based on deep learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4203216. [Google Scholar] [CrossRef]
Ma, Y.; Meng, J.; Sun, L.; Ren, P. Oceanic Internal Wave Signature Extraction in the Sulu Sea by a Pixel Attention U-Net: PAU-Net. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Qi, K.; Zhang, H.; Lu, J.; Zheng, Y.; Zhang, Z. Strip segmentation of oceanic internal waves in SAR images based on TransUNet. Acta Oceanol. Sin. 2023, 42, 67–74. [Google Scholar] [CrossRef]
Vasavi, S.; Divya, C.; Sarma, A.S. Detection of solitary ocean internal waves from SAR images by using U-Net and KDV solver technique. Glob. Transit. Proc. 2021, 2, 145–151. [Google Scholar] [CrossRef]
Xu, F.; Ma, H.; Sun, J.; Wu, R.; Liu, X.; Kong, Y. LSTM Multi-modal UNet for Brain Tumor Segmentation. In Proceedings of the IEEE 4th International Conference on Image, Vision and Computing (ICIVC), Xiamen, China, 5–7 July 2019; pp. 236–240. [Google Scholar]
Jiang, Z.; Gao, X.; Shi, L.; Li, N.; Zou, L. Detection of Ocean Internal Waves Based on Modified Deep Convolutional Generative Adversarial Network and WaveNet in Moderate Resolution Imaging Spectroradi-ometer Images. Appl. Sci. 2023, 13, 11235. [Google Scholar] [CrossRef]
Duan, B.; Barintag, S.; Meng, J.; Gong, M. Stripe Extraction of Oceanic Internal Waves Using PCGAN with Small-Data Training. Remote Sens. 2024, 16, 787. [Google Scholar] [CrossRef]
Schuler, D.L.; Jansen, R.W.; Lee, J.S.; Kasilingam, D. Polarisation orientation angle measurements of ocean internal waves and current fronts using polarimetric SAR. IEE Proc. Radar Sonar Navig. 2003, 150, 135–143. [Google Scholar] [CrossRef]
Santos-Ferreira, A.M.; da Silva, J.C.B.; Srokosz, M. SAR-mode altimetry observations of internal solitary waves in the tropical ocean Part 2: A method of detection. Remote Sens. 2019, 11, 1339. [Google Scholar] [CrossRef]
Sun, Z.; Shao, W.; Jiang, X.; Nunziata, F.; Wang, W.; Shen, W.; Migliaccio, M. Contribution of breaking wave on the co-polarized backscattering measured by the Chinese Gaofen-3 SAR. Int. J. Remote Sens. 2022, 43, 1384–1408. [Google Scholar] [CrossRef]
Li, X.; Zhou, Y.; Wang, F. Advanced information mining from ocean remote sensing imagery with deep learning. J. Remote Sens. 2022, 2022, 9849645. [Google Scholar] [CrossRef]
Zhang, H.; Meng, J.; Sun, L.; Zhang, X.; Shu, S. Performance analysis of internal solitary wave detection and identification based on compact polarimetric SAR. IEEE Access 2020, 8, 172839–172847. [Google Scholar] [CrossRef]
Simonin, D.; Tatnall, A.R.; Robinson, I.S. The automated detection and recognition of internal waves. Int. J. Remote Sens. 2009, 30, 4581–4598. [Google Scholar] [CrossRef]
Chen, Y.J.; Ko, D.S.; Shaw, P.T. The generation and propagation of internal solitary waves in the South China Sea. J. Geophys. Res. Oceans 2013, 118, 6578–6589. [Google Scholar] [CrossRef]
Sun, L.; Zhang, J.; Meng, J. On propagation velocity of internal solitary waves in the northern South China Sea with remote sensing and in-situ observations data. Oceanol. Limnol. Sin. 2018, 49, 471–480. [Google Scholar] [CrossRef]
Das, M.; Dey, A.; Sadhu, S.; Ghoshal, T.K. Adaptive central difference filter for non-linear state estimation. IET Sci. Meas. Technol. 2015, 9, 728–733. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. Preprint arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular Structure Segmentation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6047–6056. [Google Scholar]
Wang, H.; Feng, R.; Wu, L.; Liu, M.; Cui, Y.; Zhang, C.; Guo, Z. DSU-Net: Dynamic Snake U-Net for 2-D Seismic First Break Picking. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5926613. [Google Scholar] [CrossRef]
Sun, B.; Duan, Z.; Han, X.; Huang, X.; Xie, B.; Wu, X. Dangerous Object Detection Using YOLOv8 and Dynamic Snake Convolution. In Proceedings of the 2024 6th International Conference on Internet of Things, Automation and Artificial Intelligence (IoTAAI), Guangzhou, China, 26–28 July 2024; pp. 80–83. [Google Scholar]
Cong, R.; Sun, M.; Zhang, S.; Zhou, X.; Zhang, W.; Zhao, Y. Frequency Perception Network for Camouflaged Object Detection. In Proceedings of the 31st ACM International Conference on Multimedia (MM’23), New York, NY, USA, 29 October–3 November 2023; pp. 1179–1189. [Google Scholar]
Cai, J.; Hu, W.; Yan, H.; Wang, N.; Hong, M.; Xing, X.; Zhang, Y. Automatic Extraction of Internal Wave from Complex Background Using Polarimetric SAR and Convolutional Neural Network. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 16222–16235. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. Proc. Eur. Conf. Comput. Vis. 2018, 11211, 3–19. [Google Scholar] [CrossRef]
Abhishek, K.; Hamarneh, G. Matthews correlation coefficient loss for deep convolutional networks: Application to skin lesion segmentation. In Proceedings of the IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 225–229. [Google Scholar]
Matthews, B.W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta-Protein Struct. 1975, 405, 442–451. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Barintag, S.; An, Z.; Jin, Q.; Chen, X.; Gong, M.; Zeng, T. MTU2-Net: Extracting Internal Solitary Waves from SAR Images. Remote Sens. 2023, 15, 5441. [Google Scholar] [CrossRef]

Figure 1. (a) Schematic representation of sea surface height data in the study area. (b) Schematic of sea surface height data after center differencing.

Figure 2. Schematic diagram of Internal Solitary Wave labeling.

Figure 3. Architecture diagram of Internal Solitary Wave identification model.

Figure 4. Innovations to the model identification model.

Figure 5. Dynamic Snake convolution architecture.

Figure 6. Frequency Perception Module (FPM) architecture.

Figure 7. Convolutional Block Attention Module architecture.

Figure 8. Comparison of the identification effect of each model.

Figure 9. (a) ISW identification results within the South China Sea. (b) Identification results using only ocean temperature data. (c) Ground truth of the identified area.

Figure 10. (a) ISW identification results within the deep learning method. (b) ISW identification results within the conventional method. (c) Ground truth of the identified area.

Table 1. The effect of the Transformer layer.

Transformer Layer	Dice Coefficient	HD95	MPA	MIoU
1	0.6455	7.04	0.8274	0.7282
2	0.6632	5.27	0.8542	0.7374
3	0.6215	7.63	0.8591	0.7134

Table 2. The effect of dropout rate.

Dropout Rate	Dice Coefficient	HD95	MPA	MIoU
0	0.5904	10.00	0.8657	0.6955
0.1	0.6632	5.27	0.8542	0.7374
0.2	0.6445	6.49	0.8484	0.7266
0.3	0.6113	7.90	0.8571	0.7065

Table 3. Deep learning segmentation model experimental results.

	Dice Coefficient	HD95	MPA	MIoU
U-Net + CBAM	0.6291	13.89	0.7948	0.7196
MTU²-Net	0.6321	10.69	0.8076	0.7207
The proposed model	0.6632	5.27	0.8542	0.7374

Table 4. Results of model ablation experiments.

TransUNet	DySnakeConv	FPM	CBAM	Dice Coefficient
√				0.6125
√	√			0.6363
√	√	√		0.6450
√	√	√	√	0.6632

“√”represents the modification made.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, Z.; Zhu, Y.; Peng, S.; Xie, J.; Li, S.; Song, T. A TransUNet-Based Intelligent Method for Identifying Internal Solitary Waves in the South China Sea. J. Mar. Sci. Eng. 2025, 13, 1154. https://doi.org/10.3390/jmse13061154

AMA Style

Wan Z, Zhu Y, Peng S, Xie J, Li S, Song T. A TransUNet-Based Intelligent Method for Identifying Internal Solitary Waves in the South China Sea. Journal of Marine Science and Engineering. 2025; 13(6):1154. https://doi.org/10.3390/jmse13061154

Chicago/Turabian Style

Wan, Zubiao, Yuhang Zhu, Shiqiu Peng, Jieshuo Xie, Shaotian Li, and Tao Song. 2025. "A TransUNet-Based Intelligent Method for Identifying Internal Solitary Waves in the South China Sea" Journal of Marine Science and Engineering 13, no. 6: 1154. https://doi.org/10.3390/jmse13061154

APA Style

Wan, Z., Zhu, Y., Peng, S., Xie, J., Li, S., & Song, T. (2025). A TransUNet-Based Intelligent Method for Identifying Internal Solitary Waves in the South China Sea. Journal of Marine Science and Engineering, 13(6), 1154. https://doi.org/10.3390/jmse13061154

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A TransUNet-Based Intelligent Method for Identifying Internal Solitary Waves in the South China Sea

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Dataset Creation

2.2. Deep Learning Framework

2.3. Dynamic Snake Convolution

2.4. Frequency Perception Module (FPM)

2.5. Convolutional Block Attention Module (CBAM)

2.6. Matthews Correlation Coefficient Loss Function

3. Results

3.1. Evaluation Metrics

3.2. Data and Training Setup

3.3. Comparison with Other Deep Learning Models

3.4. Ablation Experiments

4. Discussion

4.1. Distribution of ISWs in the South China Sea and Model Generalization Ability

4.2. Comparison with Traditional Identification Methods

4.3. ISW Identification Under Extreme Sea States

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI