STNet: Prediction of Underwater Sound Speed Profiles with an Advanced Semi-Transformer Neural Network

Huang, Wei; Lu, Junpeng; Lu, Jiajun; Wu, Yanan; Zhang, Hao; Xu, Tianhe

doi:10.3390/jmse13071370

Open AccessArticle

STNet: Prediction of Underwater Sound Speed Profiles with an Advanced Semi-Transformer Neural Network

by

Wei Huang

¹

,

Junpeng Lu

¹,

Jiajun Lu

¹,

Yanan Wu

¹

,

Hao Zhang

^1,* and

Tianhe Xu

^2,*

¹

The Faculty of Information Science and Engineering, Ocean University of China, Qingdao 266404, China

²

School of Space Science and Technology, Shandong University (Weihai), Weihai 264200, China

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(7), 1370; https://doi.org/10.3390/jmse13071370

Submission received: 11 June 2025 / Revised: 10 July 2025 / Accepted: 16 July 2025 / Published: 18 July 2025

(This article belongs to the Special Issue Ocean Climate: Deep Learning, Statistical Methods and Dynamical Modeling)

Download

Browse Figures

Versions Notes

Abstract

The real-time acquisition of an accurate underwater sound velocity profile (SSP) is crucial for tracking the propagation trajectory of underwater acoustic signals, making it play a key role in ocean communication positioning. SSPs can be directly measured by instruments or inverted leveraging sound field data. Although measurement techniques provide a good accuracy, they are constrained by limited spatial coverage and require a substantial time investment. The inversion method based on the real-time measurement of acoustic field data improves operational efficiency but loses the accuracy of SSP estimation and suffers from limited spatial applicability due to its stringent requirements for ocean observation infrastructures. To achieve accurate long-term ocean SSP estimation independent of real-time underwater data measurements, we propose a semi-transformer neural network (STNet) specifically designed for simulating sound velocity distribution patterns from the perspective of time series prediction. The proposed network architecture incorporates an optimized self-attention mechanism to effectively capture long-range temporal dependencies within historical sound velocity time-series data, facilitating an accurate estimation of current SSPs or prediction of future SSPs. Through the architectural optimization of the transformer framework and integration of a time encoding mechanism, STNet could effectively improve computational efficiency. For long-term forecasting (using the Pacific Ocean as a case study), STNet achieved an annual average RMSE of 0.5811 m/s, outperforming the best baseline model, H-LSTM, by 26%. In short-term forecasting for the South China Sea, STNet further reduced the RMSE to 0.1385 m/s, demonstrating a 51% improvement over H-LSTM. Comparative experimental results revealed that STNet outperformed state-of-the-art models in predictive accuracy and maintained good computational efficiency, demonstrating its potential for enabling accurate long-term full-depth ocean SSP forecasting.

Keywords:

self-attention mechanism; semi-transformer neural network (STNet); sound speed profile (SSP); time series prediction

1. Introduction

The integrated underwater positioning, navigation, timing, and communication (PNTC) system is of great significance in marine disaster warning, rescue operations, and resource exploration [1]. Acoustic signals exhibit significantly lower energy attenuation in underwater environments than radio waves, with propagation capabilities extending to tens of kilometers, thereby establishing themselves as the optimal signal carriers for underwater PNTC systems [2,3]. However, the heterogeneous nature of underwater sound velocity distribution induces a signal propagation path curvature in accordance with Snell’s law when traversing different depth layers, consequently compromising distance measurement accuracy and ultimately degrading the performance of acoustic positioning systems. Therefore, the rapid acquisition of a regional sound velocity distribution becomes very important, as it could help with reconstructing real-time signal propagation paths through ray-tracing theory, which enhances ranging and positioning precision by calculating the equivalent line-of-sight distances [4].

The velocity of underwater acoustic signal propagation is determined by multiple environmental parameters, primarily temperature, salinity, and static pressure [5]. In shallow marine environments, seasonal and diurnal variations induce substantial temperature fluctuations, causing significant spatiotemporal nonlinearity in a sound velocity distribution. Conversely, in deep ocean regions, static pressure dominates as the primary factor influencing sound velocity, exhibiting an approximately linear relationship with increasing depth. In general, the vertical gradient variation of sound velocity is more obvious than the horizontal variation, making the sound speed profile (SSP) a standard for characterizing the distribution of underwater acoustic velocity [6]. In other words, an SSP actually represents a vertical distribution of acoustic propagation velocities at discrete depth intervals within a specific geographic region [7].

The acquisition of underwater SSPs primarily employs two methodological approaches: direct measurement techniques and inversion-based estimation methods [1]. The direct measurement method mainly relies on instruments such as a sound velocity profiler (SVP) [8] or a conductivity, temperature, and depth profiler (CTD) [9,10], which can achieve high-precision sound velocity measurements, but the observation process is time-consuming and the observation range is limited, making it difficult to achieve real-time and large-scale SSP acquisition. For example, measuring an SSP over a 3000 m depth through CTD requires at least 2 h, during which the vessel must remain stationary. SSP inversion research provides an effective way for the rapid estimation of a regional sound velocity distribution, mainly including matching field processing (MFP) [11], compressed sensing (CS) [12,13], and machine learning [14] methods. Most of these models rely on acoustic field information such as signal propagation time for sound speed distribution inversion, which places high demands on the deployment of measurement equipment for underwater acoustic observation data, making it difficult to achieve the large-scale estimation of ocean sound speed distribution.

To eliminate the need for underwater field data measurement, some scholars have studied sound speed distribution prediction methods based on time series. The main basis for this is that sound speed distribution exhibits significant seasonal characteristics. Typical methods include long short-term memory (LSTM) neural networks [15] and the transformer [16], but these models have a complex model structure and require a large amount of training samples, which are difficult to meet in many maritime areas. To achieve accurate and real-time long-term prediction of ocean SSPs without on-site data measurements, while controlling network complexity, we propose a semi-transformer neural network (STNet) model for sound velocity profile prediction in this paper. The STNet model utilizes efficient attention mechanisms to comprehensively capture long-range dependencies in historical sound velocity time series data, accurately estimating the sound velocity distribution at any past time and predicting future trends in sound velocity distribution over the long term. The main contributions of this paper are as follows:

To achieve accurate and real-time long-term prediction of ocean SSPs without on-site data measurements, we propose the STNet model for SSP prediction, which overcomes the prolonged training time associated with complex encoder–decoder structures in traditional transformers.
To improve execution efficiency, we propose a parallel processing strategy for the training process of the STNet model. Time encoding and position encoding are sequentially applied to the sound velocity data to form a spatiotemporal distribution data matrix. Then, the attention mechanism is used to capture the inherent dependency relationship between the temporal dynamics and spatial distribution of the data.
To fully evaluate the effectiveness of STNet, we tested the model using historical long-period Argo observation data and short-period experimental data measured from the South China Sea in April 2023. The experimental results indicated that STNet exhibited superior performance in predicting both long- and short-period sound velocity distributions.

The remainder of this paper is organized as follows. Section 2 first proposes the STNet-based structure for SSP estimation and then provides detailed working principles of important modules. Section 3 analyzes and discusses the experimental results, thoroughly evaluating the feasibility and effectiveness of STNet. Finally, conclusions are given in Section 4.

2. Related Works

In 1979, Munk and Wunsch [17] first proposed the concept of ocean acoustic tomography, establishing a novel methodology for reconstructing regional sound velocity distributions through the analysis of acoustic field measurements. This approach exhibits a shorter response time compared to direct measurement methods. To invert the SSP of the entire water depth, Tolstoy [11] applied MFP to this field and combined heuristic algorithms to improve the efficiency of solving optimal solutions. Taroudakis et al. [18] combined MFP with genetic algorithms (MFP-GA) for SSP inversion to improve the accuracy and efficiency. Yu et al. [19] validated the feasibility of the MFP-GA method in shallow waters. However, the MFP technique faces significant computational challenges, including a high algorithmic complexity and prolonged processing duration, when searching the optimal matching coefficient between simulated and measured acoustic field data. To expedite the inversion process, Bianco et al. [13] and Choo et al. [12] proposed SSP inversion methods based on CS theory using different types of matrices to establish the mapping relationship from sound field distribution to sound velocity distribution. But due to the introduction of linear simplification in the mapping relationship, the accuracy performance is sacrificed.

In the past decade, the advent of artificial intelligence technology has overcome some of the limitations of traditional models and found extensive application in marine science [20,21]. Huang et al. [14] proposed an SSP inversion strategy that combines artificial neural networks with ray theory, which saves time in the application phase through offline training. To reduce noise interference, they subsequently introduced an autoencoder structure and utilized autonomous underwater vehicles to construct three-dimensional sound velocity distributions. Although these neural network models exhibit better accuracy and execution efficiency compared to MFP and CS, they still rely on acoustic field measurement data, resulting in limited coverage and expensive equipment costs.

In order to eliminate the need for underwater field data measurement and further improve the SSP estimation efficiency, the construction method of a sound velocity field using remote sensing data and the prediction method of a sound velocity field using historical data have become the latest research hotspots. Jain et al. [22] successfully demonstrated the feasibility of estimating SSPs using satellite sea surface parameters in combination with artificial neural network technology. To improve the accuracy, Li et al. [23] proposed a self-organizing mapping (SOM) deep learning model for SSP construction that combines remote sensing sea surface temperature and sea level anomalies with empirical orthogonal function (EOF) decomposition coefficients of historical SSPs. Ou et al. [24] developed an end-to-end tree boosting model, Yu et al. [25] proposed a radial basis function neural network model, and Liu et al. [26] proposed a single-EOF regression method, which all leverage remote sensing data for SSP construction. These methods do not require underwater field data measurement, thus significantly improving the real-time performance of SSP inversion. Nevertheless, remote sensing data cannot reflect the varieties in deep sound velocity, leading to significant deviations in sound velocity estimation [27].

Considering the seasonal changes in temperature and the periodic motion of ocean currents, the distribution of underwater sound velocity exhibits a certain periodic variation pattern. Therefore, some scholars have fitted the sound velocity changes from the perspective of time series prediction. Piao et al. [15] introduced LSTM for the high-precision prediction of SSPs under the influence of solitary internal waves. Lu et al. [28] further developed a hierarchical LSTM (H-LSTM) model, establishing separate forecasting models for each depth layer. However, LSTM focuses on the short-term variation patterns of data and is not good at capturing the long-term variation patterns of data. Moreover, as the prediction time range increases, the estimation accuracy of sound velocity distribution significantly decreases. Recently, the transformer model has demonstrated significant advantages in time series prediction tasks [16]. The key lies in its introduction of the self-attention mechanism, which allows the model to more effectively capture long-term dependencies, thus overcoming the limitations imposed by sequence length. This is particularly critical for time series modeling, as future values may be influenced by points far back in the past. However, compared to LSTM, the transformer has a more complex network structure and higher computational complexity. Therefore, it is necessary to study a lightweight sound speed distribution prediction model.

3. Methodology

3.1. Overall Framework for SSP Prediction

In order to achieve a rapid and accurate estimation or prediction of sound velocity distribution without on-site data measurement, we propose an STnet model from the perspective of time series fitting, which uses historical sound velocity to fit the sound velocity variation trend in a given area. The overall framework of SSP prediction based on STNet is given in Figure 1.

For a specific region, historical SSP data are first sorted according to the order of sampling time and then undergo some preprocessing before they are used for model training. Specifically, the data are divided into a training set and a testing set with a ratio of 8:2. The STNet model uses parallel processing of the entire depth’s SSPs to improve the efficiency of sound velocity estimation. Therefore, the data need to be time- and position-encoded before being fed into the model, so that the model can better extract the correlation of sound velocity between different depth and time spans. Then, the encoded data are fused to form a data format that is easy for the model to read. In the following content, we provide a detailed introduction to each module.

3.2. Data and Preprocessing

3.2.1. Data Source

For evaluating the performance of STNet, two datasets were adopted to test the model, which were the global GDCSM_Argo dataset [29] and the 2023 South China Sea experimental measured SSP (SCS-SSP) data. Benefiting from the implementation of the Argo project, a large amount of historical SSPs have been accumulated over the past decades. The GDCSM_Argo dataset used in this paper came from the China Argo Real-time Data Center [29], in which SSPs from four typical ocean regions: the South China Sea, the Atlantic Ocean, the Pacific Ocean, and the Indian Ocean, were selected as test objects. The spatial resolution of the GDCSM_Argo dataset is

1^{\circ} \times 1^{\circ}

, the temporal resolution is one month (average SSP within a month), and the depth range is 0–1975 m (non-uniform 58 layer sampling).

To further evaluate the performance of the model, a sea trial at

17 . 3^{\circ}

N,

116 . 2^{\circ}

E, in the South China Sea was conducted in April 2023, which covered a 10 km × 10 km area with a depth exceeding 3500 m. The data collection system was a vessel equipped with a CTD and several expendable CTDs (XCTDs). During the experimental period, a total of 14 SSPs were sampled with the average time interval as 2 h, among which 5 SSPs were collected by CTD and 9 SSPs were collected by XCTDs. The reason for using XCTD is to save time in the CTD measurement, but XCTD can only cover a depth range of 2000 m. Before application, the data measured by XCTD needs to be extended to a depth of 3500 m, which can be achieved through the method in [30]. The spatial positions of data observations are illustrated in Figure 2, with detailed information about the relevant data provided in Table 1.

3.2.2. Data Resampling

To improve the execution efficiency, the parallel computing concept is applied to STNet. However, setting an excessively high depth resolution for the sound velocity profile will lead to increased computational demands due to the larger data volume without enhancing the model’s accuracy. Therefore, we resampled the two types of data mentioned above.

In a given research area, an SSP is represented as a vector of sound velocity values across the vertical spatial distribution, such as

S_{m} = {[s_{m, d_{1}}, s_{m, d_{2}}, . . ., s_{m, d_{D}}]}^{T}

, where m represents the ith SSP sample and

d_{D}

represents the depth value of the Dth depth layer. Assuming there are M SSP samples in total in this region, after sorting these sound velocity data in chronological order, the dataset can be represented as follows:

S = {S_{1}, S_{2}, . . ., S_{m}}, m = 1, 2, . . ., M .

(1)

For the GDCSM_Argo dataset, SSPs were stratified into layers as follows: 0–10 m (5 m intervals), 10–180 m (10 m intervals), 180–460 m (20 m intervals), 500–1250 m (50 m intervals), 1300–1900 m (100 m intervals), and depths beyond 1900 m as a single layer. This preprocessing method resamples the full-depth profile at different intervals, which helps to effectively reduce data dimensionality while preserving the typical structural features of the ocean SSPs. The Argo SSP data are divided into 58 layers, with the stratified data matrix organized chronologically as follows:

S^{a g, r g} = [\begin{matrix} s_{1, d_{1}}^{a g, r g} & s_{2, d_{1}}^{a g, r g} & \dots & s_{120, d_{1}}^{a g, r g} \\ s_{1, d_{2}}^{a g, r g} & s_{2, d_{2}}^{a g, r g} & \dots & s_{120, d_{2}}^{a g, r g} \\ ⋮ & ⋱ & ⋮ \\ s_{1, d_{58}}^{a g, r g} & s_{2, d_{58}}^{a g, r g} & \dots & s_{120, d_{58}}^{a g, r g} \end{matrix}],

(2)

where

r g = SCS, AtO, PaO, InO

represents the four distinct study regions (SCS = South China Sea, AtO = Atlantic Ocean, PaO = Pacific Ocean, and InO = Indian Ocean), and

s_{120, d_{58}}^{a g, r g}

is the sound speed value at the 58th layer of the last monthly average SSP in the ten-year GDCSM_Argo dataset.

The ocean experiment in the South China Sea in 2023 only lasted for a few days, so the data obtained were relatively stable and covered a large depth range up to 3500 m, weakening the influence of shallow water sound velocity changes on the overall distribution pattern. Therefore, the sea trial data from 0 to 3500 m were stratified at equal intervals, with a total of 36 layers formed every 100 m, excluding depths above 3500 m. The stratified data matrix, organized chronologically, is represented as follows:

S^{o e} = [\begin{matrix} s_{1, d_{1}}^{o e} & s_{2, d_{1}}^{o e} & \dots & s_{120, d_{1}}^{o e} \\ s_{1, d_{2}}^{o e} & s_{2, d_{2}}^{o e} & \dots & s_{120, d_{2}}^{o e} \\ ⋮ & ⋱ & ⋮ \\ s_{1, d_{36}}^{o e} & s_{2, d_{36}}^{o e} & \dots & s_{120, d_{36}}^{o e} \end{matrix}],

(3)

3.3. STNet Model

In recent years, attention mechanisms have achieved great success in time series data processing, especially in capturing long-term dependencies. However, the famous transformer framework composed of an encoder–decoder structure appears to be too complex and inefficient in predicting sound velocity. To construct a simplified and effective model, we established a novel semi-transformer structure similar to the decoder component. Since it does not involve encoding modules, the model could significantly reduce the number of parameters and shorten the training time compared to the transformer framework.

Figure 3 illustrates the STNet model. To improve the feature capture capability of the model, we propose a fusion strategy of time and positional encoding. Time encoding and position encoding enable STNet to capture the trend of data changes over time while also fully considering spatial distribution correlations. Subsequently, the data undergoes 4 attention channels, each with an 8-head attention module and a feed-forward neural network (FNN) layer with 128 neurons for feature extraction. Finally, a mapping relationship between features and sound velocity distribution is established through a fully connected layer. In the following subsection, we provide a detailed introduction to each module.

3.3.1. Time Encoding

Given that the distribution of sound velocity is significantly affected by temperature, there are significant differences in sound velocity data in different seasons, especially in shallow waters, and the same location exhibits strong similarity in the same month of different years. Temporal information encoding aims to embed specific time-related information into SSP data, similar to categorizing data by units of time such as year, month, and day. Taking Argo data as an example, the data are monthly average SSPs, so the interval for temporal encoding is a month. Suppose the historical sound speed dataset contains monthly average data for N years, these data are arranged in chronological order and encoded time information starting from index 0. The time encoding is implemented as follows:

T E_{j} = \frac{(j % 12) \times 2}{11} - 1,

(4)

where

T E

is the temporal encoding value, j is the sequential index of the sound speed data with

j = 0, 1, . . ., 12 N - 1

, and % indicates the modulo operation. Temporal encoding maps the time information within the range [−1, 1]. By embedding time information, the model can distinguish data from the same month in different years.

3.3.2. Positional Encoding

To improve the efficiency of model operation, we propose a parallel processing strategy. In this way, time series data of sound velocity at different depths can be processed simultaneously. Due to the differences in the time series patterns of sound velocity in different depth layers and the typical vertical distribution of sound velocity profile data, it is particularly important to encode the position of the sound velocity time series, which can help the model understand the sequential relationship between time series data in different depth layers. The positional encoding can be achieved by:

\{\begin{matrix} P E_{p, 2 i} & = sin (\frac{p}{10000^{\frac{2 i}{F}}}) \\ P E_{p, 2 i + 1} & = cos (\frac{p}{10000^{\frac{2 i}{F}}}) \end{matrix},

(5)

where p represents the position in the sequence, i is the dimension index in the positional encoding vector, and F is the dimension of the input vector.

3.3.3. Self-Attention Mechanism

The self-attention mechanism is an innovative approach that excels in capturing long-range dependencies, making it suitable for the task of predicting ocean sound speed distribution. Let

S

be the input sound speed time series variable after position and temporal encoding. By linearly transforming

S

through three weight matrices

W_{Q}

,

W_{K}

, and

W_{V}

, the query matrix

Q

, the key matrix

K

, and the value matrix

V

for calculating attention values can be obtained respectively:

\{\begin{matrix} Q = S W_{Q} \\ K = S W_{K} \\ V = S W_{V} \end{matrix} .

(6)

Once

Q

,

K

, and

V

are obtained, the attention scores can be calculated through

A t t (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{F_{K}}}) V,

(7)

where

F_{K}

is the dimension of

K

.

Multi-Head Attention Mechanism

To simultaneously focus on the different pieces of subspace information of the input sequence, the implementation of the attention mechanism actually adopts a multi-head attention structure as shown in Figure 4.

Figure 4 introduces multiple independently learning attention heads in parallel in one attention channel, enabling the model to analyze sound velocity data from different subspaces and comprehensively capture key features in the data. Through the weighted integration of multi-subspace feature representations, the model enhances its capability to capture intrinsic correlations within acoustic velocity time series, consequently improving the predictive accuracy of sound velocity distribution patterns. That is to say, multiple sets of

Q

,

K

, and

V

are obtained with the help of different sets of the

W_{Q}

,

W_{K}

, and

W_{V}

matrices. Then, the calculated attention values of each attention head are combined through a weighted combination and linear transformation. In this way, the model can focus on the importance of different positions in the input sound velocity sequence while capturing local and global dependencies.

Suppose there are U attention heads in each attention channel. According to (7), the attention value of the uth head will be

H_{u} = A t t (Q_{u}, K_{u}, V_{u}),

(8)

where

Q_{u}

,

K_{u}

, and

V_{u}

are the query, key, and value matrices of the uth attention head. The final output feature of the multi-head attention channel is as follows:

MH (Q, K, V) = C o n c a t (H_{1}, H_{2}, . . ., H_{U}) W_{0},

(9)

where

C o n c a t

is a linking function, and

W_{0}

is a linear transformation matrix.

Masked Multi-Head Attention Mechanism

The calculation principles of Masked Multi-Head Attention and Multi-Head Attention are consistent, with the only difference being the addition of a mask code. When processing time series, the output of the current time t should only depend on the output before time t and not on the output after t. The mask code ensures that future time information will not be accessed by the model during training.

3.3.4. Feed-Forward Neural Network Layer

Besides the attention layers, the STNet stack module includes an FNN layer. Let

Y_{M H}

represent the output of the attention layer. The result

Y_{F}

after passing through the FFN can be expressed as follows:

Y_{F} = m a x (0, Y_{M H} W_{1} + b_{1}) W_{2} + b_{2},

(10)

where

W_{1}

and

W_{2}

are linear transformation matrices, and

b_{1}

and

b_{2}

are biases. It is evident that

Y_{F}

undergoes two linear transformations and one ReLU activation function.

After passing through the attention layer and FFN layer, the output features are not directly transmitted to the next layer but undergo residual processing and data normalization. The purpose of this design is to prevent degradation during model training, while accelerating training speed and improving training stability.

3.3.5. Model Parameter Updating

For a task region, a dedicated model was constructed according to Figure 3, and the training process was completed offline in advance so as to save the time cost of applying the model in predicting SSPs. To improve training efficiency and accelerate model convergence, we adopted batch gradient descent and adaptive learning rate adjustment strategies.

Suppose that the SSP estimated by the model is

\hat{S} = {[{\hat{s}}_{t + 1, d_{1}}, {\hat{s}}_{t + 1, d_{2}}, \dots, {\hat{s}}_{t + 1, d_{z}}]}^{T}

,

z = 1, 2, \dots, Z

, where

t + 1

represents the next timestamp, and there are a total of Z depth layers, the parameters of the model will be updated by a back propagation algorithm according to a loss function:

L_{r m s e} = \sqrt{\frac{\sum_{z = 1}^{Z} {({\hat{s}}_{t + 1, d_{z}} - s_{t + 1, d_{z}})}^{2}}{Z}}, z = 1, 2, \dots, Z,

(11)

where

s_{t + 1, d_{z}}

and

{\hat{s}}_{t + 1, d_{z}}

represent the actual and estimated sound velocity values, respectively, at the zth depth layer.

4. Results and Discussions

4.1. Parameter Settings and Baselines

The data processing and model validation in this study were completed in MATLAB R2023b (computer CPU configuration, Intel i7-14650h; GPU, NVIDIA GeForce RTX 4060) with the model parameter settings listed in Table 2, which were obtained through multiple experimental attempts. To comprehensively evaluate the reliability of the STNet model, H-LSTM [28], multi-layer perceptron (MLP) [25], and polynomial fitting (PF) [31] are selected as baselines for comparisons in this paper. Besides the root mean square error (RMSE) of (11), which is

L_{r m s e}

, the mean square error (MSE)

L_{m s e}

and mean absolute error (MAE)

L_{m a e}

are also adopted as evaluation metrics:

L_{m s e} = \frac{\sum_{z = 1}^{Z} {({\hat{s}}_{t + 1, d_{z}} - s_{t + 1, d_{z}})}^{2}}{Z}, z = 1, 2, \dots, Z,

(12)

L_{m a e} = \frac{\sum_{z = 1}^{Z} |{\hat{s}}_{t + 1, d_{z}} - s_{t + 1, d_{z}}|}{Z}, z = 1, 2, \dots, Z .

(13)

where the smaller the indicator value, the higher the prediction accuracy.

4.2. Influence of Training Time Stepping

To optimize the selection of training sequence steps, we conducted multiple experiments taking the Pacific region as an example. Table 3 provides the prediction accuracy of the model for the monthly sound velocity distribution in the next year under different training steps. The results are the average of 50 attempts. The results show that in most months, when the training time step was 1, the prediction performance was better than the others, with an average prediction error of 0.5811 m/s. Therefore, the model training time step was 1 in subsequent experiments. An interesting phenomenon is that the predicted value for the 12th month in the future was significantly better than that for the 11th month, which might have been be influenced by the annual cycle variation of sound velocity.

4.3. Influence of Training Data Length

To test the influence of different historical SSP data time spans on the prediction accuracy performance of the model, we took the Argo dataset in the Pacific ocean as an example and selected data from the first 1, 3, 5, 7, and 9 years before the prediction task for model training. The average prediction error of 50 attempts is given in Table 4. The results indicate that as the training data time span increases, the predictive performance of the model gradually improves. When there are too few learning samples, the model struggles to capture the trend of data changes, leading to an increase in prediction error RMSE. Specifically, at least five or more complete cycles of SSP data before the forecast task need to be provided for model learning in order to ensure relatively accurate predictions for future SSPs.

4.4. Evaluation of Periodic Capture Capability

To validate whether STNet can accurately capture the periodic variations in sound speed time series, we tested the model using ten years of historical data from four regions according to Table 1. The 2nd, 4th, and 6th layers (corresponding to depths of 5 m, 20 m, and 40 m, respectively) were randomly selected from the 58 depth layers to demonstrate the model’s training and prediction outputs as shown in Figure 5. The comparison of periodic variation trends at different depth layers across the four regions demonstrates that STNet has good abilities to learn long-range dependencies with the input sequences.

4.5. Long-Term Predictive Performance Evaluation

To evaluate the accuracy performance of the model in long-term forecasting, the first 9 years of historical Argo datasets from the Atlantic Ocean and the Indian Ocean were used as learning samples for the model to predict the sound velocity distribution for the corresponding regions in the next year. The training data consisted of 58 unequally layered SSP data (formatted as [58, 108]), and the primary prediction results were also divided into 58 layers (formatted as [58, 12]). After interpolation with a spacing of 1 m, the full-depth results (formatted as [1976, 12]) were compared with the baseline methods in Figure 6 and Figure 7. Comparatively speaking, the prediction results of the H-LSTM and PF methods, to some extent, reflect the distribution of underwater SSPs, but the prediction accuracy is not good enough, so these methods can be used for the preliminary estimation of underwater SSPs. The prediction results of the MLP model are relatively rough and cannot accurately reflect the actual distribution of SSPs in complex marine environments. In contrast, the predicted results of STNet are highly consistent with actual observed data, demonstrating higher accuracy in the long-term prediction task of deep Ocean SSP. Specifically, in complex shallow sea environments (within 200 m of sea depth), STNet can effectively capture the variation patterns of marine SSP.

To more clearly assess the prediction performance of various models, Figure 8 illustrates the absolute errors at different depths for the Atlantic Ocean and the Indian Ocean. In the Atlantic Ocean, the maximum absolute errors of STNet, H-LSTM, MLP, and PF at full ocean depth are approximately 3 m/s, 11 m/s, 20 m/s, and 5.5 m/s, respectively. The absolute errors of these four methods at most depths are below 0.5 m/s, 2 m/s, 4 m/s, and 1.5 m/s, respectively. In the Indian Ocean, the maximum absolute errors of STNet, H-LSTM, MLP, and PF at full ocean depth are approximately 4 m/s, 5.5 m/s, 12 m/s, and 7 m/s, respectively. The absolute errors at most depths are below 0.5 m/s, 1 m/s, 2 m/s, and 1 m/s, respectively. Overall, the absolute error of the STNet model in predicting sound velocity at different depths in the Atlantic and Indian Oceans is significantly lower than that of the H-LSTM, BP, and PF models, demonstrating its superiority in capturing the details of sound velocity time series.

To visually compare the differences between the predicted SSP and the actual SSP for various models, Figure 9 presents a two-dimensional comparison of predicted and actual SSPs in July 2022 for the four ocean regions referring to Table 1. The results show that among the four sea areas, the SSP predicted by STNet has the best fit with the actual curve, followed by H-LSTM and PF, while MLP has the largest fitting error. The SSP predicted by the STNet model can accurately reflect the characteristic information of the actual SSP and is more stable in long-term prediction of the sound velocity.

4.6. Short-Term Predictive Performance Evaluation

To validate the short-term prediction capability of the STNet model, we conducted short-term sound velocity data collection in the South China Sea in 2023. The model was trained by the first 13 sampled SSPs to predict the sound velocity distribution for the next two hours. Figure 10a directly provides a comparison between the predicted SSP and the label one, which has a high similarity. In Figure 10b, the SSP is interpolated and compared with the results obtained by the H-LSTM, MLP, and PF methods. The results indicate that the STNet model’s predictions closely match the actual SSPs across the 0–3500 m depth range, exhibiting high similarity. The H-LSTM model also performs well, but it lags behind STNet in handling finer details. The PF model could roughly capture the global characteristics of the actual SSPs but shows significant deviations in shallow waters (within 200 m). The MLP model does not perform well in dealing with time series, resulting in predictions with high fluctuations, failing to accurately reflect the primary features of the actual SSPs. Table 5 shows the corresponding full-depth prediction errors under 50 attempts. The results show that compared to the H-LSTM, MLP, and PF models, STNet significantly improves SSP prediction accuracy, reaffirming the superior capability of our new model in SSP prediction.

4.7. Comparison of Execution Efficiency

The proposed STNet employs a parallel processing strategy, enabling the simultaneous handling of multiple sound speed time series data. This approach is of great significance for improving the efficiency of SSP prediction. Table 6 shows a detailed comparison of the average time efficiency of different methods during the model training phase under 50 attempts. Due to its lightweight architecture, STNet reduces computational complexity by approximately 65% compared to conventional transformer models while maintaining a comparable prediction accuracy. For STNet, with parallel processing, it takes only 28.07 s for model training, which is nearly 10 times more efficient than deep learning models like H-LSTM and MLP. Even compared to the mathematically efficient PF method, the difference is minimal. This result highlights the efficiency of parallel processing strategies, indicating the enormous potential of STNet in predicting ocean sound velocity.

5. Conclusions

To achieve real-time and accurate long-term prediction of full-depth ocean SSPs, we propose an STNet model. The STNet model simplifies and optimizes traditional transformer models while cleverly incorporating time encoding modules. This design enables STNet to effectively capture long-range dependencies within historical sound speed time series data and accurately track temporal trends. To validate the feasibility and effectiveness of the STNet model, extensive long-term prediction experiments were conducted across multiple ocean regions. Experimental results indicate that STNet outperforms other state-of-the-art models in both prediction accuracy and time efficiency. Consequently, STNet provides crucial technical support for achieving precise, large-scale, and full-depth predictions of ocean sound speed profiles.

Author Contributions

Conceptualization, W.H., Y.W. and J.L. (Jiajun Lu); methodology, J.L. (Junpeng Lu); software, J.L. (Junpeng Lu), J.L. (Jiajun Lu) and Y.W.; validation, W.H. and J.L. (Junpeng Lu); investigation, H.Z. and T.X.; resources, W.H.; data curation, J.L. (Junpeng Lu), J.L. (Jiajun Lu) and Y.W.; writing—original draft preparation, W.H., J.L. (Junpeng Lu), J.L. (Jiajun Lu) and Y.W.; writing—review and editing, H.Z. and T.X.; project administration, H.Z. and T.X.; funding acquisition, W.H., Y.W., H.Z. and T.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2024YFB3909701, in part by the National Natural Science Foundation of China under Grants 42404001 and 62271459, in part by the Shandong Provincial Natural Science Foundation under Grant ZR2023QF128, and in part by the Stable Supporting Fund of Acoustic Science and Technology Laboratory under Grant JCKYS2025SSJS008.

Data Availability Statement

The authors acknowledge the historical SSP data support from the China Argo Real-time Data Center (https://www.argo.org.cn/, latest access: 10 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SSP	Sound speed profile
STNet	Semi-transformer neural network
MFP	Matched field processing
CS	Compressed sensing
GA	Genetic algorithm

References

Huang, W.; Wu, P.; Lu, J.; Lu, J.; Xiu, Z.; Xu, Z.; Li, S.; Xu, T. Underwater SSP Measurement and Estimation: A Survey. J. Mar. Sci. Eng. 2024, 12, 2356. [Google Scholar] [CrossRef]
Erol-Kantarci, M.; Mouftah, H.T.; Oktug, S. A Survey of Architectures and Localization Techniques for Underwater Acoustic Sensor Networks. IEEE Commun. Surv. Tutor. 2011, 13, 487–502. [Google Scholar] [CrossRef]
Luo, J.; Yang, Y.; Wang, Z.; Chen, Y. Localization Algorithm for Underwater Sensor Network: A Review. IEEE Internet Things J. 2021, 8, 13126–13144. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Y.; Chen, C.; Liu, C. Unified Underwater Acoustic Localization and Sound Speed Estimation for an Isogradient Sound Speed Profile. IEEE Sens. J. 2024, 24, 3317–3327. [Google Scholar] [CrossRef]
Munk, W.; Wunsch, C. Ocean acoustic tomography: Rays and modes. Rev. Geophys. 1983, 21, 777–793. [Google Scholar] [CrossRef]
Jensen, F.B.; Kuperman, W.A.; Porter, M.B.; Schmidt, H. Computational Ocean Acoustics: Chapter 1; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011; pp. 3–4. [Google Scholar] [CrossRef]
Liu, B.; Tang, X.; Tharmarasa, R.; Kirubarajan, T.; Jassemi, R.; Hallé, S. Underwater Target Tracking in Uncertain Multipath Ocean Environments. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 4899–4915. [Google Scholar] [CrossRef]
Zhang, S.; Xu, X.; Xu, D.; Long, K.; Shen, C.; Tian, C. The design and calibration of a low-cost underwater sound velocity profiler. Front. Mar. Sci. 2022, 9, 996299. [Google Scholar] [CrossRef]
Luo, C.; Wang, Y.; Wang, C.; Yang, M.; Yang, S. Analysis of Glider Motion Effects on Pumped CTD. In Proceedings of the OCEANS 2023, Limerick, Ireland, 5–8 June 2023; pp. 1–7. [Google Scholar] [CrossRef]
Kirimoto, K.; Han, J.; Konashi, S. Development of High Accuracy CTD Sensor: 5EL-CTD. In Proceedings of the OCEANS 2024, Singapore, 14–18 April 2024; pp. 1–8. [Google Scholar] [CrossRef]
Tolstoy, A.; Diachok, O.; Frazer, L.N. Acoustic tomography via matched field processing. J. Acoust. Soc. Am. 1991, 89, 1119–1127. [Google Scholar] [CrossRef]
Choo, Y.; Seong, W. Compressive Sound Speed Profile Inversion Using Beamforming Results. Remote Sens. 2018, 10, 704. [Google Scholar] [CrossRef]
Bianco, M.; Gerstoft, P. Dictionary learning of sound speed profiles. J. Acoust. Soc. Am. 2017, 141, 1749–1758. [Google Scholar] [CrossRef] [PubMed]
Huang, W.; Liu, M.; Li, D.; Yin, F.; Chen, H.; Zhou, J.; Xu, H. Collaborating Ray Tracing and AI Model for AUV-Assisted 3-D Underwater Sound-Speed Inversion. IEEE J. Ocean. Eng. 2021, 46, 1372–1390. [Google Scholar] [CrossRef]
Piao, S.; Yan, X.; Li, Q.; Li, Z.; Wang, Z.; Zhu, J. Time series prediction of shallow water sound speed profile in the presence of internal solitary wave trains. Ocean Eng. 2023, 283, 115058. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; NIPS’17. pp. 6000–6010. [Google Scholar]
Munk, W.; Wunsch, C. Ocean acoustic tomography: A scheme for large scale monitoring. Deep-Sea Res. Part I-Oceanogr. Res. Pap. 1979, 26, 123–161. [Google Scholar] [CrossRef]
Taroudakis, M.I.; Markaki, M.G. Matched Field Ocean Acoustic Tomography Using Genetic Algorithms. In Acoustical Imaging; Tortoli, P., Masotti, L., Eds.; Springer: Boston, MA, USA, 1996; pp. 601–606. [Google Scholar] [CrossRef]
Yu, Y.; Li, Z.; He, L. Matched-field inversion of sound speed profile in shallow water using a parallel genetic algorithm. Chin. J. Oceanol. Limnol. 2010, 28, 1080–1085. [Google Scholar] [CrossRef]
Bianco, M.J.; Gerstoft, P.; Traer, J.; Ozanich, E.; Roch, M.A.; Gannot, S.; Deledalle, C.A. Machine learning in acoustics: Theory and applications. J. Acoust. Soc. Am. 2019, 146, 3590–3628. [Google Scholar] [CrossRef] [PubMed]
Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat, F. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef] [PubMed]
Jain, S.; Ali, M. Estimation of Sound Speed Profiles Using Artificial Neural Networks. IEEE Geosci. Remote Sens. Lett. 2006, 3, 467–470. [Google Scholar] [CrossRef]
Li, H.; Qu, K.; Zhou, J. Reconstructing Sound Speed Profile From Remote Sensing Data: Nonlinear Inversion Based on Self-Organizing Map. IEEE Access 2021, 9, 109754–109762. [Google Scholar] [CrossRef]
Ou, Z.; Qu, K.; Shi, M.; Wang, Y.; Zhou, J. Estimation of sound speed profiles based on remote sensing parameters using a scalable end-to-end tree boosting model. Front. Mar. Sci. 2022, 9, 1051820. [Google Scholar] [CrossRef]
Yu, X.; Xu, T.; Wang, J. Sound Velocity Profile Prediction Method Based on RBF Neural Network. In China Satellite Navigation Conference (CSNC) 2020 Proceedings: Volume III; Springer: Singapore, 2020; pp. 475–487. [Google Scholar]
Liu, Y.; Chen, Y.; Meng, Z.; Chen, W. Performance of single empirical orthogonal function regression method in global sound speed profile inversion and sound field prediction. Appl. Ocean Res. 2023, 136, 103598. [Google Scholar] [CrossRef]
Kim, Y.J.; Han, D.; Jang, E.; Im, J.; Sung, T. Remote sensing of sea surface salinity: Challenges and research directions. GISci. Remote Sens. 2023, 60, 2166377. [Google Scholar] [CrossRef]
Lu, J.; Huang, W.; Zhang, H. Dynamic Prediction of Full-Ocean Depth SSP by a Hierarchical LSTM: An Experimental Result. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Xie, C.; Miaomiao, X.; Cao, S.; Zhang, Y.; Zhang, C. Gridded Argo data set based on GDCSM analysis technique: Establishment and preliminary applications. J. Mar. Sci. 2019, 37, 24–35. [Google Scholar]
Huang, W.; Lu, J.; Li, S.; Xu, T.; Wang, J.; Zhang, H. Fast Estimation of Full Depth Sound Speed Profile Based on Partial Prior Information. In Proceedings of the 2023 IEEE 6th International Conference on Electronic Information and Communication Technology (ICEICT), Qingdao, China, 21–24 July 2023; pp. 479–484. [Google Scholar] [CrossRef]
Liu, F.; Ji, T.; Zhang, Q. Sound Speed Profile Inversion Based on Mode Signal and Polynomial Fitting. Acta Armamentarii 2019, 40, 2283–2295. [Google Scholar] [CrossRef]

Figure 1. Framework for SSP prediction by STNet.

Figure 2. Locations of research area.

Figure 3. STNet model.

Figure 4. Multi-head attention module.

Figure 5. Periodic variation fitting results, where (a,b,c,d) are the result of the Atlantic Ocean, Indian Ocean, Pacific Ocean, and South China Sea, respectively.

Figure 6. Accuracy performance of long-term SSP forecasting in the Atlantic Ocean, where (a) is the original sound velocity distribution, and (b,c,d,e) are the estimated results by STNet, H-LSTM, MLP and PF, respectively.

Figure 7. Accuracy performance of long-term SSP forecasting in the Indian Ocean, where (a) is the original sound velocity distribution, and (b,c,d,e) are the estimated results by STNet, H-LSTM, MLP, and PF, respectively.

Figure 8. Absolute errors of long-term SSP prediction at different depths, where (a–d) are the estimated errors by STNet, H-LSTM, MLP, and PF in the Atlantic Ocean, and (e–h) are the estimated errors by STNet, H-LSTM, MLP, and PF in the Indian Ocean.

Figure 9. Comparison of predicted and actual SSPs in July 2022 for the four ocean regions, where (a,b,c,d) correspond to the Atlantic Ocean, Indian Ocean, Pacific Ocean, and South China Sea, respectively.

Figure 10. Comparison of predicted and actual SSPs in the South China Sea in 2023, where (a) is a comparison between the SSP predicted by the STNet model with the observed SSP, and (b) is a comparison among different methods after interpolating.

Table 1. Data information.

GDCSM_Argo Data
Area	Time Dimension	Temporal Resolution	Number of SSPs	Depth	Layers
South China Sea ( $116 . 5^{\circ}$ E, $15 . 5^{\circ}$ N)	2013–2022 (120 months)	one month	120	0–1975 m	unequal interval (58 layers)
Atlantic Ocean ( $44 . 5^{\circ}$ W, $24 . 5^{\circ}$ N)
Pacific Ocean ( $152 . 5^{\circ}$ E, $18 . 5^{\circ}$ N)
Indian Ocean ( $65 . 5^{\circ}$ E, $20 . 5^{\circ}$ S)
SCS-SSP Data
South China Sea ( $116 . 2^{\circ}$ E, $17 . 3^{\circ}$ N)	12–14 April 2023	Around 2 h	14	0–3500 m	equal interval (36 layers)

Table 2. Model Parameter Settings.

Parameter	Setting
Dimension of sequence input layer	58/36
Number of heads	8
Number of attention channels	4
Neurons of FNN layer	128
Dropout rate	0.15
Max epoch	300
Batch size	32
Optimizer	Adam
Initial learning rate	0.001

Table 3. Prediction errors of SSPs under different training steps.

Training Time Step/Month	RMSE for Different Months (m/s)												Average RMSE (m/s)
Training Time Step/Month	1	2	3	4	5	6	7	8	9	10	11	12	Average RMSE (m/s)
1 time step	0.637	0.563	0.751	0.405	0.445	0.983	0.444	0.675	0.422	0.813	0.529	0.309	0.581
2 time steps	0.602	0.546	0.808	0.534	0.784	0.965	0.904	0.950	0.350	0.923	1.307	0.495	0.763
4 time steps	0.776	0.478	0.468	0.653	1.270	1.338	0.485	0.930	0.536	0.662	0.998	0.577	0.764
6 time steps	0.705	0.706	0.800	0.851	0.557	0.846	0.784	0.827	0.752	1.040	1.132	0.470	0.789
10 time steps	0.996	1.241	1.357	1.253	0.731	0.848	0.455	0.662	0.569	0.702	0.663	0.584	0.838

Table 4. Prediction errors of SSPs under different training data lengths.

Training Data Length/Month	RMSE for Different Months (m/s)												Average RMSE (m/s)
Training Data Length/Month	1	2	3	4	5	6	7	8	9	10	11	12	Average RMSE (m/s)
1 year	1.647	0.529	0.805	0.795	0.858	0.813	1.175	1.041	0.880	0.949	0.857	0.816	0.930
3 years	0.521	0.678	1.320	0.584	0.980	1.288	0.781	0.711	0.779	0.767	0.739	0.837	0.832
5 years	0.667	0.755	0.934	0.854	0.556	1.057	0.621	0.405	0.548	0.550	0.432	0.583	0.664
7 years	0.698	0.537	0.557	0.689	0.465	1.023	0.629	0.601	0.373	0.453	0.487	0.549	0.588
9 years	0.637	0.563	0.751	0.405	0.445	0.983	0.444	0.675	0.422	0.813	0.529	0.309	0.581

Table 5. Short-term SSP prediction errors.

Method	STNet	H-LSTM	MLP	PF
RMSE (m/s)	0.079	0.153	0.957	0.548

Table 6. SSP prediction errors of ocean experiment in the South China Sea.

Method	STNet	H-LSTM	MLP	PF
Training time (s)	28.07	223.19	232.30	12.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, W.; Lu, J.; Lu, J.; Wu, Y.; Zhang, H.; Xu, T. STNet: Prediction of Underwater Sound Speed Profiles with an Advanced Semi-Transformer Neural Network. J. Mar. Sci. Eng. 2025, 13, 1370. https://doi.org/10.3390/jmse13071370

AMA Style

Huang W, Lu J, Lu J, Wu Y, Zhang H, Xu T. STNet: Prediction of Underwater Sound Speed Profiles with an Advanced Semi-Transformer Neural Network. Journal of Marine Science and Engineering. 2025; 13(7):1370. https://doi.org/10.3390/jmse13071370

Chicago/Turabian Style

Huang, Wei, Junpeng Lu, Jiajun Lu, Yanan Wu, Hao Zhang, and Tianhe Xu. 2025. "STNet: Prediction of Underwater Sound Speed Profiles with an Advanced Semi-Transformer Neural Network" Journal of Marine Science and Engineering 13, no. 7: 1370. https://doi.org/10.3390/jmse13071370

APA Style

Huang, W., Lu, J., Lu, J., Wu, Y., Zhang, H., & Xu, T. (2025). STNet: Prediction of Underwater Sound Speed Profiles with an Advanced Semi-Transformer Neural Network. Journal of Marine Science and Engineering, 13(7), 1370. https://doi.org/10.3390/jmse13071370

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

STNet: Prediction of Underwater Sound Speed Profiles with an Advanced Semi-Transformer Neural Network

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Overall Framework for SSP Prediction

3.2. Data and Preprocessing

3.2.1. Data Source

3.2.2. Data Resampling

3.3. STNet Model

3.3.1. Time Encoding

3.3.2. Positional Encoding

3.3.3. Self-Attention Mechanism

Multi-Head Attention Mechanism

Masked Multi-Head Attention Mechanism

3.3.4. Feed-Forward Neural Network Layer

3.3.5. Model Parameter Updating

4. Results and Discussions

4.1. Parameter Settings and Baselines

4.2. Influence of Training Time Stepping

4.3. Influence of Training Data Length

4.4. Evaluation of Periodic Capture Capability

4.5. Long-Term Predictive Performance Evaluation

4.6. Short-Term Predictive Performance Evaluation

4.7. Comparison of Execution Efficiency

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI