Next Article in Journal
Mechanism and Regularity of Wet Modes in a Highly Integrated Marine Magnetic Levitation Pump Rotor Under Confined Water Conditions
Previous Article in Journal
An Adaptive Optimization Method for Moored Buoy Site Selection Integrating Ontology Reasoning and Numerical Computation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

High-Fidelity Decoding Method for Acoustic Data Transmission and Reception of DIFAR Sonobuoy Using Autoencoder

Department of Information and Communication Engineering, Changwon National University, Changwon 51140, Republic of Korea
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2025, 13(12), 2402; https://doi.org/10.3390/jmse13122402
Submission received: 3 November 2025 / Revised: 9 December 2025 / Accepted: 16 December 2025 / Published: 18 December 2025
(This article belongs to the Section Ocean Engineering)

Abstract

Directional frequency analysis and recording (DIFAR) is a widely used sonobuoy in modern underwater acoustic monitoring and surveillance. The sonobuoy is installed in the area of interest, collects underwater data, and transmits the data to nearby aircraft for data analysis. In this process, transmission of a large volume of raw data poses significant challenges due to limited communication bandwidth. To address this problem, existing studies on autoencoder-based methods have drastically reduced amounts of information to be transmitted with moderate data reconstruction errors. However, the information bottleneck inherent in these autoencoder-based methods often leads to significant fidelity degradation. To overcome these limitations, this paper proposes a novel autoencoder method focused on the reconstruction fidelity. The proposed method operates with two key components: Gated Fusion (GF), proven critical for effectively fusing multi-scale features, and Squeeze and Excitation (SE), an adaptive Channel Attention for feature refinement. Quantitative evaluations on a realistic simulated sonobuoy dataset demonstrate that the proposed model achieves up to a 90.36% reduction in spectral mean squared error for linear frequency modulation signals compared to the baseline.

1. Introduction

In the modern maritime environment, underwater acoustic surveillance is an essential defense mechanism for the early detection of and response to hostile threats. The success of Anti-Submarine Warfare (ASW), in particular, relies significantly on a comprehensive understanding of the underwater acoustic environment and the capability for precise signal analysis [1,2].
Directional Frequency Analysis and Recording (DIFAR) sonobuoys employed in ASW operations are a type of acoustic vector sensor, representing a sophisticated sensing system that captures not only acoustic pressure but also directional information of the sound field. It typically consists of one omnidirectional hydrophone, which measures a scalar pressure value, and two orthogonal dipole sensors, which measure the particle velocity vector. Combining these multi-channel outputs enables the estimation of the direction-of-arrival (DOA) in addition to the signal’s frequency characteristics, which contributes to the localization and tracking of target maneuvers [3,4].
The collected data is transmitted via wireless channels to Aircraft, as illustrated in Figure 1, where it undergoes complex signal processing. This data pipeline, however, is constrained by the inherent limitations of expendable equipment: namely, limited battery life and restricted communication bandwidth [5]. Consequently, the efficient and lossless transmission of voluminous raw data to analysis platforms remains a critical technical challenge affecting the operational effectiveness of DIFAR systems.
To address this challenge, deep learning-based data compression techniques were introduced to overcome the inefficiencies of traditional methods like Frequency Division Multiplexing (FDM) [6]. Among these, the Autoencoder (AE) has been investigated as an effective solution due to its ability to transform signals into low-dimensional latent representations. Initial research applying AEs to DIFAR signal compression demonstrated potential for significant reductions in data transmission volume. However, these approaches were observed to suffer from signal quality degradation caused by the information bottleneck inherent in the encoding process [6]. Although subsequent studies improved compression efficiency by integrating model lightweighting and Residual Vector Quantization (RVQ) techniques, they also adhered to a research paradigm that prioritized the compression ratio [7]. As a result, while the communication payload was effectively reduced, the loss of fine-grained acoustic information during compression remained a persistent factor limiting the precision of DIFAR signal analysis.
In active detection scenarios utilizing DIFAR sonar systems, analytical accuracy is contingent upon the precise measurement of distortions in the received signal, which results from the transmitted waveform reflecting off a target. For the Continuous Wave (CW) and Linear Frequency Modulation (LFM) signals used in this study, preserving signal fidelity is a critical consideration. For instance, the Doppler shift induced by target motion manifests as a frequency change in the CW signal, serving as a primary cue for estimating the target’s relative velocity [8]. Likewise, the time delay and pulse width variations in LFM signals contain information regarding the target’s range and relative acceleration [9]. Conventional compression-centric approaches, in the process of achieving high compression ratios, are susceptible to the loss of information required to represent these subtle frequency and temporal variations. This information loss can consequently lead to increased errors in estimating target maneuver parameters or result in degraded detection performance. Therefore, an approach that establishes the fidelity of the reconstructed signal as the primary objective is required to ensure the precise measurement of these parameters.
To achieve high-fidelity reconstruction, the first consideration is the skip connection. This refers to a structural link that bypasses certain layers to transfer information directly and can be broadly categorized into two types: long-range and short-range connections between the encoder and the decoder. The long-range connection, as utilized in models like U-Net [10] and DPCRN [11], directly forwards low-level features extracted from the early layers of the encoder to the later layers of the decoder, skipping the deep bottleneck layers where information is compressed. This approach has been widely adopted not only in single-channel speech enhancement models [11] but also as a core framework in recent architectures for processing complex multi-channel audio signals [12,13], demonstrating its effectiveness in preserving fine-grained time-frequency structures from loss during compression. Subsequently, the short-range connection, typically in the form of a ‘Residual Block’ plays a pivotal role in the field of modern time-domain Neural Audio Codecs (NAC). Famous models such as Google’s SoundStream [14] and Meta’s EnCodec [15] deeply integrate skip connections as residual blocks within their encoders and decoders. This ensures that even in very deep networks, the input signal can be passed through directly via an identity mapping, which prevents the loss of fine-grained waveform information and enables stable training. The success of these approaches suggests that structural designs minimizing information loss are a critical consideration for achieving high-quality compression and reconstruction of time-domain waveforms. In addition, the second method is the attention mechanism, a technique that enables the model to selectively focus on significant elements within the transferred information. In the field of audio signal processing, channel attention—which explicitly models the interdependencies between channels of a feature map—has been used effectively. A representative example is the Squeeze-and-Excitation (SE) block [16]. The SE block performs ‘Squeeze’ (global information compression) and ‘Excitation’ (channel-wise weight recalibration) operations, playing a crucial role in enhancing the representational power of the model by selectively emphasizing channels that are more salient for signal reconstruction. The efficacy of this channel attention has been validated across various sub-disciplines of acoustic signal processing. In Automatic Speech Recognition (ASR), an ‘Attention-Enhanced Squeeze-and-Excitation block’ is applied to Transformer-based structures like Blockformer to improve recognition performance [17]. In speech enhancement, SE blocks have been proposed for audio-visual models to improve quality in noisy environments by fusing audio signals with visual information, such as lip movements [18]. Furthermore, in speaker verification, research has combined SE modules with deep CNNs, such as ResNeXt, to adaptively adjust channel-wise feature maps and reduce noise interference, thereby increasing the extraction efficiency of speaker-specific characteristics [19]. In speech emotion recognition, SE blocks are utilized to effectively reinforce core channel information related to emotional expression during the capture and fusion of multi-scale features [20]. Ultimately, this technique maximizes overall representational power of the model by selectively emphasizing the most significant feature channels for signal restoration, even within limited information contexts such as noisy inputs or latent vectors.
Based on the technical background reviewed, this paper proposes a novel autoencoder architecture to achieve the high-fidelity reconstruction of encoded DIFAR waveform data. Different from previous studies in [6,7], the proposed model redesigns the latent space to enable the decoding of finer feature representations by integrating two main building blocks as follows:
  • Multi-scale Feature Fusion: Adopting a U-Net structure, we introduce skip connections that fuse hierarchical features from the encoder with their corresponding layers in the decoder. This is intended to preserve both the global structure and the local details of the signal.
  • Adaptive Feature Recalibration: SE blocks, a channel-wise attention mechanism, are applied to dynamically recalibrate the relative importance of the fused feature channels, thereby guiding the model to focus on the information most contributory to signal reconstruction.
This architecture improves the spectral accuracy of the reconstructed signal while somewhat increasing the data transmission rate. The contributions of this study are summarized as follows: First, we proposed an autoencoder design by employing multi-scale feature fusion and adaptive feature recalibration for reducing acoustic data reconstruction error. Second, we quantitatively evaluated that the proposed model successfully reduces the spectral reconstruction error compared to existing models with data simulated in the DIFAR specification and a technical trade-off point between communication efficiency and signal fidelity.

2. Related Works

2.1. Autoencoder-Based Sonobuoy Signal Transmission and Reception

AE has been recently applied for sonobuoy signal transmission and reception [6,7]. Due to the sandglass-shaped network of AE, underwater acoustic signals are compressed into low-dimensional latent vectors through the encoder for signal transmission at the buoy and reconstruct the original signal through the decoder at the aircraft as shown in Figure 2. Initial research applying AEs for sonobuoy signal compression demonstrated the potential for significant reductions in the amount of the transmitted data. However, these methods were observed to have limitations, as the information bottleneck inherent in the process led to degradation in the quality, i.e., Spectral Mean Squared Error (MSE), of the reconstructed signal [6]. Although subsequent studies improved compression efficiency and reduced computational complexity by integrating Residual Vector Quantization (RVQ) and model pruning techniques [7], the reconstruction performance is still limited. Consequently, while these approaches contributed to enhanced communication efficiency, the loss of fine-grained acoustic information during the encoding-decoding process remained a primary factor limiting the fidelity of the reconstructed signal.

2.2. U-Net Architecture for Preserving Multi-Scale Audio Features

To enhance the fidelity of signal reconstruction models, structural designs capable of minimizing information loss within deep networks are required. In this context, the U-Net architecture has been widely adopted in the audio signal processing field as an effective framework for information preservation [10]. Initially proposed for biomedical image segmentation, U-Net [10] is characterized by a symmetrical encoder–decoder structure and the skip connections that link these two paths. The encoder extracts high-level global context from the input signal through progressive down sampling, while the decoder reconstructs the signal to its original resolution via up sampling. The skip connections serve to directly forward multi-resolution feature maps from the encoder layers to their corresponding layers in the decoder. This enables the decoder to directly reference low-level local details, which are otherwise prone to loss during the reconstruction process, thereby significantly enhancing the precision of the final output. Furthermore, the structural advantages of U-Net have been validated in the field of speech enhancement, which operates in the Time-Frequency (T-F) domain. Models such as DPCRN [11] utilize the U-Net structure to successfully reconstruct the harmonic structure and formant information of speech from spectrograms heavily corrupted by noise. Moreover, this core U-Net framework has been extended to multi-channel speech enhancement architectures, often in combination with Graph Neural Networks (GNNs) [12,13], where it is effectively applied to process complex inter-channel spatial information while concurrently preserving fine-grained T-F structures via skip connections. These preceding studies demonstrate that skip connections are an architectural element crucial for effectively preserving the multi-scale features of audio signals, consequently enabling high-fidelity signal reconstruction.

2.3. Feature Refinement Using Channel Attention Mechanisms

Whereas skip connections provide a ‘pathway’ for information, the attention mechanism is a technique that enables the model to learn the ‘importance’ of the information traversing that path. It guides the model to selectively concentrate on features that are more salient for the current task, thereby enhancing its representational power within constrained resources. In the audio processing, channel attention, which explicitly models the interdependencies among the channels of a feature map, is widely utilized [16,17,18,19,20].
SE in [16] is a representative technique with a simple yet effective architecture. The SE block first aggregates global spatial information from each channel into a single descriptor during the ‘Squeeze’ phase, typically using global average pooling. Subsequently, in the ‘Excitation’ phase, this descriptor is used to learn channel-wise dependencies, and a sigmoid function generates weights between 0 and 1, indicating the importance of each channel. Finally, these weights are applied to the original feature map via channel-wise multiplication to perform adaptive feature recalibration. Through this mechanism, salient features are emphasized while less relevant features are suppressed. Accordingly, SE blocks have been effectively integrated into the backbone networks of models across diverse audio tasks, including ASR [17], speech enhancement [18], speaker verification [19], and speech emotion recognition [20], to refine feature representations and enhance overall system performance.

3. Proposed Method

In this study, we propose a novel autoencoder architecture for the high-fidelity reconstruction of time-domain DIFAR waveforms. This architecture is based on hierarchical feature fusion and incorporates skip connections. The proposed network is designed to mitigate the information bottleneck, which is characteristic of sequential structures of conventional deep autoencoders, and to improve reconstruction performance by effectively preserving acoustic features at multiple scales. This section describes in detail the design principles and overall architecture of the proposed network, and the specific implementation of its key components, specifically the skip connections and attention mechanisms.

3.1. Architectural Design Principles and Overall Structure

The design of the proposed network is based on the following three principles: (1) End-to-End Processing, (2) Multi-Path Information forwarding for information preservation, and (3) Adaptive Feature Refinement. These principles are designed to work synergistically to reduce information loss within the deep autoencoder structure and achieve high-fidelity signal reconstruction.
At first, the proposed network is based on an end-to-end approach that directly processes the time-domain waveform. This method allows the network to learn representations directly from the raw signal, which can simplify the processing pipeline. Many audio processing models utilize the Short-Time Fourier Transform (STFT) to convert the signal into the time-frequency (T-F) domain. This approach is effective for analyzing frequency characteristics, but it includes a feature extraction step that requires manual tuning of hyperparameters, such as window size and hop length. Similarly, while Wavelet transforms offer excellent time-frequency localization, we opted for the time-domain approach to avoid the shift-variance limitations of standard discrete wavelets, while implicitly preserving the phase information essential for DOA estimation without incurring the data redundancy associated with complex wavelets. By adopting a time-domain method, our model bypasses this step involving manual parameter selection. This enables the network to automatically learn the salient features for reconstruction—including both magnitude and phase information—directly from the data and allows the entire pipeline to be tuned holistically. Subsequently, multi-path information forwarding is implemented via skip connections [10] within the hierarchical encoder–decoder structure, as illustrated in Figure 3. In the proposed network, the encoder serves as a feature extraction path, deriving hierarchical features through progressive down sampling of the input waveform. Conversely, the decoder functions as a reconstruction path, restoring the signal from the compressed features. Skip connections create shortcut connections that pass multi-scale features from the extraction path directly to the corresponding layers in the reconstruction path. This serves two crucial functions. First, it mitigates the gradient vanishing problem that can occur in deep networks, thereby supporting stable training. Second, it allows the decoder to utilize both high-level contextual information from the compressed latent vector and the rich, low-level details delivered via the skip connections when reconstructing the signal. This effectively preserves the fine-grained temporal structures of the signal, which are often susceptible to loss during compression. Finally, the adaptive feature refinement is realized through an attention mechanism. Although skip connections provide abundant information, not all transferred features are equally important for signal reconstruction. The attention mechanism provides the model with the capability to dynamically focus on information that is more salient for the current reconstruction step while suppressing information that is relatively less important. In this work, we specifically adopt SE [16] channel attention technique. To clarify the terminology for our proposed method, ‘channel’ in this context refers to the individual dimensions (features) of a feature vector, distinguishing it from the ‘sensor channels’ (Omni, NS, EW) of the sonobuoy. This method calculates the relative importance of each channel (dimension) of the feature vector fused by the skip connections and applies this as a weight. This process serves to reduce the influence of unnecessary noise or redundant information and to emphasize the core feature representations critical for signal reconstruction.
In conclusion, this architecture organically combines these three principles. The skip connections secure the ‘structural path’ to prevent information loss, while the attention mechanism dynamically controls the ‘quality’ of the information that passes through that path. Because this entire process operates directly in the time domain, which preserves phase information, it ultimately enables the high-fidelity reconstruction of DIFAR signals.

3.2. Hierarchical Encoder–Decoder with Skip Connections

The proposed network is composed of a hierarchical encoder and decoder, which implements the design principles mentioned in Section 3.1. Letting θ be the set of their trainable parameters, the encoder and decoder can be represented as the functions f e n c ( ·   ;   θ ) and g d e c ( ·   ;   θ ) , respectively. Both components are constructed from fully connected layers and are interconnected via skip connections. This section details the specific architecture of each component and the experimental design used to explore effective skip connection configurations.

3.2.1. Model Architecture

The encoder, f e n c , receives a time-domain input waveform x R N composed of N samples. The output feature map s i of the i -th encoder layer ( i { 1 ,   ,   L } ) is sequentially computed by taking the output s i 1 of the previous layer (where s 0 = x ) as input, as follows:
s i = σ ( w i   ×   s i 1 + b i ) ,
where s i is the output feature vector of the i -th layer, w i and b i are the trainable weight matrix and bias vector of that layer, and σ ( · ) is the ReLU activation function. The encoder generates the final latent vector z   =   s L R D z (where D z is the latent dimension) and a set of intermediate feature maps S s k i p  intended for skip connections.
In addition, the decoder, g d e c , receives the latent vector z and a selected set of skip feature maps s k     k   K } to reconstruct the original waveform. The j -th decoder expansion layer fuses the output d j 1 of the previous layer (where d 0 = z ) and the encoder feature s k delivered via the skip connection. In this study, we adopted element-wise addition as the information fusion method, and the process is expressed as follows:
d j = σ ( w j   ×   d j 1 + b j   + s k ) ,
where w j and b j are the trainable weight matrix and bias vector of the j -th decoder layer. The fused feature d j undergoes subsequent processing to become the final output of the j -th decoder layer. After this process is repeated through all decoder layers, the final layer employs a Tanh activation function to constrain the dynamic range of the reconstructed waveform x ^ to [ 1 , 1 ] and outputs it.

3.2.2. Systematic Exploration of Skip Connection Configurations

The excessive transfer of low-level features can potentially impair generalization performance. Therefore, to explore the favorable trade-off point between reconstruction performance and the amount of transmitted information, we systematically designed a set of model variants with different combinations of skip connections. The index set K j for a model ( M j ) using j skip connections is defined by sequentially adding layers, starting from the one closest to the latent space, as shown in:
K j = i     i   { L j ,   ,   L 1 } }   ,
where j is the number of skip connections used ( j     { 0 ,   1 ,   , L 1 } ), and j = 0 corresponds to the baseline model with no skip connections, where K 0   =   . The total amount of information transmitted by each model variant, denoted as the total latent dimension D t o t a l , is calculated as the sum of the dimensions of the final latent vector z and the selected skip features, as shown in:
D t o t a l   K j   = dim z + k K j dim s k .

3.2.3. Gated Feature Fusion Mechanism

The element-wise addition method for information fusion, described in (2) of Section 3.2.1, is simple and efficient. However, it is limited by the fact that it always blends the two information sources, the decoder features and the encoder skip features, at a fixed ratio. To model the complex characteristics of the signal with greater sophistication, a gated mechanism was designed as an additional model variant as shown in Figure 4. This mechanism dynamically adjusts the mixing ratio between the two features.
In the gated fusion mechanism, given the up sampled feature d   =   σ ( W j   ×   d j 1   +   b j ) from the j -th decoder expansion layer and the encoder feature s k delivered via the skip connection, the fusion process is performed in the following two steps. First, the two feature vectors are concatenated to create [ d ,   s k ] , which is then passed through a separate gate network to compute the ‘gate vector’ g a t e     R D . This gate network is composed of two fully connected layers and a sigmoid activation function, and is expressed as:
g a t e   =   σ s i g (   W g 2   ×   σ (   W g 1   ×   [ d ,   s k ]   )   ) ,
where W g 1 and W g 2 are the trainable weights of the gate network, and σ s i g is the sigmoid function. Each element of the gate vector g a t e has a value in the range [0, 1] and serves as a dynamic weight that determines how much of the skip feature s k to ‘pass’ through. Second, the computed gate vector is used to perform a weighted sum of the two features. The finally fused feature d j is calculated as shown in:
d j   =   g a t e     s k   +   ( 1     g a t e )     d   ,
where denotes element-wise multiplication. The performance impact of applying this gated fusion method is comparatively analyzed in Section 4 to verify the efficacy of this more sophisticated information fusion approach.

3.3. Feature Refinement with Channel Attention

Although skip connections provide rich, multi-scale features from the encoder to mitigate the information bottleneck, not all transferred feature channels contribute equally to the final signal reconstruction. Some channels may contain critical signal components, whereas others might include information that is relatively less significant or redundant. To ameliorate this inefficiency and enhance representational power of the proposed model, we selectively integrated a channel attention mechanism for adaptive feature recalibration into specific model variants to comparatively analyze its effect. The attention mechanism serves to emphasize the features contributing the most to signal reconstruction from the fused representations, while suppressing superfluous information.
For this purpose, we adopted the SE block [16] designed based on ‘Squeeze’ and ‘Excitation’, and the block diagram of the SE block is illustrated in Figure 5.

3.3.1. Squeeze: Global Information Embedding

The Squeeze operation aims to generate a low-dimensional descriptor that facilitates learning inter-channel relationships. This is achieved by summarizing the global information for each channel (dimension) of the D-dimensional input feature vector u     R D . In the fully connected layer environment addressed in this study, the input feature vector u itself can be regarded as implicitly containing this global information. Consequently, the Squeeze operation F s q is implemented by projecting u into a lower-dimensional space via the first fully connected (FC) layer. This process is expressed as:
z s q = F s q u ,   W 1 =   σ W 1 × u ,
where W 1 R D r × D is the trainable weight matrix, σ is the ReLU activation function, and r is the reduction ratio which controls the complexity of the model. Consequently, z s q becomes a D r -dimensional low-dimensional descriptor vector, which is then used as the input for the Excitation stage.

3.3.2. Excitation: Adaptive Channel Gating

The Excitation stage utilizes the low-dimensional descriptor z s q , generated during the Squeeze stage, to produce a gating vector that represents the relative importance of each channel. This process is implemented by restoring z s q to the original D dimension using a second FC layer, which is then passed through a sigmoid activation function. The sigmoid function serves to normalize the weight for each channel to the range [0, 1]. The Excitation operation F e x is expressed as:
s c a l e   =   F e x z s q ,   W 2 =   σ s i g W 2   ×   z s q ,
where z s q is the output of the Squeeze stage, W 2     R ( D   ×   ( D r ) ) is the weight matrix of the second FC layer, and σ s i g is the sigmoid function. The resulting gating vector s c a l e     R D indicates the importance of each channel.
The final attention-applied output vector v is obtained through channel-wise (element-wise) multiplication of the original input feature vector u and the calculated gating vector s c a l e :
v   =   s c a l e     u   .
Through this process, the SE block performs dynamic feature recalibration, amplifying the influence of channels important for signal reconstruction and diminishing the influence of less important channels. The quantitative performance change, based on the application of this attention mechanism, is comparatively analyzed in Section 4.

4. Experiments

This section details the experimental setup, quantitative results, and qualitative analysis of the experiments conducted to validate the performance of the proposed high-fidelity autoencoder. The experiments were designed to focus on: (1) a performance comparison between the proposed model and existing autoencoder-based models, (2) an analysis of performance variation according to the amount of transmitted information, and (3) an ablation study to verify the effects of the skip connection fusion method and the attention mechanism.

4.1. Simulation Setup

The dataset used for performance validation in this study was custom-generated using the MATLAB (Version R2024b) Phased Array System Toolbox [21] to precisely simulate the actual operational environment. The data generation parameters were established, assuming an active sonar operational scenario, referencing the operational specifications of the actual source sonobuoy (based on AN/SSQ-565) and the receiving DIFAR sonobuoy (based on AN/SSQ-573) [22,23]. The key parameters used for dataset generation are summarized in Table 1.
Overall, a total of 90 h of training data were constructed using various combinations of signals and environmental parameters. The test dataset, used for all evaluations in Section 4.5, was separately constructed and consisted of 300 10-s WAV files (150 for CW and 150 for LFM). All signals were generated as 3-channel (Omni, NS, EW) with a sampling frequency of 31,250 Hz. For model training, these signals were segmented into 0.1-s (3125-sample) units. Each segment was normalized to the range [−1, 1] before being used as model input.

4.2. Compared Models

Training the proposed network is defined as the process of optimizing the model parameters θ to reduce the difference between the original signal and the reconstructed signal. This section describes the objective function for training and the specific settings and environment used for the actual training.
To systematically analyze the performance impact of each component of the proposed architecture (skip connections, fusion methods, and attention), we designed and comparatively evaluated the following models. The key characteristics and design objectives of each model are summarized in Table 2.
  • AE-Baseline: This model is based on the sequential autoencoder structure from [6]. To ensure a fair comparison with the architecture proposed in this study (e.g., AE-S, AE-GF), it was redesigned to possess the identical backbone structure as these models, but with the skip connections removed. It serves as the baseline model to isolate and evaluate the pure effect of introducing skip connections.
  • AE-Baseline 60 (modified): For a fair comparison with AE-Baseline, this model’s bottleneck dimension was modified to 60. Specifically, the Linear (100, 50) and Linear (50, 10) encoder layers and their symmetric decoder counterparts from the original AE-Baseline were removed. They were replaced with a Linear (100, 60) encoder layer and a Linear (60, 100) decoder layer to adjust the bottleneck dimension to 60. This model has the same total transmitted information as one of the proposed models, AE-S 60. It serves as a critical comparison point for separately analyzing the effects of “merely increasing the latent space size” versus “modifying the architecture with skip connections.”
  • AE-RVQ-CM: This model, proposed in previous work [7], is an architecture focused on lightweighting. It utilizes a 10-bit integer latent vector and a Compensation Module.
  • AE-S (S for Skip): This is the core model proposed in this study, which features multiple skip connections and uses element-wise addition as the feature fusion method. The variants, such as AE-S 60 and AE-S 160, are differentiated by the total amount of information (dimensionality) transmitted via the skip connections, as designed in Section 3.2.2.
For example, for the AE-S 60 variant ( j = 1 ), K 1 = { 4 } , and the total transmitted dimension is 10 + 50 = 60 . For the AE-S 160 variant ( j = 2 ), K 2   = { 3 , 4 } , and the total dimension is 10 + 50 + 100 = 160 . The detailed configurations of the model variants compared in this study are summarized in Table 3. This systematic experimental design enables a quantitative analysis of how the amount of information and the hierarchical depth of the features transmitted via skip connections affect reconstruction performance.
  • AE-S-SE (SE for Squeeze-and-Excitation): This model integrates a channel attention (SE Block) module into the AE-S architecture to verify the effect of refining the fused features.
  • AE-GF (GF for Gated Fusion): Instead of element-wise addition, this model applies the Gated Feature Fusion mechanism, as described in Section 3.2.3, to analyze the effect of dynamically adjusting the mixing ratio of the two features.
  • AE-GF-SE: This model integrates the channel attention module into the AE-GF’s gated fusion architecture to evaluate the synergistic effect of the two mechanisms.

4.3. Training and Implementation Details

This section describes the objective function, optimization method, and specific implementation environment and hyperparameters used for training all the models defined in Section 4.2. To ensure a fair comparison of performance differences arising from architectural variations, all models, with the exception of AE-RVQ-CM, were trained using an identical training procedure and data partitioning.
The general architecture described in Section 3.2.1 was implemented with the specific parameters detailed in Table 4. The notation s i represents the feature map at the i -th layer of the encoder, while d j denotes the feature map at the j -th layer of the decoder.

4.3.1. Objective Function

The main objective function for training all models is defined by directly measuring the similarity between the original waveform x and the reconstructed waveform x ^ in the time domain. We calculate the MSE for each sample in the training dataset D and average this over the entire dataset to use as the final loss function L ( θ ) . The MSE loss function is defined as Equation (10):
L θ =   ( 1 D )   ×   Σ x D   x     g d e c f e n c x ;   θ ;   θ 2 2   ,
where | | · | | 2 2 signifies the square of the L 2 norm. This loss function directly minimizes the sample-wise error between the two waveforms in the time domain, thereby guiding the model to generate a high-fidelity waveform closely resembling the original signal.

4.3.2. Implementation and Hyperparameters

All experiments were implemented using the PyTorch (Version 2.6.0) deep learning framework on a high-performance workstation equipped with an Intel Core i9-14900KF CPU, 64 GB of RAM, and an NVIDIA GeForce RTX 4090 GPU. This computing environment was utilized for both model training and the precise measurement of inference time. To ensure reproducibility, the seed for all randomized processes was fixed to 42.
To optimize the model parameters θ , we used the Adam optimizer. For most models (excluding AE-RVQ-CM), the Adam optimizer’s hyperparameters were set with an initial learning rate fixed at 3 × 10 4 , and the momentum-related parameters β 1 and β 2 were set to 0.5 and 0.9, respectively. Training was conducted for a total of 600 epochs.
For the AE-RVQ-CM model [7], a separate learning rate scheduler was applied. The model was trained for an initial 300 epochs with the same learning rate to allow it to converge. Following this, fine-tuning was performed in the subsequent training phase by reducing the learning rate by a factor of 0.2. All other models used a fixed learning rate throughout the entire training process.
For the architectural hyperparameters, the reduction ratio r in all Gated Fusion (GF) blocks was set to 2, and the reduction ratio in all Squeeze-and-Excitation (SE) blocks was set to 5.”

4.4. Evaluation Metrics

To conduct a multifaceted and quantitative evaluation of the reconstruction performance of the proposed and comparison models, this study employed the following three standard evaluation metrics from the field of acoustic signal processing.
Spectral Mean Squared Error (Spec MSE): Spec-MSE is a metric that calculates the mean squared error between the magnitude components of the spectrograms of the original signal x and the reconstructed signal x ^ , following the Short-Time Fourier Transform (STFT). It is defined as Equation (11):
S p e c   M S E =   1 F   × T t = 1 T f = 1 F S T F T x t , f S T F T x ^ t , f 2 ,  
where F and T represent the number of frequency bins and time frames, respectively. Spec-MSE directly measures the similarity between the spectral structures of the two signals. Therefore, it is the most critical metric for evaluating the ‘high-fidelity reconstruction’ objective of this study. A lower value indicates a smaller spectral reconstruction error.
Signal-to-Noise Ratio (SNR) [dB]: SNR is a classical metric representing the signal-to-noise ratio. In this study, it is calculated by regarding the original signal x as the ‘signal’ and the difference between the original and reconstructed signal ( x     x ^ ) as the ‘noise’ (reconstruction error). The SNR in decibels (dB) is calculated as Equation (12):
S N R = 10 log 10 ( x n 2 / x n x ^ n 2 )   ,  
Which evaluates the energy level of the reconstruction error relative to the total energy of the signal. A higher value indicates superior reconstruction performance.
Scale-Invariant Signal-to-Noise Ratio (SI-SNR) [dB]: SI-SNR is a signal quality metric designed to be independent of the absolute scale difference between the original signal and the reconstructed signal. It is computed by optimally projecting the reconstructed signal x ^ onto the original signal x to match its scale, and then calculating the signal-to-noise ratio. When x t a r g e t and e n o i s e are defined as x t a r g e t   =   x ^ , x x x 2 and e n o i s e   =   x ^     x t a r g e t , the SI-SNR is calculated as Equation (13):
S I S N R = 10 log 10 ( x t a r g e t 2 e n o i s e 2 )   ,  
SI-SNR reflects not only the simple energy difference but also the structural similarity of the waveforms more effectively. Therefore, it has been widely used for performance evaluation in the speech and audio processing fields in recent years [24]. A higher value signifies less signal distortion and superior quality.
Computational Efficiency Metric: To validate the feasibility of real-time processing for sonobuoy systems, we evaluated the model’s inference speed. Inference Time is defined as the average computational time required to process a single 0.1-s waveform sample. To ensure the reliability of the measurement, the inference time was recorded on a workstation equipped with an NVIDIA GeForce RTX 4090 GPU. The reported values represent the average of 2000 iterations following a GPU warm-up phase, thereby excluding initialization overheads.

4.5. Results and Discussion

This section presents a quantitative and qualitative analysis of the performance of the comparison models defined in Section 4.2, conducted under the experimental environment described in Section 4.1. First, the superiority of the proposed architecture is demonstrated through an overall performance comparison. Subsequently, an ablation study is conducted to identify the sources of performance improvements of the proposed model by isolating and analyzing the effects of each structural component, specifically skip connections, fusion methods, and attention mechanisms. Furthermore, the trade-off relationship between the transmitted information volume and reconstruction performance is analyzed to explore an efficient model configuration. Finally, qualitative differences in quality are compared via spectrogram visualization.

4.5.1. Overall Performance Comparison

Table 5 summarizes the quantitative performance evaluation results for each model on the CW and LFM test datasets. A notable finding from the results is that the AE-S 1660 model, which uses a simple architecture (AE-S) with an extremely large volume of transmitted information, recorded the highest numerical performance across most metrics. This model achieved a Spec-MSE of 2.01   × 10 5 for CW signals and 1.61   × 10 5 for LFM signals, demonstrating that very high reconstruction quality is attainable in the absence of information constraints. However, the AE-S 1660 model requires the transmission of a 1660-dimensional latent vector, rendering it impractical from a communication efficiency perspective. This diverges from the objective of this study, which is to achieve ‘high fidelity’ within a ‘constrained communication environment’. Consequently, this research focuses on a comparative analysis of each architecture’s performance within a practical range of information volumes, specifically 60 and 160 dimensions, that are comparable to, or only slightly larger than, that of the AE-Baseline.
When the scope of analysis is narrowed to the practical information volumes (60 and 160 dimensions), the AE-GF-SE 160 model, which integrates gated fusion and attention based on multi-scale skip connections, exhibited the most superior performance. Specifically, this model recorded an SI-SNR of 20.35 dB for CW signals and 18.28 dB for LFM signals. These results imply several significant points. First, (comparing AE-S 60 and AE-Baseline 60) under the limited information capacity of 60 dimensions, the simple introduction of skip connections did not offer a distinct advantage over the sequential structure (LFM SI-SNR 14.43 dB vs. 14.77 dB). Second, (comparing AE-Baseline 60 and AE-GF 60) even with an identical 60 dimensions, applying sophisticated structural improvements such as gated fusion (AE-GF 60) was more effective for enhancing performance (LFM SI-SNR 15.37 dB) than using simple skip connections (AE-S 60). Third, the results indicate that within a practical transmission range, the AE-GF-SE model (which combines gated fusion and attention) offers the most favorable trade-off between performance and efficiency when a sufficient information capacity (160 dimensions) is secured.
Furthermore, we analyzed the computational efficiency based on the measurement conditions described in Section 4.3. As shown in the last column of Table 5, the AE-Baseline 60 model achieved the fastest inference time of 0.2804 ms. This is slightly faster than the original AE-Baseline (0.2893 ms) because, as detailed in Section 4.2, the model depth was reduced by removing specific encoding/decoding layers to adjust the bottleneck dimension.
The AE-RVQ-CM model [7] recorded a latency of 2.9961 ms. It is important to note that the model pruning technique used in the previous study was intentionally excluded in this experiment to allow for a direct comparison of structural complexity. Consequently, the computational costs inherent to the iterative quantization search and convolutional layers are fully reflected in the result. Meanwhile, the proposed AE-GF-SE 160 model showed an inference time of 0.6520 ms. Although the integration of Gated Fusion and Attention modules resulted in a slight increase in latency compared to the baseline, the processing time remains negligible compared to the 100 ms frame duration. This suggests that the proposed architecture is feasible for real-time processing on high-performance computing platforms.

4.5.2. Ablation Study on Architectural Components

In this section, an Ablation Study was conducted to individually analyze the performance impact of each component of the proposed architecture. The analysis focuses on three aspects: (1) the effect of introducing skip connections, (2) a comparison of feature fusion methods (additive vs. gated), and (3) the additional effect of the attention mechanism. All comparisons are based on the quantitative results presented in Table 5.
(1) Effect of Skip Connections: To analyze the specific effect of introducing skip connections, we first compared the performance of AE-Baseline 60 (a sequential architecture) and AE-S 60 (a skip architecture). Both models possess the same 60-dimensional information capacity.
  • For CW signals, the performance slightly degraded. AE-S 60 (SI-SNR 18.15 dB) showed a 0.29 dB degradation compared to AE-Baseline 60 (SI-SNR 18.44 dB), while its Spec-MSE was also slightly worse ( 2.93   × 10 5 vs. 2.79   × 10 5 ).
  • For the more complex LFM signals, the AE-S 60 showed a clear performance degradation compared to AE-Baseline 60, with SI-SNR dropping from 14.77 dB to 14.43 dB and Spec-MSE worsening from 6.28   × 10 5 to 7.10   × 10 5 .
This indicates that at a low 60-dimensional capacity, simple additive skip connections are insufficient or even detrimental for reconstructing complex signals. However, this trend was reversed in the AE-S 160 model, which increases the information capacity to 160 dimensions. For LFM signals, AE-S 160 achieved an SI-SNR of 17.02 dB. This performance surpasses both the 60-dimensional baseline without skips (AE-Baseline 60, 14.77 dB) and the low-capacity skip model (AE-S 60, 14.43 dB). In conclusion, skip connections are an effective architecture only when accompanied by a sufficient information capacity (dimensionality) to handle signal complexity.
(2) Comparison of Feature Fusion Methods: To compare the different methods for fusing encoder features and decoder features, we analyzed the performance of the AE-S series (using element-wise sum) and the AE-GF series (using gated fusion).
  • Using the 160-dimensional capacity as a baseline, AE-GF 160 improved the SI-SNR for LFM signals to 17.34 dB, compared to 17.02 dB for AE-S 160.
  • This trend was even more pronounced at 60 dimensions, where AE-GF 60 (15.37 dB) outperformed AE-S 60 (14.43 dB).
This suggests that the gated mechanism, which dynamically adjusts the mixing ratio of the two information sources, enables a more sophisticated and effective feature fusion than simple element-wise addition, contributing to performance enhancement across all tested capacities.
(3) Effect of the Attention Mechanism: To evaluate the additional contribution of the attention mechanism, we compared models without attention (AE-GF) against models incorporating the SE block (AE-GF-SE).
  • Using the 160-dimensional models as an example, the AE-GF-SE 160 (with attention) improved the SI-SNR for LFM signals to 18.28 dB from 17.34 dB (compared to AE-GF 160). It also reduced the Spec-MSE from 3.79   × 10 5 to 3.16   × 10 5  (a reduction of approx. 16.6%).
  • However, this trend was not observed at the 60-dimensional capacity. For LFM signals, the AE-GF-SE 60 (15.12 dB) showed a slight performance degradation compared to the AE-GF 60 (15.37 dB).
This indicates that the channel attention’s ‘feature refinement’ role is most effective when operating on a sufficiently rich set of features (160 dimensions). When the information capacity is already highly constrained (60 dimensions), the attention mechanism may not provide an additional benefit.
In summary, this Ablation Study confirmed that gated fusion is a robust improvement over additive fusion. It also revealed that the effectiveness of skip connections and channel attention is highly dependent on the model’s information capacity, acting complementarily with sufficient dimensionality rather than as independent, universal enhancements.

4.5.3. Analysis of Amount of Transmitted Information-Performance Trade-Off

Figure 6 visually illustrates the change in reconstruction performance (Spec-MSE) as a function of the total transmitted dimension. Observing the performance trend of the AE-S series models, the reconstruction error was observed to decrease as the transmitted information was increased from 10 to 160 dimensions. This implies that the additional useful multi-scale information, delivered via skip connections, improves the reconstruction quality. However, a notable finding is the “U-shaped” pattern that emerges. When the information volume is increased from 160 to 660 dimensions, performance does not improve but rather temporarily degrades. Performance then substantially improves again when the information volume is further increased to 1660 dimensions. This phenomenon can be explained by the relationship between the informational value and the redundancy of the transmitted features. Up to the 160-dimension configuration, the new information positively contributes to performance enhancement. However, the s 2 feature, which is added in the 660-dimension configuration, may have high information redundancy with the deeper features or contain unrefined noise components, adversely affecting the decoder’s reconstruction process. Conversely, when a large volume of low-level features (features from the early encoder layers) is provided, as in the 1660-dimension case, the decoder relies on this abundant low-level information to reconstruct the signal with high performance, rather than interpreting the compressed high-level information. However, models such as AE-S 1660 are impractical as they require high communication costs. Therefore, when considering practical transmission efficiency, the favorable trade-off point between performance and cost is found to be around the 160-dimension mark. This point represents an efficient architectural configuration that can achieve a high level of reconstruction quality at a reasonable communication cost.

4.5.4. Qualitative Analysis

To complement the quantitative evaluation, a comparative analysis of spectrograms and frequency spectra was performed for representative CW and LFM test signals. For clearer visualization, the frequency axis in Figure 7 and Figure 8 is limited to the range of 0–3 kHz, focusing on the operational bandwidth of DIFAR sonobuoys (5–2400 Hz).
Figure 7 compares the spectrograms of the original signals and the signals reconstructed by each model. In the reconstruction results from the baseline model (AE-Baseline) and the previous research model (AE-RVQ-CM), shown in (b), (g), (c), and (h), faint vertical line artifacts are periodically observed. This is interpreted as a boundary discontinuity problem, occurring because the 0.1-s (3125-sample) input segments are not connected perfectly smoothly. Conversely, in the proposed models applying multi-scale skip connections ((d), (i), (e), (j)), these boundary artifacts are reduced. This suggests that the skip connections better preserve the local temporal information of the signal, contributing to the continuity between segments. Particularly when comparing the LFM signal reconstructions, these artifacts are even further suppressed in AE-GF-SE 160 (j) than in AE-GF-SE 60 (i), indicating that a larger volume of information positively impacts temporal precision.
Figure 8a and Figure 9a analyze the spectral reconstruction of the CW signal. As observed in Figure 8a, all comparison models successfully reconstruct the target frequency band, showing a clear peak at the correct frequency. However, a closer inspection provided in Figure 9a reveals that the proposed AE-GF-SE models demonstrate a slightly closer resemblance to the Clean signal around the target frequency peak. While the baseline models exhibit minor discrepancies in peak shape, the proposed model follows the original signal’s spectral curve more precisely.
This trend is also evident in the LFM signal analysis shown in Figure 8b and Figure 9b. Similarly to the CW case, the comparison models (AE-Baseline, AE-RVQ-CM) reconstruct the target signal component relatively well. However, as shown in the magnified view of Figure 9b, the proposed AE-GF-SE models demonstrate a higher similarity to the Clean signal within the target region. Furthermore, in the non-target regions (outside the main signal band), the proposed models follow the spectral characteristics of the original signal more accurately than the comparison models, maintaining a comparable noise level to the Clean signal.
In conclusion, this qualitative analysis visually demonstrates that the proposed skip connection and attention-based architecture preserve not only the simple waveform similarity but also the temporal continuity and the precise shape of the spectral structure with high fidelity, particularly within the region of interest. This serves as evidence to support the quantitative evaluation results.

5. Conclusions

This study proposed a novel AE architecture for the high-fidelity compression and reconstruction of DIFAR sonobuoy acoustic signals. It originated from the critical observation that autoencoder-based approaches prioritizing high compression ratios often suffer from quality limitations due to the information bottleneck. In response, this study introduced a design that prioritizes reconstruction fidelity. The proposed network is based on an encoder–decoder structure that directly processes time-domain waveforms. It integrates multi-scale skip connections and channel attention mechanisms as key elements to minimize information loss.
To validate the efficacy of the proposed architecture, we constructed an extensive dataset that simulates a realistic underwater acoustic environment based on the specifications of operational sonobuoys. Through systematic comparative experiments and an Ablation Study, we demonstrated that the proposed model exhibits improved performance across all quantitative and qualitative metrics compared to conventional deep autoencoders and lightweight models. Notably, by analyzing the trade-off relationship between transmitted information volume and reconstruction performance, we identified a practical favorable trade-off point at approximately 160 dimensions. We also revealed that the gated fusion and attention mechanisms act complementarily to contribute to the enhancement of the efficiency of information fusion and the precision of feature representation, respectively. The model at this favorable trade-off, AE-GF-SE 160, achieved a significant reduction in spectral mean squared error (Spec-MSE) compared to the baseline model (AE-Baseline), decreasing it by approximately 72.58% for CW signals and 90.36% for LFM signals. The results of this study clearly show that sophisticated fusion mechanisms (e.g., Gated Fusion) and capacity-aware attention are effective structures for information preservation in the compression-reconstruction process of time-domain acoustic signals. The proposed high-fidelity reconstruction model has the potential to improve the accuracy of Doppler shift and micro-feature analysis for underwater targets, consequently contributing to the enhanced reliability of ASW operations. In future research, we plan to validate the proposed technique using real-world measurement data to further ensure its practical robustness.

Author Contributions

Conceptualization, Y.P. and J.H.; methodology, J.H.; software, Y.P.; validation, Y.P. and J.H.; formal analysis, Y.P.; investigation, J.H.; resources, J.H.; data curation, Y.P.; writing—original draft preparation, Y.P.; writing—review and editing, Y.P. and J.H.; visualization, Y.P.; supervision, J.H.; project administration, J.H.; funding acquisition, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Agency for Defense Development (ADD), Republic of Korea, under grant UI257009DD.

Data Availability Statement

The original data presented in the study are openly available in Zenodo at https://doi.org/10.5281/zenodo.17529063 [25].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AEAutoencoder
ASRAutomatic Speech Recognition
ASWAnti-Submarine Warfare
CWContinuous Wave
DIFARDirectional Frequency Analysis and Recording
DOADirection-of-Arrival
DPCRNDual-Path Convolution Recurrent Network
FDMFrequency Division Multiplexing
GFGated Feature Fusion
GNNsGraph Neural Networks
LFMLinear Frequency Modulation
MPAMaritime Patrol Aircraft
MSEMean Squared Error
NACNeural Audio Codecs
RVQResidual Vector Quantization
SESqueeze-and-Excitation
SI-SNRScale-Invariant Signal-to-Noise Ratio
SNRSignal-to-Noise Ratio
Spec-MSESpectral Mean Squared Error
STFTShort-Time Fourier Transform
T-FTime-Frequency

References

  1. Urick, R.J. Principles of Underwater Sound, 3rd ed.; Peninsula Publishing: Los Altos, CA, USA, 1983. [Google Scholar]
  2. Massoud, A.; Osman, A.; Noureldin, A. DOA estimation based on fourth order cumulant beamforming for nonuniform linear array of DIFAR sonobuoys. Comput. Electr. Eng. 2012, 38, 986–993. [Google Scholar] [CrossRef]
  3. Kuzu, A.; Danis, U.; Kurt, E.; Karabulut, E.; Sahinkaya, D.; Bilgic, E.; Kaplaner, A.; Birecik, S. Calibration and test of DIFAR sonobuoys. In Proceedings of the 2011 IEEE International Symposium on Industrial Electronics (ISIE), Gdansk, Poland, 27–30 June 2011; pp. 1276–1281. [Google Scholar] [CrossRef]
  4. Lee, J.; Han, S.; Kwon, B. Development of Communication Device for Sound Signal Receiving and Controlling of Sonobuoy. J. Korea Inst. Mil. Sci. Technol. 2021, 24, 317–327. [Google Scholar] [CrossRef]
  5. Thode, A.M.; Sakai, T.; Michalec, J.; Rankin, S.; Soldevilla, M.S.; Martin, B.; Kim, K.H. Displaying bioacoustic directional information from sonobuoys using “azigrams”. J. Acoust. Soc. Am. 2019, 146, 95–102. [Google Scholar] [CrossRef] [PubMed]
  6. Park, J.; Seok, J.; Hong, J. Autoencoder-Based Signal Modulation and Demodulation Methods for Sonobuoy Signal Transmission and Reception. Sensors 2022, 22, 6510. [Google Scholar] [CrossRef] [PubMed]
  7. Park, Y.; Hong, J. Autoencoder-Based DIFAR Sonobuoy Signal Transmission and Reception Method Incorporating Residual Vector Quantization and Compensation Module: Validation Through Air Channel Modeling. Appl. Sci. 2025, 15, 92. [Google Scholar] [CrossRef]
  8. Hodges, R.P. Underwater Acoustics: Analysis, Design and Performance of Sonar; Wiley: Chichester, UK, 2011. [Google Scholar]
  9. Curlander, J.C.; McDonough, R.N. Synthetic Aperture Radar: Systems and Signal Processing; Wiley: New York, NY, USA, 1992. [Google Scholar]
  10. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015. [Google Scholar] [CrossRef]
  11. Le, X.; Chen, H.; Chen, K.; Lu, J. DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement. In Proceedings of the Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; pp. 2811–2815. [Google Scholar] [CrossRef]
  12. Tzirakis, P.; Kumar, A.; Donley, J. Multi-Channel Speech Enhancement Using Graph Neural Networks. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar] [CrossRef]
  13. Chau, H.N.; Bui, T.D.; Nguyen, H.B.; Duong, T.T.H.; Nguyen, Q.C. A Novel Approach to Multi-Channel Speech Enhancement Based on Graph Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1133–1144. [Google Scholar] [CrossRef]
  14. Zeghidour, N.; Luebs, A.; Omran, A.; Skoglund, J.; Tagliasacchi, M. SoundStream: An End-to-End Neural Audio Codec. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 495–507. [Google Scholar] [CrossRef]
  15. Défossez, A.; Copet, J.; Synnaeve, G.; Adi, Y. High Fidelity Neural Audio Compression. arXiv 2022. [Google Scholar] [CrossRef]
  16. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
  17. Liu, W.; Zhan, T.; Xu, C. Automatic speech recognition based on attention-enhanced blockformer. In Proceedings of the Fourth International Conference on Computer Vision and Data Mining (ICCVDM 2023), Changchun, China, 22–24 October 2023. [Google Scholar] [CrossRef]
  18. Iuzzolino, M.L.; Koishida, K. AV(SE)2: Audio-Visual Squeeze-Excite Speech Enhancement. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7539–7543. [Google Scholar] [CrossRef]
  19. Rouvier, M.; Bousquet, P.M. Studying Squeeze-and-Excitation Used in CNN for Speaker Verification. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 1110–1115. [Google Scholar] [CrossRef]
  20. Li, M.; Zheng, Y.; Li, D.; Wu, Y.; Wang, Y.; Fei, H. Ms-senet: Enhancing Speech Emotion Recognition Through Multi-scale Feature Fusion with Squeeze-and-excitation Blocks. arXiv 2023. [Google Scholar] [CrossRef]
  21. Phased Array System Toolbox Documentation. Available online: https://www.mathworks.com/help/phased/index.html (accessed on 31 October 2025).
  22. Ultra Maritime Ltd. Multistatic Low Frequency Active Source & Moni Receive AN/SSQ-565, Datasheet v1-2-24; Ultra Maritime Ltd.: Dartmouth, NS, Canada, 2024. [Google Scholar]
  23. Ultra Maritime Ltd. Passive Directional Multi-Mode Sonobuoy AN/SSQ-573, Datasheet v1-2-24; Ultra Maritime Ltd.: Dartmouth, NS, Canada, 2024. [Google Scholar]
  24. Luo, Y.; Mesgarani, N. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 696–700. [Google Scholar] [CrossRef]
  25. Park, Y. Simulated 3-Channel DIFAR Sonobuoy Acoustic Dataset (CW and LFM). Zenodo 2025. [Google Scholar] [CrossRef]
Figure 1. Active detection scenario using a DIFAR sonobuoy system.
Figure 1. Active detection scenario using a DIFAR sonobuoy system.
Jmse 13 02402 g001
Figure 2. Architecture of the autoencoder-based signal transmission and reception system, mapping the sonobuoy to the encoder and the aircraft to the decoder.
Figure 2. Architecture of the autoencoder-based signal transmission and reception system, mapping the sonobuoy to the encoder and the aircraft to the decoder.
Jmse 13 02402 g002
Figure 3. Architecture of the proposed model integrating skip connections, GF, and SE blocks.
Figure 3. Architecture of the proposed model integrating skip connections, GF, and SE blocks.
Jmse 13 02402 g003
Figure 4. Detailed block diagram of the Gated Feature Fusion (GF) block.
Figure 4. Detailed block diagram of the Gated Feature Fusion (GF) block.
Jmse 13 02402 g004
Figure 5. Operational flow of the SE block, which learns and recalibrates the channel-wise importance of input features.
Figure 5. Operational flow of the SE block, which learns and recalibrates the channel-wise importance of input features.
Jmse 13 02402 g005
Figure 6. Spectrogram MSE as a function of the total transmitted latent dimension. Performance is evaluated on (a) CW signals and (b) LFM signals.
Figure 6. Spectrogram MSE as a function of the total transmitted latent dimension. Performance is evaluated on (a) CW signals and (b) LFM signals.
Jmse 13 02402 g006
Figure 7. Spectrogram visualization comparing the reconstruction quality of different models for representative (ae) CW and (fj) LFM signals. All spectrograms shown are from the Omni-directional (Omni) sensor channel. The subfigures correspond to: (a,f) Original clean signals; (b,g) AE-Baseline [7]; (c,h) AE-RVQ-CM [8]; (d,i) AE-GF-SE 60; (e,j) AE-GF-SE 160.
Figure 7. Spectrogram visualization comparing the reconstruction quality of different models for representative (ae) CW and (fj) LFM signals. All spectrograms shown are from the Omni-directional (Omni) sensor channel. The subfigures correspond to: (a,f) Original clean signals; (b,g) AE-Baseline [7]; (c,h) AE-RVQ-CM [8]; (d,i) AE-GF-SE 60; (e,j) AE-GF-SE 160.
Jmse 13 02402 g007
Figure 8. Comparison of the average frequency spectra for (a) CW and (b) LFM test signals in the operational bandwidth (0–3 kHz).
Figure 8. Comparison of the average frequency spectra for (a) CW and (b) LFM test signals in the operational bandwidth (0–3 kHz).
Jmse 13 02402 g008
Figure 9. Magnified comparison of the average frequency spectra for (a) CW and (b) LFM test signals. The proposed AE-GF-SE models align more closely with the clean signal in target signal band compared to the baseline models.
Figure 9. Magnified comparison of the average frequency spectra for (a) CW and (b) LFM test signals. The proposed AE-GF-SE models align more closely with the clean signal in target signal band compared to the baseline models.
Jmse 13 02402 g009
Table 1. Key parameters used for dataset generation.
Table 1. Key parameters used for dataset generation.
CategoryParameterValue/RangeReference Sonobuoy
GeneralSampling Frequency (fs)31,250 Hz-
Sound Propagation Speed1500 m/s-
Acoustic Environment Depth500 m-
Source Signal
(Transmitter)
Signal TypesCW/LFMAN/SSQ-565 [22] (Active Source)
CW Frequency Range5–2400 HzAN/SSQ-573 [23] Receiver
LFM Frequency Range5–2400 HzAN/SSQ-573 Receiver
LFM Bandwidth{100, 200, 300, 400} HzAN/SSQ-565 (Active Source)
LFM Chirp TypeUp-chirp,
Down-chirp
AN/SSQ-565 (Active Source)
Pulse Duration{1, 2, 4} sAN/SSQ-565 (Active Source)
Acoustic
Environment
Horizontal Range
(Target to Receiver)
0–5 km
(Random)
-
Source Depth10–300 m (Random)AN/SSQ-565 selectable depths
Receiver (DIFAR) Depth{30, 60, 120, 300} m
(Random Selection)
AN/SSQ-573 selectable depths [23]
Receiver Sensor
(DIFAR)
Sensor Type3-Channel
DIFAR (Omni, NS, EW)
AN/SSQ-573 (Passive/Active Receiver)
Acoustic Frequency Range5–2400 HzAN/SSQ-573
Table 2. Description of the compared model architectures and their key characteristics.
Table 2. Description of the compared model architectures and their key characteristics.
Model
Abbreviation
Bottleneck/
Total Dim
Skip ConnectionFeature FusionAttention.
AE-Baseline [6]10 (float16)N-N
AE-Baseline 60
(modified)
60 (float16)N-N
AE-RVQ-CM [7]10 (int10)N-N
AE-S Series60, 160, … (float16)YElement-wise SumN
AE-S-SE Series60, 160 (float16)YElement-wise SumY
AE-GF Series60, 160 (float16)YGated FusionN
AE-GF-SE Series60, 160 (float16)YGated FusionY
Note: “Bottleneck/Total Dim.” refers to the total transmitted dimensions, and the value type indicates the data type of the latent vector.
Table 3. Configurations of the experimental model variants with different skip connection schemes. Total transmitted dimension includes the 10-dim bottleneck latent vector.
Table 3. Configurations of the experimental model variants with different skip connection schemes. Total transmitted dimension includes the 10-dim bottleneck latent vector.
Model Variant NameNumber of Skips (j) Activated   Skip   Index   Set   ( K j ) Total   Transmitted   Dimension   ( D t o t a l )
AE-Baseline [6]0 10
AE-S 601{4}60
AE-S 1602{3, 4}160
Table 4. Detailed layer configurations of the encoder and decoder. The notation s i represents the feature map at the i -th layer of the encoder, while d j denotes the feature map at the j -th layer of the decoder.
Table 4. Detailed layer configurations of the encoder and decoder. The notation s i represents the feature map at the i -th layer of the encoder, while d j denotes the feature map at the j -th layer of the decoder.
ComponentLayer NameInputOutputOutput DimActivation
EncoderLinear1 s 0 = x s 1 1000ReLU
Linear2 s 1 s 2 500ReLU
Linear3 s 2 s 3 100ReLU
Linear4 s 3 s 4 50ReLU
Linear5 s 4 z = s 5 10ReLU
DecoderLinear6 d 0 = z d 1 50ReLU
Linear7 d 1 d 2 100ReLU
Linear8 d 2 d 3 500ReLU
Linear9 d 3 d 4 1000ReLU
Linear10 (output) d 4 x ^ = d 5 3125Tanh
Table 5. Quantitative performance comparison of all models on the CW and LFM test datasets.
Table 5. Quantitative performance comparison of all models on the CW and LFM test datasets.
MethodSignal
Type
Latent
Vector Size
Spectrogram
MSE
SNR
(dB)
SI-SNR
(dB)
Inference Time (ms)
AE-Baseline [6]CW10
(float 16 bit)
8.06 × 10 5 13.6913.530.2893
LFM 3.28 × 10 4 6.805.24
AE-Baseline 60
(modified)
CW60
(float 16 bit)
2.79 × 10 5 18.2618.440.2804
LFM 6.28 × 10 5 14.9014.77
AE-RVQ-CM [7]CW10
(int 10 bit)
5.33 × 10 5 14.6214.862.9961
LFM 1.43 × 10 4 10.099.64
AE-S 60CW60
(float 16 bit)
2.93 × 10 5 17.9818.150.3370
LFM 7.10 × 10 5 14.6014.43
AE-S 160CW160
(float 16 bit)
2.71 × 10 5 18.6718.920.3480
LFM 3.94 × 10 5 17.0417.02
AE-S 660CW660
(float 16 bit)
3.83 × 10 5 18.3118.980.3601
LFM 4.31 × 10 5 17.2017.32
AE-S 1660CW1660
(float 16 bit)
2.01 × 10 5 21.3121.780.3842
LFM 1.61 × 10 5 21.3521.53
AE-S-SE 60CW60
(float 16 bit)
2.57 × 10 5 18.4618.640.4632
LFM 6.05 × 10 5 15.0914.94
AE-S-SE 160CW160
(float 16 bit)
2.39 × 10 5 19.3619.630.5041
LFM 3.59 × 10 5 17.5817.60
AE-GF 60CW60
(float 16 bit)
2.50 × 10 5 18.8119.030.4243
LFM 5.73 × 10 5 15.5015.37
AE-GF 160CW160
(float 16 bit)
2.43 × 10 5 19.3419.700.4310
LFM 3.79 × 10 5 17.3017.34
AE-GF-SE 60CW60
(float 16 bit)
2.50 × 10 5 18.6118.790.4807
LFM 5.84 × 10 5 15.2615.12
AE-GF-SE 160CW160
(float 16 bit)
2.21 × 10 5 19.9820.350.6520
LFM 3.16 × 10 5 18.1818.28
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Park, Y.; Hong, J. High-Fidelity Decoding Method for Acoustic Data Transmission and Reception of DIFAR Sonobuoy Using Autoencoder. J. Mar. Sci. Eng. 2025, 13, 2402. https://doi.org/10.3390/jmse13122402

AMA Style

Park Y, Hong J. High-Fidelity Decoding Method for Acoustic Data Transmission and Reception of DIFAR Sonobuoy Using Autoencoder. Journal of Marine Science and Engineering. 2025; 13(12):2402. https://doi.org/10.3390/jmse13122402

Chicago/Turabian Style

Park, Yeonjin, and Jungpyo Hong. 2025. "High-Fidelity Decoding Method for Acoustic Data Transmission and Reception of DIFAR Sonobuoy Using Autoencoder" Journal of Marine Science and Engineering 13, no. 12: 2402. https://doi.org/10.3390/jmse13122402

APA Style

Park, Y., & Hong, J. (2025). High-Fidelity Decoding Method for Acoustic Data Transmission and Reception of DIFAR Sonobuoy Using Autoencoder. Journal of Marine Science and Engineering, 13(12), 2402. https://doi.org/10.3390/jmse13122402

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop