1. Introduction
In the modern maritime environment, underwater acoustic surveillance is an essential defense mechanism for the early detection of and response to hostile threats. The success of Anti-Submarine Warfare (ASW), in particular, relies significantly on a comprehensive understanding of the underwater acoustic environment and the capability for precise signal analysis [
1,
2].
Directional Frequency Analysis and Recording (DIFAR) sonobuoys employed in ASW operations are a type of acoustic vector sensor, representing a sophisticated sensing system that captures not only acoustic pressure but also directional information of the sound field. It typically consists of one omnidirectional hydrophone, which measures a scalar pressure value, and two orthogonal dipole sensors, which measure the particle velocity vector. Combining these multi-channel outputs enables the estimation of the direction-of-arrival (DOA) in addition to the signal’s frequency characteristics, which contributes to the localization and tracking of target maneuvers [
3,
4].
The collected data is transmitted via wireless channels to Aircraft, as illustrated in
Figure 1, where it undergoes complex signal processing. This data pipeline, however, is constrained by the inherent limitations of expendable equipment: namely, limited battery life and restricted communication bandwidth [
5]. Consequently, the efficient and lossless transmission of voluminous raw data to analysis platforms remains a critical technical challenge affecting the operational effectiveness of DIFAR systems.
To address this challenge, deep learning-based data compression techniques were introduced to overcome the inefficiencies of traditional methods like Frequency Division Multiplexing (FDM) [
6]. Among these, the Autoencoder (AE) has been investigated as an effective solution due to its ability to transform signals into low-dimensional latent representations. Initial research applying AEs to DIFAR signal compression demonstrated potential for significant reductions in data transmission volume. However, these approaches were observed to suffer from signal quality degradation caused by the information bottleneck inherent in the encoding process [
6]. Although subsequent studies improved compression efficiency by integrating model lightweighting and Residual Vector Quantization (RVQ) techniques, they also adhered to a research paradigm that prioritized the compression ratio [
7]. As a result, while the communication payload was effectively reduced, the loss of fine-grained acoustic information during compression remained a persistent factor limiting the precision of DIFAR signal analysis.
In active detection scenarios utilizing DIFAR sonar systems, analytical accuracy is contingent upon the precise measurement of distortions in the received signal, which results from the transmitted waveform reflecting off a target. For the Continuous Wave (CW) and Linear Frequency Modulation (LFM) signals used in this study, preserving signal fidelity is a critical consideration. For instance, the Doppler shift induced by target motion manifests as a frequency change in the CW signal, serving as a primary cue for estimating the target’s relative velocity [
8]. Likewise, the time delay and pulse width variations in LFM signals contain information regarding the target’s range and relative acceleration [
9]. Conventional compression-centric approaches, in the process of achieving high compression ratios, are susceptible to the loss of information required to represent these subtle frequency and temporal variations. This information loss can consequently lead to increased errors in estimating target maneuver parameters or result in degraded detection performance. Therefore, an approach that establishes the fidelity of the reconstructed signal as the primary objective is required to ensure the precise measurement of these parameters.
To achieve high-fidelity reconstruction, the first consideration is the skip connection. This refers to a structural link that bypasses certain layers to transfer information directly and can be broadly categorized into two types: long-range and short-range connections between the encoder and the decoder. The long-range connection, as utilized in models like U-Net [
10] and DPCRN [
11], directly forwards low-level features extracted from the early layers of the encoder to the later layers of the decoder, skipping the deep bottleneck layers where information is compressed. This approach has been widely adopted not only in single-channel speech enhancement models [
11] but also as a core framework in recent architectures for processing complex multi-channel audio signals [
12,
13], demonstrating its effectiveness in preserving fine-grained time-frequency structures from loss during compression. Subsequently, the short-range connection, typically in the form of a ‘Residual Block’ plays a pivotal role in the field of modern time-domain Neural Audio Codecs (NAC). Famous models such as Google’s SoundStream [
14] and Meta’s EnCodec [
15] deeply integrate skip connections as residual blocks within their encoders and decoders. This ensures that even in very deep networks, the input signal can be passed through directly via an identity mapping, which prevents the loss of fine-grained waveform information and enables stable training. The success of these approaches suggests that structural designs minimizing information loss are a critical consideration for achieving high-quality compression and reconstruction of time-domain waveforms. In addition, the second method is the attention mechanism, a technique that enables the model to selectively focus on significant elements within the transferred information. In the field of audio signal processing, channel attention—which explicitly models the interdependencies between channels of a feature map—has been used effectively. A representative example is the Squeeze-and-Excitation (SE) block [
16]. The SE block performs ‘Squeeze’ (global information compression) and ‘Excitation’ (channel-wise weight recalibration) operations, playing a crucial role in enhancing the representational power of the model by selectively emphasizing channels that are more salient for signal reconstruction. The efficacy of this channel attention has been validated across various sub-disciplines of acoustic signal processing. In Automatic Speech Recognition (ASR), an ‘Attention-Enhanced Squeeze-and-Excitation block’ is applied to Transformer-based structures like Blockformer to improve recognition performance [
17]. In speech enhancement, SE blocks have been proposed for audio-visual models to improve quality in noisy environments by fusing audio signals with visual information, such as lip movements [
18]. Furthermore, in speaker verification, research has combined SE modules with deep CNNs, such as ResNeXt, to adaptively adjust channel-wise feature maps and reduce noise interference, thereby increasing the extraction efficiency of speaker-specific characteristics [
19]. In speech emotion recognition, SE blocks are utilized to effectively reinforce core channel information related to emotional expression during the capture and fusion of multi-scale features [
20]. Ultimately, this technique maximizes overall representational power of the model by selectively emphasizing the most significant feature channels for signal restoration, even within limited information contexts such as noisy inputs or latent vectors.
Based on the technical background reviewed, this paper proposes a novel autoencoder architecture to achieve the high-fidelity reconstruction of encoded DIFAR waveform data. Different from previous studies in [
6,
7], the proposed model redesigns the latent space to enable the decoding of finer feature representations by integrating two main building blocks as follows:
Multi-scale Feature Fusion: Adopting a U-Net structure, we introduce skip connections that fuse hierarchical features from the encoder with their corresponding layers in the decoder. This is intended to preserve both the global structure and the local details of the signal.
Adaptive Feature Recalibration: SE blocks, a channel-wise attention mechanism, are applied to dynamically recalibrate the relative importance of the fused feature channels, thereby guiding the model to focus on the information most contributory to signal reconstruction.
This architecture improves the spectral accuracy of the reconstructed signal while somewhat increasing the data transmission rate. The contributions of this study are summarized as follows: First, we proposed an autoencoder design by employing multi-scale feature fusion and adaptive feature recalibration for reducing acoustic data reconstruction error. Second, we quantitatively evaluated that the proposed model successfully reduces the spectral reconstruction error compared to existing models with data simulated in the DIFAR specification and a technical trade-off point between communication efficiency and signal fidelity.
3. Proposed Method
In this study, we propose a novel autoencoder architecture for the high-fidelity reconstruction of time-domain DIFAR waveforms. This architecture is based on hierarchical feature fusion and incorporates skip connections. The proposed network is designed to mitigate the information bottleneck, which is characteristic of sequential structures of conventional deep autoencoders, and to improve reconstruction performance by effectively preserving acoustic features at multiple scales. This section describes in detail the design principles and overall architecture of the proposed network, and the specific implementation of its key components, specifically the skip connections and attention mechanisms.
3.1. Architectural Design Principles and Overall Structure
The design of the proposed network is based on the following three principles: (1) End-to-End Processing, (2) Multi-Path Information forwarding for information preservation, and (3) Adaptive Feature Refinement. These principles are designed to work synergistically to reduce information loss within the deep autoencoder structure and achieve high-fidelity signal reconstruction.
At first, the proposed network is based on an end-to-end approach that directly processes the time-domain waveform. This method allows the network to learn representations directly from the raw signal, which can simplify the processing pipeline. Many audio processing models utilize the Short-Time Fourier Transform (STFT) to convert the signal into the time-frequency (T-F) domain. This approach is effective for analyzing frequency characteristics, but it includes a feature extraction step that requires manual tuning of hyperparameters, such as window size and hop length. Similarly, while Wavelet transforms offer excellent time-frequency localization, we opted for the time-domain approach to avoid the shift-variance limitations of standard discrete wavelets, while implicitly preserving the phase information essential for DOA estimation without incurring the data redundancy associated with complex wavelets. By adopting a time-domain method, our model bypasses this step involving manual parameter selection. This enables the network to automatically learn the salient features for reconstruction—including both magnitude and phase information—directly from the data and allows the entire pipeline to be tuned holistically. Subsequently, multi-path information forwarding is implemented via skip connections [
10] within the hierarchical encoder–decoder structure, as illustrated in
Figure 3. In the proposed network, the encoder serves as a feature extraction path, deriving hierarchical features through progressive down sampling of the input waveform. Conversely, the decoder functions as a reconstruction path, restoring the signal from the compressed features. Skip connections create shortcut connections that pass multi-scale features from the extraction path directly to the corresponding layers in the reconstruction path. This serves two crucial functions. First, it mitigates the gradient vanishing problem that can occur in deep networks, thereby supporting stable training. Second, it allows the decoder to utilize both high-level contextual information from the compressed latent vector and the rich, low-level details delivered via the skip connections when reconstructing the signal. This effectively preserves the fine-grained temporal structures of the signal, which are often susceptible to loss during compression. Finally, the adaptive feature refinement is realized through an attention mechanism. Although skip connections provide abundant information, not all transferred features are equally important for signal reconstruction. The attention mechanism provides the model with the capability to dynamically focus on information that is more salient for the current reconstruction step while suppressing information that is relatively less important. In this work, we specifically adopt SE [
16] channel attention technique. To clarify the terminology for our proposed method, ‘channel’ in this context refers to the individual dimensions (features) of a feature vector, distinguishing it from the ‘sensor channels’ (Omni, NS, EW) of the sonobuoy. This method calculates the relative importance of each channel (dimension) of the feature vector fused by the skip connections and applies this as a weight. This process serves to reduce the influence of unnecessary noise or redundant information and to emphasize the core feature representations critical for signal reconstruction.
In conclusion, this architecture organically combines these three principles. The skip connections secure the ‘structural path’ to prevent information loss, while the attention mechanism dynamically controls the ‘quality’ of the information that passes through that path. Because this entire process operates directly in the time domain, which preserves phase information, it ultimately enables the high-fidelity reconstruction of DIFAR signals.
3.2. Hierarchical Encoder–Decoder with Skip Connections
The proposed network is composed of a hierarchical encoder and decoder, which implements the design principles mentioned in
Section 3.1. Letting
be the set of their trainable parameters, the encoder and decoder can be represented as the functions
and
, respectively. Both components are constructed from fully connected layers and are interconnected via skip connections. This section details the specific architecture of each component and the experimental design used to explore effective skip connection configurations.
3.2.1. Model Architecture
The encoder,
, receives a time-domain input waveform
composed of
samples. The output feature map
of the
-th encoder layer (
) is sequentially computed by taking the output
of the previous layer (where
) as input, as follows:
where
is the output feature vector of the
-th layer,
and
are the trainable weight matrix and bias vector of that layer, and
is the ReLU activation function. The encoder generates the final latent vector
(where
is the latent dimension) and a set of intermediate feature maps
intended for skip connections.
In addition, the decoder,
, receives the latent vector
and a selected set of skip feature maps
to reconstruct the original waveform. The
-th decoder expansion layer fuses the output
of the previous layer (where
) and the encoder feature
delivered via the skip connection. In this study, we adopted element-wise addition as the information fusion method, and the process is expressed as follows:
where
and
are the trainable weight matrix and bias vector of the
-th decoder layer. The fused feature
undergoes subsequent processing to become the final output of the
-th decoder layer. After this process is repeated through all decoder layers, the final layer employs a Tanh activation function to constrain the dynamic range of the reconstructed waveform
to
and outputs it.
3.2.2. Systematic Exploration of Skip Connection Configurations
The excessive transfer of low-level features can potentially impair generalization performance. Therefore, to explore the favorable trade-off point between reconstruction performance and the amount of transmitted information, we systematically designed a set of model variants with different combinations of skip connections. The index set
for a model (
) using
skip connections is defined by sequentially adding layers, starting from the one closest to the latent space, as shown in:
where
is the number of skip connections used (
), and
corresponds to the baseline model with no skip connections, where
. The total amount of information transmitted by each model variant, denoted as the total latent dimension
, is calculated as the sum of the dimensions of the final latent vector
and the selected skip features, as shown in:
3.2.3. Gated Feature Fusion Mechanism
The element-wise addition method for information fusion, described in (2) of
Section 3.2.1, is simple and efficient. However, it is limited by the fact that it always blends the two information sources, the decoder features and the encoder skip features, at a fixed ratio. To model the complex characteristics of the signal with greater sophistication, a gated mechanism was designed as an additional model variant as shown in
Figure 4. This mechanism dynamically adjusts the mixing ratio between the two features.
In the gated fusion mechanism, given the up sampled feature
from the
-th decoder expansion layer and the encoder feature
delivered via the skip connection, the fusion process is performed in the following two steps. First, the two feature vectors are concatenated to create
, which is then passed through a separate gate network to compute the ‘gate vector’
. This gate network is composed of two fully connected layers and a sigmoid activation function, and is expressed as:
where
and
are the trainable weights of the gate network, and
is the sigmoid function. Each element of the gate vector
has a value in the range [0, 1] and serves as a dynamic weight that determines how much of the skip feature
to ‘pass’ through. Second, the computed gate vector is used to perform a weighted sum of the two features. The finally fused feature
is calculated as shown in:
where
denotes element-wise multiplication. The performance impact of applying this gated fusion method is comparatively analyzed in
Section 4 to verify the efficacy of this more sophisticated information fusion approach.
3.3. Feature Refinement with Channel Attention
Although skip connections provide rich, multi-scale features from the encoder to mitigate the information bottleneck, not all transferred feature channels contribute equally to the final signal reconstruction. Some channels may contain critical signal components, whereas others might include information that is relatively less significant or redundant. To ameliorate this inefficiency and enhance representational power of the proposed model, we selectively integrated a channel attention mechanism for adaptive feature recalibration into specific model variants to comparatively analyze its effect. The attention mechanism serves to emphasize the features contributing the most to signal reconstruction from the fused representations, while suppressing superfluous information.
For this purpose, we adopted the SE block [
16] designed based on ‘Squeeze’ and ‘Excitation’, and the block diagram of the SE block is illustrated in
Figure 5.
3.3.1. Squeeze: Global Information Embedding
The Squeeze operation aims to generate a low-dimensional descriptor that facilitates learning inter-channel relationships. This is achieved by summarizing the global information for each channel (dimension) of the
D-dimensional input feature vector
. In the fully connected layer environment addressed in this study, the input feature vector
itself can be regarded as implicitly containing this global information. Consequently, the Squeeze operation
is implemented by projecting
into a lower-dimensional space via the first fully connected (FC) layer. This process is expressed as:
where
is the trainable weight matrix,
is the ReLU activation function, and
is the reduction ratio which controls the complexity of the model. Consequently,
becomes a
-dimensional low-dimensional descriptor vector, which is then used as the input for the Excitation stage.
3.3.2. Excitation: Adaptive Channel Gating
The Excitation stage utilizes the low-dimensional descriptor
, generated during the Squeeze stage, to produce a gating vector that represents the relative importance of each channel. This process is implemented by restoring
to the original
dimension using a second FC layer, which is then passed through a sigmoid activation function. The sigmoid function serves to normalize the weight for each channel to the range [0, 1]. The Excitation operation
is expressed as:
where
is the output of the Squeeze stage,
is the weight matrix of the second FC layer, and
is the sigmoid function. The resulting gating vector
indicates the importance of each channel.
The final attention-applied output vector
is obtained through channel-wise (element-wise) multiplication of the original input feature vector
and the calculated gating vector
:
Through this process, the SE block performs dynamic feature recalibration, amplifying the influence of channels important for signal reconstruction and diminishing the influence of less important channels. The quantitative performance change, based on the application of this attention mechanism, is comparatively analyzed in
Section 4.
4. Experiments
This section details the experimental setup, quantitative results, and qualitative analysis of the experiments conducted to validate the performance of the proposed high-fidelity autoencoder. The experiments were designed to focus on: (1) a performance comparison between the proposed model and existing autoencoder-based models, (2) an analysis of performance variation according to the amount of transmitted information, and (3) an ablation study to verify the effects of the skip connection fusion method and the attention mechanism.
4.1. Simulation Setup
The dataset used for performance validation in this study was custom-generated using the MATLAB (Version R2024b) Phased Array System Toolbox [
21] to precisely simulate the actual operational environment. The data generation parameters were established, assuming an active sonar operational scenario, referencing the operational specifications of the actual source sonobuoy (based on AN/SSQ-565) and the receiving DIFAR sonobuoy (based on AN/SSQ-573) [
22,
23]. The key parameters used for dataset generation are summarized in
Table 1.
Overall, a total of 90 h of training data were constructed using various combinations of signals and environmental parameters. The test dataset, used for all evaluations in
Section 4.5, was separately constructed and consisted of 300 10-s WAV files (150 for CW and 150 for LFM). All signals were generated as 3-channel (Omni, NS, EW) with a sampling frequency of 31,250 Hz. For model training, these signals were segmented into 0.1-s (3125-sample) units. Each segment was normalized to the range [−1, 1] before being used as model input.
4.2. Compared Models
Training the proposed network is defined as the process of optimizing the model parameters to reduce the difference between the original signal and the reconstructed signal. This section describes the objective function for training and the specific settings and environment used for the actual training.
To systematically analyze the performance impact of each component of the proposed architecture (skip connections, fusion methods, and attention), we designed and comparatively evaluated the following models. The key characteristics and design objectives of each model are summarized in
Table 2.
AE-Baseline: This model is based on the sequential autoencoder structure from [
6]. To ensure a fair comparison with the architecture proposed in this study (e.g., AE-S, AE-GF), it was redesigned to possess the identical backbone structure as these models, but with the skip connections removed. It serves as the baseline model to isolate and evaluate the pure effect of introducing skip connections.
AE-Baseline 60 (modified): For a fair comparison with AE-Baseline, this model’s bottleneck dimension was modified to 60. Specifically, the Linear (100, 50) and Linear (50, 10) encoder layers and their symmetric decoder counterparts from the original AE-Baseline were removed. They were replaced with a Linear (100, 60) encoder layer and a Linear (60, 100) decoder layer to adjust the bottleneck dimension to 60. This model has the same total transmitted information as one of the proposed models, AE-S 60. It serves as a critical comparison point for separately analyzing the effects of “merely increasing the latent space size” versus “modifying the architecture with skip connections.”
AE-RVQ-CM: This model, proposed in previous work [
7], is an architecture focused on lightweighting. It utilizes a 10-bit integer latent vector and a Compensation Module.
AE-S (S for Skip): This is the core model proposed in this study, which features multiple skip connections and uses element-wise addition as the feature fusion method. The variants, such as AE-S 60 and AE-S 160, are differentiated by the total amount of information (dimensionality) transmitted via the skip connections, as designed in
Section 3.2.2.
For example, for the AE-S 60 variant (
),
, and the total transmitted dimension is
. For the AE-S 160 variant (
),
, and the total dimension is
. The detailed configurations of the model variants compared in this study are summarized in
Table 3. This systematic experimental design enables a quantitative analysis of how the amount of information and the hierarchical depth of the features transmitted via skip connections affect reconstruction performance.
AE-S-SE (SE for Squeeze-and-Excitation): This model integrates a channel attention (SE Block) module into the AE-S architecture to verify the effect of refining the fused features.
AE-GF (GF for Gated Fusion): Instead of element-wise addition, this model applies the Gated Feature Fusion mechanism, as described in
Section 3.2.3, to analyze the effect of dynamically adjusting the mixing ratio of the two features.
AE-GF-SE: This model integrates the channel attention module into the AE-GF’s gated fusion architecture to evaluate the synergistic effect of the two mechanisms.
4.3. Training and Implementation Details
This section describes the objective function, optimization method, and specific implementation environment and hyperparameters used for training all the models defined in
Section 4.2. To ensure a fair comparison of performance differences arising from architectural variations, all models, with the exception of AE-RVQ-CM, were trained using an identical training procedure and data partitioning.
The general architecture described in
Section 3.2.1 was implemented with the specific parameters detailed in
Table 4. The notation
represents the feature map at the
-th layer of the encoder, while
denotes the feature map at the
-th layer of the decoder.
4.3.1. Objective Function
The main objective function for training all models is defined by directly measuring the similarity between the original waveform
and the reconstructed waveform
in the time domain. We calculate the MSE for each sample in the training dataset
and average this over the entire dataset to use as the final loss function
. The MSE loss function is defined as Equation (10):
where
signifies the square of the
norm. This loss function directly minimizes the sample-wise error between the two waveforms in the time domain, thereby guiding the model to generate a high-fidelity waveform closely resembling the original signal.
4.3.2. Implementation and Hyperparameters
All experiments were implemented using the PyTorch (Version 2.6.0) deep learning framework on a high-performance workstation equipped with an Intel Core i9-14900KF CPU, 64 GB of RAM, and an NVIDIA GeForce RTX 4090 GPU. This computing environment was utilized for both model training and the precise measurement of inference time. To ensure reproducibility, the seed for all randomized processes was fixed to 42.
To optimize the model parameters , we used the Adam optimizer. For most models (excluding AE-RVQ-CM), the Adam optimizer’s hyperparameters were set with an initial learning rate fixed at , and the momentum-related parameters and were set to 0.5 and 0.9, respectively. Training was conducted for a total of 600 epochs.
For the AE-RVQ-CM model [
7], a separate learning rate scheduler was applied. The model was trained for an initial 300 epochs with the same learning rate to allow it to converge. Following this, fine-tuning was performed in the subsequent training phase by reducing the learning rate by a factor of 0.2. All other models used a fixed learning rate throughout the entire training process.
For the architectural hyperparameters, the reduction ratio in all Gated Fusion (GF) blocks was set to 2, and the reduction ratio in all Squeeze-and-Excitation (SE) blocks was set to 5.”
4.4. Evaluation Metrics
To conduct a multifaceted and quantitative evaluation of the reconstruction performance of the proposed and comparison models, this study employed the following three standard evaluation metrics from the field of acoustic signal processing.
Spectral Mean Squared Error (Spec MSE): Spec-MSE is a metric that calculates the mean squared error between the magnitude components of the spectrograms of the original signal
and the reconstructed signal
, following the Short-Time Fourier Transform (STFT). It is defined as Equation (11):
where
and
represent the number of frequency bins and time frames, respectively. Spec-MSE directly measures the similarity between the spectral structures of the two signals. Therefore, it is the most critical metric for evaluating the ‘high-fidelity reconstruction’ objective of this study. A lower value indicates a smaller spectral reconstruction error.
Signal-to-Noise Ratio (SNR) [dB]: SNR is a classical metric representing the signal-to-noise ratio. In this study, it is calculated by regarding the original signal
as the ‘signal’ and the difference between the original and reconstructed signal (
) as the ‘noise’ (reconstruction error). The SNR in decibels (dB) is calculated as Equation (12):
Which evaluates the energy level of the reconstruction error relative to the total energy of the signal. A higher value indicates superior reconstruction performance.
Scale-Invariant Signal-to-Noise Ratio (SI-SNR) [dB]: SI-SNR is a signal quality metric designed to be independent of the absolute scale difference between the original signal and the reconstructed signal. It is computed by optimally projecting the reconstructed signal
onto the original signal
to match its scale, and then calculating the signal-to-noise ratio. When
and
are defined as
and
, the SI-SNR is calculated as Equation (13):
SI-SNR reflects not only the simple energy difference but also the structural similarity of the waveforms more effectively. Therefore, it has been widely used for performance evaluation in the speech and audio processing fields in recent years [
24]. A higher value signifies less signal distortion and superior quality.
Computational Efficiency Metric: To validate the feasibility of real-time processing for sonobuoy systems, we evaluated the model’s inference speed. Inference Time is defined as the average computational time required to process a single 0.1-s waveform sample. To ensure the reliability of the measurement, the inference time was recorded on a workstation equipped with an NVIDIA GeForce RTX 4090 GPU. The reported values represent the average of 2000 iterations following a GPU warm-up phase, thereby excluding initialization overheads.
4.5. Results and Discussion
This section presents a quantitative and qualitative analysis of the performance of the comparison models defined in
Section 4.2, conducted under the experimental environment described in
Section 4.1. First, the superiority of the proposed architecture is demonstrated through an overall performance comparison. Subsequently, an ablation study is conducted to identify the sources of performance improvements of the proposed model by isolating and analyzing the effects of each structural component, specifically skip connections, fusion methods, and attention mechanisms. Furthermore, the trade-off relationship between the transmitted information volume and reconstruction performance is analyzed to explore an efficient model configuration. Finally, qualitative differences in quality are compared via spectrogram visualization.
4.5.1. Overall Performance Comparison
Table 5 summarizes the quantitative performance evaluation results for each model on the CW and LFM test datasets. A notable finding from the results is that the AE-S 1660 model, which uses a simple architecture (AE-S) with an extremely large volume of transmitted information, recorded the highest numerical performance across most metrics. This model achieved a Spec-MSE of
for CW signals and
for LFM signals, demonstrating that very high reconstruction quality is attainable in the absence of information constraints. However, the AE-S 1660 model requires the transmission of a 1660-dimensional latent vector, rendering it impractical from a communication efficiency perspective. This diverges from the objective of this study, which is to achieve ‘high fidelity’ within a ‘constrained communication environment’. Consequently, this research focuses on a comparative analysis of each architecture’s performance within a practical range of information volumes, specifically 60 and 160 dimensions, that are comparable to, or only slightly larger than, that of the AE-Baseline.
When the scope of analysis is narrowed to the practical information volumes (60 and 160 dimensions), the AE-GF-SE 160 model, which integrates gated fusion and attention based on multi-scale skip connections, exhibited the most superior performance. Specifically, this model recorded an SI-SNR of 20.35 dB for CW signals and 18.28 dB for LFM signals. These results imply several significant points. First, (comparing AE-S 60 and AE-Baseline 60) under the limited information capacity of 60 dimensions, the simple introduction of skip connections did not offer a distinct advantage over the sequential structure (LFM SI-SNR 14.43 dB vs. 14.77 dB). Second, (comparing AE-Baseline 60 and AE-GF 60) even with an identical 60 dimensions, applying sophisticated structural improvements such as gated fusion (AE-GF 60) was more effective for enhancing performance (LFM SI-SNR 15.37 dB) than using simple skip connections (AE-S 60). Third, the results indicate that within a practical transmission range, the AE-GF-SE model (which combines gated fusion and attention) offers the most favorable trade-off between performance and efficiency when a sufficient information capacity (160 dimensions) is secured.
Furthermore, we analyzed the computational efficiency based on the measurement conditions described in
Section 4.3. As shown in the last column of
Table 5, the AE-Baseline 60 model achieved the fastest inference time of 0.2804 ms. This is slightly faster than the original AE-Baseline (0.2893 ms) because, as detailed in
Section 4.2, the model depth was reduced by removing specific encoding/decoding layers to adjust the bottleneck dimension.
The AE-RVQ-CM model [
7] recorded a latency of 2.9961 ms. It is important to note that the model pruning technique used in the previous study was intentionally excluded in this experiment to allow for a direct comparison of structural complexity. Consequently, the computational costs inherent to the iterative quantization search and convolutional layers are fully reflected in the result. Meanwhile, the proposed AE-GF-SE 160 model showed an inference time of 0.6520 ms. Although the integration of Gated Fusion and Attention modules resulted in a slight increase in latency compared to the baseline, the processing time remains negligible compared to the 100 ms frame duration. This suggests that the proposed architecture is feasible for real-time processing on high-performance computing platforms.
4.5.2. Ablation Study on Architectural Components
In this section, an Ablation Study was conducted to individually analyze the performance impact of each component of the proposed architecture. The analysis focuses on three aspects: (1) the effect of introducing skip connections, (2) a comparison of feature fusion methods (additive vs. gated), and (3) the additional effect of the attention mechanism. All comparisons are based on the quantitative results presented in
Table 5.
(1) Effect of Skip Connections: To analyze the specific effect of introducing skip connections, we first compared the performance of AE-Baseline 60 (a sequential architecture) and AE-S 60 (a skip architecture). Both models possess the same 60-dimensional information capacity.
For CW signals, the performance slightly degraded. AE-S 60 (SI-SNR 18.15 dB) showed a 0.29 dB degradation compared to AE-Baseline 60 (SI-SNR 18.44 dB), while its Spec-MSE was also slightly worse ( vs. ).
For the more complex LFM signals, the AE-S 60 showed a clear performance degradation compared to AE-Baseline 60, with SI-SNR dropping from 14.77 dB to 14.43 dB and Spec-MSE worsening from to .
This indicates that at a low 60-dimensional capacity, simple additive skip connections are insufficient or even detrimental for reconstructing complex signals. However, this trend was reversed in the AE-S 160 model, which increases the information capacity to 160 dimensions. For LFM signals, AE-S 160 achieved an SI-SNR of 17.02 dB. This performance surpasses both the 60-dimensional baseline without skips (AE-Baseline 60, 14.77 dB) and the low-capacity skip model (AE-S 60, 14.43 dB). In conclusion, skip connections are an effective architecture only when accompanied by a sufficient information capacity (dimensionality) to handle signal complexity.
(2) Comparison of Feature Fusion Methods: To compare the different methods for fusing encoder features and decoder features, we analyzed the performance of the AE-S series (using element-wise sum) and the AE-GF series (using gated fusion).
Using the 160-dimensional capacity as a baseline, AE-GF 160 improved the SI-SNR for LFM signals to 17.34 dB, compared to 17.02 dB for AE-S 160.
This trend was even more pronounced at 60 dimensions, where AE-GF 60 (15.37 dB) outperformed AE-S 60 (14.43 dB).
This suggests that the gated mechanism, which dynamically adjusts the mixing ratio of the two information sources, enables a more sophisticated and effective feature fusion than simple element-wise addition, contributing to performance enhancement across all tested capacities.
(3) Effect of the Attention Mechanism: To evaluate the additional contribution of the attention mechanism, we compared models without attention (AE-GF) against models incorporating the SE block (AE-GF-SE).
Using the 160-dimensional models as an example, the AE-GF-SE 160 (with attention) improved the SI-SNR for LFM signals to 18.28 dB from 17.34 dB (compared to AE-GF 160). It also reduced the Spec-MSE from to (a reduction of approx. 16.6%).
However, this trend was not observed at the 60-dimensional capacity. For LFM signals, the AE-GF-SE 60 (15.12 dB) showed a slight performance degradation compared to the AE-GF 60 (15.37 dB).
This indicates that the channel attention’s ‘feature refinement’ role is most effective when operating on a sufficiently rich set of features (160 dimensions). When the information capacity is already highly constrained (60 dimensions), the attention mechanism may not provide an additional benefit.
In summary, this Ablation Study confirmed that gated fusion is a robust improvement over additive fusion. It also revealed that the effectiveness of skip connections and channel attention is highly dependent on the model’s information capacity, acting complementarily with sufficient dimensionality rather than as independent, universal enhancements.
4.5.3. Analysis of Amount of Transmitted Information-Performance Trade-Off
Figure 6 visually illustrates the change in reconstruction performance (Spec-MSE) as a function of the total transmitted dimension. Observing the performance trend of the AE-S series models, the reconstruction error was observed to decrease as the transmitted information was increased from 10 to 160 dimensions. This implies that the additional useful multi-scale information, delivered via skip connections, improves the reconstruction quality. However, a notable finding is the “U-shaped” pattern that emerges. When the information volume is increased from 160 to 660 dimensions, performance does not improve but rather temporarily degrades. Performance then substantially improves again when the information volume is further increased to 1660 dimensions. This phenomenon can be explained by the relationship between the informational value and the redundancy of the transmitted features. Up to the 160-dimension configuration, the new information positively contributes to performance enhancement. However, the
feature, which is added in the 660-dimension configuration, may have high information redundancy with the deeper features or contain unrefined noise components, adversely affecting the decoder’s reconstruction process. Conversely, when a large volume of low-level features (features from the early encoder layers) is provided, as in the 1660-dimension case, the decoder relies on this abundant low-level information to reconstruct the signal with high performance, rather than interpreting the compressed high-level information. However, models such as AE-S 1660 are impractical as they require high communication costs. Therefore, when considering practical transmission efficiency, the favorable trade-off point between performance and cost is found to be around the 160-dimension mark. This point represents an efficient architectural configuration that can achieve a high level of reconstruction quality at a reasonable communication cost.
4.5.4. Qualitative Analysis
To complement the quantitative evaluation, a comparative analysis of spectrograms and frequency spectra was performed for representative CW and LFM test signals. For clearer visualization, the frequency axis in
Figure 7 and
Figure 8 is limited to the range of 0–3 kHz, focusing on the operational bandwidth of DIFAR sonobuoys (5–2400 Hz).
Figure 7 compares the spectrograms of the original signals and the signals reconstructed by each model. In the reconstruction results from the baseline model (AE-Baseline) and the previous research model (AE-RVQ-CM), shown in (b), (g), (c), and (h), faint vertical line artifacts are periodically observed. This is interpreted as a boundary discontinuity problem, occurring because the 0.1-s (3125-sample) input segments are not connected perfectly smoothly. Conversely, in the proposed models applying multi-scale skip connections ((d), (i), (e), (j)), these boundary artifacts are reduced. This suggests that the skip connections better preserve the local temporal information of the signal, contributing to the continuity between segments. Particularly when comparing the LFM signal reconstructions, these artifacts are even further suppressed in AE-GF-SE 160 (j) than in AE-GF-SE 60 (i), indicating that a larger volume of information positively impacts temporal precision.
Figure 8a and
Figure 9a analyze the spectral reconstruction of the CW signal. As observed in
Figure 8a, all comparison models successfully reconstruct the target frequency band, showing a clear peak at the correct frequency. However, a closer inspection provided in
Figure 9a reveals that the proposed AE-GF-SE models demonstrate a slightly closer resemblance to the Clean signal around the target frequency peak. While the baseline models exhibit minor discrepancies in peak shape, the proposed model follows the original signal’s spectral curve more precisely.
This trend is also evident in the LFM signal analysis shown in
Figure 8b and
Figure 9b. Similarly to the CW case, the comparison models (AE-Baseline, AE-RVQ-CM) reconstruct the target signal component relatively well. However, as shown in the magnified view of
Figure 9b, the proposed AE-GF-SE models demonstrate a higher similarity to the Clean signal within the target region. Furthermore, in the non-target regions (outside the main signal band), the proposed models follow the spectral characteristics of the original signal more accurately than the comparison models, maintaining a comparable noise level to the Clean signal.
In conclusion, this qualitative analysis visually demonstrates that the proposed skip connection and attention-based architecture preserve not only the simple waveform similarity but also the temporal continuity and the precise shape of the spectral structure with high fidelity, particularly within the region of interest. This serves as evidence to support the quantitative evaluation results.
5. Conclusions
This study proposed a novel AE architecture for the high-fidelity compression and reconstruction of DIFAR sonobuoy acoustic signals. It originated from the critical observation that autoencoder-based approaches prioritizing high compression ratios often suffer from quality limitations due to the information bottleneck. In response, this study introduced a design that prioritizes reconstruction fidelity. The proposed network is based on an encoder–decoder structure that directly processes time-domain waveforms. It integrates multi-scale skip connections and channel attention mechanisms as key elements to minimize information loss.
To validate the efficacy of the proposed architecture, we constructed an extensive dataset that simulates a realistic underwater acoustic environment based on the specifications of operational sonobuoys. Through systematic comparative experiments and an Ablation Study, we demonstrated that the proposed model exhibits improved performance across all quantitative and qualitative metrics compared to conventional deep autoencoders and lightweight models. Notably, by analyzing the trade-off relationship between transmitted information volume and reconstruction performance, we identified a practical favorable trade-off point at approximately 160 dimensions. We also revealed that the gated fusion and attention mechanisms act complementarily to contribute to the enhancement of the efficiency of information fusion and the precision of feature representation, respectively. The model at this favorable trade-off, AE-GF-SE 160, achieved a significant reduction in spectral mean squared error (Spec-MSE) compared to the baseline model (AE-Baseline), decreasing it by approximately 72.58% for CW signals and 90.36% for LFM signals. The results of this study clearly show that sophisticated fusion mechanisms (e.g., Gated Fusion) and capacity-aware attention are effective structures for information preservation in the compression-reconstruction process of time-domain acoustic signals. The proposed high-fidelity reconstruction model has the potential to improve the accuracy of Doppler shift and micro-feature analysis for underwater targets, consequently contributing to the enhanced reliability of ASW operations. In future research, we plan to validate the proposed technique using real-world measurement data to further ensure its practical robustness.