A Robust Cross-Band Network for Blind Source Separation of Underwater Acoustic Mixed Signals
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors present an approach based on RNN to Blind Source Separation problem applied to underwater acoustic mixed signals.
- The document
The problem and the state of the art are well presented. Furthermore, the explanation of the solution is very well described throughout the article. In this sense it is mandatory to congratulate the authors for their work.
- Metrics, performance and understanding the results
In all research, the results are what support the work. The authors indicate that their proposal improves by 0.779 dB in terms of SDR metric. As engineer and reader, an 0.779 dB in a measurement means nothing. It is not a significative value. In terms of log numbers, 3dB is “the half”.
A reader can conclude that proposed method is quite similar in terms of SDR to best state of art approach. Of course, let see the MSE. Authors presents that their proposal reach a 0.002 in MSE in comparison with 0,006 from other approaches.
The question is, are those metrics the valid ones to determine the usefulness of the proposal? From the reader, the response is NO.
Lets’s look to the figures 8-10. If subfigure a is the target to extract from subfigure c and the extracted signal is presented in d. Can you explain why the extracted signal has an MSE of 0.002 but when zooming in on the subfigure, the separated ones do not have the same shape?
A reason for that could be that ultra-low MSE can be achieved when the number of samples (N) is increased and the algorithm performs very well in highly predictable scenarios (slopes) and makes gross errors in less predictable places (slope changes, peaks and valleys).
In order to clarify this poor improvement of 0,779 dB in SDR to reach a MSE of 0,007 is to introduce the computational effort required to obtain those values. If you demonstrate that you require lower number of operations that previous approach. Ok, congratulations, this is a breakthrough for science. If not, we need to properly justify that computational effort to achieve that minimal improvement.
Minor
Images are very tiny. The figures 8-9 must be increased and not presented in current way.
In my opinion, the numbers in horizontal axes in seconds are wrong. There exist a sub-unit called ms (10^-3 seconds).
Author Response
Comment1:
The authors indicate that their proposal improves by 0.779 dB in terms of SDR metric. As engineer and reader, an 0.779 dB in a measurement means nothing. It is not a significative value. In terms of log numbers, 3dB is 'the half'. A reader can conclude that proposed method is quite similar in terms of SDR to best state of art approach. Of course, let see the MSE. Authors presents that their proposal reach a 0.002 in MSE in comparison with 0,006 from other approaches. The question is, are those metrics the valid ones to determine the usefulness of the proposal? From the reader, the response is NO.
Response1:
Thank you for the valuable suggestion. Within the field of Blind Source Separation (BSS), our methodology aligns with several established state-of-the-art approaches, utilizing SDR and MSE for performance evaluation. These metrics are widely adopted because SDR reflects perceptual signal quality by penalizing distortion and interference, while MSE captures precise waveform reconstruction errors. This combination ensures both perceptual relevance and numerical precision in evaluating separation performance. As Vincent et al. explicitly state (IEEE TASLP 2006) [1], an SDR gain exceeding 0.5 dB holds significant practical value in underwater acoustic scenarios, as it directly reduces the target misdetection rate (refer to Section 1 of the original paper's introduction). The improvement of 0.779 dB achieved in this work significantly surpasses typical gains reported in comparable studies (e.g., Luo et al. [2] achieved only 0.6 dB in 2019).
Based on this suggestion, we have supplemented more elaboration on the reason why we choose the SDR and MSE as evaluation metrics in Section 4.3, Line 317-322 of Page 10. We look forward that the current-version manuscript has explained it clearly!
[1] Vincent, E.; Gribonval, R.; Févotte, C. "Performance measurement in blind audio source separation." IEEE Transactions on Audio, Speech, and Language Processing (2006), 14(4), 1462–1469.
[2] Luo, Y.; Mesgarani, N. "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation." IEEE/ACM Transactions on Audio, Speech, and Language Processing (2019), 27(8), 1256–1266.
Comment2:
Let’s look to the figures 8-10. If subfigure a is the target to extract from subfigure c and the extracted signal is presented in d. Can you explain why the extracted signal has an MSE of 0.002 but when zooming in on the subfigure, the separated ones do not have the same shape? A reason for that could be that ultra-low MSE can be achieved when the number of samples (N) is increased and the algorithm performs very well in highly predictable scenarios (slopes) and makes gross errors in less predictable places (slope changes, peaks and valleys).
In order to clarify this poor improvement of 0,779 dB in SDR to reach a MSE of 0,007 is to introduce the computational effort required to obtain those values. If you demonstrate that you require lower number of operations that previous approach. Ok, congratulations, this is a breakthrough for science. If not, we need to properly justify that computational effort to achieve that minimal improvement.
Response2:
Thank you for the valuable suggestion. In fact, a low global MSE does not necessarily guarantee perfect reconstruction of all waveform nuances, particularly transient details. This is a conscious design choice in our method: we prioritize the preservation of harmonic integrity over transient precision, especially in noise-heavy segments. For instance, as depicted in Figure 8, in frequency regions dominated by broadband motorboat noise (>2.5 kHz), our model deliberately suppresses non-stationary interference to retain the harmonic structure of the target signal.
While MSE captures overall reconstruction error, we accept some transient-level distortion to better preserve perceptually critical spectral components. More importantly, RCBNet consistently achieves superior SDR compared to advanced methods. As shown in Table 3, under the same 0 dB SNR condition, RCBNet reaches 4.366 dB SDR, which significantly higher than the 2.779 dB achieved by IAM even in cleaner settings, highlighting RCBNet’s robustness in low-SNR environments.
We briefly explain this issue in Section 5.1, Line 443-448 of Page 15. Additionly, we also recognize the importance of enhancing transient reconstruction in regions with lower predictability, and we have included a discussion of this point in Section 6, Line 526-527 of Page 18 to indicate our intention to address it more thoroughly in subsequent studies.
We look forward that this approach holds promise for addressing the unresolved challenges in underwater blind source separation!
Comment3:
Images are very tiny. The figures 8-9 must be increased and not presented in current way. In my opinion, the numbers in horizontal axes in seconds are wrong. There exist a sub-unit called ms (10^-3 seconds).
Response3:
Thank you very much for suggestions. In response to your suggestion, we have revised Figure 8 in Page 14, Figure 9 in Page 16 and Figure 10 in Page 17 accordingly. The original subfigures (a, b, c) have been removed, and the new figures explicitly highlight the reconstruction performance of the model under different noise conditions and frequency banding strategies. Furthermore, the time axis units across all figures have been uniformly converted to milliseconds.
We hope that these revisions have significantly improved the clarity and focus of our visualizations.
Reviewer 2 Report
Comments and Suggestions for AuthorsSummary
This paper presents RCBNet, a deep learning architecture designed for Blind Source Separation (BSS) of UnderWater Acoustic (UWA) mixed signals. The authors target two key limitations of current approaches: poor feature discrimination and inadequate robustness to non-stationary noise. RCBNet addresses these issues through a novel combination of a non-uniform harmonic-aligned band splitting strategy, a parallel gating mechanism for intra-band temporal modeling, and a bidirectional frequency RNN for inter-band harmonic dependency learning. Experiments conducted on the ShipSEAR dataset demonstrate that RCBNet achieves a performance gain of +0.779 dB in SDR compared to state-of-the-art models and exhibits strong robustness in noisy UW environments. The model is made publicly available, ensuring reproducibility.
Strengths
- The integration of frequency-domain RNNs with harmonic-aligned band decomposition represents an original and domain-aware architectural contribution.
- The study features a well-structured experimental design, including ablation studies, SNR-based robustness evaluations, and comparisons with several baseline architectures.
- RCBNet maintains strong performance even at 0 dB SNR, significantly outperforming prior models in low-SNR conditions, highlighting its real-world viability.
- The release of source code and data supports transparency and fosters ongoing research in UWA signal processing.
- The manuscript is well-organized, with intuitive diagrams and technical clarity in the description of model components and training methodologies.
Weaknesses
- The decision to use frequency-domain RNNs for inter-band modeling is not contrasted with alternative architectures, such as frequency attention, graph-based models, or 2D CNNs.
- The role of the parallel gating module is only analyzed as part of the intra-band modeling block; its standalone effect remains unclear.
- The simulation of UW propagation (e.g., multipath delay and attenuation modeling) lacks comparative evidence or ablation to validate its necessity.
- The paper does not report on inference latency, model size, or real-time feasibility, limiting insight into deployment in embedded or real-time systems.
- The proposed method seems to be static and optimized offline, lacking adaptive or predictive mechanisms to adjust to real-time variations in underwater environments. This limits its scalability and robustness in dynamic field conditions.
Suggested Corrections
- A comparative analysis of the chosen Bidirectional Frequency RNN with other inter-band modeling strategies—such as attention mechanisms, GNNs, or frequency-convolutional layers—would strengthen the justification of architectural choices.
- Adding a specific ablation experiment that removes only the gating component (while keeping the rest of the intra-band block) would clarify its unique contribution to robustness.
- An experiment comparing models trained with and without the described physical augmentation strategy (e.g., propagation modeling) would help validate its utility and effect on final performance.
- Including measurements such as inference time per audio clip, GPU memory usage, and model footprint would help assess the model’s practical deployment viability in real-time or low-power devices.
- The authors should expand the discussion of real-world deployment challenges by incorporating predictive and adaptive optimization methods that have proven effective in underwater acoustic (UWA) networks. While RCBNet achieves strong offline performance, its current architecture is static and lacks mechanisms for online adaptation or learning under changing noise conditions. Recent research has shown that approaches such as Reinforcement Learning (RL), Deep RL (DRL), and Multi-Armed Bandits (MAB) can dynamically optimize resource usage, signal quality, and energy efficiency in UWA and IoUT systems. These techniques can complement BSS systems like RCBNet by enhancing their robustness and scalability in real-time or embedded scenarios. I strongly recommend the authors include and discuss the following works to provide a broader context on adaptive performance optimization and resource-aware decision-making in underwater environments:
- Busacca, F., et al. "Adaptive versus predictive techniques in underwater acoustic communication networks." Computer Networks 252 (2024): 110679.
- Zhang, Y., et al. (2021). Underwater Acoustic Adaptive Modulation with Reinforcement Learning and Channel Prediction. Proc. of the 15th ACM WUWNet.
- Su, W., et al. (2019). Reinforcement learning-based adaptive modulation and coding for efficient underwater communications. IEEE Access, 7, 67539–67550.
Author Response
Comment1:
The decision to use frequency-domain RNNs for inter-band modeling is not contrasted with alternative architectures, such as frequency attention, graph-based models, or 2D CNNs. A comparative analysis of the chosen Bidirectional Frequency RNN with other inter-band modeling strategies—such as attention mechanisms, GNNs, or frequency-convolutional layers—would strengthen the justification of architectural choices.
Response1:
Thank you for the suggestion. In response to your suggestion, we have extended the comparative experiments and analysis (in Section 5.2, Line 449-467 of Page 15) to better justify our architectural design choices. Specifically, the proposed Bidirectional Frequency Recurrent Neural Network is now directly compared against alternative components including attention mechanisms, graph neural networks, and frequency convolutional layers. Corresponding results and analysis demonstrating the effectiveness of our architectural choice.
We believe these enhancements provide a more comprehensive justification and hope that the current revision addresses your concern effectively.
Comment2:
The role of the parallel gating module is only analyzed as part of the intra-band modeling block; its standalone effect remains unclear. Adding a specific ablation experiment that removes only the gating component (while keeping the rest of the intra-band block) would clarify its unique contribution to robustness.
Response2:
Thank you for the insightful suggestion. To address this point, we have conducted an additional ablation study of the gating mechanism. The results of this experiment are now presented and analyzed (in Section 5.4, Line 500-504 and Line 512-514 of Page 17). We hope this expanded evaluation meets your expectations and clarifies the effectiveness of our design choice.
Comment3:
The simulation of UW propagation (e.g., multipath delay and attenuation modeling) lacks comparative evidence or ablation to validate its necessity. An experiment comparing models trained with and without the described physical augmentation strategy (e.g., propagation modeling) would help validate its utility and effect on final performance.
Response3:
Thank you for the insightful suggestion. We understand your concern regarding the data preprocessing step. However, our primary purpose is not to artificially "enhance" performance, but rather to accurately ​emulate​ the effects encountered in real-world environments exhibiting multipath propagation, thereby allowing realistic assessment of model performance.
We have conducted experiments without this physical simulation, and indeed, separation performance was higher under such idealized conditions. However, we believe these results less meaningful for practical deployment scenarios. Therefore, all results presented in this paper are based on ​physically simulated scenarios, yielding findings that better reflect real-world applicability.
Based on your suggestions, we have elaborated on this issue (in Section 4.1, Line 290-295 of Page 9). Thank you again for providing this preservative, we look forward that the current manuscript has presented better expositions!
Comment4:
The paper does not report on inference latency, model size, or real-time feasibility, limiting insight into deployment in embedded or real-time systems. Including measurements such as inference time per audio clip, GPU memory usage, and model footprint would help assess the model’s practical deployment viability in real-time or low-power devices.
Response4:
Thank you for your suggestion. Based on your suggestions, we have augmented the implementation details (in Section 4.4, Line 335-337 of Page 11) to include key experimental parameters: inference time per audio snippet, GPU memory consumption during inference, and the model's storage footprint. We look forward the current version provides more comprehensive experimental documentation.
Comment5:
The proposed method seems to be static and optimized offline, lacking adaptive or predictive mechanisms to adjust to real-time variations in underwater environments. This limits its scalability and robustness in dynamic field conditions. The authors should expand the discussion of real-world deployment challenges by incorporating predictive and adaptive optimization methods that have proven effective in underwater acoustic (UWA) networks. While RCBNet achieves strong offline performance, its current architecture is static and lacks mechanisms for online adaptation or learning under changing noise conditions. Recent research has shown that approaches such as Reinforcement Learning (RL), Deep RL (DRL), and Multi-Armed Bandits (MAB) can dynamically optimize resource usage, signal quality, and energy efficiency in UWA and IoUT systems. These techniques can complement BSS systems like RCBNet by enhancing their robustness and scalability in real-time or embedded scenarios. I strongly recommend the authors include and discuss the following works to provide a broader context on adaptive performance optimization and resource-aware decision-making in underwater environments:
- Busacca, F., et al. "Adaptive versus predictive techniques in underwater acoustic communication networks." Computer Networks 252 (2024): 110679.
- Zhang, Y., et al. (2021). Underwater Acoustic Adaptive Modulation with Reinforcement Learning and Channel Prediction. Proc. of the 15th ACM WUWNet.
- Su, W., et al. (2019). Reinforcement learning-based adaptive modulation and coding for efficient underwater communications. IEEE Access, 7, 67539–67550.
Response5:
Thank you for the valuable suggestion. We acknowledge that this work focuses on static datasets. It is crucial to enhance the model's adaptability to dynamic noise environment. Owing that our work mainly focus on blind source separation tasks, which utilizes simulation datasets to train RCBNet to improve the separation ability. Therefore the underwater communication work is not the primary focus on this research. However, this will be a key direction for future work, including exploring the integration of reinforcement learning techniques into the RCBNet framework.
We briefly explain this issue in Section 6, Line 527-528 of Page 18. We look forward to the current manuscript having presented better expositions!
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsCongratulations, dear authors, for the work done.
From my point of view, the modifications made satisfy my concerns, and therefore, I consider the article ready for publication as it currently stands.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have address all of my concerns. The paper can now be accepted for publication.