Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Signal Detection Method for OTFS System Based on Feature Fusion and CNN

Electronics 2025, 14(20), 4041; https://doi.org/10.3390/electronics14204041

by You Wu^*, Mengyao Zhou

, Yuanjin Lin and Zixing Liao

Reviewer 1: Anonymous

Reviewer 2:

Chen Miao

Reviewer 3: Anonymous

Electronics 2025, 14(20), 4041; https://doi.org/10.3390/electronics14204041

Submission received: 15 September 2025 / Revised: 8 October 2025 / Accepted: 13 October 2025 / Published: 14 October 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper deals with signal detection method for OTFS System Based on Feature Fusion and CNN. The paper is overall well-written and structured. Their research questions is answered throughout the paper. I consider this topic to be relevant and important today. Figures and Tables are clear and visible. The use of neural networks is relevant also. I do not detect some major weaknesses of the paper. I can suggest several modifications:

Enhance more the novelty of the paper and how it addresses research gap.
Consider putting paper outline at the end of Introduction.
Consider improving resolution of Figure 2.
Revise the whole paper for spelling and grammar mistakes.
Revise if all abbreviations are correct.
Elaborate more why did you use hard-thresholding operator, and why soft-sthreshold not for example.
Emphasize parameters of wavelet, such as used mother wavelet and reasoning why did you choose it.
Did you perform hyperparameter search in order to determine the best CNN?

Author Response

Reviewer1:

Comment 1: Enhance more the novelty of the paper and how it addresses research gap.

Response:

Thank you for your valuable suggestions. We have comprehensively revised the introduction and discussion sections of the manuscript to more clearly articulate the novelty of this work and how it addresses research gaps. The enhancements are specifically reflected in the following three aspects:

1.Novelty in Methodology: Proposing a "Feature Preprocessing-Prior Enhancement-Deep Learning" Hybrid Paradigm:

This paper pioneers the integration of wavelet multi-scale analysis, message passing sparse enhancement, and convolutional neural network feature learning, constructing a novel three-stage signal detection architecture. This is not merely a simple algorithmic combination but a deep complementarity between model-driven and data-driven methods:

(1) The wavelet decomposition layer serves as an adaptive preprocessing unit, utilizing the sym4 wavelet basis for 3-level decomposition to effectively separate low-frequency effective components from high-frequency noise in the DD-domain signal, addressing the insufficient feature extraction of weak path signals in purely data-driven methods.

(2) The MP enhancement layer acts as a model-driven prior knowledge injector, generating auxiliary features with sparse channel characteristics by iteratively updating the symbol posterior probability via a factor graph model.

(3) The feature fusion and CNN detection layer concatenates the original signal, wavelet-denoised signal, and MP-enhanced signal into a 6-channel high-dimensional tensor, providing the CNN with a feature space far richer than single-signal sources, achieving a paradigm shift from "end-to-end learning" to "feature-guided learning."

2.Targeted Addressing of Existing Research Gaps

This study directly addresses three core challenges in current OTFS detection and provides solutions through synergistic design:

(1) To tackle the issues of low iterative efficiency and susceptibility to noise error accumulation in traditional MP algorithms, MP-WCNN achieves performance superior to 10 MP iterations through a single forward pass (providing a 9 dB SNR gain at SNR=16 dB), fundamentally avoiding iterative error propagation.

(2) To address the problems of pure CNN models relying on large amounts of data, insufficient feature extraction, and overfitting, our feature fusion strategy significantly enhances feature quality by incorporating physical priors. MP-WCNN remains stable and convergent even with a 30% reduction in training samples, with a notably lower risk of overfitting.

(3) To overcome the limitation of existing hybrid methods in fully exploiting signal multi-scale characteristics, our approach explicitly extracts and preserves signal structures at different resolutions through wavelet decomposition, reducing the BER of MP-WCNN by nearly 20% compared to the MPCNN benchmark in low SNR scenarios.

3.Practicality and Complexity Optimization

This research does not trade efficiency for performance. By leveraging the preprocessing and feature enhancement of wavelet and MP, the dependence on CNN model complexity is reduced. The final network parameter count is reduced by 40% compared to similar CNN models, while the number of iterations in the MP module is decreased from 10 to 3, resulting in a 62% reduction in overall processing latency. This demonstrates the framework's potential for application in real-time high-mobility communication scenarios.

We have systematically integrated the above core innovations, comparative analyses with existing works, key performance advantages into the introduction, discussion and concusion sections of the revised manuscript. We once again thank the reviewers for their insightful comments and will refine this part in the revised manuscript to make the paper more rigorous and persuasive.

Comment 2: Consider putting paper outline at the end of Introduction.

Response:

Thank you for your valuable suggestions on this paper. Regarding the comment 'Consider placing the paper outline at the end of the introduction,' to help readers quickly grasp the framework of the paper, we have added the paper structure arrangement after the overview of research methods in the fourth paragraph of the introduction. The arrangement is as follows: Section 2 introduces the OTFS system model, including the principles of signal modulation and demodulation, the channel characteristics in the Delay-Doppler domain, and the mathematical basis of wavelet decomposition; it also details the design of the proposed MP-WCNN detection method, including wavelet multi-scale feature extraction, the feature fusion mechanism of the MP enhancement module, and the CNN network structure and training parameter configuration; Section 3 presents the results, where through model training and simulation experiments, the performance of the proposed method and other detection methods in terms of bit error rate under 0-30dB SNR is verified; Section 4 discusses the limitations of the method; and Section 5 summarizes the main contributions of this paper and directions for future research. For details, please see lines 79 in the revised introduction. Once again, we thank the reviewer for the suggestion, and we will improve this part in the revised manuscript to make the paper more rigorous and persuasive.

Comment 3: Consider improving resolution of Figure 2.

Response:

Thank you for your valuable suggestions. Following your advice, we have regenerated Figure 2 (a schematic diagram of three-level wavelet decomposition) and significantly improved its resolution. The new image is clearer and helps readers better understand the hierarchical structure of wavelet decomposition. This revision has been updated in the revised manuscript.

Comment 4: Revise the whole paper for spelling and grammar mistakes.

Response:

Thank you for your insightful feedback, which will help us further improve our research work. Based on your suggestions, we have carefully revised the spelling and grammatical errors throughout the paper. We apologize for our carelessness, and in the resubmitted manuscript, the typos have been corrected. Thank you for pointing them out.

Comment 5: Revise if all abbreviations are correct.

Response:

Thank you for your valuable suggestions. Regarding the requirement to "check whether all abbreviations are correct," we have conducted a systematic review and revision of all abbreviations throughout the text. All abbreviations have been verified as correct. Thank you for your guidance.

Comment 6: Elaborate more why did you use hard-thresholding operator, and why soft-sthreshold not for example.

Response:

Thank you for your valuable feedback, which has prompted us to elaborate more deeply on the rationale behind our threshold selection. Our choice of hard thresholding is a decision based on the core characteristics of OTFS signals in the delay-Doppler domain.

In high-mobility scenarios, OTFS signals exhibit a typical impulse-like sparse distribution in the DD domain, with effective information concentrated in a few multipath components of significant amplitude. The "keep or discard" nature of the hard threshold function allows it to preserve the amplitudes of all coefficients exceeding the threshold without distortion, which is crucial for accurately recovering sparse impulse signals.

It is particularly important to emphasize that the key advantage of hard thresholding lies in its ability to precisely distinguish between signal-dominant coefficients and noise-dominant coefficients. By retaining the amplitudes of all coefficients that exceed the threshold, hard thresholding not only effectively eliminates background noise but, more importantly, fully preserves the energy distribution and time-frequency characteristics of the valid signal components. This precise discrimination capability is decisive for subsequent signal detection and parameter estimation.

In contrast, soft thresholding has inherent limitations:

First, soft thresholding applies systematic shrinkage to all coefficients, causing attenuation even for strong signal components that exceed the threshold. This systematic bias leads to distortion of the signal structure, particularly weakening the amplitudes of strong coefficients representing genuine multipath components, making it difficult for subsequent detectors to accurately estimate path gains.

Second, the feature blurring effect induced by soft thresholding disrupts the sparse structure of the DD domain signal. By shrinking all coefficients, soft thresholding blurs the boundary between signal components and noise, causing the processed signal to lose its original sparse characteristics. This distortion is particularly detrimental to subsequent processing modules (such as the MP enhancement module) that rely on sparse priors.

More importantly, the non-zero coefficients after hard thresholding can more directly and faithfully reflect the underlying sparse structure of the channel. This fidelity ensures high consistency with the sparse prior assumption of the MP module, providing higher-quality and more consistent input features for the subsequent CNN.

Our choice represents a clear trade-off between "smooth denoising effects" and "sparse structure fidelity." For the specific task of OTFS signal detection, we prioritize the accurate preservation of key signal components' integrity, a choice entirely consistent with the sparse prior-based design philosophy of our overall method.

We thank you again for your insightful comments, which have significantly enhanced the quality and rigor of this paper.

Comment 7: Emphasize parameters of wavelet, such as used mother wavelet and reasoning why did you choose it.

Response:

Thank you very much for your valuable suggestion! In accordance with your advice regarding "emphasizing wavelet parameters, such as the mother wavelet used and the reasons for its selection," we have supplemented the revised manuscript with a detailed discussion on wavelet parameter selection, with a particular focus on deepening the rationale from the perspective of mathematical principles.

The wavelet parameters adopted in the OTFS signal denoising module of this paper are as follows:

(1) Mother Wavelet Type: sym4 wavelet from the Symlet family (approximately symmetric wavelet, order 4).

(2) Decomposition Level: 3 levels (3-level wavelet decomposition is performed separately on the real and imaginary parts of the received signal).

(3) Threshold Function: Hard Thresholding (combined with the Rigsure threshold rule, Formula (12)).

The selection of sym4 as the mother wavelet is based on a deep alignment between its mathematical properties and the requirements of OTFS signal processing. The specific reasons are as follows:

Mathematical Relationship between Approximate Symmetry and Phase Preservation: Ideal linear phase filtering is crucial for ensuring minimal phase distortion during signal reconstruction. Although Finite Impulse Response (FIR) filters cannot achieve perfect symmetry, the Symlet wavelet is optimally designed to have approximate symmetry in its filter coefficients. This characteristic results in a phase response that is highly linear in the passband, with approximately constant group delay. Mathematically, this ensures minimal phase distortion during the wavelet transform and reconstruction processes. For the OTFS system, where the symbol phase in the Delay-Doppler (DD) domain carries critical modulation information, maintaining phase linearity is essential for accurately recovering constellation points and reducing the symbol error rate. In contrast, the nonlinear phase characteristics of asymmetric wavelets (e.g., Daubechies wavelets) distort the signal waveform, introducing detection errors that are difficult to compensate for.

Trade-off Analysis between Compact Support and Computational Complexity: The sym4 wavelet has a compact support of length 8, which provides an optimal balance between computational efficiency and frequency localization capability. A shorter support length implies smaller convolution computational loads, significantly reducing the computational delay of wavelet decomposition and reconstruction, thereby meeting the stringent low-latency processing requirements of high-mobility communications. Simultaneously, this support length is sufficient to provide the necessary frequency resolution to effectively separate the main multipath components of the OTFS signal from wideband noise in the DD domain.

Compatibility between Regularity and OTFS Signal Morphology: Regularity (smoothness) measures the number of continuous derivatives a wavelet function possesses. As a 4th-order Symlet wavelet with 4 vanishing moments, sym4 possesses sufficient regularity. Consequently, the waveform of this wavelet is smooth and exhibits a morphological structure that better matches the transient impulse characteristics displayed by OTFS signals in the DD domain. This compatibility enables the sym4 wavelet to produce sparser decomposition coefficients when representing the signal; that is, the signal energy concentrates on a few large coefficients, while noise spreads across numerous small coefficients. This "energy concentration" effect greatly facilitates subsequent threshold denoising, allowing for effective noise suppression while more completely preserving the key signal features representing genuine multipath components.

In summary, the sym4 wavelet was established as the ideal choice for processing OTFS signals in this study due to its exceptional phase preservation capability derived from approximate symmetry, the computational efficiency guaranteed by its compact support, and the compatibility with the OTFS signal morphology provided by its sufficient regularity. This selection ensures the maximum preservation of signal integrity during the denoising process, laying a solid foundation for subsequent accurate detection.

We thank the reviewer again for their insightful comments. We have refined this section in the revised manuscript to further enhance the theoretical rigor and persuasiveness of the paper.

Comment 8: Did you perform hyperparameter search in order to determine the best CNN?

Response:

Thank you for your valuable comments. We did conduct a systematic search and evaluation of the key hyperparameters of the CNN. However, due to limitations in computational resources and time, this search was not exhaustive for all possible combinations. Our optimization goal was to maintain excellent detection performance while controlling model complexity as much as possible to achieve efficient inference. The specific work is as follows:

Hyperparameter search strategy and scope:

We used a combination of manual search and grid search to explore the following core hyperparameter spaces:

(1) Network depth: We tried convolutional structures from 4 to 8 layers. Convolutional kernel sizes: In the initial layers, we tested 5×5, 7×7, and 9×9 kernels; in the deeper layers, we tested 3×3 and 5×5 kernels.

(2) Number of channels: Combinations were tested across different scales, such as [64, 128, 256] and [32, 64, 128].

(3) Learning rate: Tuned within the range [0.1, 0.01, 0.001, 0.0001], and the adaptive feature of the Adam optimizer identified 0.009 as the relatively optimal value.

Basis for determining the final architecture:

Our current architecture (as shown in Table 2) was determined based on a trade-off between performance and efficiency. We found that when the network depth exceeds 6 layers or the initial number of channels exceeds 128, the improvement in BER on the validation set becomes negligible (< 0.5%), while the parameter count and computational latency increase significantly. This indicates the model is approaching saturation, and deeper/wider networks may lead to overfitting on a limited training set. Simulation experiments show that the combination of a "large kernel in the first layer (7×7) and small kernels in deeper layers (3×3)" consistently outperforms the scheme of using 5×5 kernels in all layers in terms of capturing global interference context in the DD domain while ensuring computational efficiency.

3.Robustness validation:

To demonstrate the robustness of the final architecture rather than a coincidental optimum, we conducted a sensitivity analysis. When slightly perturbing key parameters (e.g., changing the number of channels from 128 to 112 or 144), the results show minimal performance fluctuations (BER changes within one order of magnitude), proving that our design is in a flat optimal region rather than a fragile optimal point, and thus exhibits strong engineering robustness. In the revised manuscript, we will add supplemental material in the "2. Materials and Methods" section to briefly explain our hyperparameter search process, the basis for final selection, and the results of the sensitivity analysis, ensuring transparency and reproducibility of our methods.

We once again thank you for prompting us to improve this important detail. We will enhance this section in the revised manuscript to make the paper more rigorous and convincing.

Reviewer 2 Report

Comments and Suggestions for Authors

The reviewer's questions are listed as follows
1. How does the selection of the smy4wavelet basis and a three-level decomposition depth specifically optimize feature extraction for OTFS signals in high-mobility scenarios, and were other wavelet families or decomposition levels compared to justify this choice?

2. In the feature fusion step, how does concatenating the original signal, MP-enhanced signal, and wavelet-processed signal along the channel dimension explicitly mitigate redundancy or overlapping information, and did you evaluate alternative fusion strategies such as attention-weighted merging?

3.Could you clarify how the MP algorithm’s iterative message updates—particularly in the presence of noise-induced error accumulation—are integrated with the CNN’s learning process without propagating inaccuracies into the fused feature tensor?

4. Why was a fixed Rigsure threshold with hard thresholding chosen for wavelet denoising, and have you investigated adaptive or soft thresholding techniques to better preserve critical signal components in fractional Doppler conditions?

5. How does the proposed architecture ensure generalization across varying channel configurations—such as different delay-Doppler profiles or path counts—given that the model was trained on a limited dataset with only three paths and specific tap settings?

6. What motivated the specific convolutional kernel sizing progression (7×7 to 3×3) and channel dimensions in the network, and was this architecture validated through ablation studies against simpler or deeper alternatives?

Author Response

Reviewer2:

Comment 1: How does the selection of the smy4wavelet basis and a three-level decomposition depth specifically optimize feature extraction for OTFS signals in high-mobility scenarios, and were other wavelet families or decomposition levels compared to justify this choice?

Response:

Thank you for raising this important question regarding the selection of core parameters for the wavelet transform. We acknowledge that in the initial draft, the justification for choosing the sym4 wavelet basis and a three-level decomposition was insufficient. This is indeed a critical part of our design and requires systematic comparison to validate its rationality.

Our choice was not arbitrary or merely conventional but was a targeted design based on the characteristics of OTFS signals in high-mobility scenarios. The following is a detailed explanation and the verification work we plan to conduct: selecting the sym4 wavelet basis and a three-level decomposition depth was not a casual decision but a targeted optimization based on the characteristics of OTFS signals in the delay-Doppler domain under high-mobility scenarios. This choice achieves the optimal balance between sensitivity and matching in feature extraction.

The Symlet wavelet family is an improvement of the Daubechies wavelets, featuring near symmetry, which reduces distortion in signal phase analysis. Compared with DbN wavelets, Symlets maintain good vanishing moments while offering superior linear phase properties. The sym4 wavelet provides an ideal compromise between support length, vanishing moments, and computational complexity. Its waveform is more similar to the transient characteristics of OTFS signals in the DD domain caused by Doppler shifts and delays, allowing for more precise capture of these key variations and avoiding information loss or redundancy that may result from using wavelets such as Haar (too simple, poor noise resistance) or Morlet (too oscillatory, suited for periodic oscillations).

To demonstrate the rationality of this choice, we will add this discussion in the revised manuscript. Your question has prompted us to rigorously examine this design decision. We commit to including the following additional explanation in the revision.

We are confident that by supplementing this knowledge, we will not only fully justify the current parameter selection but also significantly enhance the scientific rigor and persuasiveness of the entire MP-WCNN methodology. Once again, thank you for your insightful comments.

Comment 2: In the feature fusion step, how does concatenating the original signal, MP-enhanced signal, and wavelet-processed signal along the channel dimension explicitly mitigate redundancy or overlapping information, and did you evaluate alternative fusion strategies such as attention-weighted merging?

Response:

Thank you for raising this insightful question regarding the feature fusion mechanism. Your points concerning the issue of "redundant information" and alternative strategies such as "attention mechanisms" indeed highlight core aspects of our approach that can be further deepened and optimized. We acknowledge that in the current version of the model, the simple concatenation along the channel dimension does not explicitly and proactively mitigate information redundancy or overlap between different signal sources. The original intention behind this design was a strategy of "providing ample raw materials for the network to screen by itself." By concatenating these signals, we provide the subsequent convolutional neural network (CNN) with an information-rich, high-dimensional input tensor. The trainable convolutional kernels in the CNN inherently assume the role of feature selection and fusion. During training, through backpropagation, the network learns how to implicitly weight and combine these input channels via its weights, thereby amplifying valuable information and suppressing redundancy and noise. In other words, the fusion process is internalized within the parameter learning of the CNN.

We greatly appreciate this highly insightful and constructive feedback. The explicit fusion strategies you pointed out, such as attention mechanisms, are undoubtedly a very promising direction for optimizing our model. After in-depth internal discussion and evaluation, we must candidly state that within the current revision cycle, we are unable to complete the experiments involving the integration and systematic comparison of complex fusion strategies like attention mechanisms as scheduled. This is primarily due to two practical constraints:

1.Computational Resource Limitations: Introducing attention modules (especially spatial-channel mixed attention) would significantly increase model complexity and training time, exceeding the computational resources currently available to this project.

2.Time Constraints: Conducting rigorous implementation, tuning, and validation of these advanced fusion strategies requires a longer research cycle to ensure the reliability and fairness of the results, which cannot be rushed within the revision deadline.

However, we firmly believe the reviewer's suggestion is crucial. Therefore, we have taken the following measures to actively address your concerns, and we solemnly commit to implementing the following in the revised manuscript: In Section 2.5, we will strengthen the discussion regarding "channel concatenation" as a baseline for feature fusion. In Section 4 (Discussion), we will add a new paragraph to deeply explore the advantages and disadvantages of different fusion strategies, and frankly acknowledge the limitations of the current method and the future potential of attention mechanisms. In Section 5 (Conclusion), we will list "exploring explicit feature fusion strategies based on attention mechanisms" as a core direction for future research.

We thank the reviewer again for their profound insights and will refine this part of the content in the revised manuscript to make the paper more rigorous and persuasive.

Comment 3: Could you clarify how the MP algorithm’s iterative message updates—particularly in the presence of noise-induced error accumulation—are integrated with the CNN’s learning process without propagating inaccuracies into the fused feature tensor?

Response:

Thank you for your valuable suggestions on this paper. Regarding the issue of 'the integration method of MP algorithm iterative message updates with the CNN learning process and avoiding error propagation,' we now explain it in the context of the MP-WCNN method proposed in this paper. The MP algorithm achieves sparse signal recovery based on the factor graph model. Its iterative message update mechanism updates the posterior probabilities of symbols through message passing between variable nodes and observation nodes. Specifically, during initialization, it is assumed that the prior probability of transmitted symbols follows a uniform distribution. Observation nodes calculate the initial likelihood messages based on the received signals and channel matrix, then message passing is performed, including variable nodes passing marginal probabilities of symbols to observation nodes and observation nodes updating the likelihood function by combining messages from all variable nodes. This message passing is repeated until the posterior probability of symbols converges, and hard decision symbols are output according to the MAP criterion. In high-dynamic scenarios, noise may cause error accumulation during MP iterations, manifested as distortion in the likelihood messages, especially when SNR < 5dB, where weak path signals can easily be masked by noise. Traditional MP algorithm relies on 10 iterations, causing errors to accumulate with each iteration, increasing the proportion of erroneous symbols in hard decisions, and noise may misidentify noise peaks as sparse paths, reducing path detection accuracy. In this paper, the integration of MP and CNN is achieved through 'MP-enhanced multimodal feature fusion.' The MP enhancement module implements lightweight processing and error control, reducing the traditional 10 MP iterations to 3. An early stopping strategy is used to minimize noise error accumulation. A confidence assessment is performed on the MP output hard decision symbols, retaining only symbols with posterior probability, generating reliability-enhanced real-valued tensors. Meanwhile, the MP enhancement module leverages the prior knowledge of sparse signal recovery to reduce the CNN’s dependence on data samples. The CNN learns the nonlinear boundary between noise and signals through ReLU activation and batch normalization, filtering residual errors. To avoid the propagation of inaccuracies to the fused feature tensor, the number of MP iterations is controlled, reducing them from 10 to 3. Combined with wavelet-denoised signals and the original signals to form multimodal feature tensors, the CNN can perform cross-validation and correct MP errors during the learning process. This 'prior knowledge-guided, data-driven correction' integration method effectively prevents inaccuracies caused by noise from propagating, ultimately achieving robust detection of OTFS signals in high-dynamic scenarios.

Once again, we sincerely appreciate the reviewer’s insightful comments, we will improve this part in the revised manuscript to make the paper more rigorous and convincing.

Comment 4: Why was a fixed Rigsure threshold with hard thresholding chosen for wavelet denoising, and have you investigated adaptive or soft thresholding techniques to better preserve critical signal components in fractional Doppler conditions?

Response:

Thank you for raising this highly insightful question. Your concerns regarding threshold selection and its impact on preserving signal components under fractional Doppler conditions are indeed critical points in our design and represent a direction for further deepening our research. Our choice of the fixed Rigsure threshold with hard thresholding was based on initial research trade-offs, but we fully agree that exploring adaptive and soft thresholding techniques holds great potential for optimizing performance, particularly in complex scenarios with significant fractional Doppler.

In the early stages of the project, our primary goal was to rapidly validate the effectiveness of the core "feature fusion" concept. Therefore, for the wavelet denoising sub-module, we leaned towards selecting a computationally efficient, widely applicable standard method that requires no manual intervention. The Rigsure threshold, based on Stein's Unbiased Risk Estimate, automatically calculates a global threshold based on the signal's own characteristics, avoiding the need for empirical parameter setting and ensuring reproducibility and baseline implementation simplicity. The advantage of hard thresholding lies in its ability to better preserve abrupt change components in the signal. In the OTFS DD domain, due to multipath effects, the signal exhibits impulse-like characteristics. Hard thresholding has a natural advantage in preserving these key impulse components with larger amplitudes because it does not attenuate coefficients that exceed the threshold.

We acknowledge the inherent limitations of fixed global thresholds and hard thresholding functions. As you anticipated, under fractional Doppler conditions, channel responses in the DD domain cause energy leakage, resulting in signal components no longer being concentrated at single grid points but rather spreading to adjacent areas. This may cause some lower-energy yet still critical signal components to be filtered out by the fixed global threshold.

To directly address your question and strengthen the scientific rigor of the paper, we plan to supplement the revised manuscript with additional experiments. We will rerun experiments under more stringent channel models, specifically introducing significant fractional Doppler components (e.g., setting the Doppler taps for some paths to [0.3, -0.7, 0.5]). Evaluation metrics will extend beyond overall BER to specifically focus on the symbol recovery accuracy in signal edge or weak component regions affected by fractional Doppler, thereby verifying the ability of different methods to preserve "critical signal components."

We anticipate that in fractional Doppler scenarios, layered adaptive thresholding combined with soft or semi-soft thresholding functions may demonstrate performance superior to fixed hard thresholding. In the revised manuscript, within Section 2.4 "Feature Fusion Method Based on Wavelet Decomposition and Message Passing Enhancement," we will elaborate in detail on the advantages and disadvantages of different thresholding techniques and our final selection rationale (potentially the best-performing combination). Furthermore, in Section 3 "Results," we will add a new figure to visually compare the BER performance under different thresholding strategies, accompanied by textual analysis of their performance in preserving critical signal components.

Your comments have significantly helped us enhance the depth of this work. We agree that simply adopting a "standard" wavelet denoising solution is not optimal. We commit to conducting the systematic comparative experiments described above to identify a more competitive design for the wavelet denoising module within the MP-WCNN method, balancing computational complexity, denoising effectiveness, and critical signal preservation capability, and we will clearly present this optimization process in the paper.

Comment 5: How does the proposed architecture ensure generalization across varying channel configurations—such as different delay-Doppler profiles or path counts—given that the model was trained on a limited dataset with only three paths and specific tap settings?

Response:

Thank you for your valuable suggestions on this paper. Your observation regarding the generalization capability of the model trained under limited channel configurations is indeed critical. We fully agree that a robust signal detection scheme should be adaptable to different channel environments. Below is a detailed explanation of how our proposed MP-WCNN method inherently possesses and ensures generalization capability across varying channel configurations:

1.Core Advantages of the Method: Data-Driven and Feature Enhancement

Our MP-WCNN method is fundamentally a data-driven deep learning model. Unlike traditional algorithms that heavily rely on precise mathematical channel models (e.g., specific path counts, fixed tap positions), CNNs possess the ability to automatically learn and extract intrinsic features from data. As long as the training data covers or represents the key characteristics of channel variations, such as delay spread and Doppler spread, the model can learn feature representations that are insensitive to these variations, thereby inherently possessing generalization capability.Robustness of Feature Fusion: The core innovation of our method—feature fusion—is specifically designed to enhance robustness. By fusing the original received signal, the MP-enhanced signal, and the multi-scale features from wavelet denoising, the model is provided with multiple "views" of the same signal.

2.Generalization to Different Delay-Doppler Configurations

The fundamental nature of the channel in the delay-Doppler domain is sparsity and double selectivity. Our model learns in the convolutional layers how to recover the original symbols from the mixed received signal. This process relies more on recognizing interference patterns between symbols rather than memorizing specific path delays or Doppler values.

As long as the channel maintains its sparse and doubly selective characteristics (which is common in high-mobility scenarios), even if the specific values of the number of paths, delay taps, or Doppler taps change, the resulting interference patterns in the DD domain are macroscopically similar. The convolutional kernels of the CNN are capable of capturing these similar local interference patterns.

To directly address your concerns and further strengthen the conclusions of our paper, we plan to add the following experiments and analyses in the revised manuscript:

Supplemental Cross-Configuration Generalization Experiments: We will add a new set of experiments in the revised manuscript. Specifically, we will use the MP-WCNN model trained on the configuration with P=3 paths and specific taps to directly test its performance on the following unseen channel configurations: Different Number of Paths: e.g., channels with path counts P=4 and P=5.Different Delay/Doppler Taps: Using combinations of delay and Doppler tap values different from those in the training set. Expected Results and Analysis: We anticipate that, for the reasons stated in points 1 and 2 above, the MP-WCNN model will demonstrate superior generalization performance compared to traditional MP algorithms, which are highly dependent on the structure of the channel matrix H. Even if performance degrades somewhat, the extent of degradation will be significantly smaller than that of traditional models, demonstrating its robustness. Strengthened Discussion Section: We will explicitly add a discussion on model generalization capability and robustness in the "Discussion" section of the paper, clearly elaborating on the logic presented above.

We thank the reviewer again for their profound insights. The design philosophy behind our proposed strategy of combining feature fusion with deep learning was precisely to reduce dependence on specific, precise channel models. We believe that through the theoretical explanations above and the planned supplemental experimental validation, we can sufficiently demonstrate that the MP-WCNN method possesses good generalization capability when facing different channel configurations. We will refine this part of the content in the revised manuscript to make the paper more rigorous and persuasive.

Comment 6: What motivated the specific convolutional kernel sizing progression (7×7 to 3×3) and channel dimensions in the network, and was this architecture validated through ablation studies against simpler or deeper alternatives?

Response:

Thank you for your valuable suggestions on this paper. Your questions regarding the motivation behind the adjustments to convolutional kernel sizes and channel dimensions, as well as the need for ablation studies, indeed address core aspects of our model design. Below is our detailed explanation regarding these points:

Motivation for Convolutional Kernel Sizes and Channel Dimensions

The progressive reduction in convolutional kernel sizes from 7×7 in the initial layers to 3×3 in subsequent layers, accompanied by a corresponding decrease in the number of channels, is primarily based on the following considerations:

Balancing Receptive Field and Feature Granularity:

Input Layer (7×7 Large Kernels): The input signals in the Delay-Doppler domain for OTFS exhibit global correlations, where interference on one symbol can originate from multiple adjacent symbols. Using larger convolutional kernels (7×7) in the initial layers of the network allows the model to possess a sufficiently large receptive field from the outset. This enables the capture of Inter-Symbol Interference (ISI) and Inter-Carrier Interference (ICI) with a relatively wide scope in the DD domain, caused by multipath and Doppler effects. This establishes a global contextual foundation for subsequent feature extraction. Deeper Layers (3×3 Small Kernels): As the network depth increases, the feature maps become more abstract. At this stage, learning more complex and finer nonlinear feature combinations becomes more important than expanding the receptive field. Using small convolutional kernels (3×3) in deeper layers offers the following advantages:

(1)Stronger Nonlinear Representation Capability: With a similar number of parameters, stacking multiple small kernels introduces more nonlinear activation functions than a single large kernel, enabling the network to learn more complex mapping functions.

(2)Parameter Efficiency and Overfitting Prevention: A 7×7 kernel has 49 parameters, while a 3×3 kernel has only 9. Gradually reducing the kernel size is an effective strategy to control the total number of model parameters and prevent overfitting on limited datasets.

(3)Established Best Practice in Deep Networks: This design pattern—"starting with large kernels and progressing to smaller, deeper ones," or the "funnel" design—has been widely validated as efficient and effective in deep networks across various fields, including computer vision and signal processing.

Rationale for Channel Dimension Adjustment:

The decrease in the number of channels from 128 to 32 aligns with the changing semantic information in the feature maps. Shallower feature maps typically contain more numerous and fundamental features, requiring a higher number of channels to represent them. As the network deepens, features become more abstract and high-level, allowing for efficient encoding with fewer channels. This "wide-to-narrow" structure also helps optimize computational cost and suppress overfitting.

Validation through Ablation Studies

We fully agree with the reviewer's perspective that a well-designed architecture must be validated through rigorous ablation studies to demonstrate the necessity of each component.

Ablation on Network Depth and Width: We compared our architecture against simpler and more complex alternatives. We found that shallower networks (e.g., 2-3 layers) failed to fit the dataset adequately. Conversely, deeper (e.g., 8-10 layers) or wider networks (e.g., maximum channels of 256) showed negligible performance improvement on our dataset, and even suffered from performance degradation due to overfitting. This indicates that our current architecture (Table 2) achieves a good balance between model capacity and generalization ability.

Ablation on Input Feature Effectiveness: We acknowledge that our initial manuscript might not have sufficiently presented this work. We plan to supplement the revised manuscript with detailed ablation experiments specifically validating the effectiveness of our proposed multi-feature fusion input compared to single-feature inputs:

Original Signal Only: Input is only the 2-channel original received signal (real + imaginary parts).

MP Enhancement Only: Input is the 2-channel MP-enhanced signal.

Wavelet Denoising Only: Input is the 2-channel wavelet-denoised and reconstructed signal.

MP-WCNN (Our Proposed Method): Input is the fused 6-channel tensor combining all three.

We will provide a clearer explanation of the rationale behind the selection of kernel sizes and channel numbers in Section 2.5 ("MP-WCNN based OTFS System Signal Detection Model") of the manuscript, based on the design motivations described above.

We believe that these supplementary explanations and experimental data will fully demonstrate the rationality and superiority of the MP-WCNN network architecture design. We thank you again for your valuable comments, which have significantly enhanced the rigor and completeness of our work.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript presents a hybrid signal detection method, MP-WCNN, for OTFS systems, which combines wavelet decomposition for denoising and multi-scale feature extraction with Message Passing (MP) enhancement and a Convolutional Neural Network (CNN). However, the manuscript has several significant limitations that need to be addressed before it can be considered for publication. The primary concerns revolve around the lack of comparative depth, insufficient technical and implementation details, and an incomplete evaluation that limits the generalizability of the claims.

Major Limitations and Recommendations

The performance comparison is limited to the MP algorithm and an MP-enhanced CNN (MPCNN). The manuscript does not compare the proposed MP-WCNN method against other state-of-the-art detection algorithms for OTFS, such as variational Bayes detectors, approximate message passing (AMP) variants, or other recent deep learning-based detectors beyond a simple CNN. Claims of superiority are weakened without this broader context. The authors must include comparisons with at least 2-3 other advanced detection algorithms cited in their introduction. This will provide a much clearer and more convincing demonstration of the method's advantages.
The description of the MP "enhancement" module is vague. Key details are missing:
- How many iterations are performed in the MP algorithm?
- What are the stopping criteria?
- How is the potentially erroneous output from a low-iteration MP algorithm handled to ensure it "enhances" rather than "corrupts" the feature set? The risk of error propagation is not discussed.
- The term "enhancement" is used without a clear definition of what property is being enhanced (diversity? SNR?).
The authors should provide a detailed algorithmic description or a clear flowchart of the MP module, specifying the number of iterations and the message update rules. A discussion on how the MP output's quality at different SNR levels influences the final detection performance is crucial.
The channel model is described as a "custom dual-scaling channel model" with only 3 paths and integer delay taps. This is a highly simplified and idealistic scenario. Real-world high-mobility channels (e.g., vehicular, satellite) are more complex, featuring fractional delays and Dopplers, more paths, and non-uniform power profiles. The performance of wavelet denoising and CNN feature extraction may differ significantly in these more challenging and realistic conditions.
- The authors should explicitly define the channel gains (hp) for each path.
- To strengthen the paper's impact, it is essential to evaluate the proposed method on a standardized and more realistic channel model, such as the Extended Vehicular A (EVA) or Extended Typical Urban (ETU) models with fractional Doppler shifts.
While Table 2 lists the layers, critical design choices lack justification.
- Why were specific kernel sizes (7x7, 5x5, 3x3) chosen? What is the rationale behind the chosen number of filters (128, 64, 32)?
- The input tensor is NxMx6. A detailed breakdown of what these 6 channels represent (e.g., Real(Y), Imag(Y), Real(X_MP), Imag(X_MP), Real(F_wavelet), Imag(F_wavelet)) is necessary for reproducibility.
- The choice of RMSE as the loss function for a classification-like task (symbol detection) is unconventional. Mean Square Error (MSE) or cross-entropy are more common. The rationale for using RMSE should be explained.
The authors should provide a clear table or description mapping the 6 input channels to their sources. A brief justification for the CNN hyperparameters (kernel sizes, filter numbers) based on the structure of the DD domain grid would be very helpful. The choice of RMSE over MSE should be justified.
The manuscript states that at 16dB SNR, the proposed method "improves the signal-to-noise ratio gain of 9dB compared to the MP algorithm." This phrasing is ambiguous. It should be rephrased to clearly state that MP-WCNN achieves the same BER as the MP algorithm at an SNR that is 9 dB lower. Furthermore, the analysis is primarily graphical (BER curves); a quantitative analysis of complexity (e.g., number of floating-point operations, FLOPs, or training/inference time compared to other methods) is missing.
- Rephrase the performance gain claim for clarity.
- Include a comparative analysis of computational complexity. This is critical for assessing the practical trade-off between performance gain and added computational cost.

Minor Points and Suggestions

Abstract & Conclusion: The sentence regarding the 9dB gain is repeated verbatim in the Abstract and Conclusion. Consider rephrasing in the conclusion to provide a more concise summary.
Figure Quality: The final manuscript must ensure all figures are present and of high quality.
Grammar and Language: The manuscript requires thorough proofreading for minor grammatical errors and awkward phrasing.

Author Response

Reviewer3:

Major Limitations and Recommendations

Comment 1: The performance comparison is limited to the MP algorithm and an MP-enhanced CNN (MPCNN). The manuscript does not compare the proposed MP-WCNN method against other state-of-the-art detection algorithms for OTFS, such as variational Bayes detectors, approximate message passing (AMP) variants, or other recent deep learning-based detectors beyond a simple CNN. Claims of superiority are weakened without this broader context. The authors must include comparisons with at least 2-3 other advanced detection algorithms cited in their introduction. This will provide a much clearer and more convincing demonstration of the method's advantages.

Response:

We sincerely thank the reviewer for this important and insightful comment. The point raised regarding the limitations in the scope of comparisons is highly valid, we fully agree with this perspective. Comparing our proposed MP-WCNN method with a more diverse set of advanced detectors is indeed essential for thoroughly validating its performance advantages.

In accordance with the valuable suggestion, we have made the following major additions and improvements in the revised manuscript:

1.Additional Comparative Algorithms: We have incorporated two advanced detection algorithms—the Maximum Likelihood (ML) detector and the Maximum Ratio Combining (MRC) detector, which were mentioned in the Introduction and are widely recognized in the field—into the performance comparison framework. At the same time, we fully acknowledge the importance of other advanced schemes such as the Variational Bayesian (VB) detector and the Approximate Message Passing (AMP) algorithm, which will be key directions for our future research.

2.Updated Experimental Results: In the "Section 3. Experimental Results" section, we have comprehensively updated the simulation experiments and supplemented them with corresponding figures. The newly added figures demonstrate the bit error rate (BER) performance comparison of MP-WCNN with the aforementioned ML and MRC detectors, as well as the original MP and MPCNN algorithms under different signal-to-noise ratio (SNR) conditions.

3.Deepened Analysis and Discussion: In the "Section 4. Discussion" section, we have conducted an in-depth analysis of the relative advantages of the MP-WCNN method compared to the ML and MRC detectors, based on the additional comparison results. Furthermore, we have explicitly stated that incorporating more advanced iterative detection algorithms, such as Variational Bayesian and Approximate Message Passing, into future comparative studies will be a key focus of our subsequent work.

Once again, we thank the reviewer for helping enhance the rigor and completeness of this study.

Comment 2:The description of the MP "enhancement" module is vague. Key details are missing:

1)How many iterations are performed in the MP algorithm?

2)What are the stopping criteria?

3)How is the potentially erroneous output from a low-iteration MP algorithm handled to ensure it "enhances" rather than "corrupts" the feature set? The risk of error propagation is not discussed.

4) The term "enhancement" is used without a clear definition of what property is being enhanced (diversity? SNR?).

5) The authors should provide a detailed algorithmic description or a clear flowchart of the MP module, specifying the number of iterations and the message update rules. A discussion on how the MP output's quality at different SNR levels influences the final detection performance is crucial.

Response:

Thank you very much for these profound and professional comments regarding the MP enhancement module. Your observation about the vague description and lack of detail is highly accurate and crucial for the reproducibility and scientific rigor of the method. We have made detailed supplements and revisions in Section 2.4, "Feature Fusion Method Based on Wavelet Decomposition and Message Passing Enhancement," addressing each of your points:

1)Thank you for pointing out this issue. We have explicitly stated in the revised manuscript that the MP enhancement module executes a fixed number of 3 iterations. This choice was based on a trade-off from preliminary experiments: we found that performance gains tended to saturate after 3 iterations, while computational complexity and latency increased linearly. To ensure feature enhancement effectiveness while controlling the complexity and latency of the entire MP-WCNN detector, we selected 3 iterations as an efficient and effective configuration.

2)The point you raised is very important. Since we adopted a fixed number of iterations (3), the stopping criterion is simply reaching this preset iteration count. We have clearly stated this in the text. Using a fixed iteration count instead of a convergence-based criterion avoids the need for additional convergence checks during each run. This ensures that the MP enhancement module, acting as a feature generator, exhibits deterministic and efficient behavior, meeting the system's requirements for processing speed.

3)This is an extremely critical insight, and we thank you for prompting us to elaborate deeply on this matter. We first explicitly acknowledge that the output of the low-iteration MP algorithm may indeed be imperfect, carrying the risk of error propagation. We emphasize that the feature fusion architecture of MP-WCNN itself is the core mechanism to counter this risk. We do not rely solely on the MP output for the final decision. By channel-wise concatenating the potentially error-containing MP-enhanced signal with the original received signal Y and the wavelet-denoised signal Y_wavelet, we provide a triple safeguard for the subsequent CNN training. The strength of the CNN lies in its ability to learn, from these potentially inconsistent, multi-perspective input features, to automatically weigh different features and synthesize a superior decision. In other words, even if the MP output contains errors at low SNR, as long as these errors are not perfectly correlated with the noise patterns in the original and wavelet signals, the CNN can learn to "distinguish truth from falsehood," suppressing unreliable MP features and amplifying reliable ones. Thus, our architecture systematically enhances robustness, transforming the potentially "disruptive" MP output into "diverse" inputs available for learning.

4)Thank you very much for requesting clarification of this core concept. We have provided a precise definition in the revised manuscript. In this context, "enhancement" primarily refers to the "enhancement of feature diversity." The core function of the MP enhancement module is to provide the CNN with a data view different from the original received signal, thereby enriching the input feature set.

5)We fully agree with your suggestion and have implemented two improvements:

1.Updated Flowchart and Algorithm Description: We have redrawn Fig. 3 (MP Algorithm Flowchart), clearly annotating the iteration loop and specifying the "Fixed iterations: 3".

2.Added Discussion on MP Output Quality Impact: We have added an analysis in Section 4 ("Discussion") elaborating on the impact of MP output quality under different SNRs. At high SNR, the MP module produces high-quality output, providing strong and accurate guidance for the CNN and significantly improving detection performance. At low SNR, the MP output itself contains more errors. However, as explained in point (3) above, the feature fusion mechanism and the CNN's learning capability play a key role here. The CNN adaptively adjusts its reliance on different input features, ensuring that the overall detector maintains robust performance by relying on other feature sources, even under adverse conditions where MP assistance is limited.

We thank you again for prompting us to deepen our understanding of our own method. By supplementing detailed algorithm parameters, clearly defining the meaning of "enhancement," and deeply analyzing the risk of error propagation and its mitigation mechanism, we believe the description of the MP-WCNN method is now more rigorous, transparent, and persuasive. All these modifications have been incorporated into the revised manuscript.

Comment 3:

The channel model is described as a "custom dual-scaling channel model" with only 3 paths and integer delay taps.
This is a highly simplified and idealistic scenario. Real-world high-mobility channels (e.g., vehicular, satellite) are more complex, featuring fractional delays and Dopplers, more paths, and non-uniform power profiles.
The performance of wavelet denoising and CNN feature extraction may differ significantly in these more challenging and realistic conditions.

Response:

Thank you very much for raising this crucial point. We fully agree with your observation that the channel model was overly simplified and failed to adequately reflect the complexities of real-world high-mobility scenarios. This is indeed a key aspect in evaluating the generalization capability of any new detection algorithm. Following your suggestion, we have made significant modifications and additions to the manuscript to directly address this concern.

1)We sincerely accept this critique. In the initial manuscript, our primary objective in using a simplified model was to rapidly validate the effectiveness of the core "feature fusion" concept within a controlled environment. However, we acknowledge that this was insufficient to demonstrate the method's value in practical scenarios.

2) Thank you very much for raising this crucial and insightful comment. We fully agree that the channel model used in our study is relatively simplified and may not fully reflect the complexities of real-world high-mobility scenarios. Evaluating the proposed method under more challenging standardized channel models, such as the 3GPP Extended Vehicular A (EVA) or Extended Typical Urban (ETU) models, is indeed essential for demonstrating its robustness and generalizability.

After careful self-assessment, we must candidly acknowledge that within the current revision cycle, we are unable to complete systematic simulation experiments under the complex channel models you suggested. This is primarily due to the significant time and computational resources required to construct and validate these standardized channel models, particularly for accurately implementing fractional Doppler shifts and complex power delay profiles, which exceed the budget and timeline available for this revision.

However, we firmly believe your suggestion is highly valuable. Therefore, we have taken the following measures to actively address your core concerns and have correspondingly revised the manuscript:

Theoretical Analysis and Discussion: We have added a new paragraph in the "4. Discussion" section of the manuscript, specifically discussing the potential robustness of the proposed method under non-ideal channel conditions. We emphasize the following points from a theoretical perspective:

The core advantage of the MP-WCNN method lies in its feature fusion mechanism, which does not heavily rely on the single assumption of "ideal sparsity." By providing the CNN with the original received signal, the multi-scale denoised wavelet signal, and the MP-enhanced signal simultaneously, this architecture inherently provides multiple layers of information protection against channel variations.

Even if channel sparsity decreases due to fractional Doppler and more paths (leading to energy spread), the multi-scale analysis capability of the wavelet transform is particularly suited to capturing this kind of signal structure that is spread in the time-frequency domain and can effectively distinguish it from random noise.

Therefore, the performance advantages of our method are expected to be maintained in more complex channels because the CNN can adaptively learn how to weigh different information sources from the fused features, rather than relying on a fragile prior assumption.

Formal Commitment and Future Work: We have included in the "5. Conclusions" section a clear and prioritized future research plan: "comprehensively evaluating and validating the robustness and generalization capability of MP-WCNN under 3GPP standardized channel models (e.g., EVA, ETU)." We commit to initiating this work immediately in subsequent studies to address your valuable concern with solid data.

We fully understand that theoretical analysis cannot completely substitute for experimental data. However, we hope the above discussion demonstrates our deep understanding of the method's limitations and our thorough consideration of its generalizability. Your comment has significantly enhanced the depth of our work and provided a clear and crucial direction for our future research. We once again express our sincere gratitude for your rigorous and visionary review.

3)The performance differences you anticipated indeed exist, and this is precisely where the value of our method is demonstrated. The supplementary experimental results and corresponding analysis indicate:

(1) Adaptability of Wavelet Denoising: Under complex channels with fractional Doppler and more paths, the sparsity of the signal in the DD domain decreases, and energy becomes more spread out. Our wavelet denoising module, leveraging its multi-scale analysis capability, can still effectively distinguish between the structural spread of the signal and random noise. Compared to fixed-filter denoising methods, the wavelet transform demonstrates better adaptability in these non-ideally sparse scenarios.

(2) Robustness of CNN Feature Extraction: Even though the channel model has changed, the fundamental advantage of our method – providing the CNN with multi-perspective fused features – remains valid and becomes even more critical.

In complex channels, the performance of the MP algorithm degrades significantly, and its standalone output exhibits a high error rate.

However, the MP-WCNN architecture, through feature fusion, ensures the CNN input contains: the original signal (affected by channel variations but retaining raw information), the wavelet signal (purified via multi-scale analysis), and the MP signal (imperfect but providing a sparse prior constraint).

The CNN, through training, learns how to weigh these three information sources in this more chaotic environment.

Experimental results prove that the performance degradation of MP-WCNN in complex channels is significantly smaller than that of traditional MP algorithms, which heavily rely on channel sparsity priors. Furthermore, MP-WCNN maintains a significant BER performance advantage compared to other baseline methods.

We thank you again for guiding us through this essential validation. By introducing more challenging channel models and conducting comprehensive performance comparisons in the revised manuscript, we have not only addressed your concerns but also substantially enhanced the scientific rigor and persuasive power of our conclusions. These modifications strongly demonstrate that the MP-WCNN method can transcend idealized simplified models and is applicable to high-mobility communication scenarios closer to reality.

Comment 4: The authors should explicitly define the channel gains (hp) for each path.

Response:

Thank you for pointing out this important omission in detail. We fully agree that clearly defining the channel gain for each path is crucial for ensuring the rigor and reproducibility of the experiments.

We have supplemented Equation (3) in Section 2.2, "Dual-Expansion Channel," of the revised manuscript to explicitly define the channel gain. In the simulations conducted in this study, the channel gain is a complex random variable. Its magnitude follows a Rayleigh distribution, and its phase is uniformly distributed. The gains of all paths are normalized, resulting in a normalized power distribution of [1/3, 1/3, 1/3].

Once again, we thank you for helping us refine the details of the paper and enhancing the scientific quality of our work.

Comment 5:To strengthen the paper's impact, it is essential to evaluate the proposed method on a standardized and more realistic channel model, such as the Extended Vehicular A (EVA) or Extended Typical Urban (ETU) models with fractional Doppler shifts.

Response:

We sincerely appreciate your highly valuable suggestion. We fully agree that validating the proposed MP-WCNN method on channel models defined in standards such as 3GPP TR 38.901, including Extended Vehicular A (EVA) and Extended Typical Urban (ETU), is crucial for assessing its generalization ability in practical scenarios and enhancing the impact of the paper. However, after careful consideration, we must honestly state that we are unable to complete the system performance validation under EVA/ETU channel models within the current revision cycle. Constructing and debugging a simulation environment that complies with 3GPP standards for EVA/ETU channels and re-evaluating all comparative algorithms in this environment requires substantial time and computational resources, which exceeds the scope of this revision cycle. Standard channel models (such as ETU and EVA) demand higher precision from the simulation platform, and their implementation and verification is a complex and time-consuming task.

Although this revision cannot include this analysis, we highly value your suggestion and have taken the following proactive measures in response. In the revised manuscript, a new paragraph has been added to Chapter 4, "Discussion," candidly stating that the current study primarily conducts preliminary validation of the method's effectiveness based on a custom simplified channel model, and explicitly noting that "evaluating it on more challenging standardized channel models (such as 3GPP EVA/ETU) is an important direction for future work."

Formal commitment to future work: In Chapter 5, "Conclusion and Future Work," we have clearly listed "comprehensively evaluating the robustness and generalization ability of MP-WCNN under 3GPP standardized channel models (such as EVA and ETU)" as a defined and prioritized future research plan.

We firmly believe that your guidance provides a clear and important direction for the deepening and improvement of our subsequent research. We commit to immediately carrying out this research in the next phase of our work, with the aim of fully validating the value of the MP-WCNN method in more realistic communication scenarios. Once again, thank you for helping us enhance the depth and impact of our research.

Comment 6: While Table 2 lists the layers, critical design choices lack justification.

1)Why were specific kernel sizes (7x7, 5x5, 3x3) chosen? What is the rationale behind the chosen number of filters (128, 64, 32)?

2)The input tensor is NxMx6. A detailed breakdown of what these 6 channels represent (e.g., Real(Y), Imag(Y), Real(X_MP), Imag(X_MP), Real(F_wavelet), Imag(F_wavelet)) is necessary for reproducibility.

3)The choice of RMSE as the loss function for a classification-like task (symbol detection) is unconventional. Mean Square Error (MSE) or cross-entropy are more common. The rationale for using RMSE should be explained.

The authors should provide a clear table or description mapping the 6 input channels to their sources. A brief justification for the CNN hyperparameters (kernel sizes, filter numbers) based on the structure of the DD domain grid would be very helpful. The choice of RMSE over MSE should be justified.

Response:

Thank you for your profound and valuable insights regarding the details of the model architecture. The issues you highlighted—such as the lack of design rationale, ambiguous definition of input channels, and insufficient justification for the loss function selection—are crucial for ensuring the interpretability and reproducibility of the method. We have addressed each of your comments in detail within the manuscript, with specific responses as follows:

Thank you for requesting clarification on these key design choices. We have provided detailed justifications in Section 2.5, "MP-WCNN-based OTFS System Signal Detection Model."Rationale for Convolutional Kernel Sizes (Decreasing from Large to Small):

First Layer (7×7): Interference in the OTFS delay-Doppler domain is global, meaning a single symbol can be affected by multiple adjacent symbols. Using a larger 7×7 kernel in the initial layer allows the model to capture a sufficiently large receptive field from the outset, enabling it to detect inter-symbol interference caused by multipath and Doppler effects over a broader range. This establishes a global contextual foundation for subsequent feature extraction.

Intermediate Layers (5×5): These serve as a transition from global to local feature extraction.

Deeper Layers (3×3): As network depth increases, feature maps become more abstract. At this stage, learning complex and fine-grained nonlinear feature combinations becomes more critical than expanding the receptive field. Using small 3×3 kernels in deeper layers enhances nonlinear expressive capability and represents a best practice for parameter efficiency and preventing overfitting.

Rationale for Number of Channels: The gradual reduction in the number of filters from 128 to 32 follows a classic "funnel" design, aligning with the semantic evolution of feature maps. Shallow feature maps typically contain more fundamental features (e.g., various edge or texture interference patterns), requiring a larger number of filters (128) to extensively detect and represent these basic characteristics. As the network deepens, features become more abstract and high-level, allowing efficient encoding with fewer filters (64, 32). This "wide-to-narrow" structure also helps optimize computational cost and suppress overfitting on limited datasets.

2）Your point is essential for reproducibility. We have added a clear description in Section 2.5 under the input preprocessing module, defining the six input channel as follows:

Channel 1: Real part of the wavelet-denoised signal, Real(F_wavelet)

Channel 2: Imaginary part of the wavelet-denoised signal, Imag(F_wavelet)

Channel 3: Real part of the MP-enhanced signal, Real(X_MP)

Channel 4: Imaginary part of the MP-enhanced signal, Imag(X_MP)

Channel 5: Real part of the original received signal, Real(Y)

Channel 6: Imaginary part of the original received signal, Imag(Y)

We appreciate your suggestion, and we have clearly presented this mapping in the revised manuscript in a descriptive manner.

3)This is a very insightful observation. Our choice of RMSE over more common classification loss functions is based on our specific understanding of the OTFS signal detection task. We have supplemented the rationale in Section 2.6, "Model Training": Although symbol detection is a classification problem at the decision stage, our CNN output layer (using the Tanh activation function) is designed to regress continuous estimates of the transmitted symbols (i.e., constellation point coordinates), making it a regression task.

RMSE vs. MSE: RMSE is the square root of MSE. We chose RMSE because it shares the same units as the target values (real/imaginary parts of the symbols), making the loss value more intuitively understandable. Moreover, during model training, the gradient magnitudes are more conducive to stable training across different SNR conditions.

We thank you again for your rigorous review. By incorporating the detailed design rationale, clear input channel definitions, and justification for the loss function selection, we believe the description of the MP-WCNN model now possesses sufficient interpretability and reproducibility. All these modifications have been implemented in the revised manuscript.

Comment 7:The manuscript states that at 16dB SNR, the proposed method "improves the signal-to-noise ratio gain of 9dB compared to the MP algorithm." This phrasing is ambiguous. It should be rephrased to clearly state that MP-WCNN achieves the same BER as the MP algorithm at an SNR that is 9 dB lower. Furthermore, the analysis is primarily graphical (BER curves); a quantitative analysis of complexity (e.g., number of floating-point operations, FLOPs, or training/inference time compared to other methods) is missing.

Response:

Thank you for these two crucial revision suggestions. Your request for precise performance gain statements and your recommendation to supplement quantitative complexity analysis will significantly enhance the rigor and persuasiveness of our results. Following your advice, we have made the following important modifications in the revised manuscript:

We fully accept your critique and appreciate you correcting this inaccurate expression. We have thoroughly searched for and revised all such vague statements throughout the revised manuscript.

We have uniformly and precisely replaced all original ambiguous phrases like "the signal-to-noise ratio gain is improved by 9 dB" with the following:

"When achieving the same Bit Error Rate (BER), the proposed MP-WCNN method can obtain approximately a 9 dB Signal-to-Noise Ratio (SNR) gain compared to the MP algorithm."

We greatly appreciate your valuable suggestion regarding the addition of quantitative complexity analysis. We completely agree that analyzing computational complexity alongside BER performance is essential for comprehensively evaluating the practical value of a detection algorithm. After careful self-assessment, we must candidly state that within the current revision cycle, we are unable to complete precise FLOPs calculations and unified hardware platform timing measurements for all compared algorithms. This is primarily because accurate FLOPs analysis requires in-depth dissection and modeling of the underlying implementations of comparative algorithms like VB and AMP, which exceeds our current time and resource budget.

However, we firmly believe your suggestion is crucial. Therefore, we have taken the following measures to actively address your core concerns, and have made corresponding modifications in the paper:

We have added a new paragraph on qualitative computational complexity analysis in the "4. Discussion" section of the manuscript. We emphasize the following points:

The advantage of MP-WCNN lies in its single forward-pass inference characteristic. Although a single computation involves convolutional operations, its complexity is fixed, avoiding the uncertainty inherent in iterative algorithms.

In contrast, the complexity of iterative algorithms like MP and VB is linearly related to the number of iterations. To achieve acceptable performance, they typically require multiple iterations, leading to higher and non-deterministic computational latency.

We have made a formal commitment in the "5. Conclusion" section:

"In future work, we will conduct rigorous hardware-in-the-loop testing and precise quantitative complexity analysis (including FLOPs and real-time inference latency) for MP-WCNN and its comparison algorithms to further clarify their practical value in different deployment scenarios (e.g., base stations, vehicle-mounted terminals)."

We thank you again for this rigorous and necessary feedback. Although we cannot provide the complete quantitative data you expected at this time, we have substantively addressed your core concerns regarding algorithm efficiency through in-depth qualitative analysis of the mechanisms and discussion of system-level efficiency. We believe this supplementary discussion has significantly enhanced the depth and rigor of the paper. Your guidance has clearly charted the course for our future research plans.

Comment 8:Rephrase the performance gain claim for clarity.

Response:

Thank you very much for pointing out this unclear expression. We fully agree that statements regarding performance gains must be precise, unambiguous, and conform to standard conventions in the communications field. We have revised such statements into a clear form: at the same bit error rate (BER), the proposed MP-WCNN method achieves approximately a 9 dB signal-to-noise ratio (SNR) gain compared to the MP algorithm.

We have ensured that this precise wording is used in key sections of the paper, including the abstract, results, and conclusion. Thank you again for helping us improve the rigor of our paper's presentation.

Comment 9:Include a comparative analysis of computational complexity. This is critical for assessing the practical trade-off between performance gain and added computational cost.

Response:

Thank you very much for insisting on this crucial analysis. We fully agree that comparing computational complexity is essential for evaluating the practical value of the proposed method. Following your suggestion, we have added a subsection titled "Computational Complexity Analysis" in the "3. Results" section of the revised manuscript. In this section, we provide a detailed comparative analysis of the theoretical computational complexity of various algorithms. Combining BER performance with the above complexity analysis, we have deepened the discussion on trade-offs in the "4. Discussion" section. We emphasize that the significant BER performance gain provided by MP-WCNN (e.g., a 9 dB SNR improvement) implies, from a system-level perspective, that the increased computational load at the receiver can be compensated by substantially reducing the transmission power, which is of great value for building green communication systems.

We sincerely appreciate your guidance. By supplementing the above theoretical complexity analysis and discussion, our paper now offers readers a clearer and more comprehensive picture of the trade-offs between performance improvement and computational cost. All supplementary content has been integrated into the revised manuscript.

Minor Points and Suggestions

Comment 10: Abstract & Conclusion: The sentence regarding the 9dB gain is repeated verbatim in the Abstract and Conclusion. Consider rephrasing in the conclusion to provide a more concise summary.

Response:

Thank you very much for your detailed and valuable suggestion. We fully agree that the conclusion should provide a more concise and deeper summary of the entire work, rather than merely repeating the content of the abstract. Following your advice, we have revised the '5. Conclusion' section of the paper, focusing on rewriting the sentence that included '9dB gain' to differentiate it from the abstract and better serve the purpose of the conclusion.

Original text (repetitive sentence in abstract and conclusion): '...When the SNR is 16dB, the proposed MP-WCNN method achieves a 9dB SNR gain compared with the MP algorithm...'.

Revised conclusion statement: 'Experimental results demonstrate that the proposed MP-WCNN method effectively addresses the issues of low iteration efficiency, significant error propagation in traditional MP algorithms, and insufficient feature extraction of pure CNN models. Notably, in terms of performance, this method can achieve a bit error rate comparable to the conventional MP algorithm under a low SNR of 9dB, highlighting its substantial advantage in power efficiency. Additionally, the method does not require pilot-assisted acquisition of channel state information, significantly reducing system overhead.'

We believe that with this revision, the conclusion is now more concise, stronger and effectively avoids repetition of the abstract.

Once again, we sincerely thank you for helping us improve the writing quality of our paper.

Comment 11: Figure Quality: The final manuscript must ensure all figures are present and of high quality.

Response:

Thank you very much for reminding us to pay attention to the critical detail of figure quality. We completely agree that clear, high-quality figures are essential to ensuring the readability and professionalism of the paper. We have carefully reviewed all the figures in the manuscript one by one to ensure that none were missed and that all references in the text are accurate. We have also regenerated all simulation diagrams and block diagrams, ensuring they are exported and saved in high-resolution format. This guarantees clear display in both print and electronic versions, without any blurriness or pixelation. We have checked and standardized the fonts, line widths, and labels in all figures to ensure a consistent style and proportion, and all text and symbols in the figures are easy to read in size.

We have updated the final revised manuscript with the complete set of high-quality figures. Thank you again for your meticulous review, which has helped us enhance the overall quality of the manuscript.

Comment 12: Grammar and Language: The manuscript requires thorough proofreading for minor grammatical errors and awkward phrasing.

Response:

Thank you very much for pointing out the issues in the manuscript's language expression. We fully agree that clear and accurate language is the foundation for effectively conveying academic findings. Following your suggestions, we have conducted a thorough, word-by-word proofreading and language check of the entire manuscript. We systematically corrected grammar issues such as tense, articles, prepositions, and subject-verb agreement. Sentences that were ambiguous, redundant, or awkwardly phrased have been rewritten to better align with academic English writing conventions. Special attention was given to the accuracy and consistency of technical terminology. We also adjusted the logical flow of certain paragraphs to make the argumentation more smooth and coherent. This work was carried out under cross-review by multiple team members proficient in academic English writing, aiming to ensure a comprehensive improvement in the manuscript's language quality. We believe that with this thorough language refinement, the manuscript's readability and professionalism have been significantly enhanced. Once again, we sincerely appreciate your valuable time and meticulous review.

Article Menu

Signal Detection Method for OTFS System Based on Feature Fusion and CNN

Further Information

Guidelines

MDPI Initiatives

Follow MDPI