4.1. Experimental Setup and Evaluation Metrics
In all experiments, the random seed was uniformly set to 0 to ensure the reproducibility of the experiments. During training, each batch processed 30 images, and stochastic gradient descent (SGD) was used as the optimizer to update the network parameters. The number of experimental rounds was 2000. To prevent overfitting and improve training efficiency, an early stopping strategy was implemented: if there was no significant improvement in model performance after 50 consecutive training steps (epochs), the training would be automatically stopped. All experiments were conducted on a computer equipped with three NVIDIA RTX 3090 GPUs with 24 GB of VRAM (Dell Inc., Xiamen, China) each. This hardware configuration ensured efficient data processing and computational speed during training, while also enabling highly parallelized and optimized model training and validation. The dataset consisted of time–frequency maps with SNRs ranging from −8 dB to 6 dB, with 500 images per SNR level, for 4000 images. At each SNR level, 20% of the images were used as the validation set, while 80% were used as the training set for model training.
In comparing signal detection results, the mean average precision (
), a commonly used metric in machine learning-based object detection, was adopted. mAP takes into account the cases of true positives (
s), false positives (
s), and false negatives (
s) to provide a comprehensive evaluation. The calculation formula used was defined as follows:
where
is the number of correctly identified signal samples,
is the number of incorrectly identified or unrecognized signal samples, and
is the number of incorrectly recognized signal target samples.
represents the number of signal sample categories, and
M and
N represent the number of IoU threshold values. Precision and recall are denoted as
and
, respectively.
mAP50 measured the average precision when the intersection over union (IoU) threshold between the predicted box and the ground-truth box was 0.5. mAP50 measured the average precision when the IoU threshold between the predicted box and the ground-truth box was 0.5. mAP50-95 measured the average precision across IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05. A higher mAP50-95 indicated more accurate localization and robustness to prediction errors. These metrics, commonly used in object detection, were adapted here to evaluate the model’s ability to accurately localize signals in the time–frequency domain.
4.2. Ablation Study
This section first explores the effectiveness of improving the C2f module to DCFFN and enhancing the loss function. The baseline model used is YOLOv8n, with the experimental results shown in
Table 2 and
Figure 7.
As shown in
Table 2, using DCFFN not only improved the signal detection mAP50-95 from 0.807 to 0.845 but also resulted in a less than 2% increase in the number of parameters. The Focal_SIoU mechanism increased the gradient adjustment weight for difficult samples under low-SNR conditions, thereby improving the accuracy under low-SNR conditions, as well as the overall accuracy. The performance of the network reached its optimal point when both DCFFN and Focal_SIoU were introduced, with mAP50-95 reaching 0.85, proving that both components effectively enhanced the detection accuracy.
Subsequently, based on DCFFN, this section investigates the impact of replacing the C2f module with various SOTA methods on the performance of the baseline model. Specifically, in this ablation experiment, other modular improvements were considered: OREPA, ContextGuided, DLKA, DCNv2-Dynamic, and SCConv. These improvements targeted the baseline model’s C2f structure and aimed to enhance the model’s spatial sensitivity and expressive power by introducing different attention mechanisms or convolution variants. The experimental results are shown in
Table 3. The results demonstrate that DCFFN significantly improved model performance while maintaining a relatively low computational cost, highlighting its effectiveness in adjusting the model’s focus to capture richer features.
OREPA [
21] used more parameters, reducing the computational load, and achieved slight improvements in both mAP50 and mAP50-95. ContextGuided [
25] optimized both parameters and computational cost, but its mAP50-95 was only 0.658, far below the baseline’s 0.807. DCNv2-Dynamic [
20,
23], with similar parameters and computational load compared to the baseline, increased mAP50 to 0.93, but saw a decrease in mAP50-95, indicating that detection accuracy was sacrificed in favor of higher detection precision. SCConv [
24] and DLKA [
26] used more parameters and computational resources, showing an improvement in mAP50 compared to the baseline, indicating that these improvements were better at detecting signals when the detection precision requirements were more relaxed. However, their mAP50-95 scores were 0.781 and 0.668, respectively, suggesting that their signal parameter estimations were not sufficiently accurate. In contrast, DCFFN, with relatively low FLOPs and a moderate parameter count, significantly improved both mAP50 and mAP50-95, especially excelling in the more challenging mAP50-95 metric, which reached 0.845.
Next, this section replaced the original C2f module with different stages of DCFFN, and the results are shown in
Table 4. From experiments 1 and 2, it can be seen that using the DCFFN structure at earlier stages led to reduced performance gains. Subsequent experiments indicated that using more DCFFN modules at earlier stages led to smaller performance gains. Therefore, it can be concluded that the highest performance gain was achieved when only the C2f at Stage 4 was replaced. Using more DCFFN modules at earlier stages reduced the detection accuracy. Furthermore, using multiple DCFFN modules significantly increased the number of parameters and the computational cost; moreover, with the same batch size, more GPU memory was required during training, which made the hardware requirements less user-friendly.
To further investigate the impact of critical hyperparameters in the DCFFN module, we conducted experiments by varying the number of attention heads (
), the offset range factor, and whether positional encoding (PE) was used. The results are shown in
Table 5.
The experimental results demonstrate that increasing the number of attention heads from four to eight improved detection performance. Specifically, mAP50-95 increased from 0.838 to 0.845, representing a relative gain of 0.7%. This improvement came at the cost of a moderate parameter increase from 3.02 M to 3.08 M and a FLOP increase of only 0.1 G, which was considered acceptable.
Regarding the offset range factor, increasing it from 2 to 4 while disabling PE caused a performance drop, with mAP50-95 decreasing from 0.845 to 0.827 (a relative decline of 2.1%), indicating that excessive offset range may harm spatial focus when not properly guided by position information.
Using positional encoding also showed notable benefits. When PE was disabled (with and offset range = 4), mAP50-95 dropped to 0.827, while enabling PE under the same attention configuration yielded a value of 0.845. This indicated a 2.2% improvement attributable to PE, highlighting its effectiveness in modeling spatial dependencies under deformation.
Overall, the configuration of , offset_range_factor = 2, and use_pe = True achieved the best trade-off between accuracy and computational cost. In scenarios where computational resources were limited, reducing the number of heads to four still maintained relatively high performance (mAP50-95 = 0.838) while reducing the parameter count by approximately 2%.
4.4. Performance Comparisons
In this section, DFN-YOLO is compared with SOTA methods, including the YOLOv5 model used in the signal detection method mentioned in Chapter 1 [
16], SigdetNet [
43], YOLOv9 [
51], YOLOv10 [
52], and YOLOv11 [
53]. To further evaluate the effectiveness of DFN-YOLO, the ED method [
5] is also implemented as a traditional baseline, which first filters the original signal into narrower frequency bands to reduce the impact of broadband noise in the time domain.
The experimental results for the SNR range of −8–6 dB are shown in
Table 7, and the performance curve of signal detection as a function of SNR is illustrated in
Figure 8. As can be seen from
Table 7, DFN-YOLO achieves an mAP50-95 of 0.85, outperforming all compared models. Its parameter size (3.08 M) and FLOPs (8.2 G) are slightly higher than YOLOv8n and moderately larger than recent versions such as YOLOv10n and YOLOv11. Compared to SigdetNet and YOLOv5n, DFN-YOLO shows significant improvements in accuracy while keeping the computational burden within an acceptable range.
In terms of inference speed, DFN-YOLO achieves a latency of 1.89 ms, which is only 0.19 ms slower than YOLOv8n (1.7 ms) and faster than YOLOv9-T (2.56 ms), YOLOv10n (1.81 ms), and YOLOv11 (1.75 ms). This indicates that despite the added complexity from DCFFN and deformable attention, the model maintains competitive runtime performance.
In terms of accuracy gain, DFN-YOLO outperforms YOLOv5n by +9.2% (0.850 vs. 0.758), YOLOv8n by +4.3% (0.850 vs. 0.807), YOLOv9-T by +1.9% (0.850 vs. 0.831), YOLOv10n by +1.2% (0.850 vs. 0.838), and YOLOv11 by +0.7% (0.850 vs. 0.843). This demonstrates that DFN-YOLO consistently achieves the best accuracy among all YOLO variants, especially under low-SNR conditions.
The cost of these improvements is modest: Compared with YOLOv8n, DFN-YOLO has only a 2.2% increase in parameters (3.08 M vs. 3.01 M) and the same FLOPs (8.2 G). Compared with YOLOv10n and YOLOv11, its parameter size is slightly higher (by 0.46 M and 0.45 M, respectively), but the detection performance improves by 1.2% and 0.7%.
From a practical deployment perspective, DFN-YOLO’s latency of 1.89 ms per inference corresponds to a throughput of approximately 529 FPS, exceeding the requirement for real-time signal monitoring applications.
Therefore, despite the slight increase in model size and latency, the consistent performance improvement across all metrics indicates that DFN-YOLO achieves a favorable trade-off between detection accuracy and computational cost, supporting its real-world applicability in low-SNR signal detection tasks.
In
Figure 8, it can be observed that when the SNR reaches 0 dB or higher, the mAP50-95 exceeds 0.9. As the SNR increases to 6 dB, the mAP50-95 improvement is only 0.03. At this point, the performance of DFN-YOLO is close to those of YOLOv9 to YOLOv11, but it still maintains optimal performance. When the SNR is below 0 dB, using the Focal_SIoU function in the proposed method allows it to focus more effectively on handling low-quality and more challenging data under low-SNR conditions, maintaining a high detection accuracy. As a result, even at −8 dB, the mAP50-95 still reaches 0.6, and at −4 dB and −2 dB, the proposed method significantly outperforms the other models. While the ED method achieves an mAP50-95 of only 0.133 at −2 dB and almost zero at SNRs below −2 dB, DFN-YOLO maintains robust performance, with an mAP50-95 score reaching 0.6 at −8 dB. Moreover, although the ED method performs better at high SNRs (above 4 dB), its inference time is much longer (6.57 s per signal image), making it less suitable for real-time or large-scale signal detection tasks. In contrast, DFN-YOLO demonstrates a favorable balance between accuracy and computational efficiency, especially under challenging conditions such as low-SNR and broadband scenarios, which are the focus of this study.
When the model achieves optimal performance, evaluating its accuracy in terms of start and stop time estimation, as well as center frequency estimation, becomes crucial. This not only helps to validate the overall performance of the model but also provides the essential foundation for subsequent signal separation and processing tasks. In this section, we conduct a detailed analysis of the estimation errors for signal parameters, including start–stop time errors and center frequency errors. These analysis results will help us to gain a deeper understanding of the model’s precision and provide valuable data to support further optimization and improvement. The data in
Table 8 report the parameter estimation errors under the condition that the signal detection IoU is greater than 50.
According to the analysis in
Table 8, there is no significant difference in center frequency estimation errors across models. Relative to the 20 MHz frequency range of the time–frequency plot, the maximum average relative error is 0.14%. As the SNR increases, all models show higher accuracy in estimating the signal’s start time, end time, and center frequency. A horizontal comparison reveals that DFN-YOLO already achieves a high level of start–stop time estimation accuracy at −8 dB, with the average error reduced by at least
s compared to other models. Considering that the time range of the time–frequency plot is 0.04 s, the start and end time errors of the proposed method at −8 dB only account for 0.1% of the entire time–frequency plot.
Combining the data from
Table 7 and
Table 8, DFN-YOLO performs the best, followed by YOLOv11n, with YOLOv10n and YOLOv9-T slightly lagging behind, while YOLOv5 shows a significant performance gap compared to the other models. This indicates strong consistency between the signal parameter estimation errors and the signal detection performance of each model.
Notably, although the ED method achieves correct detections under certain SNR conditions, its parameter estimation accuracy is still inferior to that of DFN-YOLO. However, the ED method performs slightly better in center frequency estimation in the cases that it detects successfully. This is because it first applies narrowband filtering, which inherently restricts the possible estimation error range in the frequency domain. Therefore, the detected signals are constrained within a narrower band, leading to smaller estimation deviation. At −8 dB, ED fails to detect any signals, so it is excluded from comparison under this condition.
Since the signals in the dataset have a wide bandwidth, slight shifts in the detection box may lead to large errors in center frequency estimation. However, methods such as frequency shifting, low-pass filtering, and down-sampling can be used to make more accurate frequency estimates for narrowband signals. DFN-YOLO can accurately estimate the time, crop the signal in the time domain, and provide an initial estimate of the signal’s center frequency, sufficient to support subsequent signal separation and precise frequency estimation tasks.
To provide a quantitative perspective, we further count the detection performance of DFN-YOLO in the four representative examples of
Figure 9. To better illustrate the impact of decreasing SNR on detection performance, four representative SNR levels are chosen: +6 dB as a high-quality baseline, and 0 dB to −8 dB to demonstrate the performance degradation under progressively noisier conditions. At +6 dB, all six signals are correctly detected with no false alarms or missed detections. At 0 dB, six signals are correctly detected, with no false alarms and missed detections. At −4 dB, four signals are correctly detected, two are missed, and zero false alarms occur. At −8 dB, three signals are correctly detected, while three are missed and one false alarm is observed. These results clearly reflect the impact of the SNR on detection performance and are consistent with the statistical results in
Table 8.
To evaluate the performance of DFN-YOLO under extremely low-SNR conditions not included in the training range, we tested the model trained with SNRs from −8 dB to +6 dB on two additional test sets at −10 dB and −12 dB. As shown in
Table 9, the model still achieved acceptable detection performance, with mAP50 scores of 0.402 at −10 dB and 0.445 at −12 dB, demonstrating strong generalization capability. Moreover, after retraining the model with additional −10 dB and −12 dB data, only marginal improvements were observed (e.g., mAP50 increased to 0.497 at −10 dB and 0.525 at −12 dB), further confirming the robustness of the proposed method.
To further validate robustness, a new model trained using −12 dB to +6 dB data was also evaluated on the same test sets. The results demonstrate similar performance, with only marginal improvement, showing that DFN-YOLO is robust to variations in noise levels. Compared with other YOLO variants and the traditional ED method, DFN-YOLO maintains significantly better detection accuracy and lower false alarm rates at extremely low SNRs.
These results highlight the practical potential of the proposed method in challenging wireless sensing scenarios.