1. Introduction
Maritime information technology possesses considerable utility in marine resource management and environmental protection, facilitating essential functions such as environmental monitoring, early warning, resource exploitation, and disaster prevention and response [
1,
2]. Modern maritime observation systems extensively utilize unmanned mobile vehicles outfitted with various sensors to attain thorough collecting of oceanographic data [
3]. These platforms are essential for marine data collecting, environmental monitoring, resource development, and disaster forecasting [
4,
5]. Nevertheless, the variety and multitude of devices incorporated in unmanned surface vehicles (USVs) can significantly impair the efficacy of onboard acoustic systems due to the multi-band self-noise produced during operation [
6]. This problem is especially evident when the platform is equipped with underwater acoustic transceivers, since the resultant self-noise may disrupt or entirely conceal the intended target signals. An exhaustive examination of the spatiotemporal attributes of self-noise on USV platforms, coupled with the formulation of specialized noise identification and mitigation techniques, possesses considerable theoretical significance and practical relevance for augmenting marine target detection efficacy and refining the precision of underwater acoustic signal recognition.
The investigation of the spatiotemporal attributes of self-noise produced by unmanned surface vehicles identifies three principal mechanisms of noise generation: propeller cavitation noise [
7], motor vibration noise [
8], and structure resonance noise [
9]. These noise sources have notable variability in the frequency domain and are characterized by considerable time delays and multipath propagation effects [
10]. During USV operation, the propeller’s rotation generates significant mechanical vibrations along the propulsion shaft, which are later transformed into two distinct types of noise characteristics: wideband noise and discrete tonal noise [
11]. Broadband noise is chiefly defined by its aperiodic characteristics, and its generation mechanism is intricately linked to turbulent boundary layers and cavitation occurrences [
12,
13,
14]. Conversely, discrete tonal noise is distinguished by significant peaks at particular frequency components inside the frequency domain, a characteristic notably evident in USV self-noise. This noise predominantly originates from periodic excitation processes, such as rotor–stator interactions caused by impeller rotation and periodic flow separation events. Common manifestations encompass distinct frequency components, including the rotation frequency (RF), blade passing frequency (BPF), and their harmonic frequencies. The self-noise of the USV displays intricate multi-source characteristics, along with time-delay effects and multipath propagation features in near-field transmission [
15]. The spatiotemporal characteristics offer essential theoretical insights for the precise identification of noise sources and the formulation of efficient noise reduction strategies.
In target feature recognition, conventional machine learning techniques predominantly depend on manually crafted feature extraction methods that involve deriving explicit features like edges, textures, colors, and shapes from images, and utilizing classification algorithms such as support vector machines (SVM), decision trees, or random forests for target identification [
16,
17]. Nevertheless, handmade feature-based approaches demonstrate significant limits when addressing complex high-dimensional data or identifying small objects, frequently leading to diminished recognition accuracy and heightened false positive rates. Conversely, deep learning methodologies autonomously acquire intrinsic feature representations from data and utilize multi-layer neural networks to model intricate patterns, exhibiting considerable benefits in target recognition tasks. Single-stage detection approaches have garnered significant interest owing to their high efficiency, with the YOLO series of algorithms [
18,
19,
20] being particularly notable. In recent advancements within the YOLO family, YOLOv5 has attained superior performance in feature processing by using sophisticated methods such as mosaic data augmentation, generalization algorithms, and feature pyramid networks (FPN). YOLOv11, the most recent evolutionary iteration of YOLOv5, integrates an advanced feature extraction architecture with a multi-scale detection technique, yielding substantial improvements in detection accuracy and performance. As a result, it has been extensively embraced across multiple application domains [
21,
22]. The YOLO series algorithms often utilize time-frequency processing techniques in visual recognition tasks; however, underwater audio signals possess distinct attributes, including multi-source composition, multipath propagation effects, and Doppler shifts. These characteristics hinder the efficacy of traditional spectral analysis and convolutional neural networks in addressing noisy data with significant overlap and delay spread. Conversely, attributes in the DD domain can proficiently encapsulate both time-delay and frequency-shift data within signals, thereby markedly improving recognition reliability [
23,
24,
25,
26,
27,
28,
29].
Recent years have witnessed substantial progress in signal separation and denoising methodologies within the domain of noise suppression research. Conventional techniques, such spectral subtraction and band-pass filtering, mainly address noise within designated frequency ranges; however, their efficacy diminishes when the noise coincides with the frequency range of the desired signal [
30]. Despite the application of empirical mode decomposition (EMD) and its enhanced versions in source separation and denoising within marine soundscapes [
31], they continue to encounter technical difficulties, including mode mixing and end effects [
32].
Methods based on wavelet packet transform (WPT) facilitate band-selective denoising; nonetheless, their efficacy is significantly contingent upon the manually chosen wavelet basis functions [
33,
34,
35,
36]. Principal component analysis (PCA), a data-driven methodology, attains noise reduction by projecting the acoustic spectrum onto an orthogonal subspace, exhibiting specific benefits in the analysis of maritime acoustic environments [
37,
38,
39]. Nonetheless, when faced with intricate non-stationary noise distributions—especially those characterized by significant spectral overlap or an absence of clear structural features—the denoising efficacy of PCA is markedly constrained.
This study innovatively applies the YOLOv11 object detection framework to learn and identify features in the DD domain, aiming to mitigate the substantial interference from complex self-noise generated during USV operations on underwater acoustic perception systems. Furthermore, taking into account the structurally sparse and delay-spread attributes of USV self-noise in the time-frequency domain, a novel time-frequency regularized denoising model is introduced. The primary contributions of this study are as follows:
This study elucidates the generation mechanisms of self-noise and constructs a propagation model under spatiotemporal fluctuation conditions, offering a comprehensive analysis of the characteristics of USV self-noise and revealing its three-dimensional coupling properties in the time-frequency-space domain within near-field environments.
A novel YOLOv11-based object detection system is introduced to analyze self-noise defined by delay spread and spectral overlap, employing DD feature maps. This framework utilizes transfer learning to modify a visual recognition network for auditory feature detection, facilitating swift identification of USV self-noise.
A refined time-frequency domain denoising model utilizing overlapping group shrinkage is established, incorporating structural sparsity restrictions and non-convex regularization techniques to create an optimal framework characterized by overlapping group shrinkage attributes. This model attains effective and resilient self-noise suppression and signal reconstruction in intricate noise environments.
3. Model Formulation
To satisfy the real-time and computational requirements for on-board deployment of USVs, we utilize Ultralytics’ lightweight object-detection framework YOLOv11 as the foundational implementation, which provides a minimal parameter footprint and reduced inference latency [
42]. Nonetheless, YOLOv11 serves merely as a tool; the primary contributions of this section include: for the first time, explicitly mapping USV self-noise into the DD domain to enhance separability amidst multipath, delay spread, and Doppler effects; developing a DD-YOLO framework that utilizes transfer learning to adapt a visual detection network for underwater acoustic self-noise recognition; and introducing an OTFROGS denoising model that combines structural sparsity with non-convex regularization to improve noise suppression in intricate environments.
This study analyzes USV self-noise mechanisms and constructs simulation environments, revealing the distribution patterns and acoustic characteristics of self-noise in the spatiotemporal domain, thereby offering essential data for accurate feature recognition and effective noise reduction. Characteristic analysis alone is inadequate to fulfill the recognition and suppression demands in intricate noise settings, hence demanding the development of recognition and denoising models equipped with time-frequency modeling capabilities. This paper offers a YOLOv11 recognition model utilizing DD domain features to successfully extract critical auditory characteristics of self-noise, therefore greatly enhancing recognition accuracy. A denoising model utilizing overlapping group shrinkage is devised, incorporating structural sparsity and time-frequency regularization. This model executes advanced noise suppression according to recognition outcomes, effectively diminishing noise interference on system performance. The joint use of these two models offers substantial technical support for improving the stability and stealth of USVs in intricate acoustic environments, possessing considerable theoretical and practical significance for engineering applications.
3.1. YOLOv11-Based Recognition Model Incorporating DD-Domain Features
Due to the high autonomy and system integration of USVs, there are still certain limitations in terms of system interoperability, real-time data processing capabilities, and data storage. On one hand, current processors face performance constraints when handling complex or time-sensitive tasks, leading to computational delays. On the other hand, the high computational load brought by complex algorithms further exacerbates this issue. To address these problems, this study selects the YOLOv11 neural network, which is lightweight, highly accurate, and easy to integrate, as the target recognition model to enhance the platform’s real-time processing capabilities.
Furthermore, in high-velocity motion circumstances, the USV self-noise signal generally displays attributes including multipath propagation, multi-source interference, and Doppler effects. This paper presents a DD transformation approach to efficiently extract pertinent information by mapping the signal to a two-dimensional DD domain. This method concurrently acquires the propagation delay and relative velocity data of the noise sources. The resultant transformation is subsequently utilized as input for the YOLOv11 network, augmenting the model’s capacity to assess the propagation pathways and motion properties of noise sources, thus facilitating the effective identification of various noise source components.
3.1.1. Feature Extraction in the DD Domain
The DD domain seamlessly integrates the time-domain and frequency-domain attributes of the signal by creating a two-dimensional grid along the delay and Doppler axes, facilitating a planar representation of the signal with physical significance. This two-dimensional mapping improves the stability and discernibility of signals in dynamic situations [
24]. The noise produced during USV operation is affected by the relative velocity between the platform and the receiver, usually resulting in considerable Doppler frequency changes. Examining signals in the DD domain effectively captures the frequency shift phenomena and elucidates the distinct reactions of different noise sources concerning delay and frequency shift. Propeller noise and voice-type interference display unique distribution properties in the DD map, attributable to their differing physical causes, hence offering independent feature inputs and classification criteria for subsequent deep learning recognition models.
Self-noise signals can be mapped to the DD domain through the transformation relationship between the time-frequency domain and the DD domain. Specifically, self-noise can be transformed between time-frequency domain signal
and DD domain signal
using the symplicit finite Fourier transform (SFFT) and the inverse symplicit finite Fourier transform (ISFFT). The specific formulas are as follows:
where
denotes the time-domain signal subsequent to Fourier transformation and sampling;
m and
n signify the quantities of time delay and Doppler dimensions, respectively; and
.
The implementation process is shown in
Figure 3. According to Formulas (2) and (3), the time-frequency domain matrix (with time interval
and frequency interval
) can be transformed into the DD domain (with delay interval
and Doppler interval
). This transformation allows the signal to be mapped into the DD domain. Furthermore, as shown in
Figure 4, the amplitude distribution characteristics of the signal in the DD domain are presented. Notable energy peaks are observed in the figure, with the signal in the DD domain exhibiting typical multipath delays and frequency shift characteristics.
3.1.2. Introduction to the YOLOv11 Model
YOLOv11, an effective real-time object identification algorithm, possesses a streamlined model structure and an optimized backbone framework. In contrast to Darknet53 utilized in YOLOv7 and YOLOv8, YOLOv11 incorporates the CSPDarknet network design, providing enhanced efficiency and more thorough feature extraction. This study employs the lightweight YOLOv11 model due to the minimal complexity needs of small platforms such as the USV. The YOLOv11n architecture comprises four essential components: input, backbone, neck, and decoupled head.
Meanwhile, YOLOv11, created by Ultralytics, represents the most recent iteration in the YOLO series.
Figure 5 illustrates that the backbone network is a crucial element of YOLOv11, tasked with extracting multi-scale information from the input images. This module employs C3k2 as its foundational framework, enhancing feature extraction through the stacking of convolutional layers and modules to produce feature maps at varying resolutions. Moreover, YOLOv11 incorporates the SPPF module from YOLOv8, with the addition of a C2PSA module subsequent to the SPPF to augment the model’s spatial attention skills.
The neck, situated between the backbone and the output layers, is crucial for feature fusion and enhancement. This section employs the SPPF module to optimize performance by more effectively capturing items of varying sizes, therefore enhancing the detection of small objects. These modifications and enhancements to the framework enable YOLOv11 to improve performance without compromising speed.
In the training process, we utilized the conventional YOLO-family losses for bounding-box regression, objectness (confidence), and classification, and selected mAP as the primary assessment metric [
43,
44].
3.2. Optimized Time-Frequency Regularized Overlapping Group Shrinkage Denoising Model
In intricate marine environments, utilizing a USV as a platform for underwater acoustic communication signal transmission and reception, along with essential marine information collection, necessitates the effective reduction of self-noise when significant self-noise components are identified in the received signal. This is crucial for obtaining the target signal. This paper presents an enhanced time-frequency regularized OTFROGS based on the investigation and identification of USV self-noise characteristics. This approach incorporates sparse representation optimization into the TFROGS framework to address the spatiotemporal non-stationary attributes of self-noise. In contrast to traditional normalized least-mean-squares (NLMS), wavelet thresholding (WTD), and the original TFROGS, OTFROGS integrates structural sparsity and non-convex regularization into its modeling, rendering it more adept at addressing the time-frequency-space non-stationarity of USV self-noise.
The algorithm is delineated as follows:
The noisy signal
is composed of the original signal
and self-noise
:
The essence of this model is to recover a denoised signal from that is as close as possible to the original signal . To obtain this signal, the short-time Fourier transform (STFT) is applied to convert the noise signal into a time-frequency domain signal , as shown in Formula (9).
where
represents the window function,
denotes the time shift, and
represents the frequency.
Furthermore, the sparse matrix
of the signal
can be represented as:
where
represents the dictionary matrix, and
is the sparse matrix, representing the coefficients of the signal in the time-frequency basis.
The majority of items will approximate 0, with only a limited number of elements being substantial in the time-frequency domain. A non-convex overlapping group regularization approach is employed to foster organized sparsity and improve the suppression of correlated self-noise while maintaining prominent components. The sparse matrix is approximated by minimizing:
where
denotes the set of overlapping time–frequency groups,
is the sub-vector of coefficients belonging to group
,
is a small positive constant ensuring stability, and
is employed in the experiments. The non-convex ℓp-type group penalty enforces stronger sparsity while retaining local structure.
The Frobenius norm component is given as:
while the group penalty is explicitly expressed as:
To solve Equation (6), a reweighted iterative scheme is applied. At iteration
, each group is assigned a weight
The update of
is then formulated as:
where
denotes the total weights of all overlapping groups encompassing coefficient
. This represents the overlapping group shrinkage (OGS) operator, which adaptively mitigates redundant noise while maintaining structured signal components.
A normalized weight map is then defined as:
It guarantees increased penalties for coefficients situated within robust noise clusters, while allocating comparatively diminished penalties to coefficients associated with major signal structures. Upon completing adequate iterations, the denoised sparse matrix
S is acquired. The resulting denoised signal is reconstructed using the inverse short-time Fourier transform (ISTFT):
In summary, the detailed procedure of OTFROGS is presented in
Table 2:
4. Experiment Results
This study utilized two datasets regarding data sources and theoretical justification. Dataset-A consisted of a historical 5 × 32 hydrophone array with 5 m spacing between components, utilized to examine the spatial distribution of USV self-noise across varying depths, locations, and operational states. This dataset established a robust basis for the theoretical examination of three-dimensional coupling effects in the time-frequency-space domain; nevertheless, its restricted scale rendered it inadequate for extensive deep model training and validation. Consequently, Dataset-B was acquired in near-shore waters off Qingdao with a self-contained hydrophone. To provide practical deployment and operational flexibility, a single receiver was utilized, and recordings were made at various depths to somewhat mitigate the lack of array data. It is important to acknowledge that the two datasets were not gathered during the same experimental effort, and the vessels were not identical. Nonetheless, both platforms exhibited analogous hull dimensions and propeller diameters, with measurements taken under identical tranquil near-shore water conditions. Consequently, Dataset-A guarantees the dependability of spatial pattern analysis in the theoretical segment, whilst Dataset-B supplies adequate data for the systematic training and assessment of the DD-YOLO recognition model and the OTFROGS denoising algorithm. The comparability of vessel scale, propeller configuration, and sea conditions guarantees the applicability of Dataset-B to the spatial characteristics specified in Dataset-A. Furthermore, as the self-noise signals were represented in the DD domain, which intrinsically demonstrates sparsity and separability of multipath and Doppler components, the utilization of single-channel hydrophone data remains applicable. It is important to note that the existing recognition and denoising experiments do not fully utilize spatial-domain diversity; subsequent research will augment the dataset by integrating multi-hydrophone recordings to improve the generality of the suggested models.
Furthermore, each continuous recording (defined as a session) was taken as the smallest partitioning unit, with each session collected under fixed ship speed and engine RPM conditions to ensure consistent operating states. During preprocessing, the raw audio was segmented into 3-second samples for subsequent training and evaluation. To avoid data leakage, all samples were first grouped by session and then split at the session level into training, validation, and test sets in a 10:3:1 ratio. This ensures that all 3-second samples from the same session are contained within a single subset, thereby enabling fair and independent performance evaluation.
This work gathered propeller noise, a significant element of USV self-noise, using the experimental setup depicted in
Figure 6.
Figure 6a illustrates the deployment of a USV in near-shore waters off Qingdao, where its near-field self-noise was captured using an integrated hydrophone. The comprehensive experimental parameters are delineated in
Figure 6b, where
and
denote the hydrophone depth and the distance from the vessel’s keel to the bottom, respectively. For ease of deployment and operational flexibility, the receiving system employed a single self-contained hydrophone.
4.1. Evaluation of Recognition Model Performance
This section provides a comprehensive assessment of the operational efficacy of the YOLOv11 identification model—trained on DD domain characteristics—in detecting USV self-noise.
Figure 7 outlines the preprocessing workflow used for the self-noise recordings: routine electrical interference is first removed from the raw acoustic data to suppress background clutter, after which Equations (2) and (3) are applied to transform the self-noise from the time-frequency (TF) domain into the delay-Doppler domain. This conversion enables extraction of the embedded propagation-delay and Doppler-shift signatures, as illustrated in
Figure 8.
To enhance alignment with practical experimental circumstances, the detection task utilizes a singular category—propeller self-noise. This category is not further split into cavitation or mechanical noise components, as these elements frequently co-occur and are generally interrelated under small-USV operating conditions; additionally, in the DD domain, both typically manifest as concentrated energy peaks or stripe-like formations. Joint modeling is hence more coherent and operationally feasible. This study split raw acoustic recordings with self-noise into 3-second samples, resulting in a total of 1735 samples; the dataset was divided into training, validation, and test sets in a ratio of 10:3:1. Signal processing adhered to an SFFT/ISFFT architecture, incorporating parameters such as window length, hop size, and FFT points (NFFT); each segment generated a DD energy map with a constant grid resolution of M × N = 374 × 273. The annotation protocol was executed by trained experimenters following standardized procedures: raw acoustic data were initially transformed into DD energy maps, after which annotators visually identified regions of concentrated energy and delineated axis-aligned bounding boxes, uniformly labeled “propeller-noise”. To guarantee annotation quality and consistency, each sample was independently labeled by a minimum of two experimenters, followed by cross-verification and consensus resolution of any discrepancies.
Figure 8a illustrates that the time-frequency (TF) spectrogram displays the temporal-spectral evolution of signal energy; however, due to the compounded effects of various noise sources and multipath phenomena, the features are densely clustered with indistinct contours, resulting in considerable overlap among signal components, which complicates effective feature separation and identification. Conversely, the DD-domain energy maps in
Figure 8b,c display several discrete energy peaks linked to different propagation paths and Doppler components. The distinct structural and geographical qualities provide YOLOv11 with a considerable advantage in performing target detection and multi-source recognition tasks. The DD map is annotated with bounding boxes to enclose the peak or stripe-like regions associated with propeller self-noise in the succeeding training photos, following the aforementioned procedures.
To objectively assess the efficacy of the DD representation, we performed an ablation experiment comparing the detection performance of identical YOLO models utilizing standard TF spectrograms and DD energy maps as inputs.
Table 3 demonstrates that DD inputs regularly yield superior mAP values, with YOLOv11 increasing from 0.67 (TF) to 0.87 (DD) and YOLOv8 from 0.64 (TF) to 0.82 (DD). Comparable benefits are noted in AP@[.5:.95], recall, and F1-score. For example, YOLOv11 achieves an F1 score of 0.86 with DD input, in contrast to 0.63 with TF input, and YOLOv8 improves from 0.60 (TF) to 0.81 (DD). The results demonstrate that the more distinct structural cues offered by DD features improve visual separability and lead to significant enhancements in quantitative performance, hence corroborating the findings in
Figure 8.
A systematic performance comparison is thereafter performed comparing YOLOv11, its predecessor YOLOv8, and the traditional SSD neural network. The findings are depicted in
Figure 9 and
Figure 10:
Figure 9 and
Figure 10 compare the training and validation dynamics of YOLOv11 with YOLOv8 regarding loss components and detection performance metrics (mAP50 and mAP50-95). Overall, YOLOv11 demonstrates enhanced efficacy in loss convergence. Both box_loss and clc_loss diminish more swiftly during training, whilst the associated validation losses not only converge more rapidly but also stabilize at lower values, exhibiting significantly reduced fluctuations. The results indicate that YOLOv11 exhibits superior training stability and generalization capacity. Conversely, YOLOv8 exhibits more significant oscillations and elevated residuals in the loss curves, signifying inadequate convergence behavior and diminished robustness.
Figure 11 illustrates that the PR curve of YOLOv11 continuously surpasses that of YOLOv8, exhibiting good precision even when recall exceeds 0.7, hence indicating superior robustness. Conversely, YOLOv8 demonstrates a more rapid decrease in precision as recall escalates. The AUC-PR results further validate that YOLOv11 attains a superior precision/recall trade-off, rendering it more appropriate for situations necessitating both high confidence and high recall.
As indicated by the detection performance comparison in
Table 4, YOLOv11 attains superior results across all metrics, with an AP@0.5 of 0.87, an AP@[.5:.95] of 0.59, a precision of 0.89, a recall of 0.84, and an F1-score of 0.86, illustrating a well-balanced trade-off between precision and recall, alongside enhanced robustness across varying IoU thresholds. YOLOv8 achieves an AP@0.5 of 0.82, an AP@[.5:.95] of 0.50, a precision of 0.83, a recall of 0.80, and an F1-score of 0.81, securing second place overall while preserving high detection accuracy. Conversely, SSD achieves an AP@0.5 of 0.78, exhibiting a high precision of 0.90, yet a recall of merely 0.37, culminating in an F1-score of only 0.52. This indicates an excessively cautious detection behavior, resulting in the model generating fewer predictions that are predominantly accurate, but at the cost of recall. YOLOv11 markedly surpasses YOLOv8 and SSD in detection accuracy and resilience, establishing it as a more dependable option for practical applications necessitating high-confidence target identification.
4.2. Denoising Model
This research presents an OTFROGS denoising model. The model presents a coefficient-matrix representation designed for self-noise signal characteristics; by utilizing sparse representation, it retains the principal frequency components while effectively eliminating the USV self-noise, particularly the low-energy components, leading to efficient and precise denoising performance. The suggested method’s validity is confirmed using experimentally obtained signals that encompass significant levels of USV self-noise.
4.2.1. Performance Comparison Under Simulated Conditions
This section presents the application of the proposed OTFROGS approach alongside several traditional denoising techniques to signals containing USV self-noise. As seen in
Figure 12, the OTFROGS approach displays enhanced denoising efficacy, proficiently removing the majority of noise while accurately reconstructing the target signal. Conversely, conventional denoising methods—such as NLMS, WTD, and TFROGS—attain merely partial noise reduction and preserve discernible residual noise. To elucidate the performance disparities among the algorithms more intuitively, the SNR is shown in absolute terms; hence, a lower number signifies superior denoising efficacy. As illustrated in
Figure 13, all methods yield negative SNR values under −15 dB and −25 dB noise conditions. For clarity and ease of comparison, the absolute values of SNR are reported. The proposed OTFROGS technique achieves the lowest |SNR| values of 0.99 and 7.85 in the respective noise environments, significantly outperforming traditional denoising algorithms such as NLMS, WTD, and TFROGS. These results underscore the superior noise suppression capability of OTFROGS in challenging acoustic environments.
4.2.2. Denoising Performance on Real Self-Noise Data
This section illustrates the denoising efficacy of OTFROGS on acoustic signals with USV self-noise; the relevant results are displayed in
Figure 14 and
Figure 15.
Figure 14 represents the high-SNR scenario, wherein a distinct differentiation is seen between the denoised signal and the noise-affected input. The self-noise’s high- and low-frequency components are successfully mitigated. Conversely, in the low-SNR conditions depicted in
Figure 15, despite the presence of some residual low-frequency noise in the denoised output, the overall noise reduction is significant.
In addition to the denoising efficacy demonstrated in
Figure 14 and
Figure 15, the computational expense of the two modules was assessed. On a standard desktop CPU, DD-YOLOv11 necessitates 43.67 ms per DD feature map with a memory use of 1.64 GB, whereas the OTFROGS module processes 1 s of acoustic data in approximately 0.03 s. These measures suggest capacity for near-real-time functionality. To assess the practicality of compact USVs, previous studies indicate that lightweight YOLO models generally get 20–30 FPS with less than 1 GB of memory on the Jetson Xavier NX and 8–15 FPS on the Jetson Nano [
18,
45]. Furthermore, GPU-based FFT/IFFT implementations typically provide 1.5–3× acceleration, suggesting additional speed enhancements for OTFROGS on embedded GPUs [
46,
47]. The results demonstrate compatibility with edge-class CPU/GPU devices often incorporated in small USVs.