1. Introduction
Lost and abandoned crab pots, or “ghost pots,” are a common form of derelict fishing gear in shallow coastal ecosystems where crabbing activity is high. They pose ecological and economic concerns by harming organisms such as the diamondback terrapin [
1,
2,
3,
4], degrading habitat [
5,
6], creating navigation hazards [
7], and reducing harvest efficiency through competition with actively fished pots [
6].
Targeted removal of ghost gear requires precise location data, yet detection is difficult in turbid waters where visual surveys are limited. Side-scan sonar (SSS) provides effective imaging in such environments, and its accessibility has expanded with the advent of low-cost consumer fishfinders [
8,
9]. Open-source processing tools have further democratized acoustic mapping by enabling efficient, cost-effective workflows [
10,
11], but interpretation remains challenging due to noise, low contrast, and environmental distortions [
12]. Large-scale data collection compounds these issues, often overwhelming manual review pipelines and limiting timely management response. In Delaware’s Inland Bays (IB), manual mapping of approximately 1500 acres revealed derelict pot densities exceeding 1.6 pots per acre—among the highest documented nationally [
13,
14]. The magnitude of this problem highlights the need for automated detection and mapping pipelines capable of supporting rapid, reproducible, and operationally scalable assessments.
Existing recovery programs often rely on commercial watermen or resource professionals who use side-imaging sonar only during on-water retrieval operations, where it helps locate individual pots for grappling-hook recovery following NOAA Marine Debris Program best practices [
15,
16,
17]. Critically, these surveys are not designed for systematic mapping, and derelict pot locations are rarely archived. As a result, managers lack a comprehensive understanding of pot abundance, spatial distribution, or the cumulative impact of removal efforts. Recent work shows that recreational crabbing and boating also contribute substantially to derelict pot prevalence, particularly in Delaware’s Inland Bays (IB), where only recreational harvest is authorized (7 Del. C. §2304). Community members have expressed strong interest in participating in removal efforts, motivated by concerns over ghost-fishing impacts and vessel damage. The Delaware Department of Natural Resources and Environmental Control (DNREC) has likewise identified high-density areas in the Delaware Bay (DB) between 2017 and 2019, underscoring the need for scalable, cost-effective mapping and detection approaches that extend beyond opportunistic retrieval and support both professional and community-based recovery efforts.
This study evaluates whether low-cost sonar paired with modern object-detection models can achieve operationally reliable detection of derelict crab pots at management-relevant spatial scales. To address this need, we introduce GhostVision, an open-source framework that leverages consumer-grade sonar and artificial intelligence to automate derelict gear detection and mapping. Building on prior NOAA Marine Debris Program projects, GhostVision aims to reduce technical barriers, engage citizen scientists, and provide reproducible tools for large-scale stewardship.
Despite the increasing availability of low-cost side-scan sonar, no prior study has systematically evaluated whether consumer-grade systems paired with state-of-the-art object-detection models can achieve reliable, management-ready performance in shallow, cluttered estuarine environments. This gap is critical: most existing AI–sonar studies rely on high-resolution AUV platforms, controlled acquisition geometries, or curated datasets that do not reflect the acoustic variability encountered in community-led surveys.
Accordingly, this study addresses three questions: (i) can low-cost sonar provide imagery of sufficient quality for modern object-detection models; (ii) how do leading architectures differ in dataset-centric versus operational performance; and (iii) can an open-source pipeline produce georeferenced detections with accuracy suitable for on-water recovery? The following sections describe related work, system design, model development, validation, and community-based applications.
2. Related Work
Recent advances in deep learning have substantially improved object-detection performance in sonar imagery. Convolutional neural network (CNN)-based architectures, including YOLO variants and transfer-learning frameworks, have demonstrated strong results in underwater robotics and autonomous vehicle applications [
12,
18,
19,
20]. Transformer-enhanced and hybrid detection frameworks have further improved robustness to acoustic clutter and target variability [
21,
22]. Comprehensive reviews of sonar-based deep learning systems highlight rapid progress in detection accuracy, explainability, and semi-supervised learning strategies [
23,
24,
25].
Despite these advances, much of the existing literature relies on high-resolution forward-looking sonar or AUV-mounted systems operating in relatively controlled acquisition environments. Many studies employ curated datasets, synthetic noise augmentation, or simulated acoustic conditions to compensate for limited labeled training data [
21,
26]. Others focus on deep-water or robotics-oriented deployments where navigation stability and sensor calibration reduce environmental variability [
18,
27]. While these approaches advance algorithmic performance, they often do not reflect the acoustic complexity of shallow estuarine systems characterized by strong bottom reverberation, heterogeneous substrates, dense anthropogenic debris, and variable tow geometries.
In addition, most prior work emphasizes methodological optimization—improving network architectures, feature extraction, or noise suppression—rather than evaluating operational scalability in resource-limited management contexts [
12,
23]. Few studies explicitly test whether low-cost, commercially available sonar platforms can achieve reliable detection performance at spatial scales relevant to conservation and fisheries management.
GhostVision addresses these gaps by evaluating modern object-detection architectures within a full end-to-end processing pipeline using consumer-grade sonar deployed in shallow, cluttered estuarine environments. Rather than optimizing performance within controlled datasets, this study assesses whether affordable hardware paired with state-of-the-art detection models can produce operationally actionable results—i.e., detections that remain stable across transects, maintain manageable false-positive rates, and support accurate geolocation for on-water recovery. By integrating open-source workflows, field validation, and large-area deployment, GhostVision shifts the focus from algorithmic advancement to scalable, management-oriented implementation.
3. Materials and Methods
3.1. Study Area and Context
Delaware Inland Bays are heavily-used recreational crabbing areas. Rehoboth Bay, a shallow (<5 m) tidally influenced bay located at the southern extent of Delaware, is especially popular due to ready-access from public and private boat ramps, public lands, and adjacent housing developments. Potential sonar survey sites were identified by observing crabbing activities and through local knowledge. Sonar surveys were carried out in three primary areas in Delaware inland bays, including (a) northern Rehoboth Bay near Dewey Beach; (b) western Indian River Bay; and (c) southern Indian River Bay near White Creek (
Figure 1). Sonar surveys and associated crab pot events at these locations took place between 2020 and 2022. These surveys mapped approximately 1500 acres, averaging 1.6 pots per acre, indicating that the inland bays could have as many as 20,000–30,000 derelict crab pots.
3.2. Hardware System and Field Setup
Side-scan sonar is a technology that enables acoustic swath imaging (i.e., port and starboard) of aquatic environments from survey vessels while underway (
Figure 2b). This technology, traditionally found on sophisticated and costly instrumentation, has been adapted to low-cost, off-the-shelf consumer grade systems (i.e., fishfinders) targeted to the angling community to aid fish, habitat, and depth identification. The aquatic scientific community has also taken notice, incorporating fishfinders into fisheries, habitat, and ecology research. Manufactured objects in shallow natural environments—such as metal crab pots (
Figure 2a)—produce exceptionally strong acoustic returns because their rigid, reflective surfaces scatter far more energy than adjacent substrates. When resting on the seafloor, the pot also blocks incident sound, creating a pronounced acoustic shadow, especially in shallow water like Rehoboth Bay, that further distinguishes it from the surrounding environment (
Figure 2c). This study leverages low-cost sonar systems equipped with GPS to identify and map derelict crab pots.
The mobile mapping unit (MMU) was designed as a highly adaptable and portable system to turn vessels-of-opportunity into crab pot survey vessels (
Figure 2d). The MMU features a Humminbird Solix 12 CHIRP MEGA SI+ G3 (Johnson Outdoors, Racine, WI, USA) with several operational frequencies (50/83/200/455/800 & 1200 kHz), beam orientations (traditional nadir, down imaging, side imaging, and live imaging), minimal operation learning curve, and relatively low price point (USD 3199.99). A Pelican 1510 case (Pelican Products, Inc., Torrance, CA, USA) stows survey equipment during travel. The case is modified to support powering and mounting sonar equipment during surveys. The case is fitted with waterproof power adapters to power the fishfinder with an internal Norsk 14.8 V 32 AH Lithium Ion battery (Norsk Lithium, Inc., Minneapolis, MN, USA). An INOVATIV QR VESA (INOVATIV, Azusa, CA, USA) quick-release mount fastened to the case lid combined with Bass Pro Shops LockDown Marine Electronics Mount (BPS Direct, LLC., Springfield, MO, USA) and Humminbird control head mount secures the control head to a stable base. The sonar transducer is fixed to one end of a galvanized conduit pole with hose clamps and the Humminbird GPS puck with heading sensor is connected to the other end with PVC to buffer magnetic interference to the compass. A Minn Kota trolling motor (Johnson Outdoors, Racine, WI, USA) transom mount combined with short lengths of wood plank and C-clamps enable variable mounting solutions.
3.3. Data Acquisition and Annotation
Side-scan sonar surveys with the MMUs are carried out in areas-of-interest through a series of parallel and straight transects to ensure coverage and minimal image distortion. Transects were spaced based on average depth and desired overlap following a 1:5–10× depth:range rule-of-thumb [
10]. An additional pass following the shoreline edge ensures coverage in very shallow (<1 m) areas. A sonar recording is initiated at the start of a survey transect to log the sonar data to an SD card loaded in the fishfinder. Smooth navigation at a constant speed (7–8 km/h) ensures high quality imagery with minimal distortion. The recording is terminated at the end of the transect, resulting in one recording per transect. Surveys were carried out in multiple locations in Rehoboth Bay, DE (
Figure 1).
Sonar recordings collected with fishfinders are stored in a structured proprietary format that requires specialized software for post-processing. PINGMapper v5.0 [
10] and Sar Hawk were used to produce 8-bit (0–255) grayscale and colorized (i.e., pseudo-color) imagery, spatially cropped to areas ranging ∼30–200 m
2. Imagery was uploaded to Roboflow [
28], a web-based interface for annotating and training artificial intelligence models. A total of 3110 images were manually annotated by several individuals. Pots were visually identified in imagery and delineated with a bounding box with the web utility to create image-label pairs for model training. Annotation quality was ensured in two stages. First, two annotators generated bounding-box annotations under supervision of experienced sonar analysts. In the second stage, all co-authors convened during a one-week session where annotation batches were independently reviewed and refined. Questionable annotations were brought to the larger group for consensus, resulting in the final model dataset. Image-label pairs were archived from Roboflow for reproducibility [
29].
3.4. AI Model Development
For this study, we evaluate three modern object-detection architectures—YOLOv12 [
30], YOLOv26 [
31], and RF-DETR [
32]—for their ability to localize and classify crab pots in side-scan sonar imagery. YOLOv12 introduces an attention-centric design that improves feature aggregation while maintaining fast, CNN-like inference speeds, whereas YOLOv26 emphasizes efficiency, accuracy, and deployment readiness on edge and low-power hardware. RF-DETR extends the DETR family with large-scale pre-training, a scheduler-free training strategy, and an end-to-end weight-sharing neural architecture search framework, enabling strong accuracy–latency tradeoffs and improved generalization to small or out-of-distribution datasets. Together, these architectures represent complementary approaches to real-time detection in challenging acoustic imaging conditions.
Grayscale and colorized (i.e., pseudo-color) annotated sonar imagery were split into training, validation, and test subsets (∼70/15/15) such that all images originating from the same sonar recording (i.e., survey transect) were assigned to a single subset, preventing cross-subset data leakage at the transect level. The splits were not explicitly stratified by survey date, site, or environmental conditions due to disassociation of this metadata during conversion from raw sonar logs to model-ready image datasets. This resulted in 2154 train (68%), 555 validation (17%), and 399 test (13%) with 3931, 1275, and 567 annotations, respectively.
Image augmentation was applied to the training set, including horizontal/vertical flipping, rotation, shear, gaussian blur, and noise to aid model generalization. These augmentations increase geometric and photometric variability and are appropriate for side-scan sonar imagery. Flipping mimics observing the target on the opposite sonar channel or at a different point in time; small rotation increases invariance to vessel heading relative to the target orientation; blur approximates target warping caused by vessel steering and reduced clarity in the far-field; and noise approximates sonar speckle. Up to three augmented copies were created for each train image, resulting in 5721 images with 7469 annotations. No augmented copies for validation or test subsets were generated. All images were resized to 640 × 640 pixels. The same dataset, including subset splits and augmentation, was used to train all three models to ensure comparable results.
All three models were trained using Google Colab Pro utilizing a NVIDIA Tesla T4 GPU (Nvidia Corporation, Santa Clara, CA, USA) with 16 GB high-bandwidth memory (GDDR6).
Table 1 shows the relevant parameters used during training. Batch size for each model was maximized to fully utilize available GPU RAM and speed training time. Total possible epochs was specified as 200 (YOLO variants) and 100 (RF-DETR). Early stopping patience was specified as 30 epochs so that no unnecessary compute resources were utilized. Effective training durations—defined as epoch at which validation loss was minimized—were 51 epochs for YOLOv12, 15 for YOLOv26, and 7 for RF-DETR, indicating the maximum epoch limits substantially exceeded the required convergence time. RF-DETR was trained using the rfdetr v1.3.0 and supervision v0.27.0 Python packages published on the open-source Roboflow (Roboflow, Inc., Des Moines, IA, USA) GitHub repositories. YOLOv12 was trained with the GitHub repository implementation due to inefficiencies with the Ultralytics version and to ensure compatibility with Roboflow utilities. YOLOv26 was trained with Ultralytics v8.4.4.
3.5. Rapid Post-Processing, Inference, & Mapping
Data collected with sonar survey systems, including fishfinders, feature the ability to log sonar datastreams from each sonar beam and associated attributes (i.e., northing, easting, depth at nadir, vessel speed, etc.) to file(s) [
10]. Post-processing utilities specifically designed to decode the often proprietary data structure can be used to generate georeferenced sonar datasets, including mosaics, for further examination in a geographic information system (GIS). Analyzing and mapping features in sonar imagery is a tedious process requiring extensive training and time to accurately map and classify features of interest. To reduce this processing and analysis burden and ensure widespread adoption of our approach to identifying and mapping derelict crab pots in sonar logs, we present GhostVision, an open-source Python utility for rapid post-processing, detection, and mapping of derelict fishing gear. GhostVision features a reproducible pipeline for (a) decoding sonar logs; (b) exporting sonar videos; (c) detecting and tracking crab pots; (d) calculating prediction scores; (e) georeferencing predictions; and (f) exporting detected locations to GPX (GPS Exchange Format) for removal of the gear. Once sonar surveys are collected, the data are downloaded from the fishfinder memory card to a laptop and are post-processed in a fraction of the survey time, facilitating same-day analysis and removal operations.
The software is installed in a conda [
33] environment, ensuring all dependencies are properly installed. Software users select a folder containing sonar recordings to process. Sonar recordings can be filtered by range, heading deviation, speed, area of interest (AOI), or survey time as necessary. Additionally, offsets between the sonar transducer and GPS can be applied. These parameters are used to decode and process the sonar recordings using the underlying PINGVerter v2.1 and PINGMapper [
10] processing engines. These two dependencies enable decoding of the support sonar logs (DAT-Humminbird; sl2/sl3-Lowrance; RSD-Garmin; svlog-Cerulean), apply filters, and ensure accurate positioning of detected crab traps.
Following sonar log decoding and filtering, non-georeferenced sonograms are exported. Each sonogram is 500 pings wide (i.e., chunk size) by range (or crop length) tall without overlap. PINGTile [
34] then applies a moving window over the exported images to generate overlapping images with stride specified by the user as a proportion of the chunk size. The individual sonograms are saved to a video file, mimicking the waterfall view observed during the survey. Export of overlapping sonograms saved to video serve three purposes. First, it ensures that a crab pot that might be obscured at the edge of a sonogram frame has several opportunities for detection as the window steps through the sonar imagery. Second, multiple detections enable calculating an average detection accuracy and confidence interval, providing an enhanced measure of the model performance. Finally, predicting on a video enables leveraging optimized inference pipelines available in the Roboflow Python API. Tracking algorithms can optionally be enabled to ensure a single crab pot detected in multiple frames are counted as a single pot. Object tracking also lowers the false positive rate by removing one-off detections as crab traps. The results of object tracking are saved as a csv file along with corresponding indexes indicating the frame number and centroid coordinates of detection box within the frame.
The final step of the processing pipeline is georeferencing the detected crab pots. The frame number and target centroid coordinates are used to join the detection with the ping in the sonar log it is associated with. Each ping in a sonar log stores so-called ping attributes [
10] that include vessel position, heading, speed, depth, etc. The pixel offset of the target within the ping is converted to a slant-range based on known characteristics of the ping’s across track range and resolution. The depth and slant-range are used to calculate the range to the target with the Pythagorean theorem. The final geodesic coordinates of the target are calculated based on the coordinates of the origination of the ping, the vessel heading, and the range (distance) to the target. The coordinates of each target are exported to GPX for use on the fishfinder or other GPS equipment for subsequent removal activities.
4. Results
Evaluation of the three trained models—RF-DETR, YOLOv12, and YOLOv26—was performed in two phases to highlight model performance from operational performance within the GhostVision pipeline.
Model-level Evaluation—In the first phase, we report conventional object-detection metrics derived from the validation set during training, providing a controlled, dataset-centric comparison of all three models. Metrics include precision, recall, mAP@50, and mAP@50–95. Observations on loss-curve behavior and convergence are also provided. Model inference performance on a hold-out test subset is also reported.
GhostVision Implementation Evaluation—In the second phase, we implement each model in the GhostVision sonar-processing pipeline to assess real-world operational performance, highlighting differences not made apparent from training-set metrics alone. Manually annotated and georeferenced crab pots are compared to outputs from the processing pipeline. Metrics include inference stability across survey transects, false-positive behavior, spatial error distributions, and consistency of pot-level georeferenced outputs.
4.1. Model Evaluation
In the first phase, we evaluated the performance of the three trained models using conventional metrics derived from (i) the validation subset (555 images with 1275 annotated crab pots) and (ii) on the hold-out test subset (399 images with 567 annotated crab pots). These metrics enable interrogation of model-centric behavior prior to GhostVision implementation.
4.1.1. Training Behavior and Convergence
All models were trained with an early-stopping patience of 30 epochs to promote convergence and avoid unnecessary computation. Despite differing maximum epoch limits, each architecture reached its best validation performance early in training: RF-DETR at epoch 7, YOLOv26 at epoch 15, and YOLOv12 at epoch 51 (
Figure 3). YOLOv12 showed substantial oscillation in precision and recall during the early stages, but eventually stabilized. YOLOv26 converged more smoothly and rapidly, though with slightly lower accuracy across all metrics. RF-DETR exhibited the most stable training dynamics, with a smooth loss curve and consistent metric progression, reaching its peak performance early and showing only modest improvements thereafter. Notably, RF-DETR achieved the highest recall of all models, indicating strong sensitivity to crab-pot candidates.
4.1.2. Validation Performance
Validation results reveal a trade-off between sensitivity and discriminative sharpness. RF-DETR achieved the highest recall and the highest mAP@50, indicating strong detection sensitivity and accurate localization at the 0.5 IoU threshold. YOLOv12, however, achieved the highest mAP@50–95, suggesting better overall performance across stricter IoU thresholds and thus stronger discriminative behavior. YOLOv26 performed similarly to YOLOv12 but with consistently lower precision, recall, and mAP. Full best-epoch results are presented in
Table 2.
4.1.3. Test Set Performance
A separate test subset containing sonar surveys not present in the training or validation data was used to evaluate model generalization. All test-set metrics were computed using Supervision v0.27.0 at IoU = 0.5 to ensure consistent evaluation across architectures (
Table 3). Substantial performance degradation was observed for all three models relative to validation, indicating a domain shift between curated training imagery and the more heterogeneous test surveys.
At first glance, RF-DETR appears to be the strongest model on the test set, achieving the highest mAP@50–95 (0.148) and an almost perfect recall@50 (0.979). However, its precision@50 collapses to 0.006, meaning nearly every prediction is a false positive. This behavior inflates mAP because COCO-style mAP integrates performance across the entire confidence range, including extremely low thresholds where RF-DETR produces large numbers of low-confidence detections. As a result, RF-DETR’s high mAP reflects sensitivity rather than discriminative reliability, and does not translate into usable performance at practical operating points.
The YOLO models exhibit more balanced precision–recall behavior. YOLOv12 achieves the highest F1@50 (0.348) and substantially higher recall (0.263) than YOLOv26 (0.085). YOLOv26 attains the highest precision (0.667) but misses most true crab pots, resulting in a low F1. Both YOLO models show generalization loss relative to validation but maintain responsive thresholds that can be tuned for operational use. In contrast, despite detecting nearly every pot, RF-DETR’s extreme false-positive rate renders it functionally unusable without aggressive post-processing.
4.2. Implementation Evaluation
The implementation-level evaluation assessed the end-to-end performance of each model architecture in the GhostVision sonar-processing pipeline. This enabled determining operational behavior under real survey conditions, geolocation uncertainty, and moving-window inference. This revealed performance patterns not fully reflected in the validation or test-set metrics, particularly with respect to false-positive behavior, temporal stability, and threshold sensitivity. Each evaluation utilized a 3 m radius to match a detection with a manually identified crab pot. The 3 m radius corresponds to the typical horizontal accuracy of the consumer-grade GNSS commonly found in fishfinders. This is a conservative lower bound on total georeferencing uncertainty; heading uncertainty and slant-range conversion introduce additional error that scales with detection range. Reported performance metrics are therefore conservative estimates of operational accuracy. The findings are reported in
Table 4 and expanded upon in the following subsections.
4.2.1. Unoptimized Performance
YOLOv12 achieved the strongest overall operational performance (F1 = 0.512), followed closely by YOLOv26 (F1 = 0.497). Both YOLO variants exhibited high recall (>0.8), with YOLOv12 reaching the highest value (0.922). However, each model produced nearly twice as many false positives (300 and 266) as true positives (165 and 147), resulting in modest precision (0.355 and 0.356). RF-DETR achieved the highest recall of all models (0.939) but generated more than 1500 false positives, yielding the lowest precision (0.100) and overall performance (F1 = 0.181). Bootstrap resampling across transects confirmed these patterns: YOLOv12 exhibited the highest mean F1 (0.61; 95% CI: 0.52–0.72), YOLOv26 showed slightly lower but comparable stability (0.59; 95% CI: 0.50–0.70), and RF-DETR remained substantially weaker (0.26; 95% CI: 0.20–0.35).
4.2.2. Confidence Optimized Performance
Confidence sweeps revealed that all three models produced usable score gradients, with distinct operational tradeoffs (
Figure 4). YOLOv12 achieved the highest F1 at its optimal confidence threshold (0.72), driven by balanced precision (0.69) and recall (0.74). YOLOv26 performed similarly (F1 = 0.70), with slightly lower precision. RF-DETR reached a comparable peak F1 (0.71), but did so through a different confidence structure: precision increased sharply with threshold (0.86 at optimum) while recall declined to 0.60. This reflects a conservative scoring distribution rather than instability—RF-DETR’s F1 curve was smooth and exhibited a broad plateau, indicating consistent behavior across a range of thresholds. In contrast, the YOLO models maintained higher recall but at the cost of substantially more false positives, producing more balanced but less selective confidence gradients.
4.2.3. Persistence Optimized Performance
All models showed improved precision and F1 with increasing persistence thresholds, confirming that multi-frame agreement through object-tracking is a useful discriminator (
Figure 5). RF-DETR achieved the highest overall performance (F1 = 0.67 at persistence = 18), driven by a favorable balance of true positives and substantially fewer false positives than the YOLO models. Its F1 curve was relatively shallow, indicating stable performance across a broad range of thresholds rather than sensitivity to a narrow optimum. In contrast, YOLOv12 and YOLOv26 exhibited lower peak F1 scores (0.57 and 0.60, respectively) and higher false-positive rates, with modestly sharper peaks that reflect greater sensitivity to threshold selection rather than stronger discriminative separation.
4.2.4. Combined Confidence and Persistence Performance
To assess whether confidence and temporal persistence provide complementary discrimination, we defined a combined score:
where
controls the relative weighting between confidence and normalized persistence. A value of
corresponds to pure confidence-based scoring, while
uses persistence alone. Sweeping
from 0 to 1 and thresholding the resulting score surface reveals each model’s sensitivity to score composition. In
Figure 6 we show mapped outputs and show example insets to show local variation in detections.
Sweeping the combined confidence–persistence weighting (
) revealed that all models benefited from integrating the two signals, but with distinct sensitivity profiles (
Figure 7). YOLOv12 achieved its highest F1 at
(F1 = 0.72), indicating that confidence remained the dominant discriminative factor, with persistence providing modest reinforcement. YOLOv26 peaked at
, further emphasizing its reliance on confidence-based scoring. RF-DETR reached its maximum at
, but its F1 curve was comparatively flat, suggesting limited responsiveness to weighting and weaker coupling between confidence and persistence. This apparent proximity to a “balanced” weighting did not translate to stronger discrimination, as both signals contributed modestly to separation between true and false detections.
Detection breakdowns at each model’s optimal combined score supported these trends. RF-DETR and YOLOv12 produced the fewest false positives (FP = 47 for both models), while YOLOv26 yielded more false positives (FP = 59). False negatives were similar across models, with RF-DETR, YOLOv12, and YOLOv26 producing FN = 50, 53, and 49, respectively. True-positive counts were likewise comparable (TP = 129, 126, and 130), indicating that all three models achieved similar overall detection coverage under the combined score. Overall, combining confidence and persistence stabilized threshold selection across models, with only modest differences in error balance among architectures.
4.3. Processing Time
GhostVision processing times were extracted from system log files for all three models across 11 complete test recordings. Processing was computed on a laptop on Windows Subsystem for Linux 2 (WSL 2) with Intel ® (Intel Corporation, Santa Clara, CA, USA) Core™ 7 CPU, 64 GB RAM, and NVIDIA RTX 5000 Ada Generation GPU. The GPU was utilized for model inference with all other workflows processed on the CPU. Total processing times (hh:mm:ss) were: YOLOv12 00:11:49, YOLOv26 00:11:23, and RF-DETR 00:12:28, for a combined dataset representing approximately 02:07:00 of sonar survey time. The mean processing-to-survey-time ratio was 8.7% for YOLOv12, 8.8% for YOLOv26, and 9.0% for RF-DETR (approximately 11× faster than real time for all models). Inference-only times were substantially smaller: YOLOv12 139.0 s (00:02:19), YOLOv26 121.1 s (00:02:01.1), and RF-DETR 182.7 s 00:03:02.7 across the same 11 recordings. Mean inference-to-survey ratios were 1.74% (YOLOv12), 1.48% (YOLOv26), and 2.25% (RF-DETR), and mean inference-to-processing ratios were 18.99%, 17.14%, and 23.64%, respectively. Individual recording processing times ranged from 36–123 s for survey durations of 260–1601 s.
5. Discussion
This work demonstrates that low-cost consumer sonar, paired with modern object-detection models, can support reliable automated mapping of derelict crab pots in shallow coastal systems. The three models (YOLOv12, YOLOv26, RF-DETR) diverge substantially under default pipeline settings but converge when post-processing is applied. At default settings, YOLOv12 achieves the strongest practical performance (F1 = 0.512), followed closely by YOLOv26 (F1 = 0.497). RF-DETR, despite its high recall (0.939), produced 1514 false positives at default confidence, which is operationally impractical without further configuration. This also shows why threshold-averaged metrics such as mAP and AUPRC can overstate default operational utility when low-confidence detections are abundant. However, this pattern does not persist: after confidence-threshold optimization, all three models achieve F1 ≈ 0.72–0.74. After persistence filtering, RF-DETR outperforms the YOLO models (F1 = 0.667 vs. 0.574–0.596). This convergence suggests that the choice of model is less important than the choice of post-processing strategy, and that practitioners willing to perform threshold calibration can achieve equivalent performance from any of the three architectures.
The implementation-level evaluation highlights the importance of assessing models within a full processing pipeline. Moving-window inference and temporal persistence filtering substantially improved precision for all architectures, confirming that multi-frame agreement is a valuable discriminator in noisy acoustic imagery. YOLOv12, in particular, maintained strong performance across a broad range of thresholds, making it the most robust choice for heterogeneous survey conditions.
The models were trained and evaluated exclusively on crab pots imaged in shallow Delaware estuaries. Transferability to other gear types, substrates, water depths, sonar frequencies, or vessel configurations has not been tested. Sonar imagery characteristics—backscatter intensity, speckled texture, shadow length—vary substantially with platform, frequency, gain settings, and environmental conditions. We expect that performance on out-of-domain data would vary significantly than reported here, and that fine-tuning on locally collected annotations would be required for deployment in new contexts. GhostVision’s modular architecture and the publicly available training code are designed to facilitate this adaptation.
GhostVision operates as a rapid post-processing pipeline, not a live detection system. During the implementation evaluation, all three models processed each sonar recording in approximately 7–15% of its survey duration, achieving a processing-to-survey-time ratio of ≈8.9–9.8% (roughly 10–11× faster than real time). Inference alone accounted for approximately 1.5–2.3% of survey duration and roughly 17–24% of end-to-end processing time, indicating that most runtime is spent in upstream and downstream pipeline components (tile generation, georectification, and export). In practice, this means that a 27 min sonar transect can be fully processed—including geocoding and export to GIS and GPS formats—in under three minutes, enabling same-day analysis in routine field deployments. Future work will examine moving the processing to field-ready equipment including edge devices.
GhostVision lowers long-standing barriers to derelict gear monitoring by combining accessible hardware, open-source software, and reproducible and rapid processing steps. This enables community members, small research groups, and resource managers to conduct surveys at spatial scales that previously required specialized equipment and expertise. Limitations remain, including environmental variability, class generalization beyond crab pots, and geolocation uncertainty tied to vessel-specific offsets. Nonetheless, the results show that automated detection using consumer sonar is now feasible for operational stewardship.
6. Conclusions
GhostVision provides an open-source, end-to-end framework for automated detection and geolocation of derelict crab pots using low-cost consumer sonar. All three models achieved comparable operational performance after post-processing (F1 ≈ 0.72 at respective optimized thresholds). YOLOv12 was the most stable model under default operational settings, producing the strongest untuned F1 (0.512) and the lowest false-positive burden. RF-DETR demonstrated the highest recall and became competitive after confidence or persistence filtering; its high area under the precision-recall curve (0.770) indicates strong discriminative capacity when an appropriate threshold is applied. The choice among models for operational deployment will depend on the user’s tolerance for default false positives and willingness to perform threshold calibration. These results demonstrate that modern object-detection models, when paired with affordable sonar hardware, can support management-scale derelict gear surveys without the need for specialized instrumentation or expert interpretation. By enabling rapid post-processing and mapping at spatial scales relevant to both professional and community-led removal efforts, GhostVision lowers long-standing barriers to acoustic monitoring and expands access to effective survey tools. The framework establishes a reproducible foundation for future extensions, including multi-class detection, integration with autonomous platforms, and adaptive learning from community-validated recoveries. Together, these capabilities position GhostVision as a scalable pathway toward sustained stewardship of coastal environments.
Author Contributions
Conceptualization, A.T., K.B. and C.S.B.; methodology, C.S.B., K.B., N.A., O.B. and A.T.; software, C.S.B.; validation, C.S.B., K.B. and A.T.; formal analysis, C.S.B.; investigation, C.S.B., K.B., N.A., J.W., O.C., C.H., O.B., O.H., J.G. and A.T.; resources, A.T.; data curation, J.W., O.H., K.B. and C.S.B.; writing—original draft preparation, C.S.B.; writing—review and editing, C.S.B., K.B., N.A., J.W., O.C., C.H., O.B., O.H., J.G. and A.T.; visualization, C.S.B.; supervision, A.T. and K.B.; project administration, A.T.; funding acquisition, A.T. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by funding from NOAA’s Project ABLE via award number NA22OAR4690620-T1-01 and NOAA VIMS TRAP award number 725316-712683.
Data Availability Statement
GhostVision software (Version 1.0.0), developed for this manuscript, is licensed under GNU General Public License and is available in the PINGEcosystem GitHub organization at
https://github.com/PINGEcosystem/GhostVision (accessed on 27 January 2026). All training datasets and fine-tuned object-detection models are licensed under CC BY-SA 4.0 and archived on Zenodo [
29].
Acknowledgments
We want to recognize the role of the 2024 Autonomous Systems Bootcamp hosted by the University of Delaware which facilitated collaborative marine-focused AI projects including identifying wild oysters [
35] and derelict crab pots (this work). Special thanks to B. Haywood and Delaware Sea Grant for this collaboration and organizing impactful Crab Pot Roundups. Finally, thanks to Coastal Sediments Hydrodynamics and Engineering Laboratory (CSHEL) team, especially G. Otto, for their commitment to mapping and cleaning Delaware Inland Bays.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
| SSS | Side-scan sonar |
| IB | Inland Bays |
| DB | Delaware Bay |
| NOAA | National Oceanic and Atmospheric Administration |
| DNREC | Delaware Department of Natural Resources and Environmental Control |
| MMU | Mobile mapping unit |
| GPS | Global positioning system |
| PVC | Polyvinyl chloride |
| CNN | Convolutional neural network |
| GPU | Graphics processing unit |
| RAM | Random access memory |
| GIS | Geographic information system |
| GPX | GPS exchange format |
| AOI | Area of interest |
| mAP | Mean average precision |
| IoU | Intersection over union |
References
- Davis, C.C. A study of the crab pot as a fishing gear. Chesap. Biol. Lab. Contrib. Ser. 1942, 53, 1–20. [Google Scholar]
- Bishop, J.M. Incidental Capture of Diamondback Terrapin by Crab Pots. Estuaries 1983, 6, 426–430. [Google Scholar] [CrossRef]
- Hart, K.M.; Crowder, L.B. Mitigating by-catch of diamondback terrapins in crab pots. J. Wildl. Manag. 2011, 75, 264–272. [Google Scholar] [CrossRef]
- Bilkovic, D.M.; Slacum, H.W.; Havens, K.J.; Zaveta, D.; Jeffrey, C.F.; Scheld, A.M.; Stanhope, D.; Angstadt, K.; Evans, J.D. Ecological and Economic Effects of Derelict Fishing Gear in the Chesapeake Bay 2015/2016 Final Assessment Report; Technical Report; Virginia Institute of Marine Science, College of William and Mary: Gloucester Point, VA, USA, 2016. [Google Scholar]
- Arthur, C.; Sutton-Grier, A.E.; Murphy, P.; Bamford, H. Out of sight but not out of mind: Harmful effects of derelict traps in selected U.S. coastal waters. Mar. Pollut. Bull. 2014, 86, 19–28. [Google Scholar] [CrossRef] [PubMed]
- DelBene, J.A.; Bilkovic, D.M.; Scheld, A.M. Examining derelict pot impacts on harvest in a commercial blue crab Callinectes sapidus fishery. Mar. Pollut. Bull. 2019, 139, 150–156. [Google Scholar] [CrossRef] [PubMed]
- Jeffrey, C.F.G.; Havens, K.J.; Slacum, H.W.; Bilkovic, D.M.; Zaveta, D.; Scheld, A.M.; Willard, S.; Evans, J.D. Assessing Ecological and Economic Effects of Derelict Fishing Gear: A Guiding Framework; Technical Report, Prepared for the Marine Debris Program Office of Response and Restoration; NOAA: Washington, DC, USA, 2016.
- Kaeser, A.J.; Litts, T.L. A Novel Technique for Mapping Habitat in Navigable Streams Using Low-cost Side Scan Sonar. Fisheries 2010, 35, 163–174. [Google Scholar] [CrossRef]
- Buscombe, D. Shallow water benthic imaging and substrate characterization using recreational-grade sidescan-sonar. Environ. Model. Softw. 2017, 89, 1–18. [Google Scholar] [CrossRef]
- Bodine, C.S.; Buscombe, D.; Best, R.J.; Redner, J.A.; Kaeser, A.J. PING-Mapper: Open-Source Software for Automated Benthic Imaging and Mapping Using Recreation-Grade Sonar. Earth Space Sci. 2022, 9, e2022EA002469. [Google Scholar] [CrossRef]
- Bodine, C.S.; Buscombe, D.; Hocking, T.D. Automated River Substrate Mapping From Sonar Imagery with Machine Learning. J. Geophys. Res. Mach. Learn. Comput. 2024, 1, e2024JH000135. [Google Scholar] [CrossRef]
- Karimanzira, D.; Renkewitz, H.; Shea, D.; Albiez, J. Object Detection in Sonar Images. Electronics 2020, 9, 1180. [Google Scholar] [CrossRef]
- Bilkovic, D.M.; Havens, K.; Stanhope, D.; Angstadt, K. Derelict fishing gear in Chesapeake Bay, Virginia: Spatial patterns and implications for marine fauna. Mar. Pollut. Bull. 2014, 80, 114–123. [Google Scholar] [CrossRef]
- Fleming, K. Recreational Crab Pot Abandonment and BRD Compliance in Delaware Inland Bays: 2020 and 2021 Summaries; Technical Report; Delaware Sea Grant: Newark, DE, USA, 2021. [Google Scholar]
- Havens, K.J.; Bilkovic, D.M.; Stanhope, D.; Angstadt, K.; Hershner, C. The Effects of Derelict Blue Crab Traps on Marine Organisms in the Lower York River, Virginia. N. Am. J. Fish. Manag. 2008, 28, 1194–1200. [Google Scholar] [CrossRef]
- Sullivan, M.; Evert, S.; Straub, P.; Reding, M.; Robinson, N.; Zimmermann, E.; Ambrose, D. Identification, recovery, and impact of ghost fishing gear in the Mullica River-Great Bay Estuary (New Jersey, USA): Stakeholder-driven restoration for smaller-scale systems. Mar. Pollut. Bull. 2019, 138, 37–48. [Google Scholar] [CrossRef]
- Bamford, H.A. Programmatic Environmental Assessment for the NOAA Marine Debris Program; Technical Report; NOAA: Washington, DC, USA, 2013.
- Fuchs, L.R.; Gallstrom, A.; Folkesson, J. Object Recognition in Forward Looking Sonar Images using Transfer Learning. In Proceedings of the 2018 IEEE/OES Autonomous Underwater Vehicle Workshop (AUV), Porto, Portugal, 6–9 November 2018; IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Wang, F.; Li, H.; Wang, K.; Su, L.; Li, J.; Zhang, L. An Improved Object Detection Method for Underwater Sonar Image Based on PP-YOLOv2. J. Sens. 2022, 2022, 5827499. [Google Scholar] [CrossRef]
- Zhang, H.; Yang, X.; Zhang, R.; Gao, N.; Wang, N.; Zhang, Z.; Zhang, C. Detection of dense fish schools in sonar imagery with a novel YOLOv11-SAS model. Ecol. Inform. 2026, 94, 103646. [Google Scholar] [CrossRef]
- Yang, C.; Li, Y.; Jiang, L.; Huang, J. Foreground enhancement network for object detection in sonar images. Mach. Vis. Appl. 2023, 34, 56. [Google Scholar] [CrossRef]
- El-Mihoub, T.A.; El Gadi, A.; Nolle, L.; Stahl, F. On Object Detection and Explainability with Sonar Imagery. In Proceedings of the 2024 IEEE 4th International Maghreb Meeting of the Conference on Sciences and Techniques of Automatic Control and Computer Engineering (MI-STA), Tripoli, Libya, 19–21 May 2024; IEEE: New York, NY, USA, 2024; pp. 779–786. [Google Scholar] [CrossRef]
- Aubard, M.; Madureira, A.; Teixeira, L.; Pinto, J. Sonar-based Deep Learning in Underwater Robotics: Overview, Robustness and Challenges. arXiv 2024. [Google Scholar] [CrossRef]
- Jiang, L.; Cai, T.; Ma, Q.; Xu, F.; Wang, S. Active Object Detection in Sonar Images. IEEE Access 2020, 8, 102540–102553. [Google Scholar] [CrossRef]
- Yang, C.; Xi, W.; Jiang, L.; Huang, J. Sonar Image Segmentation Framework Based on Semi-Supervised Learning. 2023. Available online: https://www.researchsquare.com/article/rs-3177039/v1 (accessed on 2 April 2026).
- Xi, J.; Ye, X. Sonar Image Target Detection Based on Simulated Stain-like Noise and Shadow Enhancement in Optical Images under Zero-Shot Learning. J. Mar. Sci. Eng. 2024, 12, 352. [Google Scholar] [CrossRef]
- Wang, Z.; Guo, J.; Zhang, S.; Zhang, Y. Sonar-based object detection for autonomous underwater vehicles in marine environments. Front. Mar. Sci. 2025, 12, 1539371. [Google Scholar] [CrossRef]
- Dwyer, B.; Nelson, J.; Hansen, T. Roboflow, (version 1.0); Roboflow, Inc.: Des Moines, IA, USA, 2026. [Google Scholar]
- Bodine, C.; Baxevani, K.; Abbasi, N.; Wierzbicki, J.; Christoph, O.; Hughes, C.; Bagoren, O.; Hines, O.; Greco, J.; Trembanis, A. Derelict Crab Pot Object Detection Dataset and Models for GhostVision 1.0.0. 2026. Available online: https://zenodo.org/records/20056679 (accessed on 1 April 2026).
- Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025. [Google Scholar] [CrossRef]
- Sapkota, R.; Cheppally, R.H.; Sharda, A.; Karkee, M. YOLO26: Key Architectural Enhancements and Performance Benchmarking for Real-Time Object Detection. arXiv 2026. [Google Scholar] [CrossRef]
- Robinson, I.; Robicheaux, P.; Popov, M.; Ramanan, D.; Peri, N. RF-DETR: Neural Architecture Search for Real-Time Detection Transformers. arXiv 2025. [Google Scholar] [CrossRef]
- Conda Contributors. Conda: A System-Level, Binary Package and Environment Manager. 2025. Available online: https://zenodo.org/records/14680664 (accessed on 27 January 2026).
- Bodine, C.S. PINGTile; v1.0.0; PING Ecosystem: Lewes, DE, USA, 2026. [Google Scholar] [CrossRef]
- Campbell, B.; Williams, A.; Baxevani, K.; Campbell, A.; Dhoke, R.; Hudock, R.E.; Lin, X.; Mange, V.; Neuberger, B.; Suresh, A.; et al. Is AI currently capable of identifying wild oysters? A comparison of human annotators against the AI model, ODYSSEE. Front. Robot. AI 2025, 12, 1587033. [Google Scholar] [CrossRef]
Figure 1.
Side-scan sonar survey locations in Rehoboth Bay, Delaware, with derelict crab pots (red points) identified through visual review and manual annotation. Surveys were conducted during the off-season for blue crab harvest between 2020 and 2022. The red box in the inset map shows the location of the large map.
Figure 1.
Side-scan sonar survey locations in Rehoboth Bay, Delaware, with derelict crab pots (red points) identified through visual review and manual annotation. Surveys were conducted during the off-season for blue crab harvest between 2020 and 2022. The red box in the inset map shows the location of the large map.
Figure 2.
Overview of crab pot structure, acoustic appearance, and sonar equipment. Panels: (a) power washing a recovered derelict crab pot, illustrating the wire-frame geometry that produces strong acoustic highlights and shadows; (b) schematic of side-scan sonar data collection, showing the grazing-angle geometry that governs target visibility; (c) examples of crab pots imaged with side-scan sonar, highlighting the characteristic bright returns and elongated acoustic shadows used for manual and automated detection; and (d) components of the Mobile Mapping Unit (MMU), which integrates sonar, GPS, and onboard computing for consistent survey acquisition.
Figure 2.
Overview of crab pot structure, acoustic appearance, and sonar equipment. Panels: (a) power washing a recovered derelict crab pot, illustrating the wire-frame geometry that produces strong acoustic highlights and shadows; (b) schematic of side-scan sonar data collection, showing the grazing-angle geometry that governs target visibility; (c) examples of crab pots imaged with side-scan sonar, highlighting the characteristic bright returns and elongated acoustic shadows used for manual and automated detection; and (d) components of the Mobile Mapping Unit (MMU), which integrates sonar, GPS, and onboard computing for consistent survey acquisition.
Figure 3.
Training history for YOLOv12, YOLOv26, and RF-DETR across six core metrics. Training and validation losses (top row) show rapid convergence for both YOLO models, while RF-DETR maintains higher loss but exhibits smoother optimization. Precision and recall (middle row) illustrate early variability followed by stabilization, with RF-DETR achieving consistently high recall and competitive precision. Mean Average Precision (mAP) at IoU 0.50 and IoU 0.50–0.95 (bottom row) highlight RF-DETR’s superior detection performance across training, despite its slower convergence.
Figure 3.
Training history for YOLOv12, YOLOv26, and RF-DETR across six core metrics. Training and validation losses (top row) show rapid convergence for both YOLO models, while RF-DETR maintains higher loss but exhibits smoother optimization. Precision and recall (middle row) illustrate early variability followed by stabilization, with RF-DETR achieving consistently high recall and competitive precision. Mean Average Precision (mAP) at IoU 0.50 and IoU 0.50–0.95 (bottom row) highlight RF-DETR’s superior detection performance across training, despite its slower convergence.
Figure 4.
Confidence–threshold sweeps for RF-DETR, YOLOv12, and YOLOv26, showing precision, recall, and F1 across the full confidence range (a–c), with vertical dashed lines marking the optimal thresholds for each model. RF-DETR achieves the highest precision at its optimal threshold, whereas YOLO models maintain higher recall, producing slightly higher F1 scores. Detection breakdowns at each model’s optimal threshold (d–f) illustrate the resulting trade-offs: RF-DETR yields fewer false positives but more false negatives, while YOLO models detect more true positives at the cost of increased false positives. Together, these plots highlight how confidence threshold selection shapes operational performance in derelict pot detection.
Figure 4.
Confidence–threshold sweeps for RF-DETR, YOLOv12, and YOLOv26, showing precision, recall, and F1 across the full confidence range (a–c), with vertical dashed lines marking the optimal thresholds for each model. RF-DETR achieves the highest precision at its optimal threshold, whereas YOLO models maintain higher recall, producing slightly higher F1 scores. Detection breakdowns at each model’s optimal threshold (d–f) illustrate the resulting trade-offs: RF-DETR yields fewer false positives but more false negatives, while YOLO models detect more true positives at the cost of increased false positives. Together, these plots highlight how confidence threshold selection shapes operational performance in derelict pot detection.
Figure 5.
Persistence–threshold sweeps for RF-DETR, YOLOv12, and YOLOv26, showing how precision, recall, and F1 vary as the minimum required number of consecutive detections (pred_cnt) increases (a–c). Vertical dashed lines mark each model’s optimal persistence threshold, with corresponding optimal metrics summarized beneath each plot. RF-DETR achieves the most balanced performance at its optimal threshold, while YOLO models maintain higher recall but at the cost of reduced precision. Detection breakdowns at the optimal persistence threshold (d–f) highlight these trade-offs: RF-DETR produces fewer false positives, whereas YOLO models detect slightly more true positives but with substantially higher false-positive counts. These results illustrate how temporal persistence filtering shapes model reliability in cluttered acoustic environments.
Figure 5.
Persistence–threshold sweeps for RF-DETR, YOLOv12, and YOLOv26, showing how precision, recall, and F1 vary as the minimum required number of consecutive detections (pred_cnt) increases (a–c). Vertical dashed lines mark each model’s optimal persistence threshold, with corresponding optimal metrics summarized beneath each plot. RF-DETR achieves the most balanced performance at its optimal threshold, while YOLO models maintain higher recall but at the cost of reduced precision. Detection breakdowns at the optimal persistence threshold (d–f) highlight these trade-offs: RF-DETR produces fewer false positives, whereas YOLO models detect slightly more true positives but with substantially higher false-positive counts. These results illustrate how temporal persistence filtering shapes model reliability in cluttered acoustic environments.
Figure 6.
Spatial comparison of combined detection scores for RF-DETR, YOLOv12, and YOLOv26 across survey sites. Each panel illustrates true positives (green), false positives (red), and false negatives (purple) overlaid on sonar mosaics, highlighting differences in spatial precision and recall between models. RF-DETR demonstrates broader coverage and higher recall, while YOLO models exhibit more conservative detections with fewer false positives. Insets show representative zoom regions used for manual validation of detection confidence and persistence.
Figure 6.
Spatial comparison of combined detection scores for RF-DETR, YOLOv12, and YOLOv26 across survey sites. Each panel illustrates true positives (green), false positives (red), and false negatives (purple) overlaid on sonar mosaics, highlighting differences in spatial precision and recall between models. RF-DETR demonstrates broader coverage and higher recall, while YOLO models exhibit more conservative detections with fewer false positives. Insets show representative zoom regions used for manual validation of detection confidence and persistence.
Figure 7.
Combined-score sweeps for RF-DETR, YOLOv12, and YOLOv26, showing how precision, recall, and F1 vary as the unified confidence–persistence score threshold increases (a–c). Vertical dashed lines denote each model’s optimal combined-score threshold, with corresponding optimal metrics summarized beneath each plot. Across models, the combined score produces more balanced operating points than confidence or persistence alone, yielding F1 values between 0.71 and 0.73. Detection breakdowns at the optimal combined score (d–f) show that all models achieve similar true-positive counts, with RF-DETR and YOLOv12 producing fewer false positives than YOLOv26. These results demonstrate that integrating confidence and temporal persistence stabilizes threshold selection and improves overall detection reliability.
Figure 7.
Combined-score sweeps for RF-DETR, YOLOv12, and YOLOv26, showing how precision, recall, and F1 vary as the unified confidence–persistence score threshold increases (a–c). Vertical dashed lines denote each model’s optimal combined-score threshold, with corresponding optimal metrics summarized beneath each plot. Across models, the combined score produces more balanced operating points than confidence or persistence alone, yielding F1 values between 0.71 and 0.73. Detection breakdowns at the optimal combined score (d–f) show that all models achieve similar true-positive counts, with RF-DETR and YOLOv12 producing fewer false positives than YOLOv26. These results demonstrate that integrating confidence and temporal persistence stabilizes threshold selection and improves overall detection reliability.
Table 1.
Training parameters for each model architecture.
Table 1.
Training parameters for each model architecture.
| | YOLOv12 | YOLOv26 | RF-DETR |
|---|
| Checkpoint | YOLO12s | YOLO26s | RF-DETR-S |
| Parameters (M) | 9.3 | 9.5 | 32.1 |
| Batch Size | 40 | 48 | 8 |
| Total Epochs | 200 | 200 | 100 |
| Patience | 30 | 30 | 30 |
| Learning Rate | | | |
Table 2.
Best epoch and corresponding validation metrics for each model, sorted by mAP@50–95. Asterisks (*) indicate the best performance for each metric across models.
Table 2.
Best epoch and corresponding validation metrics for each model, sorted by mAP@50–95. Asterisks (*) indicate the best performance for each metric across models.
| Model | Epoch | Precision | Recall | mAP@50 | mAP@50–95 |
|---|
| YOLOv12 | 51 | 0.597 | 0.496 | 0.515 | 0.180 * |
| YOLOv26 | 15 | 0.464 | 0.431 | 0.418 | 0.158 |
| RF-DETR | 7 | 0.631 * | 0.580 * | 0.555 * | 0.174 |
Table 3.
Test-set performance for all models computed using Supervision v0.27.0. Asterisks (*) indicate the best performance for each metric across models.
Table 3.
Test-set performance for all models computed using Supervision v0.27.0. Asterisks (*) indicate the best performance for each metric across models.
| Model | Precision@50 | Recall@50 | F1@50 | mAP@50 | mAP@50–95 |
|---|
| YOLOv12 | 0.516 | 0.263 | 0.348 * | 0.157 | 0.060 |
| YOLOv26 | 0.667 * | 0.085 | 0.150 | 0.074 | 0.030 |
| RF-DETR | 0.006 | 0.979 * | 0.011 | 0.379 * | 0.148 * |
Table 4.
Performance summary at four operating points for all three models evaluated against manually georeferenced crab pot locations using a 3 m spatial matching radius. Operating points are: default pipeline (no post-processing threshold); confidence-score threshold () optimized to maximize F1; temporal persistence threshold () optimized to maximize F1; and combined confidence–persistence score with optimal weighting .
Table 4.
Performance summary at four operating points for all three models evaluated against manually georeferenced crab pot locations using a 3 m spatial matching radius. Operating points are: default pipeline (no post-processing threshold); confidence-score threshold () optimized to maximize F1; temporal persistence threshold () optimized to maximize F1; and combined confidence–persistence score with optimal weighting .
| Model | Operating Point | / | TP | FP | FN | Prec | Rec | F1 |
|---|
| YOLOv12 | Unoptimized | - | 165 | 300 | 14 | 0.355 | 0.922 | 0.512 |
| | Conf. thresh. | | 133 | 60 | 46 | 0.689 | 0.743 | 0.715 |
| | Pers. thresh. | | 124 | 129 | 55 | 0.490 | 0.693 | 0.574 |
| | Combined score | | 126 | 47 | 53 | 0.728 | 0.704 | 0.716 |
| YOLOv26 | Unoptimized | - | 147 | 266 | 32 | 0.356 | 0.821 | 0.497 |
| | Conf. thresh. | | 133 | 67 | 46 | 0.665 | 0.743 | 0.702 |
| | Pers. thresh. | | 121 | 106 | 58 | 0.533 | 0.676 | 0.596 |
| | Combined score | | 130 | 59 | 49 | 0.688 | 0.726 | 0.707 |
| RF-DETR | Unoptimized | - | 168 | 1514 | 11 | 0.100 | 0.939 | 0.181 |
| | Conf. thresh. | | 107 | 17 | 72 | 0.863 | 0.598 | 0.706 |
| | Pers. thresh. | | 123 | 67 | 56 | 0.647 | 0.687 | 0.667 |
| | Combined score | | 129 | 47 | 50 | 0.733 | 0.721 | 0.727 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |