1. Introduction
Effective search and rescue (SAR) in the expansive and dynamic marine environment hinges on the swift and precise detection of objects. Marine object detection, a critical field within computer vision and remote sensing, utilizes various imaging technologies, including satellite imagery, synthetic aperture radar, sonar, and underwater cameras, to identify, classify, and locate items in oceanic settings [
1]. Sophisticated algorithms, particularly convolutional neural networks (CNNs), have demonstrated an exceptional ability to automatically learn complex features from image data. The integration of deep learning into unmanned aerial vehicles (UAVs) and other autonomous systems is revolutionizing maritime SAR operations, offering rapid deployment, broad area coverage, and immediate visual data analysis [
2]. This move towards AI-driven SAR represents a major paradigm shift, overcoming the limitations of human-centric approaches and leading to more efficient and effective emergency responses at sea. Beyond SAR, this technology plays a vital role in diverse applications, such as locating shipwrecks and mines, as well as mapping marine habitats, particularly in underwater environments with limited visibility and complex acoustic conditions [
3]. Marine object detection is increasingly important for enhancing autonomous underwater vehicle (AUV) safety and underwater infrastructure inspection efficiency, significantly improving upon limited human observation from ships or aircraft across vast maritime search areas.
The expansion of maritime activities such as shipping, offshore energy, and fishing underscores the critical need for automated detection systems in ensuring maritime security, preserving the environment, and advancing oceanographic studies. The capacity to detect and analyze marine objects yields substantial benefits for global trade, ecological balance, and defense strategies. For instance, satellite SAR-based ship detection helps to control illegal fishing, piracy, and unauthorized vessel presence [
4]. Identifying marine debris, particularly plastics, supports pollution reduction initiatives in line with the United Nations Sustainable Development Goal 14 [
5]. In marine biology, automated species detection, such as for whales and coral reefs, facilitates more effective biodiversity research [
6]. Additionally, developments in deep learning and autonomous underwater vehicles (AUVs) have enhanced real-time monitoring capabilities, decreasing the necessity for manual inspections.
The ability to quickly locate individuals, vessels, or debris is critical for more efficient rescue operations and is crucial for improving the survival rates of those in distress. The natural complexities of the marine environment—including erratic weather, reduced visibility due to fog or darkness, and the constant movement of search platforms and the objects being sought—frequently render conventional SAR approaches inadequate. Considering a simple example of detecting a small object with a high impact in the SAR marine risk assessment process, the risk matrix can be developed as described in
Table 1. In
Table 1, numbers in brackets show representative values of likelihood, impact, and detectability to calculate the risk score of the hazard (small target in this case). The risk score is calculated as follows:
The equation suggests that for marine SAR, higher detectability helps in reducing the risk score, and this can only be enabled by increasing technological investments, such as a deep learning-based detection system, along with the fusion of thermal cameras with sonar and optical cameras.
Progress in marine object detection is significantly hampered by limited and imbalanced data. Unlike abundant terrestrial datasets (e.g., COCO, ImageNet), marine datasets (e.g., SeaShips [
7], MODD, URPC) are often small, inadequately annotated, or environment-specific [
8]. Satellite data (e.g., Sentinel-1 SAR) suffer from low resolution and noise, while underwater data are affected by turbidity and occlusion [
9]. The lack of consistent benchmarks further complicates the comparison of different solutions [
10,
11].
Several interconnected challenges impede progress across several fronts. Environmental factors such as turbidity, waves, and lighting degrade visual data [
12], while sensor limitations (e.g., light refraction in water and color absorption) further complicate optical detection [
13]. Speckle noise in SAR, high cost, and sparse annotations further deteriorate matters [
4,
14]. Algorithmic development struggles with real-time processing on resource-limited autonomous devices and generalizing deep learning models due to training data biases [
15,
16]. Additionally, regulatory and ethical concerns, such as privacy in vessel tracking and data sharing restrictions, limit access to crucial annotated datasets [
10].
Despite recent advancements in attention mechanisms and context-aware architectures, further research is crucial to enhance model scalability to generalize across different datasets. Global research continues to tackle significant limitations by pursuing several key objectives: identifying well-annotated algorithms to effectively generalize across diverse marine environments by mitigating training data biases; enabling real-time processing on resource-limited platforms such as AUVs; addressing regulatory and ethical considerations concerning vessel tracking privacy and data sharing restrictions; and finally, establishing consistent benchmarks that allow for fair comparison of model performance across different solutions. Ultimately, the goal is to create a more accurate, reliable, efficient, and ethically responsible marine object detection system capable of operating effectively in the complex and varied oceanic domain.
This research aims to achieve the following:
experimentally evaluate how well various You Only Look Once (YOLO) models perform in identifying marine objects in aerial images captured under diverse weather conditions for SAR operations;
analyze the potential of YOLO models for generalization, computational requirements, and robustness;
identify regulatory efforts to enable a robust deep learning-based marine object detection for SAR.
The paper is organized as follows:
Section 2 reviews the recent literature on marine object detection and classification;
Section 3 details the different datasets, technical challenges, and evaluation metrics used in models;
Section 4 presents the benchmarking approach to evaluate YOLO models for marine SAR; in
Section 5, the experimental results from training and testing these models are presented along with measured benchmark values. In
Section 6, the analysis of the results accumulated in
Section 5 is presented. Regulatory efforts are also discussed in this section. Finally, the concluding summary of the key advancements and their potential impact on maritime SAR operations is presented in
Section 7.
2. Literature Review
The use of YOLO models in marine SAR is underpinned by their fundamental theoretical strengths, which make them particularly well-suited for real-time object detection in complex and unpredictable environments. The key theoretical foundations include:
Single-stage object detection: Unlike traditional two-stage detectors, YOLO adopts a single-stage architecture. It processes the entire image in a single forward pass, enabling significantly faster detection.
Grid-based prediction mechanism: YOLO divides the input image into a grid, with each cell responsible for predicting a set number of bounding boxes, object confidence scores, and class probabilities if the object’s center falls within the cell. This approach facilitates simultaneous and distributed multiple object detection across a single image frame.
End-to-end learning: YOLO models are trained end-to-end to map raw pixel data directly to object locations and classes. By processing the entire image during training and inference, YOLO leverages global context, essential for distinguishing small or partially occluded targets from a noisy background such as waves or glare.
CNN-based feature extraction: YOLO employs CNNs as feature extractors to learn high-level representations of objects. These features enable reliable detection of maritime targets such as people, life rafts, and vessels, even under variable lighting, motion blur, or partial occlusions.
Non-maximum suppression (NMS): YOLO incorporates NMS to eliminate overlapping or redundant bounding boxes, retaining only the most confident detections.
Anchor boxes: Later versions of YOLO utilize anchor boxes derived from clustering ground-truth annotations. The model predicts offsets relative to these anchors, enhancing its ability to detect objects of varying sizes and aspect ratios. This is especially relevant for identifying small or irregularly shaped marine targets.
Multi-sensor fusion capability: YOLO can be integrated with data from thermal/infrared cameras or radar, enabling robust performance in low-visibility or nighttime scenarios. This multimodal fusion enhances detection reliability in challenging conditions typical of real-world SAR missions.
Continuous evolution and specialization: The YOLO family has seen rapid iteration with ongoing enhancements in accuracy, speed, and robustness. These improvements target challenges such as tiny object detection, complex backgrounds, and adverse environmental conditions such as fog, rain, or sea glare that affect marine SAR operations.
Recent literature showcases a growing body of review papers addressing critical aspects of marine environments and safety. Specifically, ref. [
17] provides an intensive review of deep learning-based object recognition for both surface and underwater targets, establishing a unified framework of the key concepts and architectures, compiling benchmark datasets, and offering a comparative analysis of deep learning methodologies. Complementing this, ref. [
18] surveys state-of-the-art deep neural network approaches for marine object detection, a capability deemed crucial for the advancement of autonomous ship navigation, maritime surveillance, and intelligent transportation systems, with a particular focus on YOLO models and the necessity of large-scale, standardized datasets.
In the context of maritime safety, ref. [
19] delves into the automated detection and tracking of small objects during person overboard (POB) incidents, conceptualizing the involved technologies as an interconnected system. It introduces a novel three-phase POB model—detection, search and track, and rescue—detailing the initial two phases and their associated responsibilities. The urgency of rapid response in maritime search and rescue is further highlighted by the advancements in technology, such as UAVs equipped with sophisticated sensors, which have spurred the development of automated person detection systems using aerial imagery. In [
20], both traditional and advanced machine learning/neural network-based techniques are analyzed, and the role of synthetic data in overcoming data limitations is also considered, ultimately guiding readers in selecting the most suitable methodologies and future trends.
Beyond safety, marine pollution, such as oil spills and litter, poses significant threats to ecosystems and industries, demanding advanced monitoring. A review of 53 recent studies [
21] highlights AI’s role in detecting this pollution, showcasing high prediction rates through various model architectures, sensing technologies, and preprocessing methods. However, challenges persist, including limited training data, sensor inconsistencies, and real-time monitoring constraints.
Underwater marine object detection is a fundamental area in marine science and engineering with significant potential for ocean exploration, ecosystem monitoring, natural resource exploration, and fisheries management. Recognizing deep learning’s impact, a recent review [
22] categorized challenges in vision-based underwater object detection, including image quality degradation, small object detection, poor generalization, and real-time detection. This article also assessed datasets, compared findings with the previous AI reviews, and discussed future trends in this dynamic field.
UAV-based object detection in maritime environments faces challenges due to limited annotated training data and complex backgrounds [
23]. To address this, researchers developed the Maritime Search and Rescue Target Dataset (MSRTD) and proposed MSR-YOLO, an efficient detection model. Furthermore, detecting submerged individuals from UAVs is difficult, especially with sunlight reflection [
24]. This led to the creation of ABT-YOLOv7, which integrates an asymptotic feature pyramid network (AFPN), a BiFormer module for small object detection, and a task-specific context decoupling (TSCODE) mechanism. These advancements significantly improve detection accuracy and robustness in challenging lighting conditions.
Beyond optical UAV imagery, deep learning is crucial for processing multimodal ocean sensor data to enable intelligent perception and maritime target detection. In [
25], these technologies were explored, emphasizing the mathematical foundations of deep learning architectures such as SSD, R-CNN, and YOLO. It also highlighted the value of combining deep learning with image enhancement, data augmentation, and transfer learning to combat issues such as underwater image degradation and nonlinear noise. For detailed spectral analysis, a framework [
26] using hyperspectral imaging and machine learning models showed that CNNs (EfficientNet B0, Inception V3) achieved a significantly higher accuracy than traditional classifiers, establishing hyperspectral imaging as a valuable asset for advanced SAR.
The challenge of detecting small maritime vessels in cluttered aerial imagery has led to several innovative solutions. Maritime Background Suppression Network (MBSDet) [
27] tackles this by combining a background suppression module with a multidimensional feature enrichment (MFE) module, demonstrating superior performance on HRSC2016 and DOTA v1.0 datasets. For time-sensitive SAR operations, SG-Det [
28] offers a lightweight, real-time detector based on Shuffle-GhostNet that prioritizes speed without sacrificing accuracy. Similarly, YOLO-BEV [
29] incorporates a PAN+ with an extra-small-object detection head, a C2fSESA attention module for feature aggregation, and an RGSPP structure to reduce computational overhead. Evaluated on the MOBDrone dataset, YOLO-BEV achieved high accuracy with real-time frame rates.
Fog presents a significant challenge for maritime object detection, leading to the development of SRC-YOLO [
30], an improved YOLOv4-tiny model. SRC-YOLO utilizes a single-scale retinex for visual distortion mitigation, a modified receptive field block to expand the receptive field, and a convolutional block attention module for enhanced feature focus, significantly improving detection in foggy maritime scenes. For underwater applications relying on a side-scan sonar, the BES-YOLO [
31] network is designed to improve detection accuracy for multi-scale seafloor targets in noisy, complex environments. By incorporating an efficient multi-scale attention mechanism and a BiFPN for feature fusion, BES-YOLO achieves gains in detection and efficiency.
YOLO-SONAR [
32] is a new model designed for marine object detection in forward-looking sonar images, addressing challenges such as low resolution and seabed interference. It incorporates a competitive coordinate attention mechanism for noise reduction, a context feature extraction module to improve small object detection, and Wise-IoU v3 loss to address class imbalance. YOLO-SONAR outperforms the existing methods, achieving mAP scores of 81.96% on MDFLS and 82.30% on the new WHFLS datasets. However, it faces computational cost and data dependency limitations.
Underwater optical imaging faces a significant hurdle in marine object detection due to color disparities caused by how light is absorbed and scattered in water. These distortions obscure object boundaries, making it difficult for both human operators and automated systems to identify crucial elements such as people, vessels, or debris, especially in lifesaving SAR scenarios. Addressing these color issues is fundamental for building effective automated detection systems. For instance, restoring a drowning victim’s thermal signature in green-tinted water can significantly aid UAV-based detection. Recent research [
33] has leveraged principles of human visual perception to dynamically adjust color balance and contrast, mimicking human adaptability in turbid conditions and producing visually enhanced images. Another approach [
34] achieves underwater image enhancement through color correction using such techniques as the gray world assumption, employing type-II fuzzy sets for visibility recovery, and contrast enhancement using curve transformations. These methods are crucial because deep learning models such as YOLO rely on high-quality input data to extract meaningful features and make accurate predictions, ultimately supporting more reliable automated analysis in challenging marine environments.
Table 2 summarizes recent notable research articles related to marine SAR, which reflect the growing sophistication and diversity of deep learning applications in maritime object detection. Based on the literature review, several research gaps in marine object detection for search and rescue emerge for further investigation:
Data scarcity and imbalance: There is a lack of annotated datasets specifically for marine SAR.
Generalization: The existing models struggle to perform consistently across different marine data types.
Detection of small and partially occluded objects: Detecting small and partially hidden objects in complex marine environments is still a major challenge.
Real-time processing challenges: Achieving real-time detection and analysis on platforms with limited computational resources remains a significant technical barrier.
Benchmarking and standardization: The absence of consistent benchmarks and evaluation methods makes it difficult to compare different detection models across different studies and datasets.
Regulatory and ethical issues: There are unaddressed concerns regarding privacy, data sharing, and ethical AI use in maritime SAR operations.
The main contributions of this research are outlined below:
Large-sized datasets were employed to examine YOLO models for robustness (research gaps “a,” “b,” and “c”).
Consistent benchmarks were used to evaluate YOLO models and recent studies (research gap “e”).
Computational load analysis of YOLO models was investigated for SAR operations (research gap “d”).
Recent benchmarking efforts were discussed for real-world utility (research gap “f”).
3. Datasets, Evaluation Metrics, and Technical Challenges
As maritime surveillance and SAR operations are becoming increasingly vital, the development and evaluation of models rely heavily on robust and diverse datasets. These datasets form the backbone for tasks such as detection, classification, and tracking of maritime objects, including vessels, buoys, humans, debris, etc. To enable UAV-based YOLO models to reliably detect marine objects in real-world settings, their training and evaluation must be grounded in datasets that represent the following operational complexities:
The dataset composition should incorporate temporal and geographic diversity and varied environmental conditions. The dataset should reflect differences in lighting, sea state, time of day, season, types of water bodies, and UAV perspectives (altitudes and angles). It should reflect a diverse set of annotated marine objects, especially for such applications as search and rescue (SAR) and maritime surveillance.
Each object should be labeled using standardized bounding box formats and consistent class definitions. A balanced class distribution is critical to avoid model bias. In the case of video datasets, every frame should be annotated individually to support object detection and multi-frame tracking.
UAV imagery must be high-resolution to capture small, overlapping, and distant targets. Both imagery and annotations should conform to widely accepted standards to ensure compatibility with YOLO training pipelines. Data (real and synthetic) must avoid extreme class imbalance to prevent bias in model predictions and enhance generalization.
A well-curated dataset ensures that a YOLO-based UAV model performs precisely in marine environments. For model performance, the key qualifications include the following:
A large number of labeled images for deep learning robustness.
Low occlusion and clutter to minimize obstructions or augment data to handle them.
Multi-scale objects to ensure objects appear at varying scales.
Normalized and augmented data, possibly filtered for noise.
Geospatial metadata may help in contextualizing detection scenarios.
Below, publicly available datasets are reviewed and summarized in
Table 3.
3.1. Dataset Descriptions
MODDv2: Used for object detection, classification, and tracking, with a focus on detecting marine vessels and debris. It lacks multiple frequency bands and horizontal view orientation for aerial view applications [
35].
Singapore Maritime Dataset (SMD): Designed for detection, classification, and tracking using RGB and NIR frequency bands. It consists of three video streams captured from various altitudes and angles [
36].
OpenSARShip: The OpenSARShip dataset is a satellite dataset including SAR images in VV and VH polarizations, with bounding box annotations. It is good for radar-based detection, especially in low-visibility conditions [
37].
S2Ships: This dataset includes satellite imagery with RGB and multispectral bands, supporting ship detection with bounding box annotations. Its multispectral bands provide flexibility across various conditions [
38].
AFO (Aerial Dataset of Floating Objects): The AFO dataset contains aerial drone imagery focused on floating objects, such as kayaks, buoys, people, and boats. It is annotated with bounding boxes for object detection [
39].
xView3 SAR: This is a large-scale dataset focused on maritime object detection using SAR imagery, with various ship types and floating objects. This dataset is valuable in low-visibility conditions [
40].
LaRs: This dataset provides top-down RGB images, specifically designed for obstacle detection in maritime environments. This dataset is useful for identifying obstacles and mapping hazards [
41].
SeaDronesSee: This dataset is used for object detection and tracking with drone footage over maritime environments. It includes various object classes such as people, boats, and floating objects, supporting SAR [
42].
Seagull: This aerial dataset is designed for maritime surveillance, including RGB and thermal images, with bounding box annotations for various objects, making it suitable for both day and night detection [
43].
Multi-Category Large-Scale Dataset for Maritime Object Detection (MCMOD): This larger dataset contains images with annotated maritime objects, all captured by three onshore high-resolution video cameras in Hainan, China [
44].
While no single “Olympics” exists for marine object detection, the field is actively advanced by dedicated workshops and challenges. These initiatives utilize standardized datasets (satellite, aerial, drone imagery of marine, coastal, and port areas) and well-defined evaluation methodologies to foster progress in marine robotics, environmental monitoring, and underwater exploration. Prominent examples that serve as evolving benchmarks for marine object detection, particularly in SAR scenarios, include the following:
The SeaDronesSee Challenge, organized within the IEEE Global Vision Challenges framework, and often associated with CVPR/IROS workshops, for detecting ships, swimmers, buoys, and other marine objects;
Maritime Object Detection Challenge (MCMOT), which focuses on multi-camera object detection and tracking in varied conditions with day/night footage and adverse weather conditions;
Maritime Computer Vision (MaCVi) Challenge, as part of the IEEE/CVF conferences, with a focus on detection and classification of ships, buoys, and other maritime objects.
3.2. Evaluation Metrics
The object detection performance on marine datasets commonly involves several metrics. These include:
Intersection over union (IOU), a measure of the spatial overlap between predicted and ground-truth bounding boxes, often with a 0.5 threshold for positive detection.
Average precision (AP), which integrates precision across varying recall levels.
Mean average precision (mAP), the arithmetic mean of AP values across all object classes.
Precision and recall, assessing the rates of correct and complete detections, respectively.
F1-score, representing the harmonic mean of precision and recall.
Confusion matrix, a visualization of classification performance across different object categories, highlighting potential misclassifications (e.g., between rafts and speedboats).
Confidence score, indicating the model’s prediction certainty.
These metrics offer a comprehensive assessment of an object detection model’s performance and robustness in maritime environments, considering such factors as object scale, environmental conditions, and inter-class similarities. We will use these in our benchmarking experiments.
3.3. Technical Challenges
As discussed above, marine object detection for SAR faces substantial technical hurdles. Ensuring robustness and generalization across various scenarios and integrating multi-sensor data further complicate development. These technical challenges and their underlying causes are presented in
Table 4. The existing datasets, discussed in
Section 3.1, do not account for all these challenges. Achieving generalized model performance across different datasets thus becomes crucial for an effective marine SAR system.
4. Proposed Approach
The increasing importance of maritime surveillance and SAR demands reliable and varied data to train and assess computer vision models effectively. Based on the data presented in
Table 3 and
Table 4, AFO and SeaDroneSee datasets are chosen for benchmarking YOLO-based marine SAR models, as they were developed using aerial platforms in variable weather conditions. The SeaDronesSee dataset provides a rich and dynamic collection of high-resolution RGB images, supporting diverse tasks, tracking sequences, supplementary synthetic data, and specialized subsets. In contrast, the AFO presents several key advantages for creating and assessing detection and classification models in maritime contexts. Its variety of object types in real-world scenarios makes it highly effective for training resilient deep learning models for aerial surveillance, especially for SAR purposes.
A small marine object in YOLO models is generally characterized by its bounding box size relative to the overall image. Often, this means the object’s bounding box occupies less than 1% of the total image area or has dimensions smaller than 32 × 32 pixels within a 640 × 640 input image. This category includes such objects as small marine species, buoys, small boats, and floating debris.
YOLOv8’s anchor-free architecture, combined with an improved loss function, enables more precise bounding box predictions compared to YOLOv5. This enhancement is particularly beneficial for detecting small, irregular, or partially occluded objects. The model performs well in identifying ships, marine mammals, and floating debris, making it suitable for general maritime monitoring. However, YOLOv8 struggles with accurately detecting very small objects, especially under conditions of wave interference and low resolution [
45]. This limitation is particularly critical in SAR operations, where missing survivors, small debris, or life rafts due to detection failures can have severe consequences [
46]. While YOLOv8 is lightweight and optimized for real-time processing, even on edge devices, its effectiveness diminishes in complex marine scenes with cluttered backgrounds, small targets, or overlapping objects. Despite its speed and efficiency, these challenges highlight the need for further improvements in small-object detection accuracy for high-stakes marine applications.
While YOLOv9 excels in general object detection, its ability to detect very small marine objects (less than 50 pixels) in search and rescue (SAR) operations is limited [
47]. It often misses these tiny targets in complex marine scenes, especially when they are occluded or blend with such features as waves and floating debris, resulting in inaccurate bounding box localization. In contrast, such architectures as LFN-YOLO and CFSD-UAVNet improve small-object detection by integrating SPD-Conv to maintain spatial details and GFPN for multi-scale feature fusion, capabilities not natively present in YOLOv9 [
48,
49]. Furthermore, modified versions, such as MAR-YOLOv9, address these shortcomings through the implementation of enhanced loss functions and attention mechanisms, which are optimizations that YOLOv9 lacks.
YOLOv10 enhances object detection in marine environments by integrating anchor-free detection with task-specific decoupled heads, significantly improving classification, particularly for small marine targets [
50]. Its upgraded architecture captures multi-scale spatial details more effectively, boosting detection accuracy in challenging conditions such as sea clutter, sun glint, foam, and occlusions. Optimized for real-time performance, YOLOv10 achieves higher FPS and mAP than YOLOv8 across most scenarios while maintaining efficient memory usage, making it ideal for onboard UAV deployment. Despite its compact size and low parameter count, the model retains high accuracy, proving especially effective in resource-constrained environments such as UAVs. These advancements position YOLOv10 as a leading choice for UAV-based SAR missions [
50], where speed and accuracy are critical for detecting small objects in dynamic marine settings.
YOLOv11 represents a significant leap forward in maritime object detection, refining YOLOv10’s architecture through advanced Neural Architecture Search and an optimized backbone-head design [
51]. The introduction of multi-scale feature interaction modules combined with an attention-enhanced FPN significantly boosts detection capabilities, particularly for distant, occluded, or partially submerged objects in challenging marine environments. Engineered to excel in low-visibility conditions, YOLOv11 demonstrates exceptional resilience to motion blur and camera shake, critical for SAR operations, and automated port surveillance. Its parallel multi-task processing enhances efficiency in detecting small or overlapping marine targets, even in cluttered high-resolution imagery [
51]. Despite these advancements, YOLOv11 maintains real-time inference speeds with higher accuracy than its predecessors, making it ideal for multi-object detection.
Addressing the intricacies of marine object detection, particularly small targets and fluctuating maritime environments, the YOLO model has evolved with each iteration. Finally, YOLOv7’s versatility shines through its ability to handle varying sea conditions and dim lighting to identify small boats, swimmers, and other crucial objects [
52]. A comparative table (
Table 5) presents a summary of the suitability of YOLO models for detecting very small objects in a complex marine environment.
Based on the discussion, this study is restricted to the training and testing results of three YOLO models (YOLOv7, YOLOv10, and YOLOv11) on the SeaDronesSee and AFO datasets. The experimental workflow model is illustrated in
Figure 1.
Figure 1 outlines the deep learning methodology adopted in this study. The process begins by loading image data from the first dataset. This step is followed by preprocessing, which includes label assignment. The preprocessed data are then saved and subsequently divided into training and testing sets. Here, it is ensured that class distribution is balanced for better model performance. For each model (YOLOv7, YOLOv10, and YOLOv11), the learning process involves cross-validation using optimization of hyperparameters. Once the training is completed, the evaluation metrics (discussed in
Section 3.2) are measured to judge training performance. If satisfaction is reached based on a criterion, the model undergoes testing using the testing set, and the evaluation metrics are measured and termed as testing results. This entire procedure is completed for each model. After this, the entire training and testing process of each model is repeated for the second dataset. Finally, the performance parameters of all three models are analyzed against each dataset to assess generalization, computational efficiency, and robustness using evaluation metrics and the confusion matrix.
5. Experimental Methodology
This section presents experimental results obtained from datasets on deep learning YOLO models.
5.1. Dataset Preprocessing
The SeaDroneSee and AFO datasets primarily contain RGB images and video data with associated annotations. The SeaDroneSee dataset contains a large collection of still images and sequences of frames used for single-object and multi-object detection and tracking. The current version contains 14,227 images (8930 for training, 1547 for validation, and 3750 for testing) across 6 classes: swimmers, boats, jet skis, lifesaving appliances, buoy, and “ignored” regions. Its continuous updates ensure its relevance to real-world situations, making it a valuable resource for developing autonomous UAV-based SAR technologies. The AFO dataset contains a large collection of images featuring a broad spectrum of floating objects, including boats, debris, and natural clutter, captured from stationary and moving ground-based sensors with varied resolutions and video clips captured by drone-mounted cameras. These are typically provided in a structured JSON format along with metadata. The metadata include information about the location, dimensions such as height and width, classes, GPS coordinates, altitude, camera angles, and environmental conditions (e.g., waves, glare). Since YOLO algorithms require a specific normalized text-based annotation format, consisting of class indices and bounding box coordinates normalized to image dimensions (values between 0 and 1), the JSON annotations were converted to meet these requirements. Later, the datasets were divided into the training and validation sets. Separate test sets were then created by reserving 10% of the original training data.
5.2. Results
To analyze performance, YOLOv7, YOLOv10, and YOLOv11 were trained using the training set of both datasets. The benchmark metric results for YOLOv7 are graphed in
Figure 2 and
Figure 3 for the SeaDroneSee and AFO datasets, respectively. Likewise, the training results of YOLOv10 and YOLOv11 are displayed in
Figure 4 and
Figure 5, and in
Figure 6 and
Figure 7, respectively.
The results displayed in
Figure 2 reveal strong performance for most classes (e.g., ignored at 0.95 TP, jet ski at 0.94 TP) but highlight critical weaknesses in “lifesaving appliances” (only 0.52 TP with 0.48 FP), indicating frequent misclassification as background or other objects. Moreover, as shown in
Figure 3, the model shows high precision and recall for the majority of classes, with “ignored” and “jet ski” performing particularly well. However, lifesaving appliances remain a problem class, showing a low true-positive rate and high misclassification, which suggests the model is often confusing this class with the background or other objects. This highlights a clear need for either more representative training data or better class balancing, as this class is not being learned as well as the others.
The training results for YOLOv10 (in
Figure 4) show solid learning for “ignored” and “jet ski” (≈0.85 true-positive each with <10% background spill), moderate performance on “boat” (0.61 TP, with 15% mis-routed to lifesaving gear and many misses), weak recall for swimmers (0.46 TP, 53% FN), and an almost blind spot for “lifesaving appliances” (0.08 TP, 77% FN); the overall background false-positives stayed below 10%, but a sizable share of real boats and swimmers still vanish into the background, signaling those two classes need targeted augmentation or class-balancing.
Figure 5 demonstrates that YOLOv10 on AFO effectively suppresses background noise, maintaining false positives largely below 10%. The “ignored” and “jet ski” classes continue to show strong performance. However, boat detection is only moderate, with a significant number of boats either being misclassified as lifesaving gear or missed entirely. Swimmer detection remains weak, with true positives below 0.5 and a very high false negative rate. This indicates a particular need for more data augmentation or improved class distribution for swimmers. The model also almost entirely misses lifesaving appliances, suggesting significant difficulty with this class on the current dataset.
The training results in
Figure 6 show strong performance for most classes, with “ignored” (0.97 TP) and “jet ski” (0.91 TP) demonstrating excellent detection, while “boat” (0.89 TP) and “swimmer” (0.82 TP) show good but slightly noisier performance with moderate false positives (8–10% FP). The main weakness remains “lifesaving appliances” (0.63 TP with 28% FN), indicating persistent detection challenges, likely due to complex features or data imbalance. Background suppression is effective (<10% FP for most classes). YOLOv11 yields positive results (shown in
Figure 7) on the AFO dataset. As shown in
Figure 7, the model maintains strong precision and recall for the “ignored” and “jet ski” classes. Detection for “boat” and “swimmer” also improved compared to earlier versions, although there are still some moderate false positives for these categories. The primary challenge continues to be the “lifesaving appliances” class. While its performance slightly improved, both recall and precision remain lower compared to other classes, likely due to class imbalance or difficult features. Background suppression is effective across the board, with consistently low false positives for the background. Overall, YOLOv11 demonstrates balanced results, but improvements for “lifesaving appliances” are still necessary.
After training, the YOLO models were tested using the testing set of the SeaDroneSee dataset. The testing results are shown in
Table 6. The results indicate that YOLOv7 performs slightly better than YOLOv10 and YOLOv11. To understand it clearly, the confusion matrix of the YOLOv7 model for the SeaDroneSee dataset is shown in
Figure 8.
The YOLOv7 testing results (
Figure 8) show excellent performance for most classes, with near-perfect detection of “ignored” (0.99 TP), “swimmer” (0.96 TP), and “jet ski” (0.94 TP), demonstrating robust generalization. However, “boat” has moderate false positives (9% FP, likely confused with the background), while lifesaving appliances show improved but still suboptimal detection (0.81 TP with 15% FN). Background noise remains well-suppressed (<10% FP overall). For the AFO dataset, the testing results are shown in
Table 7, and the confusion matrix is displayed in
Figure 9.
The results of YOLOv7 on the AFO dataset exhibit strong generalization. The “ignored”, “swimmer”, and “jet ski” classes all achieve high true positive rates with minimal misclassifications. Boat detection is solid, though not perfect, with some moderate false positives likely confused with the background. While improved compared to training, “lifesaving appliances” detection still lags behind other classes, showing a higher false negative rate. This indicates that the model continues to struggle with reliably identifying this class in real test data. Notably, background false positives are well-controlled, suggesting effective noise suppression.
6. Analysis and Discussion
The YOLO series of models achieves a strong balance between speed and accuracy. Yet, when tested on specialized datasets such as SeaDroneSee and AFO, which feature demanding maritime conditions and drone-captured imagery, certain shortcomings emerge across YOLOv7, YOLOv10, and YOLOv11 models. These challenges underscore the necessity for further model refinement and dataset-specific enhancements to ensure reliable performance in maritime and aerial search and rescue operations.
Table 8 outlines the key performance limitations observed for YOLOv7, YOLOv10, and YOLOv11 models on these datasets.
Based on experimental results, the YOLO models and their variants developed for marine environments can be compared based on generalization, performance, and computational requirements.
Table 9 shows the comparison for each dataset. It shows that standard YOLOv7 yields slightly better generalized performance compared to YOLOv10 and YOLOv11 models in the context of all classes. However, it is also noted that the YOLOv7 variants show slightly better performance, but their performance cannot be generalized as they were developed specifically for relatively larger objects such as ships. A similar argument holds for YOLOv10 and YOLOv11 variants, as they were developed and tested using the custom datasets with better results.
The performance of YOLO models can also be compared in the context of real-time or near-real-time for SAR operations. In deep learning models such as YOLOv7, YOLOv10, and YOLOv11, the floating-point operations (FLOPS) in the forward pass primarily determine the computational complexity. These FLOPS for a standard convolutional layer can be computed as:
where C
in, C
out represent input and output channels, K
h, K
w represent kernel size in height and width, and H
out, W
out represent output feature map height and width. The backbone and head of YOLOv7 primarily utilize small 3 × 3 convolutional kernels across most feature extraction layers, striking a balance between model capacity, computational efficiency, and receptive field effectiveness. Additionally, 1 × 1 kernels are occasionally employed for channel reduction. YOLOv10 and YOLOv11 build on this foundation, maintaining the same design approach established by earlier versions. The multiplier ‘2’ takes care of both multiplication and addition in each operation.
For an intermediate layer, this can be calculated as follows. Assume a convolution layer of kernel size Kh × Kw = 3 × 3, input channel Cin: 512, output channel Cout: 1024, and an intermediate layer’s input image resolution of Hin: 20, Win: 20, the output dimensions would be Hout = Hin − Kh + 1 = 18, and Wout = Win − Kw + 1 = 18. Thus, the number of FLOPs for this intermediate layer can be calculated using Equation (1) as:
FLOPs for this intermediate layer = 2 × 512 × 1024 × 3 × 3 × 18 × 18 = 3.056 GFLOPs
The total complexity of the YOLO model can thus be determined by core stacked convolutional layers, though there are other such operations as activations and upsampling.
where L is the number of layers. The difference in FLOPs of the YOLO models is influenced by the number of parameters involved in each convolution. This comparison is detailed in
Table 10, where model size is calculated by multiplying the number of parameters by four, assuming 32-bit computation for each parameter.
Table 10 shows that YOLOv10 (medium) involves the lowest FLOPs compared to YOLOv7 and YOLOv11 models. Thus, it is a good candidate for developing the SAR application on edge devices. The improved lightweight variants of YOLOv7 reduce these computations and memory access time using partial convolutions, etc. [
59]. The innovations in architectural variants of the YOLOv10 model optimize this total FLOPS by removing redundancy and overhead [
60]. Likewise, YOLOv11 variants use optimizations, for example, pruning and removal of blocks for some object sizes to reduce the number of operations in each FLOP [
61].
Table 10 also shows the inference time computed on a dedicated NVIDIA T4 GPU platform, along with the runtime measured in this experimental work. The difference between inference time [
62,
63] and runtime in this work is due to the difference in computational power of the machines employed.
Benchmarking deep learning models for marine SAR typically encompasses several key components:
Developing specialized datasets: Due to the distinct challenges of marine environments, generic datasets often fall short. Researchers have thus focused on curating and enhancing datasets tailored specifically for SAR applications.
- ○
SARDet-100K represents a major step forward by aggregating and standardizing 10 existing SAR detection datasets into a large-scale, multi-class benchmark.
- ○
SeaDronesSee is widely used in UAV-based maritime SAR research, particularly for detecting individuals in the water.
- ○
VTSaR dataset supports aerial person detection by incorporating diverse scenes, activities, and viewpoints. It includes both visible and infrared images, as well as synthetic data.
- ○
Enhancing existing datasets, such as the Singapore Maritime Dataset (SMD), through relabeling or augmentation efforts to better align with deep learning-based marine object detection tasks.
Practical metrics: Beyond traditional measures such as mAP@0.5, benchmarks now increasingly consider hardware-specific inference speed, FLOPS, resource efficiency, false positive rates, resilience to environmental conditions such as fog and waves, and generalization.
Multimodal data: Benchmarking frameworks need to incorporate both optical (RGB and IR) and Synthetic Aperture Radar (SAR) imagery to improve model robustness and detection performance across diverse marine conditions.
These benchmarking initiatives are supported by a growing number of workshops and challenges, including the following:
Such events provide platforms for standardized evaluation, promote collaboration, and accelerate the translation of research into operational AI solutions for SAR missions. Overall, current benchmarking efforts in marine SAR emphasize dataset specialization, robust evaluation metrics, and model generalizability. They serve as a bridge between experimental success and the deployment of dependable AI systems that enhance real-world rescue operations.