Drone-Based Maritime Anomaly Detection with YOLO and Motion/Appearance Fusion

Suvittawat, Nutchanon; Soh, De Wen; Srigrarom, Sutthiphong

doi:10.3390/rs18030412

Open AccessArticle

Drone-Based Maritime Anomaly Detection with YOLO and Motion/Appearance Fusion

by

Nutchanon Suvittawat

¹

,

De Wen Soh

¹ and

Sutthiphong Srigrarom

^2,*

¹

Information Systems Technology and Design, Singapore University of Technology and Design, Singapore 487372, Singapore

²

Mechanical Engineering, National University of Singapore, Singapore 117411, Singapore

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 412; https://doi.org/10.3390/rs18030412

Submission received: 25 November 2025 / Revised: 16 January 2026 / Accepted: 20 January 2026 / Published: 26 January 2026

(This article belongs to the Special Issue Multi-Modal and Multi-Task Learning in Photogrammetry and Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed hybrid pipeline—YOLOv12 plus motion assistance and stillness (appearance) assistance—achieves consistently high detection quality on the held-out test set, with overall Precision, Recall, and mAP50 all above 90% across all classes (with 100% Precision and 91.2% Recall for the Windsurf “threat” class).
When deployed on 19 real maritime drone videos, the system maintains strong operational performance, with mean Precision at 92.89%, mean Recall at 90.44%, and mean Accuracy at 98.50%, despite changing maritime environments.

What are the implications of the main findings?

Fusing YOLOv12 appearance cues with motion and stillness assistance is an effective strategy for stabilizing threat tracks and reducing missed detections in complex maritime environments, suggesting that pure detector-only or motion-only pipelines are insufficient on their own.
The demonstrated performance using windsurfers as proxy anomalies indicates that, with retraining on genuine anomaly footage and extension to more locations and sensors, this framework can serve as a practical basis for future operational maritime and coastal security systems.

Abstract

Maritime surveillance is critical for ensuring the safety and continuity of sea logistics, port operations, and coastal activities in the presence of anomalies such as unlawful maritime activities, security-related incidents, and anomalous events (e.g., tsunamis or aggressive marine wildlife). Recent advances in unmanned aerial vehicles (UAVs)/drones and computer vision enable automated, wide-area monitoring that can reduce dependence on continuous human observation and mitigate the limitations of traditional methods in complex maritime environments (e.g., waves, ship clutter, and marine animal movement). This study proposes a hybrid anomaly detection and tracking pipeline that integrates YOLOv12, as the primary object detector, with two auxiliary modules: (i) motion assistance for tracking moving anomalies and (ii) stillness (appearance) assistance for tracking slow-moving or stationary anomalies. The system is trained and evaluated on a custom maritime dataset captured using a DJI Mini 2 drone operating around a port area near Bayshore MRT Station (TE29), Singapore. Windsurfers are used as proxy (dummy) anomalies because real anomaly footage is restricted for security reasons. On the held-out test set, the trained model achieves over 90% on Precision, Recall, and mAP50 across all classes. When deployed on real maritime video sequences, the pipeline attains a mean Precision of 92.89% (SD 13.31), a mean Recall of 90.44% (SD 15.24), and a mean Accuracy of 98.50% (SD 2.00%), indicating strong potential for real-world maritime anomaly detection. This proof of concept provides a basis for future deployment and retraining on genuine anomaly footage obtained from relevant authorities to further enhance operational readiness for maritime and coastal security.

Keywords:

maritime surveillance; drone-based monitoring; YOLOv12; object detection; motion-assisted tracking; appearance-based tracking; UAV; anomaly detection; coastal security; computer vision

Graphical Abstract

1. Introduction

Unmanned aerial vehicles (UAVs), or drones, have rapidly become an essential tool for maritime safety, security, and search-and-rescue (SAR) operations because they can cover large sea areas quickly and at relatively low cost. Yet, from a computer vision perspective, maritime UAV scenes are among the most difficult environments: targets are tiny, the background is dynamic (waves, wakes, and reflections), and weather and lighting conditions can change dramatically within minutes. The SeaDronesSee benchmark was created precisely to address this gap. It provides over 54,000 annotated frames of humans and vessels in open water, with dedicated tracks for detection and tracking, and shows that even state-of-the-art detectors currently reach only around 36% mAP on its detection track, highlighting how far we still are from reliable automated SAR at sea [1,2].

Before deep learning, maritime video analytics relied heavily on classical motion-based methods such as background subtraction and optical flow. A comprehensive survey on video detection and tracking of maritime vessels documents how these traditional approaches struggle with sea clutter, camera motion, and changing illumination and how they often require careful tuning for each scenario. A more focused experimental study on object detection in a maritime environment systematically evaluated more than twenty background subtraction algorithms and found that none could cope robustly with the combined challenges of waves, wakes, and platform motion. Together, these works make a strong case that motion information alone is not sufficient for dependable maritime anomaly detection; robust appearance modeling is needed as well [3,4].

At the other end of the spatial scale, spaceborne optical imagery plays a crucial role in maritime domain awareness. A detailed survey on vessel detection and classification from spaceborne optical images reviews decades of research on detecting ships from satellites, across multispectral, SAR (Synthetic Aperture Radar), and high-resolution optical sensors. It shows a clear evolution from hand-crafted features to convolutional neural networks, but it also emphasizes persistent difficulties in detecting small vessels, dealing with cloud cover, and discriminating ships from look-alike coastal structures. These satellite-scale challenges mirror, at a different resolution, the small-object and clutter problems encountered in UAV imagery [5].

In parallel, the broader maritime awareness community has been integrating AI methods into operational systems. A review of AI methods for maritime awareness systems discusses how machine learning has been used for detection, classification, anomaly detection, and route prediction across heterogeneous sensors such as the AIS (Automatic Identification System), radar, and imagery. That review points out not only the promise of deep learning but also unresolved issues: data scarcity in rare but critical events, the need for multi-sensor fusion, and the challenge of making AI-based decisions interpretable to human operators. This broader context underlines that object detection from images or video is only one component of a complete maritime anomaly-detection pipeline, but it is a foundational one [6].

With the rise of deep learning, YOLO-style one-stage detectors have become a de facto standard for real-time vision in constrained platforms. In the context of UAV maritime surveillance, Cheng et al. proposed YOLOv5-ODConvNeXt, a tailored architecture for “Deep learning based efficient ship detection from drone-captured images for maritime surveillance”, which augments YOLOv5s with omni-dimensional convolution and ConvNeXt-like modules to better capture multi-scale ship features while remaining lightweight enough for drone deployment [7]. Their experiments on a dedicated drone-captured ship dataset demonstrate that such optimized architectures can significantly improve both Accuracy and inference speed over vanilla YOLOv5, showing the potential of UAV-borne detectors for operational maritime surveillance.

Similarly, YOLOv7-Sea adapts YOLOv7 specifically to maritime UAV images by introducing microscale object heads, attention mechanisms, and data-augmentation strategies tuned to sea scenes [8]. On SeaDronesSee and related datasets, YOLOv7-Sea achieves higher detection performance than the original YOLOv7, especially on small targets such as distant swimmers or small boats. This provides further evidence that domain-specific architectural enhancements can partially close the performance gap on challenging maritime UAV benchmarks.

Beyond individual platforms, there is an increasing interest in using YOLO models for large-scale traffic mapping. In “Mapping recreational marine traffic from Sentinel-2 imagery using YOLO object detection models”, Mäyrä et al. construct a ship-detection pipeline for Sentinel-2 satellite data and apply it to quantify recreational marine traffic patterns across wide coastal regions [9]. Their work illustrates how YOLO-based detectors can be deployed at scale for strategic maritime monitoring, even at moderate resolution, and provides an example of how detection outputs can feed into higher-level analyses and policy questions.

Within the search-and-rescue domain, researchers have also started to design YOLO variants specifically for SAR imagery. MES-YOLO, an “efficient lightweight maritime search and rescue object detection algorithm with improved feature fusion pyramid network,” builds on a YOLOv8 backbone but introduces a Multi-Asymptotic Feature Pyramid Network and other architectural changes tailored to small SAR targets and complex sea states [10]. Evaluations on SAR-oriented datasets show that MES-YOLO achieves higher Accuracy than standard YOLO baselines while remaining lightweight enough for UAV or edge deployment, reinforcing the trend toward task- and domain-specific YOLO variants in maritime applications.

Taken together, the existing literature indicates two complementary conclusions. First, classical motion-centric approaches highlight that temporal information is valuable but fragile in isolation under sea clutter and platform motion. Second, appearance-based deep detectors—even when optimized—still struggle with small targets and complex maritime backgrounds, and they may produce intermittent detections that degrade tracking continuity in video. This combination motivates approaches that explicitly integrate appearance cues from deep detectors with temporal cues derived from video dynamics.

In this work, we investigate a fusion-based, deployment-oriented pipeline for UAV maritime anomaly detection that combines YOLO-based appearance detection with motion/stillness assistance modules to improve video-level stability. Rather than proposing a new detector architecture, our focus is on the system-level hypothesis that complementary temporal cues can (i) reduce wave-induced false alarms, (ii) mitigate short-term detection dropouts, and (iii) support more consistent tracking in challenging UAV maritime videos. By framing the problem around video reliability and operational robustness, our study aims to contribute evidence and analysis toward more dependable maritime UAV surveillance and SAR-support systems.

2. Literature Review

Several related studies and projects on maritime surveillance and detection are summarized in Table 1.

Prior work on YOLO for maritime monitoring can be grouped into three closely related lines. First, sensor-specific studies demonstrate that deep detectors can recover small vessels under challenging conditions: radar/time–frequency representations enable detection when echoes are embedded in sea clutter, while infrared pipelines combine carefully designed appearance cues to suppress sun-glint and background noise [11,13,16,19]. Second, a large body of work focuses on engineering lighter and more robust YOLO variants for deployment on constrained maritime or edge platforms, improving multi-scale detection and small-target sensitivity through backbone redesigns and loss-function refinements [12,15,18,20,24]. Third, UAV- and aerial-platform studies show that drone-borne imagery has become a central modality for maritime observation, with many reports achieving strong frame-level detection Accuracy across diverse scenes [17,21,23].

Despite this progress, most existing methods remain detection-centric: they largely operate on single frames (or short local fluctuation measures in infrared) and output bounding boxes and class labels, with limited use of temporal reasoning. Even recent efforts addressing data scarcity (e.g., semi-supervised learning) and benchmarking across sea states typically frame the task as “ship detection” rather than anomaly-oriented monitoring [21,22]. As a result, there is a clear gap in explicitly fusing appearance detections with motion and trajectory patterns over time to infer anomaly-like behaviors—such as small agile objects, atypical approach paths, or non-cooperative craft moving inconsistently with normal traffic.

Our work addresses this gap by shifting from frame-level detection to track-level reasoning. We build on YOLO’s proven appearance modeling and introduce a lightweight fusion layer that integrates (i) YOLO detections, (ii) motion cues from background subtraction and frame-differencing, and (iii) appearance-based reidentification against stored templates. This design supports anomaly-centric decision-making in drone video, enabling the system to distinguish ambiguous small movers from benign vessels and bridging the gap between object detection outputs and higher-level maritime situation awareness.

3. Data Collection

We constructed a maritime video dataset using a DJI Mini 2 drone (DJI, Shenzhen, China) flown over the port area near Bayshore MRT Station (TE29), Singapore. Data collection was planned to ensure inclusion of scenes containing windsurf objects, which are designated as the “anomaly” class for model training and evaluation. The dataset also captures other known maritime categories (e.g., ships, coastline, and land) to enable class separation during detection. Representative examples from the collection are shown in Figure 1.

Following video acquisition, we extracted image frames and partitioned the dataset into training, validation, and test splits in a 70:20:10 ratio [25]. Annotation was performed using Roboflow 3.0 [26], as illustrated in Figure 2. The dataset was annotated in Roboflow using polygon masks (region-style labeling) and exported in YOLO polygon format (class index followed by normalized polygon vertices). While polygon labels preserve accurate object extents, our deployed pipeline is detection/tracking-oriented; thus, during inference, we use the model’s predicted axis-aligned bounding boxes as the common representation for tracking and downstream event analysis. This choice improves runtime efficiency and integration with multi-object tracking, at the cost of discarding fine boundary details from the original polygon annotations.

Upon completing annotation across the three data splits, we summarized class distributions and instance counts; the results are presented in Table 2.

4. Methodology

4.1. YOLOv12

In this study, we employ YOLOv12 as the primary state-of-the-art detector to mitigate false positives arising from background dynamics (e.g., wave or cloud motion). YOLO is a single-stage architecture that performs direct regression from images to bounding boxes and class probabilities in a single forward pass—without an intermediate region-proposal stage. A backbone extracts features, and a detection head predicts box coordinates, an objectness score, and class probabilities at each spatial location/anchor. This design enables real-time inference and helps maintain focus on target objects while suppressing environmental noise [27]. Specifically, YOLOv12—the latest iteration in the series, released on 18 February 2025 [28]—retains the canonical backbone → neck → head layout, emphasizes attention with compute-efficient refinements, and stabilizes deep feature aggregation via R-ELAN [29]. The overall architecture is depicted in Figure 3a,b, adapted from [30]. The key innovations of YOLOv12 can be summarized as follows:

Area Attention (A²) with FlashAttention [29,31,32]

YOLOv12 reshapes the feature map into spatial “areas” and applies attention locally per area and then fuses the results, as shown in Equation (1).

A t t n (Q, K, V) = s o f t m a x (\frac{{Q K}^{⊤}}{\sqrt{d_{k}}}) V

(1)

$Q, K, V$ : Query, key, and value matrices (tokens projected from feature maps), respectively.
$d_{k}$ : Key dimensionality used for the $\frac{1}{\sqrt{d_{k}}}$ scaling in attention.
$A t t n (Q, K, V) = s o f t m a x (\frac{{Q K}^{⊤}}{\sqrt{d_{k}}}) V$ : Scaled dot-product attention.

Instead of global attention over all

H \times W

(spatial height and width) tokens, the feature map is partitioned into

A

areas (number of spatial areas/windows); attention is computed per area and then fused. This reduces the dominant cost from

O ({(H W)}^{2})

to

O ({(H W)}^{2} / A)

while preserving a large receptive field. FlashAttention further accelerates by tiling to minimize GPU High-Bandwidth Memory (HBM) reads/writes (IO-aware exact attention), resulting in attention usable at real-time resolutions (e.g., 640).

2.: R-ELAN: Residual Efficient Layer Aggregation Networks [29,33,34]

Since original ELAN/GELAN stacks deep aggregation paths but can create gradient bottlenecks in attention-centric, larger backbones, R-ELAN adds block-level residuals with scaling and retools aggregation to stabilize training and improve fusion, as shown in Equation (2) [35].

y = x + γ F (x), γ \in (0,1) (t y p i c a l l y 0.01)

(2)

$x$ : Input tensor to an R-ELAN block.
$F (x)$ : ELAN-style internal transform (multi-branch conv/attention + concatenation/merge inside the block).
$γ$ : Residual scaling factor (small constant, e.g., $\approx$ 0.01) that stabilizes training.
$y = x + γ F (x)$ : Scaled residual connection output of the block.

With this solution, the model can perform optimization more easily (especially large models), has fewer parameters than naively deep stacks, and has better feature reuse/fusion in the backbone.

4.2. Motion Assistance

Following YOLOv12-based detection, we maintain confirmed anomaly tracks during temporary detector dropouts by incorporating motion cues. Specifically, we apply background subtraction and frame-differencing to extract motion blobs and then associate these blobs with existing tracks using distance, IoU, and path-tortuosity gating. This fusion allows tracks to persist through brief appearance changes, pose/shape deformations, or partial occlusions that may otherwise cause missed detections and premature track termination.

Background subtraction (MOG2)

This was developed based on the pixel-wise Gaussian Mixture Model (GMM) for background subtraction that was introduced by [36], who modeled each pixel’s recent history as a mixture of Gaussians and classified the foreground when the current sample did not match the dominant “background” components, as shown in Equation (3). Then, refs. [37,38] proposed adaptive, recursive updates that (i) automatically select the number of Gaussians per pixel and (ii) update the mixture parameters online, improving robustness to illumination changes and scene dynamics; these works are the basis of OpenCV’s MOG2 implementation [39,40].

p (I_{t} (x)) = \sum_{k = 1}^{K} w_{k} (x) N (I_{t} (x) ∣ μ_{k} (x), \sum_{k} (x))

(3)

$I_{t} (x)$ : Pixel value at location $x$ in frame $t$ (intensity or 3-vector).
$K$ : Number of Gaussian components in the pixel’s mixture model.
$w_{k} (x)$ : Weight of mixture component $k$ at pixel $x$ .
$μ_{k} (x)$ : Mean (expected pixel value) of component $k$ at $x$ .
$\sum_{k} (x)$ : Covariance (often diagonal/scalar variance) of component $k$ at $x$ .
$N (\cdot ∣ μ, Σ)$ : Gaussian density.

2.: Frame differencing [41,42]

We compute a lightweight motion mask by thresholding the absolute inter-frame grayscale difference, as shown in Equation (4).

D_{t} = 1 (∣ G r a y (I_{t}) - G r a y (I_{t - 1}) ∣ > τ_{d i f f})

(4)

$D_{t}$ : Binary motion mask for frame $t$ (1 = moving, 0 = static).
$τ_{d i f f}$ : Gray-level threshold for differencing (pixels with change above this are motion).
1(⋅): An indicator.

Figure 3. (a) Architectural structure of YOLOv12. (b) R-ELAN architecture. The illustrations were inspired by [30].

4.3. Stillness (Appearance) Assistance

In cases where detected anomalies lose association with YOLO and become sufficiently slow or stationary such that motion-based support is ineffective, we revert to appearance-based reidentification against a stored template. We evaluate up to three appearance cues and consider the object present if any cue satisfies the acceptance criterion.

Template matching via normalized cross-correlation (NCC) [43,44]

This can capture structural similarity under linear brightness changes.

N C C (R, T) = \frac{\sum (R - \bar{R}) (T - \bar{T})}{\sqrt{\sum {(R - \bar{R})}^{2}} \sqrt{\sum {(T - \bar{T})}^{2}}} \in [- 1,1]

(5)

$T$ : Stored grayscale template patch of the target (fixed size).
$R$ : Current grayscale ROI resized to the same size as $T$ .
$\bar{T}, \bar{R}$ : Mean intensities of template and ROI, respectively.
$N C C (R, T) \in [- 1,1]$ : Normalized cross-correlation score.

This is accepted if

m a x N C C \geq τ_{N C C}

where

τ_{N C C}

is the acceptance threshold for NCC (e.g., 0.72).

2.: Oriented FAST and Rotated BRIEF (ORB) keypoint matching (binary features + Hamming) [45,46]

This is rotation-aware, robust to modest viewpoint/illumination changes, and very fast. This is accepted if

R a t i o = \frac{# g o o d m a t c h e s}{# t e m p l a t e k e y p o i n t s} \geq τ_{O R B}

(6)

ORB keypoints: Salient points detected in $T$ and $R$ .
Binary descriptors: Bitstrings describing patches around keypoints.
Hamming distance: Number of differing bits between two descriptors.
$R a t i o = \frac{# g o o d m a t c h e s}{# t e m p l a t e k e y p o i n t s}$ : Match quality.
$τ_{O R B}$ : Minimal ratio to accept (e.g., 0.20).

This is a practical match-coverage heuristic akin to the inlier ratio used in RANSAC-style geometric verification [47,48].

3.: Hue, saturation, and value (HSV) histogram correlation (color cue) [49,50]

We compute 2D HSV histograms,

H_{1}, H_{2}

, for the template and ROI. This is helpful, as it retains targets with consistent color distribution despite small pose/scale changes when texture is weak. We will accept the correlation metric if

c o r r (H_{1}, H_{2}) = \frac{\sum_{i} (H_{1} (i) - {\bar{H}}_{1}) (H_{2} (i) - {\bar{H}}_{2})}{\sqrt{\sum_{i} {(H_{1} (i) - {\bar{H}}_{1})}^{2}} \sqrt{\sum_{i} {(H_{2} (i) - {\bar{H}}_{2})}^{2}}} \geq τ_{h i s t}

(7)

$H_{1}, H_{2}$ : 2D color histograms for template and ROI.
${\bar{H}}_{1}, {\bar{H}}_{2}$ : Mean bin values of each histogram.
$c o r r (H_{1}, H_{2})$ : Pearson correlation between histograms (−1 to 1).
$τ_{h i s t}$ : Minimal histogram correlation to accept (e.g., 0.90).

We next integrate the foregoing components into a unified maritime-anomaly surveillance workflow (Figure 4). Input video frames of the maritime scene are processed by our trained YOLOv12 detector, which classifies objects (e.g., sky, ships, towers) and flags instances of the anomaly class. Because operational data on genuine maritime anomalies (e.g., unlawful vessels, remotely operated craft) are restricted, we designate windsurf targets as a proxy anomaly: they are small, visually distinct from transport ships, and exhibit irregular trajectories suitable for stress-testing detection and tracking. Objects classified as non-anomalies are tracked until exit, and their metadata are logged by class. For anomalies, we maintain track continuity using two complementary modules: (i) motion assistance (M), which exploits background subtraction and frame-differencing to associate motion blobs when YOLO momentarily misses the target, and (ii) stillness assistance (S), which applies appearance-based reidentification against stored templates when the target is slow or stationary and motion cues are unreliable. The system switches between M and S as object dynamics change, while YOLO re-acquires detections whenever possible. The pipeline runs to completion over each video, producing an annotated output video and CSV logs for metrics and audit trails.

4.4. Performance Metrics

When training YOLOv12 to recognize target classes (e.g., ships, windsurfs, land; see Table 2, Section 3), rigorous evaluation criteria are required to determine deployment readiness. For object detection, three core metrics—Precision (P), Recall (R), and mean Average Precision (mAP), with particular emphasis on mAP50—are standard and form the basis of our model assessment [51,52]. We first define the requisite terms as follows: let a binary detector produce predicted labels

\hat{y} \in {0, 1}

for ground truth

y \in {0, 1}

. Then,

TP (true positives): Predicted $\hat{y}$ = 1 and $y$ = 1.
FP (false positives): Predicted $\hat{y}$ = 1 and $y$ = 0.
TN (true negatives): Predicted $\hat{y}$ = 0 and $y$ = 0.
FN (false negatives): Predicted $\hat{y}$ = 0 and $y$ = 1.

Precision (P)

Fraction of predicted positives that are actually positive:

P = \frac{T P}{T P + F P}

(8)

High Precision means few false alarms (low FP).

2.: Recall (R)

Fraction of actual positives that are detected:

R = \frac{T P}{T P + F N}

(9)

High Recall means few misses (low FN).

3.: Average Precision (AP) and Mean Average Precision (mAP) [53,54,55]

We first introduce Intersection over Union (IoU), as it underpins the subsequent detection metrics.

Intersection over Union (IoU)

For object detection, a predicted box,

B_{p}

, matches a ground-truth box,

B_{g}

, if

I o U (B_{p}, B_{g}) = \frac{| B_{p} \cap B_{g} |}{| B_{p} \cup B_{g} |} \geq τ

(10)

where

τ

is an IoU threshold (e.g., 0.50).

Average Precision (AP) is the area under the Precision–Recall curve (with standard interpolation): [53,56,57]

A P = \int_{0}^{1} P_{i n t e r p} (r) d r \approx \sum_{k}^{K} (r_{k} - r_{k - 1}) \max_{\tilde{r} \geq r_{k}} P (\tilde{r})

(11)

$r$ : Recall level on the Precision–Recall (PR) curve; $r$ $\in [0, 1]$ .
$P (r)$ : Precision measured at Recall level, $r$ , after sorting detections by confidence.
$P_{i n t e r p} (r)$ : Interpolated Precision (ensures a non-increasing PR curve for integration).
$k$ is an index over the Recall steps.

Mean Average Precision (mAP) is the mean of AP over all classes. In modern practice, mAP@0.50 (mAP50) is AP-computed at a single IoU threshold,

τ

= 0.50, and then averaged across classes (VOC-style).

m A P 50 = \frac{1}{C} \sum_{c = 1}^{C} {A P}_{c}^{τ = 0.50}

(12)

$C$ : Number of classes.
${A P}_{c}^{τ}$ : AP for class, $c$ , computed at IoU threshold, $τ$ .

Following offline evaluation using Precision, Recall, and mAP, we assess deployed performance on full-length videos using Accuracy (in place of mAP). Specifically, we compute object-level Accuracy against human-annotated ground-truth anomalies to quantify the proportion of correctly identified anomalies relative to all annotated instances, providing an overall measure of end-to-end detection effectiveness in operational conditions.

4.: Accuracy

Overall proportion of correct predictions, useful as a general score:

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(13)

5. Experimental Results and Discussion

We trained our model, which is based on the YOLOv12 structure, with the datasets described in Section 3. The training of the model was performed on an NVIDIA GeForce RTX 3050 Laptop GPU with 32 GB of DDR4 memory. We selected YOLOv12n based on a performance–efficiency trade-off observed in our preliminary benchmarking. Specifically, we trained and evaluated several YOLO-family detectors available in Ultralytics [58] for 500 epochs on our dataset. Among the tested variants, YOLOv12 consistently achieved the strongest overall performance while maintaining a moderate parameter count, which kept GPU memory usage and training time within our computational constraints (Table 3). In contrast, transformer-based or two-stage detectors such as RT-DETR, DETR, and Faster R-CNN [23] typically contain > 30 million parameters, resulting in substantially higher training cost under the same hardware conditions. For instance, in our RT-DETR trial (≈33 M parameters), a single training epoch required approximately 7 min, implying roughly 58 h for a full 500-epoch run. This is markedly longer than the training time of the lightweight YOLO models evaluated in our study, which required approximately 1 h per model for the complete training schedule. Given its competitive detection Accuracy and substantially lower computational burden, YOLOv12n was, therefore, chosen as the base detector for integration into our maritime anomaly-detection fusion pipeline.

In Table 4, regarding the YOLOv12 test set’s performance, the model attains Precision = 98.27%, Recall = 96.76%, and mAP50 = 98.77% for overall detection among all classes, indicating strong detection quality. For the designated anomaly class (windsurfs), Precision = 100% and Recall = 91.20%. This reflects the typical Precision–Recall trade-off: while false positives for windsurfs are effectively eliminated (100% Precision), a small fraction of true windsurf instances are missed (its Recall < overall Recall). A Recall of 91.20% implies that, on average, approximately 9 of 10 windsurf instances are detected.

To evaluate the robustness and stability of the trained model, we generated five randomized train/validation/test partitions for our dataset (327 images) using a 70:20:10 split ratio, consistent with the protocol in Table 4, to assess the robustness of the trained YOLOv12 model. As summarized in Table 5, the mean test set Precision, Recall, and mAP50 across the five split versions all exceed 90%, and the results follow the same overall trend reported in Table 4. Moreover, the variability across splits is low, with standard deviations below 1%. Collectively, these findings suggest that, despite the limited dataset size, the model exhibits stable generalization performance and shows no clear evidence of overfitting or underfitting under the evaluated split conditions.

We then deployed and evaluated the trained model on 19 maritime videos captured near Bayshore MRT Station (TE29), Singapore, by a DJI Mini 2 drone, using the parameter settings listed in Table 6. Across videos, the mean Precision is 92.89% with a standard deviation (SD) of 13.31%, as shown in Figure 5a, and the mean Recall is 90.44% (SD 15.24%), as shown in Figure 5b, evidencing variability consistent with changing environmental conditions (e.g., sea state, lighting, and scale). Despite this variance, the end-to-end Accuracy averages 98.50% (S.D. 2.00%), as shown in Figure 5c, suggesting robust operational performance.

When the trained model was deployed on the drone-based maritime footage, YOLOv12 served as the primary detector for identifying anomalies (windsurfs). Using its learned weight parameters, YOLOv12 first localized objects in the scene and assigned class labels; instances classified as windsurfs were treated as anomalies for the purpose of this study, as illustrated in Figure 6, which shows three detected anomalies (windsurfs). When a YOLOv12-detected anomaly continued to move, but the detector temporarily lost track of it, the system activated the motion assistance module. This module relied on pixel- and frame-differencing to estimate positional changes and maintain the track until YOLOv12 re-acquired the target, as shown in Figure 7, where two anomalies are detected by YOLOv12 and one additional anomaly is maintained solely by motion assistance as a continuation of a detection from Figure 6.

In scenarios where detected anomalies gradually slowed down and became nearly stationary—leading YOLOv12 to lose track—the system first attempted to rely on motion assistance. If motion cues were insufficient, the stillness (appearance) assistance module was engaged. This module preserved the track by comparing the visual similarity of candidate regions to the last YOLOv12 detection, thereby bridging periods without reliable motion or YOLOv12 outputs. Figure 8 demonstrates this behavior, with three anomalies detected by YOLOv12 and one maintained by stillness (appearance) assistance. Finally, all three detection and tracking components—YOLOv12, motion assistance, and stillness (appearance) assistance—operate jointly to minimize missed detections and improve overall tracking Accuracy. This combined behavior is illustrated in Figure 9, where three anomalies are detected by YOLOv12 (including one that was previously maintained only by stillness assistance in Figure 8 and has since been re-acquired by YOLOv12), one anomaly is maintained by motion assistance (previously detected by YOLOv12 in Figure 8), and one additional anomaly is tracked exclusively by stillness (appearance) assistance, which was not detected in the earlier frame shown in Figure 8.

As summarized in Table 7, both auxiliary modules—motion assistance and stillness (appearance) assistance—provide measurable support to the baseline YOLOv12 detector during video processing. In particular, among the 19 evaluated footage videos, 14 sequences contained at least one interval in which the motion assistance module was activated to sustain continuity for a YOLOv12-detected real anomaly. When counted at the event level, this corresponds to 20 of 60 true-positive anomaly instances (33.33%) benefiting from motion assistance at least once, primarily by reducing short-term tracking interruptions and helping the system recover from transient detection dropouts.

Similarly, the stillness (appearance) assistance module was triggered less frequently but still contributed to detection stability. Specifically, 4 of 19 videos (or 6 of 60 true-positive instances, 10.00%) relied on stillness assistance to preserve track continuity. This pattern is consistent with the intended design: appearance-based support is most beneficial under conditions where motion cues are weak or ambiguous (e.g., low relative motion, brief stationary behavior, or subtle object displacement), whereas motion assistance is more broadly applicable in dynamic scenes.

Although the absolute proportions are not dominant, these results are important for two reasons. First, they indicate that the proposed fusion pipeline provides robustness gains rather than headline improvements in detector Accuracy. In practical maritime surveillance, operational failures can arise not only from persistent misdetections but also from intermittent instability—for example, momentary misses due to waves, glare, compression artifacts, partial occlusion, or rapid viewpoint changes. Even occasional activation of assistance modules can, therefore, reduce fragmented tracks, stabilize temporal reasoning, and improve downstream analytics (e.g., duration, trajectory, and event-level reporting).

Second, the observed contribution supports the role of our fusion approach as a supplementary reliability layer rather than a competing detection architecture. The assistance modules are designed to “bridge” short detection gaps and prevent track loss, enabling YOLOv12 to operate more consistently when deployed in real-world maritime environments where appearance and motion conditions vary substantially across time and locations. Consequently, the fusion model primarily enhances deployment readiness—improving continuity and operational stability—rather than replacing the underlying detector. Additional details of the ablation study are provided in Table A1 in Appendix A.

Taken together, the results support that the integrated pipeline—YOLOv12 with motion assistance and stillness (appearance) assistance—performs reliably on real maritime footage, with high overall Accuracy and acceptable variability across scenes.

Generalization

Given the relatively limited size of our in-house dataset, we augmented it by integrating the drone-based maritime AFO dataset [59], which was also adopted in [23]. This dataset includes a “wind/sup-board” category that is closely aligned with our “windsurf” class, enabling a more consistent representation of the target anomaly category across sources. In addition, the AFO dataset provides finer-grained maritime object labels—such as humans, boats, buoys, sailboats, and kayaks—whereas our original annotations primarily consolidated non-target vessels into a single “ship” category, reflecting the generally larger object scale and reduced class diversity observed in our footage. We merged the two datasets and repartitioned the combined corpus into training, validation, and test sets using a 60:10:30 split, as shown in Table 8. The relatively large test proportion was selected intentionally to evaluate model robustness and generalization under a more diverse data distribution and a stricter held-out evaluation setting. The corresponding performance results on this combined dataset are reported in Table 9.

After training on the merged (our dataset + AFO) corpus, we evaluated the resulting model on an external set of drone-recorded maritime videos from a surveillance-and-rescue dataset [61], which was also used in [23]. This MOBDrone dataset was selected as a test case because its label taxonomy aligns well with the class definitions in our combined training set. Specifically, it includes person (corresponding to the human class), boat, wood, and life buoys, as well as surfboards, which serve as the closest counterpart to our windsurf anomaly category. The example results of this deployment study are presented in Figure 10.

On the tested device, our laptop, processing a 30 FPS input stream, did not achieve full real-time operation under the serialized multi-module pipeline. For example, on the 16.27 s clip (488 frames, 30.00 FPS) in Figure 10, the system achieved 5.44 FPS (0.18× real-time), with 179.93 ms/frame mean latency (p95 203.03 ms). Runtime profiling shows that the overhead is distributed across modules, with YOLO inference at 71.05 ms/frame and motion assistance at 73.08 ms/frame, and the remaining 35.80 ms/frame attributed to tracking/visualization/I/O. Power telemetry (NVIDIA nvidia-smi) reported a mean run-phase power of 11.59 W (run-phase energy 0.2894 Wh), with an idle baseline of 10.18 W for this run. Based on the measured throughput, a practical real-time deployment on the same device can be achieved via frame skipping, with an empirically recommended processing rate of approximately 4.8 FPS (proc_fps ≈ 4–6 for 30 FPS inputs), which preserves continuous monitoring while keeping computation bounded. If full-rate processing closer to the input FPS is required, the same serialized architecture can be deployed on an edge platform with higher GPU compute and memory bandwidth (or hardware-accelerated decoding), which is expected to raise end-to-end throughput and better support continuous real-time operation in practical engineering settings.

Overall, the cross-dataset evaluation across training, testing, and deployment suggests that the proposed YOLO-fusion model is well suited as a supplementary module rather than a replacement detector. In particular, it can be integrated with larger and more diverse datasets and combined with existing maritime object-detection pipelines that primarily optimize per-frame Accuracy, providing additional benefits for video-based operation by improving tracking continuity and reducing intermittent detection loss during deployment.

6. Conclusions

Maritime surveillance and security are critical for maintaining safe and efficient sea logistics, port operations, and coastal activities in the presence of anomalies such as unlawful maritime activities, security-related incidents, and anomalous events (e.g., extreme waves or aggressive marine wildlife). In this study, we proposed a hybrid drone-based maritime monitoring framework that integrates YOLOv12 with a motion assistance module for moving targets and a stillness (appearance) assistance module for slow or stationary targets. The system was trained on a custom dataset collected using a DJI Mini 2 drone around the port area near Bayshore MRT Station (TE29), Singapore, where windsurfers were used as proxy (dummy) anomalies due to the unavailability of real anomaly footage, which is restricted for security reasons.

The experimental results show that the trained model achieves more than 90% Precision, Recall, and mAP50 across all classes on the detection test set. When deployed on real maritime video sequences, the pipeline attains a mean Precision of 92.89% (SD 13.31), a mean Recall of 90.44% (SD 15.24%), and a mean Accuracy of 98.50% (SD 2.00%). These results suggest that the proposed framework is robust and shows strong potential for real-world maritime anomaly detection, despite some variation in performance across scenarios.

However, this work also has several limitations that open avenues for further research. First, the use of windsurfers as proxy anomalies should be replaced or complemented with genuine anomaly footage obtained from relevant authorities to better reflect operational conditions and anomaly behavior. Second, the current dataset is limited to a single geographic area and sensor platform; future studies should evaluate the model across multiple ports, environmental conditions (e.g., adverse weather, nighttime), and different UAV platforms to assess generalization. Third, the framework could be extended to incorporate additional sensors (such as thermal cameras or radar), more diverse anomaly classes, and advanced temporal or multi-object tracking algorithms to further enhance robustness under heavy clutter and occlusion. Finally, real-time deployment tests, human-in-the-loop evaluation, and integration with existing maritime command-and-control systems would be valuable steps toward transitioning this proof-of-concept into an operational maritime and coastal security tool.

Author Contributions

N.S.: writing—original draft, conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation. D.W.S.: writing—review and editing, supervision, resources. S.S.: writing—review and editing, conceptualization, supervision, project administration, resource, data curation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Our dataset is available at ref. [63]. The AFO dataset is available at ref. [60], and the MOBDrone dataset is available at ref. [62].

Acknowledgments

We would like to thank Christian Kurniawan for his valuable technical advice and guidance on the methods used in this study. We also wish to express our gratitude to Apimuk Sornsaeng, Atirut Vijittunnugool, and Songkhla Paisansukhakul for their support in facilitating connections and providing consultation throughout the course of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In Table A1, the YOLO-only baseline and the proposed pipeline yield identical values for Precision, Recall, and Accuracy, with the same standard deviations. This outcome is expected because the proposed pipeline uses YOLO as the primary detector for anomalous objects (“windsurfs”), while the motion- and stillness-based assistance modules are invoked only when needed. Therefore, when all 19 video sequences are processed in a YOLO-only setting, the resulting performance matches that of the full pipeline. These findings confirm that the proposed assistance modules do not alter YOLO’s baseline detection performance and that YOLO remains the dominant detection component of the system.

Table A1. Ablation study comparing the YOLO-only baseline with our proposed pipeline.

Type	Precision	SD	Recall	SD	Accuracy	SD
YOLO-only	92.89%	13.31%	90.44%	15.24%	98.50%	2.00%
Our proposed pipeline	92.89%	13.31%	90.44%	15.24%	98.50%	2.00%

References

Ben93kie. SeaDronesSee. Github. 14 December 2023. Available online: https://github.com/Ben93kie/SeaDronesSee (accessed on 16 November 2025).
Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2260–2270. [Google Scholar]
Moreira, R.D.S.; Ebecken, N.F.F.; Alves, A.S.; Livernet, F.; Campillo-Navetti, A. A survey on video detection and tracking of maritime vessels. Int. J. Recent Res. Appl. Stud. 2014, 1, 47–60. [Google Scholar]
Prasad, D.K.; Prasath, C.K.; Rajan, D.; Rachmawati, L.; Rajabally, E.; Quek, C. Object Detection in a Maritime Environment: Performance Evaluation of Background Subtraction Methods. IEEE Trans. Intell. Transp. Syst. 2018, 20, 1787–1802. [Google Scholar] [CrossRef]
Kanjir, U.; Greidanus, H.; Oštir, K. Vessel detection and classification from spaceborne optical images: A literature survey. Remote Sens. Environ. 2018, 207, 1–26. [Google Scholar] [CrossRef] [PubMed]
Pohonţu, A. A Review over AI Methods Developed for Maritime Awareness Systems. Sci. Bull. Cel Batran’Naval Acad. 2020, 23, 287. [Google Scholar]
Cheng, S.; Zhu, Y.; Wu, S. Deep learning based efficient ship detection from drone-captured images for maritime surveillance. Ocean Eng. 2023, 285, 115440. [Google Scholar] [CrossRef]
Zhao, H.; Zhang, H.; Zhao, Y. Yolov7-sea: Object detection of maritime uav images based on improved yolov7. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 233–238. [Google Scholar]
Mäyrä, J.; Virtanen, E.A.; Jokinen, A.-P.; Koskikala, J.; Väkevä, S.; Attila, J. Mapping recreational marine traffic from Sentinel-2 imagery using YOLO object detection models. Remote Sens. Environ. 2025, 326, 114791. [Google Scholar] [CrossRef]
Jin, Z.; He, T.; Qiao, L.; Duan, J.; Shi, X.; Yan, B.; Guo, C. MES-YOLO: An efficient lightweight maritime search and rescue object detection algorithm with improved feature fusion pyramid network. J. Vis. Commun. Image Represent. 2025, 109, 104453. [Google Scholar] [CrossRef]
Li, G.; Song, Z.; Fu, Q. Small Boat Detection for Radar Image Datasets with YOLO V3 Network. In Proceedings of the 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP); IEEE: New York, NY, USA, 2019; pp. 1–5. [Google Scholar]
Li, Z.; Zhao, L.; Han, X.; Pan, M. Lightweight ship detection methods based on YOLOv3 and DenseNet. Math. Probl. Eng. 2020, 2020, 4813183. [Google Scholar] [CrossRef]
Li, L.; Liu, G.; Li, Z.; Ding, Z.; Qin, T. Infrared ship detection based on time fluctuation feature and space structure feature in sun-glint scene. Infrared Phys. Technol. 2021, 115, 103693. [Google Scholar] [CrossRef]
Li, H.; Deng, L.; Yang, C.; Liu, J.; Gu, Z. Enhanced YOLO v3 Tiny Network for Real-Time Ship Detection from Visual Image. IEEE Access 2021, 9, 16692–16706. [Google Scholar] [CrossRef]
Chang, L.; Chen, Y.-T.; Wang, J.-H.; Chang, Y.-L. Modified Yolov3 for Ship Detection with Visible and Infrared Images. Electronics 2022, 11, 739. [Google Scholar] [CrossRef]
Chen, X.; Qiu, C.; Zhang, Z. A Multiscale Method for Infrared Ship Detection Based on Morphological Reconstruction and Two-Branch Compensation Strategy. Sensors 2023, 23, 7309. [Google Scholar] [CrossRef] [PubMed]
Zhu, Q.; Ma, K.; Wang, Z.; Shi, P. YOLOv7-CSAW for maritime target detection. Front. Neurorobotics 2023, 17, 1210470. [Google Scholar] [CrossRef] [PubMed]
Gao, Z.; Yu, X.; Rong, X.; Wang, W. Improved YOLOv8n for Lightweight Ship Detection. J. Mar. Sci. Eng. 2024, 12, 1774. [Google Scholar] [CrossRef]
Lu, D.; Teng, L.; Wang, J.T.M.; Tian, Z.; Wang, G. Infrared Bilateral Polarity Ship Detection in Complex Maritime Scenarios. Sensors 2024, 24, 4906. [Google Scholar] [CrossRef]
Deng, H.; Wang, S.; Wang, X.; Zheng, W.; Xu, Y. YOLO-SEA: An Enhanced Detection Framework for Multi-Scale Maritime Targets in Complex Sea States and Adverse Weather. Entropy 2025, 27, 667. [Google Scholar] [CrossRef]
Alshibli, A.; Memon, Q. Benchmarking YOLO Models for Marine Search and Rescue in Variable Weather Conditions. Automation 2025, 6, 35. [Google Scholar] [CrossRef]
Wu, M.; Zhang, W.; Min, R.; Zhang, L.; Xu, Y.; Qin, Y.; Yu, J. Semi-Supervised Maritime Object Detection: A Data-Centric Perspective. J. Mar. Sci. Eng. 2025, 13, 1242. [Google Scholar] [CrossRef]
Wang, Y.; Liu, J.; Zhao, J.; Li, Z.; Yan, Y.; Yan, X.; Xu, F.; Li, F. LCSC-UAVNet: A High-Precision and Lightweight Model for Small-Object Identification and Detection in Maritime UAV Perspective. Drones 2025, 9, 100. [Google Scholar] [CrossRef]
Shen, L.; Gao, T.; Yin, Q. YOLO-LPSS: A Lightweight and Precise Detection Model for Small Sea Ships. J. Mar. Sci. Eng. 2025, 13, 925. [Google Scholar] [CrossRef]
Solawetz, J. Train, Validation, Test Split for Machine Learning. Roboflow. 4 September 2020. Available online: https://blog.roboflow.com/train-test-split/ (accessed on 1 November 2025).
Roboflow. Available online: https://app.roboflow.com/ (accessed on 1 November 2025).
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Gallagher, J. Launch: Use YOLOv12 with Roboflow. Roboflow. 20 February 2025. Available online: https://blog.roboflow.com/use-yolov12-with-roboflow/ (accessed on 1 November 2025).
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Sapkota, R.; Flores-Calero, M.; Qureshi, R.; Badgujar, C.; Nepal, U.; Poulose, A.; Zeno, P.; Vaddevolu, U.B.P.; Khan, S.; Shoman, M.; et al. YOLO advances to its genesis: A decadal and comprehensive review of the You Only Look Once (YOLO) series. Artif. Intell. Rev. 2025, 58, 274. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dao, T.; Fu, D.Y.; Ermon, S.; Rudra, A.; Ré, C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; J’egou, H. Going deeper with Image Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 32–42. [Google Scholar]
Bachlechner, T.; Majumder, B.P.; Mao, H.H.; Cottrell, G.W.; McAuley, J. ReZero is All You Need: Fast Convergence at Large Depth. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence; PMLR: New York, NY, USA, 2021; pp. 1352–1361. [Google Scholar]
Khanam, R.; Hussain, M. A Review of YOLOv12: Attention-Based Enhancements vs. Previous Versions. arXiv 2025, arXiv:2504.11995. [Google Scholar] [CrossRef]
Stauffer, C.; Grimson, W. Adaptive background mixture models for real-time tracking. In Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149); IEEE: New York, NY, USA, 1999; Volume 2, pp. 246–252. [Google Scholar]
Zivkovic, Z. Improved adaptive Gaussian mixture model for background subtraction. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004; IEEE: New York, NY, USA, 2004; Volume 2, pp. 28–31. [Google Scholar]
Zivkovic, Z.; Heijden, F.V.D. Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognit. Lett. 2006, 27, 773–780. [Google Scholar] [CrossRef]
OpenCV (Open Source Computer Vision). Background Subtraction. Available online: https://docs.opencv.org/3.4/de/df4/tutorial_js_bg_subtraction.html (accessed on 2 November 2025).
OpenCV (Open Source Computer Vision). cv::BackgroundSubtractorMOG2 Class Reference. Available online: https://docs.opencv.org/3.4/d7/d7b/classcv_1_1BackgroundSubtractorMOG2.html#details (accessed on 2 November 2025).
Rath, S.R. Moving Object Detection Using Frame Differencing with OpenCV. DEBUGGER CAFÉ. 8 February 2021. Available online: https://debuggercafe.com/moving-object-detection-using-frame-differencing-with-opencv/ (accessed on 2 November 2025).
Ghosh, A. Moving Object Detection with OpenCV Using Contour Detection and Background Subtraction. LearnOpenCV. 9 January 2024. Available online: https://learnopencv.com/moving-object-detection-with-opencv/ (accessed on 2 November 2025).
Kai, B.; Hanebeck, U.D. Template matching using fast normalized cross correlation. In Optical Pattern Recognition XII; SPIE: Bellingham, WA, USA, 2001; Volume 4387, pp. 95–102. [Google Scholar]
Lewis, J.P. Fast Normalized Cross-Correlation. Vis. Interface 1995, 10, 120–123. [Google Scholar]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: Anefficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision; IEEE: New York, NY, USA, 2011; pp. 2564–2571. [Google Scholar]
OpenCV (Open Source Computer Vision). cv::BFMatcher Class Reference. Available online: https://docs.opencv.org/3.4/d3/da1/classcv_1_1BFMatcher.html (accessed on 2 November 2025).
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
OpenCV (Open Source Computer Vision). Feature Matching + Homography to find Objects. Available online: https://docs.opencv.org/4.x/d1/de0/tutorial_py_feature_homography.html (accessed on 2 November 2025).
OpenCV (Open Source Computer Vision). Histogram Comparison. Available online: https://docs.opencv.org/3.4/d8/dc8/tutorial_histogram_comparison.html (accessed on 2 November 2025).
OpenCV (Open Source Computer Vision). Histograms Image Processing. Available online: https://docs.opencv.org/3.4/d6/dc7/group__imgproc__hist.html (accessed on 2 November 2025).
Google. Classification: Accuracy, Recall, Precision, and Related Metrics. Available online: https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall?hl=th (accessed on 3 November 2025).
Ultralytics. Performance Metrics Deep Dive. 26 June 2025. Available online: https://docs.ultralytics.com/guides/yolo-performance-metrics/ (accessed on 3 November 2025).
Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Kukil. Mean Average Precision (mAP) in Object Detection. LearnOpenCV. 9 August 2022. Available online: https://learnopencv.com/mean-average-precision-map-object-detection-model-evaluation-metric/ (accessed on 3 November 2025).
Gruner, M. Mean Average Precision (mAP) and other Object Detection Metrics. RidgeRun^ai. 23 April 2024. Available online: https://www.ridgerun.ai/post/mean-average-precision-map-and-other-object-detection-metrics (accessed on 3 November 2025).
Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.L.; Silva, E.A.B.D. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. [Google Scholar] [CrossRef]
The Stanford NLP Group. Evaluation of Ranked Retrieval Results. Available online: https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html (accessed on 3 November 2025).
Ultralytics. Ultralytics YOLO Docs. Available online: https://docs.ultralytics.com/models/ (accessed on 20 December 2025).
Ga, J.; Knapik, M.; Cyganek, B. An ensemble deep learning method with optimized weights for drone-based water rescue and surveillance. Integr. Comput.-Aided Eng. 2021, 28, 221–235. [Google Scholar]
Gąsienica-Józkowy, J. AFO—Aerial Dataset of Floating Objects. Kaggle. Available online: https://www.kaggle.com/datasets/jangsienicajzkowy/afo-aerial-dataset-of-floating-objects?resource=download/ (accessed on 22 December 2025).
Cafarelli, D.; Ciampi, L.; Vadicamo, L.; Gennaro, C.; Berton, A.; Paterni, M.; Benvenuti, C.; Passera, M.; Falchi, F. MOBDrone: A drone video dataset for man overboard rescue. In International Conference on Image Analysis and Processing; Springer International Publishing: Cham, Switzerland, 2022; pp. 633–644. [Google Scholar]
Artificial Intelligence for Media and Humanities (AIMH). MOBDrone: A Drone Video Dataset for Man OverBoard Rescue. Available online: https://aimh.isti.cnr.it/dataset/mobdrone/ (accessed on 22 December 2025).
Suvittawat, N. Maritime Threat. Roboflow. 30 October 2025. Available online: https://app.roboflow.com/airport-turnaroundmy-own/maritime_static-yl7do/3 (accessed on 18 November 2025).

Figure 1. (a) DJI Mini 2 drone used for data acquisition; (b) study site—the port area near Bayshore MRT Station (TE29), Singapore; (c) representative image frame from the collected maritime dataset.

Figure 2. Example of a labeled maritime scene annotated using Roboflow.

Figure 4. Workflow of the proposed computer vision pipeline for maritime anomaly detection and tracking.

Figure 5. Performance on 19 maritime videos: (a) per-video Precision, (b) per-video Recall, and (c) per-video Accuracy of the trained model.

Figure 6. Example scene of maritime anomaly detection with YOLOv12 only.

Figure 7. Example scene of maritime anomaly detection with YOLOv12 and motion assistance.

Figure 8. Example scene of maritime anomaly detection with YOLOv12 and stillness (appearance) assistance.

Figure 9. Example scene of maritime anomaly detection with YOLOv12, motion assistance, and stillness (appearance) assistance.

Figure 10. Example scene of maritime anomaly detection produced by our proposed YOLO-based fusion model, trained on the AFO dataset [60] and deployed on video footage from the MOBDrone dataset [62].

Table 1. Lists of related research for maritime detection.

No., Year, Citation	Type of Tool/Dataset Used	Methods Used	Main Results
1, 2019, [11]	Marine pulse-Doppler radar (CSIR Fynmet radar; radar echoes converted to time–frequency images)	Short-time Fourier transform (STFT) to convert radar echoes to time–frequency images, then YOLOv3 for small-boat detection in heavy sea clutter.	Achieves ~94.9% classification Accuracy, about 14% better than LeNet-5, on noisy radar data where boat echoes are buried in clutter.
2, 2020, [12]	Visible ship images from public online sources	Replaces YOLOv3 backbone with DenseNet-style dense connections and uses spatially separable convolutions to reduce parameters while preserving detection quality.	Lightweight ship detection model (LSDM) reaches ~94% AP with only one-third of YOLOv3’s parameters; LSDM-tiny keeps YOLOv3-tiny speed but with clearly higher Accuracy.
3, 2021, [13]	Infrared cameras in strong sun-glint conditions	Designs a detection algorithm using time-fluctuation features plus spatial-structure descriptors to separate small infrared (IR) ship targets from strong, dynamic sun-glint background.	Outperforms several baseline IR ship detectors in dense sun-glint, especially for small targets with weak contrast.
4, 2021, [14]	Publicly available visible ship images from marine surveillance cameras	Improves YOLOv3-tiny with architectural tweaks (e.g., preset anchors, convolution layer, attention module called CBAM) to boost Accuracy while retaining real-time speed for ship detection.	Achieves significantly better detection Accuracy than vanilla YOLOv3-tiny, with only slight loss relative to full YOLOv3 but much higher speed.
5, 2022, [15]	Dual-sensor port surveillance: visible and infrared cameras at maritime/port locations	Reduces YOLOv3 complexity (fewer filters, fewer detection scales) and adds Spatial Pyramid Pooling (SPP) to handle all-day (visible + IR) ship detection efficiently.	The proposed method achieved better detection performance than the original Yolov3 in ship detection, increasing mAP by 5.8%, FPS by 8%, and reducing BFLOPs by about 47.6%.
6, 2023, [16]	Infrared ship images	Uses multi-scale morphological reconstruction and structure tensor with two feature-based filters to enhance ship saliency, plus a two-branch compensation strategy (MMRSM-TBC) to improve the segmentation results.	Significantly improves detection of differently sized IR ship targets under cluttered sea backgrounds compared with several existing IR detection methods.
7, 2023, [17]	Optical images from maritime search-and-rescue (SAR) drone-captured publicly available dataset	Extends YOLOv7 with C2f modules, SimAM attention, ASFF for feature fusion, K-means++ anchor redesign, and WIoU loss to boost small-target SAR detection.	Improves mAP by ~10.7% over baseline YOLOv7 on a SAR maritime dataset, with better robustness and lower false negatives for small objects.
8, 2024, [18]	Visible ship images from public dataset	Proposes DSSM–LightNet, an improved YOLOv8n with Dual Convolutional (DualConv), spatially enhanced attention module (SEAM), and minimum point distance IoU (MPDIoU).	Achieves higher Precision, Recall and mAP than other models in YOLO series while remaining lightweight and real-time
9, 2024, [19]	Infrared ship imagery from self-collection and public datasets	Introduces bilateral polarity modeling with grayscale morphological reconstruction (GMR) and relative total variation (RTV) smoothing to better separate ships from clutter in IR images.	Achieves accurate and effective detection of bright and dark polarity ship targets, outperforming several prior IR ship-detection methods.
10, 2025, [20]	Online available visible maritime dataset	Proposed YOLO-SEA by modifying YOLOv8 with SESA (SimAM-Enhanced SENetV2 Attention) module; improved BiFPN (Bidirectional Feature Pyramid Network) structure and Soft-NMS (Soft NonMaximum Suppression)	Shows higher detection performance, especially on small ships, compared with several YOLO and other model baselines on maritime benchmarks.
11, 2025, [21]	Public aerial/drone datasets	Compares multiple YOLO variants across SAR benchmarks under different weather and sea states, focusing on people and small crafts in open water.	Finds that YOLOv7 excels in overall detection Accuracy and generalization, and YOLOv10 and YOLOv11 provide faster inference and better computational efficiency
12, 2025, [22]	Two public maritime image datasets	Improves semi-supervised object detection (SSOD) by using depth-aware pseudo-label filtering (DAPF) and dynamic region mixup (DRMix) augmentation.	Achieves mAP improvements of 2.2% and 0.9% on the 2 datasets with only 10% labeled data for each.
13, 2025, [23]	Three public aerial/drone image datasets	Proposes LCSC-UAVNet, a lightweight CNN with Lightweight Shared Difference Convolution Detection Head (LSDCH), Contextual Global Module (CGM) for small object detection.	Delivers improvements in mAP, Recall, efficiency, where the mAP increased by over 10%, outperforming state-of-the-art models such as YOLOv10 and RT-DETR.
14, 2025, [24]	Public ship dataset	Introduces YOLO-LPSS, a lightweight YOLO-based model with specialized feature modules designed for small sea ships, optimized for Accuracy and speed.	Achieves strong performance on small-ship benchmarks with reduced computation, outperforming several baselines in small-target detection.

Table 2. Summary of object classes by train, validation, and test sets.

Object Classes	Dataset (327 Images)
Object Classes	Train	Validation	Test
Coast	189 instances	55 instances	25 instances
Land	395 instances	117 instances	55 instances
Ship	3560 instances	1063 instances	489 instances
Sky	229 instances	66 instances	32 instances
Tower	164 instances	47 instances	22 instances
Windsurf	558 instances	155 instances	75 instances
Number of total images	229 images	66 images	32 images

Table 3. Accuracy comparison in YOLO series.

YOLO Series	Test Set (Overall)			Number of Parameters (Millions)
YOLO Series	P	R	mAP50	Number of Parameters (Millions)
YOLOv8n	0.9794	0.9704	0.9874	3.2
YOLOv9t	0.9743	0.9640	0.9825	2.0
YOLOv10n	0.9136	0.9555	0.9767	2.3
YOLOv11n	0.9792	0.9452	0.9840	2.6
YOLOv12n	0.9827	0.9676	0.9877	2.6

Table 4. Accuracy of the YOLOv12 model in detecting objects by class.

Object Classes	Train			Validation			Test
Object Classes	P	R	mAP50	P	R	mAP50	P	R	mAP50
Coast	0.9964	1	0.995	0.9844	1	0.995	0.9827	1	0.995
Land	0.9672	0.9924	0.9947	0.9694	1	0.995	0.9923	1	0.995
Ship	0.947	0.8683	0.9463	0.9419	0.8844	0.9409	0.9598	0.8937	0.9603
Sky	0.997	1	0.995	0.9891	1	0.995	0.983	1	0.995
Tower	0.9961	1	0.995	0.9859	1	0.995	0.9782	1	0.995
Windsurf	0.9676	0.853	0.9514	0.9418	0.8352	0.9213	1	0.912	0.9862
Overall	0.9785	0.9523	0.9796	0.9687	0.9533	0.9737	0.9827	0.9676	0.9877

Table 5. Performance stability over five randomized train/validation/test partitions of our dataset, with each partition covering the entire dataset, for robustness evaluation of the trained YOLOv12 model.

Overall	Train			Validation			Test
Overall	P	R	mAP50	P	R	mAP50	P	R	mAP50
Dataset 1	0.9826	0.9572	0.9806	0.9693	0.9538	0.9763	0.9614	0.9434	0.9731
Dataset 2	0.9759	0.9578	0.9813	0.9768	0.9455	0.9799	0.9726	0.9511	0.9729
Dataset 3	0.9847	0.9414	0.9811	0.9794	0.9361	0.9769	0.9824	0.9299	0.9836
Dataset 4	0.979	0.9535	0.9814	0.9671	0.9589	0.9786	0.9614	0.9523	0.9741
Dataset 5	0.9774	0.954	0.9802	0.9765	0.9531	0.9804	0.9726	0.9463	0.9797
Mean	0.9799	0.9528	0.9809	0.9738	0.9495	0.9784	0.9701	0.9446	0.9767
STD	0.0037	0.0066	0.0005	0.0053	0.0089	0.0018	0.0089	0.009	0.0048

Table 6. Details of our model hyperparameter tuning, including YOLOv12, MOG2, and frame-differencing, on all 19 videos in our dataset.

Module	Hyperparameter	Value Used (Default)	Purpose/Note
YOLO	yolo_conf	0.50	Confidence threshold for keeping detections.
YOLO	yolo_iou	0.45	IoU threshold used by NMS during inference.
YOLO	yolo_imgsz	1280	Inference image size/resolution for small objects like “windsurfs”.
MOG2	mog2_history	300	Background model history length (frames).
MOG2	mog2_var	16.0	Foreground sensitivity (variance threshold).
MOG2	mog2_shadows	False	Shadow detection disabled (can increase false motion if enabled).
MOG2	learningRate	Auto-decayed until approximately 1/history (≈0.00333 for history = 300)	OpenCV default/auto (because mog2.apply(frame) is used).
MOG2 post-step	fgmask_bin_thresh	200	Binarization threshold applied to raw MOG2 foreground mask before fusion.
Frame differencing	diff_thresh	20	Threshold on grayscale absolute difference between consecutive frames.
Post-processing	motion_dilate_iter	2	Dilation iterations to connect fragmented motion blobs.
Blob filtering	motion_min_area	80	Minimum contour area (px²) kept as motion candidate.
Blob filtering	motion_max_area	20,000	Maximum contour area (px²) kept as motion candidate.

Table 7. Contribution of motion and stillness assistance to original YOLOv12.

Type	Total Number	No. That Motion Helped	Percentage of Motion Helped	No. That Stillness Helped	Percentage of Stillness Helped
Footage Videos	19	14	73.68	4	21.05
Real Anomaly (True Positive)	60	20	33.33	6	10.00

Table 8. Summary of object classes by train, validation, and test sets from cross-validation between our dataset and the AFO dataset [60].

Object Classes	Dataset (3968 Images)
Object Classes	Train	Validation	Test
Coast	166 instances	26 instances	77 instances
Land	351 instances	70 instances	146 instances
Ship	3226 instances	659 instances	1227 instances
Sky	204 instances	39 instances	84 instances
Tower	145 instances	24 instances	64 instances
Windsurf	2824 instances	449 instances	1437 instances
Human	19,744 instances	2855 instances	10,575 instances
Boat	442 instances	55 instances	205 instances
Buoy	312 instances	29 instances	246 instances
Sailboat	102 instances	9 instances	48 instances
Kayak	865 instances	155 instances	426 instances
Number of total images	2381 images	397 images	1190 images

Table 9. Accuracy of the YOLOv12 model in detecting objects by class using our dataset combined with the AFO dataset [60].

Object Classes	Train			Validation			Test
Object Classes	P	R	mAP50	P	R	mAP50	P	R	mAP50
Coast	0.9956	1	0.995	0.9795	1	0.995	0.9905	1	0.995
Land	0.9608	0.9829	0.9914	0.9858	0.9922	0.9947	0.9663	0.9811	0.9921
Ship	0.9688	0.8112	0.9302	0.973	0.7654	0.9213	0.9381	0.8199	0.9216
Sky	0.9958	1	0.995	0.9834	1	0.995	0.9889	1	0.995
Tower	0.9953	1	0.995	0.9807	1	0.995	0.9857	1	0.995
Windsurf	0.9898	0.9632	0.9879	0.9651	0.9227	0.9708	0.9663	0.9624	0.9842
Human	0.9421	0.8847	0.9366	0.9346	0.8224	0.9022	0.8883	0.8433	0.8853
Boat	0.9581	0.9502	0.9767	0.9484	0.9273	0.9709	0.908	0.9268	0.9462
Buoy	0.9059	0.5554	0.7663	0.7773	0.4828	0.5675	0.8505	0.5813	0.7455
Sailboat	0.9863	0.9902	0.995	0.9546	1	0.995	0.8696	1	0.995
Kayak	0.996	0.9977	0.995	0.9825	0.9871	0.9893	0.9743	0.9859	0.9894
Overall	0.9722	0.9214	0.9604	0.9514	0.9	0.9361	0.9388	0.9183	0.9495

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Suvittawat, N.; Soh, D.W.; Srigrarom, S. Drone-Based Maritime Anomaly Detection with YOLO and Motion/Appearance Fusion. Remote Sens. 2026, 18, 412. https://doi.org/10.3390/rs18030412

AMA Style

Suvittawat N, Soh DW, Srigrarom S. Drone-Based Maritime Anomaly Detection with YOLO and Motion/Appearance Fusion. Remote Sensing. 2026; 18(3):412. https://doi.org/10.3390/rs18030412

Chicago/Turabian Style

Suvittawat, Nutchanon, De Wen Soh, and Sutthiphong Srigrarom. 2026. "Drone-Based Maritime Anomaly Detection with YOLO and Motion/Appearance Fusion" Remote Sensing 18, no. 3: 412. https://doi.org/10.3390/rs18030412

APA Style

Suvittawat, N., Soh, D. W., & Srigrarom, S. (2026). Drone-Based Maritime Anomaly Detection with YOLO and Motion/Appearance Fusion. Remote Sensing, 18(3), 412. https://doi.org/10.3390/rs18030412

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Drone-Based Maritime Anomaly Detection with YOLO and Motion/Appearance Fusion

Highlights

Abstract

1. Introduction

2. Literature Review

3. Data Collection

4. Methodology

4.1. YOLOv12

4.2. Motion Assistance

4.3. Stillness (Appearance) Assistance

4.4. Performance Metrics

Intersection over Union (IoU)

5. Experimental Results and Discussion

Generalization

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI