Next Article in Journal
Research on Path Planning Methods and Characteristics of Urban Unmanned Aerial Vehicles Under Noise Constraints
Next Article in Special Issue
Energy–Information–Decision Coupling Optimization for Cooperative Operations of Heterogeneous Maritime Unmanned Systems
Previous Article in Journal
DualFOD: A Dual-Modality Deep Learning Framework for UAS-Based Foreign Object Debris Detection Using Thermal and RGB Imagery
Previous Article in Special Issue
MCB-RT-DETR: A Real-Time Vessel Detection Method for UAV Maritime Operations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Glare-Aware Resi-YOLO: Tiny-Vessel Detection with Dual-Brain Edge Deployment for Maritime UAVs

Department of Computer Science and Information Engineering, Chang Jung Christian University, Tainan City 711, Taiwan
*
Author to whom correspondence should be addressed.
Drones 2026, 10(3), 226; https://doi.org/10.3390/drones10030226
Submission received: 9 February 2026 / Revised: 13 March 2026 / Accepted: 17 March 2026 / Published: 23 March 2026

Highlights

What are the main findings?
  • Resi-YOLO raises APsmall by 13.1 percentage points over YOLOv8n on the high-glare test split.
  • On Jetson Orin Nano, the deployed pipeline runs at 12.8 FPS end-to-end, while TensorRT inference exceeds 30 FPS.
What are the implications of the main findings?
  • Robust tiny-vessel perception can be executed onboard maritime UAVs without cloud dependence.
  • Glare Severity Score (GSS)-stratified evaluation and the dual-brain design offer a practical blueprint for safety-oriented deployment under link variability.

Abstract

Maritime UAV perception must reliably detect and track tiny vessels under harsh specular glare. In practice, detection failures are dominated by two coupled factors: (i) vessels often occupy only a few pixels, causing small-object recall collapse and (ii) sun glint and sea-surface reflections generate over-exposed regions that trigger false positives and unstable associations. This paper presents Resi-YOLO, a system-level pipeline that improves tiny-vessel sensitivity while preserving embedded throughput on a Jetson Orin Nano. At the model level, Resi-YOLO combines a P2-enhanced feature path with CBAM-based glare suppression to strengthen high-resolution semantics and suppress glare-induced artifacts; optional SAHI-style slicing is supported for ultra-high-resolution scenes. At the system level, we adopt a heterogeneous dual-brain deployment, where the Orin Nano performs primary inference and an MCU-based safety-island tracker mitigates delay/jitter via time-stamped measurement replay and IMM-UKF updates. We further define a Glare Severity Score (GSS) to stratify robustness by illumination intensity. Experiments show that Resi-YOLO improves APsmall by 13.1 percentage points over YOLOv8n (18.4% to 31.5%), raises high-glare mAP@0.5 from 41.2% to 53.7%, and runs at 12.8 FPS end-to-end (~100 ms latency) on Jetson Orin Nano, while TensorRT inference-only throughput exceeds 30 FPS.

Graphical Abstract

1. Introduction

1.1. Background and Motivation

Maritime monitoring supports search and rescue (SAR), illegal fishing enforcement, and coastal security. UAVs offer rapid deployment and flexible sensing over satellites or manned aircraft. However, maritime UAV video exhibits unique challenges: sun glint, haze, sea spray, and dynamic backgrounds (whitecaps and wakes), invalidating terrestrial benchmarks. FPV drones add high-mobility threats, requiring robust detection and tracking.

1.2. Key Challenges

We address: (i) tiny targets (<32 × 32 pixels after resizing); (ii) glare and brightness clutter mimicking vessel features; (iii) a deployment gap (high-accuracy models demand GPU resources that are often unsustainable for edge platforms); (iv) asynchronous sensor delays and non-linear motions in tracking; (v) SWaP (Size, Weight, and Power) constraints for long-endurance operations.
Marine-Engineering Perspective: Unlike land-based UAV vision benchmarks, maritime operations tightly couple vision-based perception with time-varying over-water communication links and mission safety. Over-the-sea channels can exhibit rapid two-/three-ray fading and evaporation-duct effects, intermittently collapsing throughput and increasing packet loss and jitter [1]. Recent maritime computer-vision workshops further consolidate benchmarks, evaluation protocols, and open challenges for over-water detection and tracking, providing a common ground for comparing system-level robustness under real deployment constraints [2]. When a high-rate video stream is transported via RTSP/RTP, buffering and retransmission behaviors (especially over TCP) can induce a latency cliff—or even a deceptively smooth but seconds-delayed feed (bufferbloat)—breaking the perception–control loop [3,4,5]. Therefore, we formulate Resi-YOLO (Resilient YOLO) as an integrated perception and safety subsystem: glare-aware attention and P2-enhanced features improve tiny-vessel observability, while an MCU safety island and time-stamped measurement replay (TSMR) maintain bounded-latency decision-making and enable explicit robustness certification under delay/jitter/dropout conditions.
To quantify network-induced instability, we measured the deployed streaming pipeline under LIVE-RTSP, LAG-50 ms, and JITTER-20 ms (+drop) conditions. Under the nominal wired-RTSP setting, network transfer latency averaged about 5 ms (p95 ≈ 10 ms), while the fully deployed pipeline measured an approximately 90 ms mean and 130 ms p95 end-to-end latency. Under controlled delay/jitter injection, conventional tracking-by-detection pipelines showed clear degradation in temporal alignment and identity continuity, whereas the MCU safety island with TSMR preserved more stable tracking outcomes. We therefore use the term latency cliff to denote the experimentally observed transition from bounded-latency operation to tracking-unsafe delay inflation, rather than a purely narrative link-level effect.

1.3. Contributions

  • Core Detector Architecture: A YOLO11n-based detector is augmented with a P2 detection head, CBAM attention, and NWD loss to improve tiny-vessel recall and glare robustness under maritime sea clutter. Optional modules including SAHI slicing and RGF/BLSF geometric filtering are evaluated separately as deployment-specific extensions.
  • MCU Safety-Island Dual-Brain Architecture: A heterogeneous perception–navigation pipeline is implemented where Jetson Orin Nano (NVIDIA, Santa Clara, CA, USA) performs deep perception while an MCU safety island maintains deterministic tracking and minimal navigation cues (e.g., IMM-UKF state estimation and lightweight planning). This design explicitly targets marine engineering constraints such as intermittent links, abrupt exposure changes, and GPU workload spikes by decoupling high-throughput, non-deterministic vision processing from safety-critical control loops.
  • System-Level Implementation: A waterproof UAV platform (Pixhawk-class autopilot + stabilized 4K gimbal camera), TensorRT deployment on Jetson Orin Nano, and a latency-aware data bus (e.g., RTSP/MQTT/WebSocket) for edge-to-ground integration are implemented.
  • Evaluation Blueprint for Marine Operations: Stratified evaluation across glare severity (GSS), target-size bins, and video I/O impairments (delay/jitter/dropout) is carried out, accompanied by reproducibility templates, command/parameter logs, and deployment checklists.
As illustrated in Figure 1, maritime UAV detection is simultaneously challenged by tiny targets, sun glint/whitecaps, and motion blur, which can severely degrade both localization and confidence estimation. The following contributions are designed to address these coupled failure modes from the model, system, and evaluation perspectives.

2. Related Work

2.1. Vision-Based Maritime UAV Perception Under Sea Glare

Maritime drone benchmarks like SeaDronesSee [6,7] highlight that sun glint and sea clutter create unique failure modes. While lightweight detectors such as YOLOv7-sea [8], YOLOv4 [9], and LCSC-UAVNet [10] demonstrate favorable accuracy-efficiency trade-offs, dedicated sun-glint detection and removal methods have also been proposed for oceanic drone RGB imagery [11]. However, our work shifts the focus toward a system-level, dual-brain architecture that preserves tracking continuity even during GPU or communication failures.

2.2. Attention and Loss Mechanisms for Clutter Suppression

Modern object-detection frameworks are commonly categorized into two-stage and one-stage detectors. Two-stage methods such as the Faster R-CNN first generate region proposals and then perform classification and bounding-box regression [12]. Detecting tiny vessels requires specialized feature extraction and clutter rejection. Rusyn et al. [13] showed that multi-threshold binarization can provide efficient feature extraction in remote-sensing imagery, highlighting the value of lightweight preprocessing under resource-constrained deployment.
Canonical designs like FPN [14] and BiFPN [7] improve multi-scale recall, while CBAM attention [15] suppresses specular highlights. Localization sensitivity is further addressed via NWD loss [16] and SAHI slicing [17]. Recent UAV- and maritime-oriented tiny-object detectors include S3Det [18], Succulent-YOLO [19], CRAB-YOLO [12], and Rose-YOLO [20]. These methods improve small-target sensitivity through augmentation, reconstruction, or restoration-oriented designs but often introduce additional computational overhead. In contrast, Resi-YOLO targets latency-bounded maritime edge deployment through lightweight feature-level enhancements (P2, CBAM, and NWD) and a heterogeneous dual-brain design.

2.3. Reliability-Aware Perception and Geometric Filtering in Maritime Vision

Beyond appearance-based detection, reliability-aware perception mitigates visual artifacts in challenging environments. In stereo-based maritime vision, depth reliability maps from SGBM [21] identify unstable reflections to enable confidence-guided filtering. Similarly, Binary Line Segment Filtering (BLSF) [22] suppresses structured clutter like wave crests by penalizing detections with linear patterns. Resi-YOLO integrates these reliability and geometric cues to enhance robustness under glare-dominated conditions.

2.4. Heterogeneous Architectures and Edge-Cloud Systems

Standard tracking pipelines (e.g., BoT-SORT, ByteTrack, and SORT) [23,24,25] and metrics like HOTA [24,26] provide the foundation for maritime MOT [27,28]. However, to meet marine engineering reliability standards, we introduce an MCU safety island. This maintains deterministic tracking continuity during video I/O impairments (delay/jitter/dropout) that typically degrade standard edge-AI branches.

3. Proposed Method: Enhanced Resi-YOLO with Dual-Brain Integration

3.1. Overview

The Resi-YOLO framework represents a holistic approach to maritime perception, bridging the gap between high-complexity deep learning models and the deterministic requirements of flight safety. Our methodology is structured into three core pillars: (i) model-level architectural enhancements for tiny-target recall and glare suppression, (ii) system-level heterogeneous integration for latency decoupling via a dual-brain paradigm, and (iii) reliability-guided strategies to mitigate optical artifacts. This section details the theoretical formulation of these components and explains how they are synthesized into a fail-operational UAV perception pipeline designed for high-glare maritime environments.

Stress-Test Corpus Versus Benchmark-Scale Validation

Resi-YOLO augments YOLO11n with a P2 detection head, CBAM attention, NWD loss, and optional SAHI slicing. Under the proposed dual-brain paradigm, high-throughput perception runs on the edge-AI node, while deterministic tracking and minimal navigation cues are maintained by an MCU-based safety island. Figure 2 also clarifies the functional partition of the proposed dual-brain design. The Jetson branch executes high-rate visual inference, whereas the MCU safety island preserves deterministic tracking and conservative navigation under degraded latency or communication reliability. The qualitative examples below further show that the proposed detector recovers tiny-vessel instances that are missed by the baseline.

3.2. P2 Detection Head and NWD Loss for Tiny Vessels

To enhance the detectability of tiny vessels, we introduce a P2 detection head that fuses high-resolution shallow features with deeper semantic cues. As formulated in Equation (1), the P2 feature map is constructed by upsampling the P3 feature and merging it with the corresponding backbone C2 feature, thereby preserving fine spatial details at a stride of 4:
F P 2 = Conv ( Upsample ( F P 3 ) F C 2 )
Here, F P 3 denotes the P3 feature (stride 8) from the top-down pathway and F C 2 denotes the corresponding backbone C2 feature (stride 4). Compared with the default detection head operating at stride 8, the P2 head preserves finer spatial sampling, which is critical for targets spanning only a few pixels (Figure 3). Specifically, we upsample the P3 feature and fuse it with the corresponding backbone C2 feature to form P2, recovering contextual cues while retaining high-resolution details (stride 4). We further provide a qualitative multi-stage visualization (baseline → +P2 → +NWD → +CBAM → full model) to illustrate the incremental improvements.
For localization, we adopt the Normalized Wasserstein Distance (NWD) loss by modeling each bounding box as a 2D Gaussian distribution [16]. Let W 2 denote the 2-Wasserstein distance between the predicted and ground-truth Gaussians; we define NWD as in Equation (2) and minimize the corresponding loss L N W D = 1 N W D .
N W D = e x p ( W 2 C )
Here, the normalization constant is C = 12.8 , following the original NWD formulation for tiny-object detection [16]. This value keeps the Wasserstein distance in a numerically stable range during training and improves convergence when predicted and ground-truth boxes are extremely small or non-overlapping.

3.3. CBAM for Glare Suppression

To suppress glare-induced feature dominance and sensor saturation [29], we adopt the Convolutional Block Attention Module (CBAM) [15] to reweight both channel-wise and spatial responses, following recent studies that apply attention mechanisms to improve maritime object detection under sea clutter and reflection noise [6]. The CBAM mitigates the collapse of fine-texture cues by learning to suppress glare-dominated activations while reallocating capacity to structured vessel contours.
As defined in Equation (3), the Channel Attention module emphasizes informative feature maps through pooled global descriptors and the Spatial Attention module, defined in Equation (4), further refines localization by highlighting salient regions and attenuating high-brightness background highlights:
CBAM :   Channel   M C F = σ M L P A v g P o o l F + M L P M a x P o o l F
Spatial   M s F = σ f 2 x 2 A v g P o o l F ; M a x P o o l F
Figure 4 provides a visual explanation of the glare suppression mechanism. Under severe radiometric clutter, the CBAM shifts attention away from saturated reflective regions and reallocates feature emphasis toward vessel-like structures, which helps stabilize downstream detection under high-GSS conditions. This joint modulation stabilizes feature representations under strong reflections and sea clutter, consistent with recent attention-assisted maritime detection frameworks [6,30].

3.4. Core Architecture and Optional Modules

To clarify the architectural contributions and avoid confusion between core components and optional modules, we categorize the system elements as follows.
Core architecture: The proposed detector is built upon three primary modifications: the introduction of the P2 detection head for enhanced tiny-object localization, the integration of the Convolutional Block Attention Module (CBAM) to improve feature representation under glare conditions, and the adoption of the Normalized Wasserstein Distance (NWD) loss for robust bounding-box regression on extremely small targets. These components form the core architecture evaluated in the ablation experiments.
Optional deployment modules: Additional mechanisms may be enabled depending on deployment requirements. These include the SAHI tiling strategy for high-resolution inference and Reliability-Guided Fusion (RGF) together with the Binary Line Segment Filter (BLSF) as optional geometric post-filters to suppress glare-induced artifacts. These modules are not required for the base detector and are evaluated separately to demonstrate their auxiliary benefits.

3.5. SAHI for High-Resolution Inference

To improve tiny-vessel recall in long-range maritime scenes, we optionally adopt a SAHI-style tiling strategy [17] for high-resolution inputs (e.g., 4K frames). Each frame is divided into overlapping patches, resized to the detector input resolution, and processed independently. Detections are mapped back and merged via NMS to remove duplicates.
Tiling increases the effective spatial resolution and mitigates the disappearance of sub-32 px targets after global downsampling, at a predictable computational cost (throughput decreases with patch count). Therefore, SAHI is enabled only for long-range monitoring or when scenes are dominated by tiny vessels.

3.6. Reliability-Guided Fusion (RGF) and Binary Line Segment Filter (BLSF)

Maritime glare and wave reflections often produce saturated regions and elongated streaks, degrading feature reliability. We introduce a Reliability-Guided Fusion (RGF) mechanism combined with a Binary Line Segment Filter (BLSF).
RGF leverages a Depth Reliability Map (DRM) derived from stereo disparity estimation using SGBM, following the reliability modeling in [22]. Regions with unstable disparity are treated as low-confidence. Because glare frequently causes depth failure, the DRM acts as a spatial weighting matrix that suppresses unreliable features before subsequent detection layers. This intermediate fusion attenuates glare artifacts prior to attention modules such as the CBAM.
In addition, a lightweight BLSF suppresses geometrically implausible linear clutter (e.g., wave crests and horizon streaks). The BLSF penalizes detections dominated by elongated, low-compactness structures rather than vessel-like blobs. We adopt the core formulation from [21] and integrate it as a post-filter within the maritime pipeline to reduce glare-driven false positives while preserving tiny-vessel recall. Algorithmic details are provided in [21,22]; here, we focus on system-level integration.

3.7. Sensitivity Analysis and Geometric Filtering

In marine operations, sea-surface reflections create strong, time-varying multipaths. Depending on platform height and range, signals follow two- or three-ray interference patterns, with deep fades when direct and reflected components cancel [25]. These link-level fluctuations manifest as bursty packet loss and rapidly varying bitrate at the application layer.
A key pitfall is that “usable video” does not guarantee “usable control.” RTSP over TCP or UDP/RTP can accumulate delay under congestion; interactions between reliability mechanisms and buffering may cause freeze (head-of-line blocking) or stale frames due to bufferbloat [3,4,5]. Given finite control stability margins, this leads to a latency cliff where perception feedback becomes unsafe.
The dual-brain architecture mitigates this by: (i) running detection on the Jetson as a high-throughput perception brain, (ii) maintaining a deterministic MCU safety island that rejects stale detections and propagates state during dropouts, and (iii) providing a metadata-first low-bandwidth channel for situational awareness.
To formally define the degradation and recovery criteria between the perception and safety-control branches, we implement a dual-brain state machine with asymmetric temporal hysteresis, as illustrated in Figure 5. The system degrades immediately to MCU-led mode when the measured L e 2 e persistently exceeds the safe bound, when valid detections become stale beyond the replay buffer horizon, or when GSS > 0.9 indicates vision-degraded saturation. Control is not returned symmetrically. Instead, the system first enters a Recovery Pending state and restores Jetson-led operation only after 15 consecutive valid inference frames with GSS < 0.8 and no renewed latency spikes or frame-drop bursts. This hysteretic recovery rule prevents oscillatory mode switching under intermittent glare or communication jitter.
Section 6.3 quantifies robustness using TSMR via controlled delay/jitter/dropout injection and tracking evaluation.
To decouple non-deterministic vision latency from time-critical control, the MCU implements a safety-island tracker with a ring buffer of time-stamped detections and inertial priors. Late or out-of-phase measurements are handled through time-stamped measurement replay (TSMR) with IMM-UKF correction and forward re-propagation. The processing pipeline of TSMR is illustrated in Figure 6.
For relative target speed v , temporal misalignment Δ t induces spatial association error ε ≈ vΔt, which may exceed the gating radius for tiny vessels and cause identity switches. TSMR reduces association fragmentation and ensures deterministic tracking outputs.
Figure 7 summarizes the role split in the proposed dual-brain architecture. The Jetson branch handles high-rate perception and association, whereas the MCU safety island preserves time-stamped replay, state continuity, and conservative command generation when latency or link quality degrades.

3.8. MCU Safety-Island RRT Planning and MAVLink Commanding

In addition to deterministic tracking (TSMR + IMM-UKF), the MCU safety island also provides a lightweight obstacle-avoidance planning loop to maintain conservative navigation when the edge-AI branch becomes delayed. At each planning tick (e.g., 10–20 Hz), the MCU queries the most recent state estimate x ^ ,   P from the IMM-UKF and constructs a local safety representation (e.g., a coarse 2D occupancy/cost map in the NED frame) using (i) predicted target states from the tracker, (ii) predefined keep-out zones (geo-fence), and (iii) short-term motion constraints from the autopilot. A bounded-iteration RRT planner is then executed with a fixed compute budget (max nodes/max iterations) to generate a collision-free waypoint sequence over a short receding horizon.
The resulting path is converted into a smoothed short-horizon reference trajectory. The bounded-iteration RRT planner runs at 10–20 Hz because local cost-map construction and graph expansion are computationally heavier and only need to update at the environmental time scale. By contrast, the MCU transmits MAVLink setpoints at 20–50 Hz by interpolating along the latest verified-safe trajectory, so that the command stream remains aligned with the faster autopilot inner-loop dynamics. This rate decoupling separates path generation from command tracking: the planner refreshes the reference when new perception/tracker updates arrive, while the higher-rate setpoint stream preserves control smoothness and avoids oscillatory or staircase motion between planning ticks. If the perception branch reports degraded vision (stale or missing detections beyond a time threshold), the MCU freezes the last verified-safe trajectory and switches to a conservative “hold/loiter” policy until reliable updates resume.

4. System Implementation: Maritime UAV–Edge–Cloud Pipeline

To ensure that algorithmic gains translate to mission utility, we implement an end-to-end pipeline including a custom waterproof UAV platform, an embedded edge node for inference, and a cloud dashboard for logging and visualization.

4.1. Custom Waterproof UAV Platform

We adopt a fully programmable custom UAV built around an open autopilot and a stabilized IP/RTSP camera interface for direct edge ingestion.
Engineering rationale: Maritime operations frequently experience intermittent communication, specular-glare-induced vision latency cliffs, and constrained SWaP-C. Accordingly, we deploy a dual-brain architecture where the Jetson Orin Nano performs perception and publishes detections, while an MCU safety island maintains deterministic tracking and can issue conservative navigation commands to the Pixhawk via CAN when GPU inference or the ground link becomes unreliable. Supplementary Table S1 summarizes the deployed hardware stack. The Jetson Orin Nano was chosen as the low-cost edge-AI baseline because it delivers up to 40 TOPS (INT8) within a 15 W power envelope, representing an order-of-magnitude uplift over the 2019-era Jetson Nano while remaining compatible with UAV SWaP-C constraints [6,31]. This computing headroom allows us to deploy advanced features (e.g., the P2 layer and attention modules) and even concurrent models without sacrificing real-time throughput [23], bridging the edge deployment gap identified in Section 1.2 [6,23,31].
To ensure performance stability across extended maritime patrols, we explicitly address the thermal and power implications of operating the Jetson Orin Nano across its 15 W Max-P and 25 W MAXN/Super Mode envelopes. In our UAV integration, the edge computer is mounted on a dedicated heat-spreading structure with active airflow, and GPU/CPU clocks are locked to avoid frequency oscillations and latency spikes caused by thermal throttling. When operating in Super Mode for bandwidth-intensive perception workloads, power draw and junction temperature are continuously monitored, and non-critical workloads are adaptively throttled to maintain sustained real-time performance throughout long-duration missions. This thermal-aware power management strategy aligns with recent edge-robotics guidance on multi-model execution under tight TDP constraints and is critical for preserving deterministic behavior in safety-relevant UAV operations [32].
NVIDIA positions Jetson Orin Nano as a practical entry-level edge-AI platform that bridges real-time deployment needs with compact power envelopes, aligning with the onboard perception requirements of maritime UAV systems [33,34]. To motivate the choice of the embedded deployment platform, Supplementary Table S2 compares the key specifications of NVIDIA Jetson Nano and Jetson Orin Nano that are most relevant to real-time maritime UAV perception. The substantial differences in GPU architecture, memory bandwidth, and AI compute capability explain the improved throughput and latency headroom observed on Jetson Orin Nano, which is therefore selected as the primary target platform for Resi-YOLO deployment in this work.
Figure 8 summarizes the physical integration of the maritime UAV platform. The annotated top and side views identify the autopilot, Jetson Orin Nano, camera-gimbal payload, power-distribution hardware, and telemetry modules. This layout provides context for the non-propulsion power budget and the practical integration constraints of long-duration maritime deployment.

4.2. Dual-Brain Link Rate and Packet Definition

The Jetson–MCU link is implemented over UART and transmits a compact state packet at a fixed rate f link (e.g., 50–100 Hz). Each packet includes: (i) a monotonic sequence ID, (ii) a source time-stamp t k (in microseconds), (iii) the active track ID(s), (iv) target kinematics (e.g., ( x , y ) position and ( v x , v y ) velocity in the local frame), and (v) a quality/reliability flag (e.g., confidence score, GSS regime, or a “vision-degraded” bit). The MCU maintains a ring buffer of the most recent N = f link T buf packets with T buf = 1 s (e.g., 100 entries at 100 Hz), enabling time-stamped measurement replay (TSMR) to compensate for delay/jitter and to propagate state estimates through short perception outages without breaking the safety loop. This explicit link-rate/buffer design provides an engineering guarantee that the system can tolerate at least 1 s of short-term visual degradation while preserving bounded-latency decision-making.

4.3. Power, Signal, and Time Synchronization

To ensure repeatable performance under marine vibration and long-duration missions, we explicitly document (i) power distribution, (ii) signal paths, and (iii) time synchronization for consistent cross-module logging and latency attribution. Power distribution: The onboard battery feeds dedicated regulators/BEC rails for (a) the edge node, (b) the autopilot/MCU safety island, and (c) the camera and network interface. To prevent transient brownouts from propagating across modules, each rail is decoupled with local bulk capacitance and protected by undervoltage/overcurrent safeguards. This separation helps avoid vision pipeline resets during aggressive maneuvers and reduces timing jitter induced by power instability.
Signal and time synchronization: The camera stream is delivered to the edge node via Ethernet (RTSP), while the autopilot state and control channels are exchanged via MAVLink (UART) and the safety-island interface. For consistent logging across the dual-brain pipeline, the edge node maintains a synchronized clock (e.g., NTP) and records time-stamps at (1) stream ingestion/decoding, (2) inference output, and (3) publishing/telemetry transmission. The MCU similarly time-stamps received packets using its local clock and stores sequence IDs to support time-aligned replay (TSMR). To ensure consistent time-stamp alignment between the Jetson and MCU, an offset calibration and drift monitoring mechanism is implemented, as illustrated in Figure 9.
Together, these measures enable unambiguous separation between onboard L edge and full streaming L e 2 e latency when analyzing delay/jitter events. These implementation details ensure that the reported robustness (delay/jitter tolerance and glare-aware reliability) is attributable to the proposed architecture rather than incidental integration artifacts.

4.4. Low-Bandwidth Messaging

To accommodate weak maritime links, the edge node publishes compact detection events via MQTT. The backend subscribes to relevant topics, stores logs in a database (e.g., PostgreSQL/MongoDB), and pushes live updates to the frontend via WebSocket.
To address the constraints of maritime communication links, we design a compact and event-driven messaging scheme that decouples detection alerts, system health monitoring, and visual verification. Lightweight MQTT topics are used for high-priority alerts and status updates, while WebSocket streams are activated only when operator confirmation is required. The detailed messaging interfaces and payload design are summarized in Table 1.
Figure 10 provides a functional overview of the end-to-end dataflow, highlighting how high-rate video streams are processed locally on the edge node and converted into low-bandwidth, event-driven messages suitable for unreliable maritime links. The figure emphasizes system architecture, data-rate asymmetry, and the separation between onboard perception and backend visualization.

5. Experimental Protocol

5.1. Datasets and Splits

We adopted a hybrid dataset strategy to evaluate both benchmarking comparability and real-world deployment robustness. Specifically, a public maritime UAV dataset (e.g., SeaDronesSee) was used for benchmarking against prior work, while an in-house coastal UAV dataset collected under glare-heavy conditions was used to assess deployment-oriented performance in realistic maritime environments. The datasets differ in both visual characteristics and operational conditions. The public dataset provides standardized evaluation settings commonly used in maritime detection research, whereas the in-house dataset captures real UAV footage under strong specular reflections and high-glare sea surfaces. Dataset statistics, including image resolution, number of frames, annotation counts, tiny-object ratios, and glare severity distribution, were compiled and are reported in Table 2.
The in-house coastal UAV dataset comprised approximately 2500 4K frames (3840 × 2160) collected over the western coastal waters near Anping Port, Tainan, Taiwan, primarily during late-morning to noon flights in November–December, when strong specular reflection frequently affected the sea surface. Flights were conducted at 30–120 m AGL with gimbal pitch angles of 30–60° and target slant ranges of approximately 100–800 m. The label ontology was aligned with SeaDronesSee and included Boat, Swimmer, JetSki, LifeJacket, and Buoy. Bounding boxes were annotated independently by three annotators and cross-checked by consensus intersection; severely ambiguous or glare-saturated regions were marked as ignore and excluded from loss computation. Data collection complied with local aviation regulations, and the operating geometry prevented capture of identifiable facial features or sensitive facilities.

5.2. Training Recipe and Glare-Oriented Augmentations

We applied both standard YOLO augmentations (Mosaic, MixUp, and HSV jitter) and glare-oriented augmentations designed to simulate high-reflection maritime conditions. All YOLO baselines and model variants were implemented using the official Ultralytics codebase, ensuring consistent training and inference pipelines across all experiments [35]. In addition to standard augmentations, we introduced glare-specific transformations including synthetic saturation patches and contrast-limited adjustments to emulate specular reflections commonly observed on sea surfaces. The augmentation probabilities and parameter ranges were fixed across all model variants to ensure a fair comparison. Figure 11 summarizes the three augmentation categories used during training. Mosaic/MixUp increases scale and context diversity, synthetic glare patches emulate specular highlights on the sea surface, and brightness/contrast perturbations expand exposure variability. These transformations were applied uniformly across all model variants to improve robustness without introducing an unfair advantage to any single configuration. In contrast, Figure 1 and Figure 4 highlight representative failure cases and attention behavior observed during evaluation under real glare conditions.

5.3. Metrics: Accuracy, Robustness, and Efficiency

In addition to the standard mAP metric, we evaluated failure-mode and deployment-oriented metrics summarized in Table 3. Specifically, we report APsmall and Recallsmall at IoU = 0.5 for objects whose bounding-box area after resize was <1024 px2 (equivalently, <32 × 32 pixels, following the COCO area-based convention rather than side length), FPIglare on glare-heavy subsets, and embedded efficiency metrics including mean and p95 latency together with steady-state FPS on Jetson Orin Nano (batch = 1) after warm-up.
To quantify radiometric clutter caused by specular reflections, we defined the Glare Severity Score (GSS). For each frame, the image was converted from BGR to HSV color space, and glare pixels were identified using V ≥ 217 and S ≤ 38. These thresholds were chosen to capture clipped, weakly saturated specular highlights on the sea surface rather than normally illuminated water. The GSS was computed within a sea-surface ROI, excluding sky and shoreline regions, as the ratio of glare pixels to total ROI pixels:
  G S S = i = 1 n P glare , i P total
This score, ranging from 0 to 1, is used to stratify illumination severity within our study. Because exposure conditions may vary across sequences, the GSS is interpreted here as a relative robustness indicator under a common camera family and preprocessing pipeline, rather than as an absolute photometric quantity across different sensors.

5.4. Baselines, Ablations, and Resolution Study

We used YOLO11n as the primary backbone and baseline and additionally included YOLOv8n as a widely adopted reference model among recent YOLO-based detectors [19], with all models trained and evaluated under identical data splits and training recipes [36]. Ablation experiments were conducted on the YOLO11n backbone by progressively enabling the P2 detection head, CBAM attention module, and NWD loss and by evaluating their combined core configuration (Resi-YOLO). SAHI and RGF/BLSF were evaluated separately as optional deployment modules. We further evaluated inference-time input resolution (e.g., 640, 960, and 1280) to characterize the trade-off between tiny-object detection accuracy and embedded throughput. All models followed the official Ultralytics releases and codebase [35,37]. Under identical training and inference settings on Jetson Orin Nano, Resi-YOLO (P2 + CBAM + NWD) achieved consistent improvements in small-object detection accuracy, positioning the model beyond the YOLO11n baseline in the accuracy–throughput design space.
Figure 12 summarizes the accuracy–throughput trade-off on Jetson Orin Nano by plotting APsmall against measured FPS, with mean and p95 end-to-end latency annotated to capture RTSP streaming-time variability (TensorRT FP16, 10 W, batch = 1). To complement this system-level view, Figure 10 illustrates the end-to-end dataflow and where each processing block resides, whereas Figure 13 provides a quantitative latency breakdown across pipeline stages (decoding, preprocessing, inference, postprocessing, and messaging), clarifying the dominant contributors to end-to-end delay and supporting real-time feasibility on the Jetson Orin Nano.
Figure 13 decomposes the end-to-end delay into decoding, preprocessing, inference, postprocessing, and publishing stages. The results show that inference is a major but not exclusive contributor to latency; video I/O and preprocessing also account for a non-negligible fraction of the deployment-time delay. This breakdown clarifies why engine-level FPS and full-pipeline FPS should be interpreted separately.
Robustness under size and glare variation: Figure 14 reports recall as a function of object size and glare severity, stratified by the Glare Severity Score (GSS). Across all size bins, Resi-YOLO consistently outperforms the YOLO11n baseline, with the largest gains observed in the small-object regime (<32 px) under medium-to-high glare conditions. In particular, for objects of size 16–32 px, recall improvements of +0.08 to +0.12 are achieved as glare severity increases, confirming that the proposed design is most effective when both scale and radiometric clutter are challenging.
For larger objects (>64 px), the recall improvement is more modest (+0.03 to +0.04) and largely invariant to glare severity, indicating that the baseline model already performs near saturation in this regime. Cells marked as NA correspond to insufficient samples and are excluded from interpretation. Overall, the results demonstrate that Resi-YOLO shows robustness primarily where IoU-based matching and glare sensitivity most severely limit baseline performance.
Figure 14 shows that the benefit of Resi-YOLO is concentrated in the most challenging regimes, particularly for small objects under medium-to-high glare. The gain becomes less pronounced for larger vessels, where the baseline already operates near saturation. Cells with insufficient samples are excluded from interpretation. Following the above protocol, we evaluated quantitative detection and tracking performance, embedded throughput and latency, and communication robustness on Jetson Orin Nano, as presented in Section 6. Where applicable, we also analyzed the portability of the proposed Resi-YOLO pipeline to lower-cost deployment scenarios, with additional implementation details summarized in Supplementary Table S2.

6. Results and Discussion

The robustness trends observed in Figure 14 are consistent with the qualitative and architectural analyses presented earlier. Figure 3 demonstrates that P2 and NWD primarily address scale-related failures by stabilizing localization and matching for tiny objects, while Figure 4 explains how the CBAM suppresses glare-dominated attention under severe radiometric clutter. Figure 11 further shows that glare-oriented data augmentation exposes the model to diverse illumination patterns during training. Together, these components jointly contribute to the recall improvements observed for small vessels under increasing glare severity, as quantified in Figure 14. Notably, the NWD formulation directly mitigates the bounding-box oscillation observed for far-range tiny vessels in Figure 3c, because it maintains smooth gradients even when IoU overlap becomes unstable or nearly vanishes. This stabilization further benefits downstream association by reducing frame-to-frame box jitter under high-GSS conditions.
Figure 3 qualitatively shows that P2 and NWD primarily address scale-related failure modes in tiny-vessel detection. The higher-resolution P2 pathway preserves finer spatial cues for far-range targets, while the NWD formulation stabilizes localization when overlap-based supervision becomes unreliable for very small boxes. As a result, the proposed detector exhibits fewer misses and fewer glare-induced false alarms than the baseline in challenging maritime scenes.

6.1. Tiny-Vessel Detection Performance

We first report tiny-vessel detection performance and ablation results, focusing on small-object recall and glare-robust confidence. Table 4 summarizes the core accuracy metrics under the unified split and training recipe, where Resi-YOLO (P2 + CBAM + NWD) is compared against YOLOv8n and YOLO11n baselines. By reducing frame-to-frame bounding-box jitter, NWD also stabilizes association gating in the downstream tracker, which in turn lowers ambiguous matchings and contributes to fewer identity switches under glare-prone, delay/jitter-impaired maritime streams.

6.2. Glare Robustness with GSS-Stratified Evaluation

Average mAP can hide failures under extreme glare, where sea-surface specular highlights dominate the radiometric budget and corrupt both detection confidence and data association. We therefore report glare-stratified detection metrics using the Glare Severity Score (GSS), which measures the proportion of over-exposed, low-saturation pixels within a predefined region of interest (ROI). Based on the GSS, the test set is partitioned into three illumination regimes—low, medium, and high—and Resi-YOLO is evaluated against the YOLO11n baseline under each regime.
Table 5 summarizes the GSS-stratified detection performance. When glare severity is low (GSS ∈ [0.0, 0.3]), the baseline model already exhibits reasonable detection capability; nevertheless, Resi-YOLO still achieves an absolute recall improvement of 0.08, indicating that the P2 detection head contributes additional sensitivity even under favorable illumination. Under medium glare conditions (GSS ∈ [0.3, 0.6]), Resi-YOLO further widens the performance gap, improving recall by 0.10 and mAP@0.5 by 7.3 percentage points compared with the baseline.
Most notably, under extreme glare conditions (high GSS ∈ [0.6, 1.0]), the baseline model’s recall degrades sharply to 0.30, reflecting severe vulnerability to specular reflections and radiometric clutter. In contrast, Resi-YOLO maintains a recall of 0.45 and achieves a mAP@0.5 gain of 12.5 percentage points (41.2% to 53.7%). These results demonstrate that the proposed CBAM-based glare suppression mechanism effectively filters physics-induced noise and enhances weak target signals, thereby preserving detection robustness in high-glare maritime environments. A minimal script outline for GSS computation is provided in Supplementary Section S2.

6.3. Dual-Brain Tracking Robustness Under Delay/Jitter

To emulate real-world maritime teleoperation, we evaluate end-to-end multi-object tracking (MOT) robustness under deployment-like video I/O impairments, including live RTSP streaming, fixed latency injection, and jitter with frame drops [21]. In practical maritime scenarios, video streams transmitted over satellite or wireless links are often subject to irregular delays and packet loss, which can severely disrupt temporal alignment and identity association in conventional tracking pipelines.
Table 6 reports tracking stability metrics under different deployment conditions, including MOTA, IDF1, ID switch counts (IDS), and end-to-end perception latency [38]. LIVE-RTSP denotes direct ingestion of the RTSP stream on Jetson without impairment. To isolate the contribution of the MCU safety island from that of improved detection, all delay/jitter experiments used the same Resi-YOLO detection stream while varying only the tracking backend between the standard ByteTrack pipeline [23] and the MCU-based TSMR + IMM-UKF tracker. LAG-50 ms and JITTER-20 ms (+drop) therefore reflect tracking robustness under identical detections but different temporal recovery mechanisms.
In maritime MOT tasks, target motion is inherently non-linear and is further perturbed by camera vibration induced by wave motion. As shown in Table 6, under nominal conditions (LIVE-RTSP), Resi-YOLO significantly outperforms the YOLO11n baseline in both MOTA and IDF1, while reducing ID switches by approximately 28%, demonstrating more stable identity association even in glare-prone environments.
When a fixed latency of 50 ms is introduced, conventional tracking-by-detection pipelines [23] suffer from spatiotemporal misalignment when latency or jitter disrupts temporal consistency between detection and association stages, typically leading to a sharp increase in ID switches. In contrast, the proposed dual-brain architecture mitigates this effect by isolating tracking and state estimation within the MCU safety island. By replaying time-stamped measurements through TSMR and compensating for delayed observations via IMM-UKF, IDF1 decreases by only 0.7%, and IDS growth remains tightly controlled.
Under jitter and frame-drop conditions, all methods experience performance degradation due to missing and irregular updates. Nevertheless, the deterministic replay mechanism consistently preserves higher tracking robustness than detector-only pipelines, confirming the practical engineering value of decoupling perception uncertainty from control determinism. These results demonstrate that the proposed dual-brain design effectively bridges the gap between laboratory MOT benchmarks and real-world maritime deployment under unreliable communication links.

6.4. Embedded Feasibility on Jetson Orin Nano

We report end-to-end throughput (FPS), per-stage latency breakdown, and energy proxies on Jetson Orin Nano with TensorRT deployment. The dual-brain design keeps the safety island responsive even when the main GPU experiences occasional latency spikes, providing a practical path toward real-time maritime autonomy.

6.4.1. Definition of Throughput and Latency Metrics

To ensure reproducibility and avoid ambiguity between engine-level performance and deployed system throughput, we distinguish the following metrics used throughout this section.
FPSinfer (TensorRT inference throughput): the raw neural network inference speed measured using the TensorRT engine only. This metric excludes video capture, decoding, communication, and message serialization overhead.
FPSpipeline (end-to-end pipeline throughput): the effective runtime throughput of the deployed perception system, including video capture/encoding, network transfer, decoding, preprocessing, inference, postprocessing, and message publishing.
L e d g e : onboard perception latency measured on the Jetson from frame ingestion/decoding to result publishing.
L e 2 e : system end-to-end latency from camera capture to downstream message delivery, including camera encoding and network transport.
All throughput measurements were collected after a warm-up phase of approximately 200 frames to stabilize GPU clocks and TensorRT execution. Unless otherwise specified, inference experiments used TensorRT FP16 with batch = 1 on Jetson Orin Nano.
Table 7 (or Supplementary Section S3) reports the per-stage pipeline latency budget, while Table 8 summarizes embedded efficiency and complexity. Notably, by upgrading from the Jetson Nano (Maxwell, ~0.5 TFLOPS) to the Jetson Orin Nano (Ampere, ~40 TOPS), the primary inference stage latency is reduced by roughly one order of magnitude [6]. In our tests, TensorRT FP16 inference for Resi-YOLO dropped from ~150 ms on Nano to ~15 ms on Orin, which in turn lowered the end-to-end latency from ~150 ms to ~100 ms (mean)—well within the 200 ms target for operator-in-the-loop feedback. This significant boost in edge computing capability comes with only a modest increase in power draw (MaxN 10 W → 15 W), underscoring the improved FPS/Watt of the Orin Nano platform (approximately 3× higher frames per second per watt than Jetson Nano) and reinforcing Resi-YOLO’s status as a “Green AI” solution for on-board deployment [39,40].
To avoid ambiguity, we distinguish: (i) onboard perception latency L edge , measured on Jetson from stream ingestion/decoding to result publishing (decode→infer→publish), and (ii) system end-to-end latency L e 2 e , which additionally includes camera capture/encoding and network transport. Table 8 reports p 95 of L edge for different model variants, while Figure 13 visualizes the full L e 2 e breakdown under streaming deployment.
Table 7. Latency budget breakdown on Jetson Orin Nano for the deployed 640 × 640 streaming pipeline (TensorRT FP16, batch = 1) after an approximately 200-frame warm-up. Here, L e 2 e denotes end-to-end latency from camera capture to downstream alert delivery, whereas L e d g e refers to Jetson-side perception latency from stream ingestion/decoding to result publishing.
Table 7. Latency budget breakdown on Jetson Orin Nano for the deployed 640 × 640 streaming pipeline (TensorRT FP16, batch = 1) after an approximately 200-frame warm-up. Here, L e 2 e denotes end-to-end latency from camera capture to downstream alert delivery, whereas L e d g e refers to Jetson-side perception latency from stream ingestion/decoding to result publishing.
StageMean (ms)p95 (ms)Measurement Notes
Capture + encoding4050On-camera ISP + H.265 encoder latency (SIYI A8 mini gimbal camera).
Network transfer510Gimbal-to-Jetson Ethernet streaming (wired LAN, negligible jitter).
Video decoding
(NVDEC)
1525Hardware decoding via nvv4l2decoder (DeepStream optimized).
Preprocessing (VIC)812Resizing and color space conversion (NV12→RGBA) on VIC hardware.
Inference
(TensorRT)
~15~20TensorRT FP16, batch = 1 (Resi-YOLO model). (INT8 could further reduce latency.)
Postprocessing (NMS)48NMS and formatting on CPU/GPU.
Publishing/serializing25JSON serialization and MQTT publishing overhead.
End-to-end (total)~90~130Total pipeline latency (frame capture to alert); target < 200 ms for reliable teleoperation.
Note: Capture + encoding was measured from the SIYI A8 mini gimbal camera (SIYI Technology, Shenzhen, China) path (on-camera ISP + H.265 encoding), while Jetson-side stages were time-stamped using the synchronized Jetson system clock. Statistics were computed over 10,240 streamed frames under the stated wired-LAN deployment setting.
Table 8. Onboard throughput and latency measured at 640 × 640 resolution without SAHI slicing under Standard Mode (15 W Max-P) on Jetson Orin Nano. FPS reports the effective runtime throughput in the deployed configuration.
Table 8. Onboard throughput and latency measured at 640 × 640 resolution without SAHI slicing under Standard Mode (15 W Max-P) on Jetson Orin Nano. FPS reports the effective runtime throughput in the deployed configuration.
Model VariantParameters (M)GFLOPSFPS
(Standard)
p95 Latency (ms)Computational Stability (Sim)
YOLOv8n (Baseline)3.28.7~27.045High
YOLO11n (Vanilla)2.66.5~22.552High
YOLO11n + P23.410.5~14.578Medium
YOLO11n + CBAM2.87.120.858Very High
Resi-YOLO (P2 + CBAM + NWD)3.511.212.885High
Note: FPS in Table 8 is measured end-to-end with the deployed streaming pipeline (capture/encode → NVDEC decode → pre-process → TensorRT inference → NMS/post-process → serialization/publish), consistent with the latency budget in Table 7; therefore, it is not directly comparable to inference-only throughput reported in Table 9.
Table 9. TensorRT engine throughput (inference-only) at 640 × 640 resolution without SAHI slicing under different Jetson power modes. Standard Mode (15 W Max-P) achieves >30 FPS, while Super Mode (25 W MAXN) reaches up to 55.4 FPS; measurements exclude video I/O and messaging overhead.
Table 9. TensorRT engine throughput (inference-only) at 640 × 640 resolution without SAHI slicing under different Jetson power modes. Standard Mode (15 W Max-P) achieves >30 FPS, while Super Mode (25 W MAXN) reaches up to 55.4 FPS; measurements exclude video I/O and messaging overhead.
MetricStandard Mode
(15 W Max-P)
Super Mode (25 W MAXN)ImprovementPhysical Interpretation
Engine Throughput (FPS, TensorRT-only)30.255.4+83.4%Significantly higher perception frequency
Average Power (W)12.118.2+50.4%Increased power within acceptable
Energy per Frame (mJ/frame)400.6328.5−18.0%Lower energy cost per processed frame
Efficiency (FPS/W)2.503.04+21.6%Higher compute utilization
Note: Table 9 reports inference/engine throughput for Resi-YOLO under fixed TensorRT settings (e.g., precision and batch size kept constant) to quantify energy per frame and FPS/W scaling between 15 W Max-P and 25 W MAXN/Super Mode. This measurement excludes external video capture/network overheads captured in Table 7.

6.4.2. Embedded Computational Efficiency and Energy-Aware Analysis: Advantages of Super Mode

Performance evaluation on embedded platforms should consider not only FPS but also energy efficiency, which directly affects UAV mission endurance. To investigate the impact of input resolution on tiny-object detection performance, we evaluated inference at three input resolutions (640, 960, and 1280). Increasing the input resolution improves APsmall by preserving spatial detail for sub-32 pixel vessels but reduces throughput due to increased feature-map computation. In our experiments on Jetson Orin Nano, the 960 resolution provided the best trade-off between accuracy and embedded throughput. The 1280 resolution achieved the highest APsmall but reduced the pipeline throughput below the operational target for real-time UAV deployment. The detailed accuracy–throughput relationship is illustrated in Figure 12. Table 8 compares computational complexity and embedded efficiency across model variants on Jetson Orin Nano. Here, FPS denotes end-to-end streaming throughput (including video I/O and pre/postprocessing), consistent with the latency budget in Table 7, rather than isolated engine throughput.
The “Computational Stability” metric reflects sustained throughput variance during long-run execution, capturing sensitivity to bandwidth contention and thermal management. The CBAM introduces minimal overhead and maintains “Very High” stability. In contrast, the P2 head increases feature-map resolution and candidate density, raising memory traffic and sensitivity to CPU–GPU contention and thermal micro-throttling; thus, stability is conservatively rated as “High.” Deployment mitigations (e.g., locked clocks and thermal-aware control) are described in Section 4.1.
The CBAM adds only 0.6 GFLOPS while improving glare robustness, whereas P2 increases computational cost to enhance tiny-object sensitivity. The combined Resi-YOLO configuration achieves a balanced trade-off, sustaining 12.8 FPS (Standard Mode, 15 W Max-P, 640 × 640, no SAHI; Table 8) with the highest detection stability among variants.
Across power modes, Standard Mode reports effective deployed throughput (Table 8), while Super Mode (25 W MAXN) raises TensorRT engine throughput to >30 FPS and up to 55.4 FPS (inference-only, no SAHI; Table 9). This distinction separates real deployment performance from engine-level compute limits.
The difference between the reported pipeline throughput (12.8 FPS in Table 8) and the TensorRT engine throughput (>30 FPS in Table 9) arises from additional stages required for real deployment. While Table 9 measures inference-only performance (FPS_infer), Table 8 reflects the full operational pipeline (FPS_pipeline), which includes video decoding, preprocessing, postprocessing, and message publishing overhead. This distinction explains why the deployed streaming system operates at a lower effective throughput despite higher raw inference capability.

6.4.3. Energy Efficiency Analysis Under Jetson Orin Nano Super Mode

NVIDIA introduced Super Mode in JetPack 6.2, increasing Jetson Orin Nano memory bandwidth from 68 GB/s to 102 GB/s and boosting GPU frequency to 1020 MHz [41]. This upgrade benefits bandwidth-intensive models such as Resi-YOLO, which rely on high-resolution feature maps for tiny-object detection [42,43].
In our experiments, Resi-YOLO achieves 12.8 FPS in Standard Mode (15 W Max-P) at a 640 × 640 resolution without SAHI (Table 8). Under Super Mode (25 W MAXN), TensorRT engine throughput (inference-only) exceeds 30 FPS and reaches up to 55.4 FPS (Table 9), also without SAHI. Thus, Table 8 reports effective deployed throughput, while Table 9 reflects engine-level compute limits across power modes.
Although Super Mode increases instantaneous power draw, throughput rises by 83.4% and energy per frame decreases by 18.0%, indicating improved FPS/W scaling despite higher absolute power consumption. Under fixed battery capacity, this allows UAV platforms to cover larger maritime areas or acquire denser perception data within the same flight duration, reinforcing the practical viability of Resi-YOLO for embedded deployment.
As shown in Table 9, enabling Super Mode increases the instantaneous power consumption to 18.2 W; however, due to the non-linear acceleration in inference throughput (an 83% increase in FPS), the energy consumption per processed frame is reduced by 18%. This improvement implies that, under the same battery capacity, UAV platforms operating in Super Mode can either survey a larger maritime area or acquire denser detection results within the same flight duration.

6.5. Discussion and Limitations

While Resi-YOLO demonstrates strong robustness, extreme fog/rain and severe motion blur remain challenging and may require temporal denoising or multi-sensor fusion. As shown in Figure 15, under intense glare and heavy sea clutter, Resi-YOLO more reliably recovers tiny distant vessels and reduces false negatives compared with the baseline, although occasional false positives persist in highly reflective regions. Notably, Figure 15d presents a hard negative test (background-only scenes) under high glare. The YOLO11n baseline produces spurious detections in saturated specular regions, whereas Resi-YOLO suppresses these false alarms through CBAM-driven feature reweighting. This qualitative evidence complements the GSS-stratified results, indicating improved recall for tiny vessels as well as better specificity in target-absent frames.
To approach the 80–90% mAP range, one could incorporate higher-resolution training/inference, dense slicing, or restoration-heavy super-resolution frontends before detection. However, these strategies would substantially increase compute, memory traffic, and latency, making them less suitable for the SWaP- and latency-constrained maritime edge setting targeted here. In contrast, the present Resi-YOLO core configuration reaches 65.1% mAP@0.5 and 31.5% APsmall while preserving 12.8 FPS end-to-end on Jetson Orin Nano. We therefore interpret the current design as a Pareto-efficient operating point for real-time maritime UAV deployment, rather than an accuracy-maximized but latency-heavy solution.
However, when specular saturation dominates the ROI (e.g., GSS > 0.9), large pixel regions become clipped and fine gradients vanish, leading to irreversible information loss. In such cases, reliable visual recovery is infeasible. We therefore treat GSS > 0.9 as a vision-degraded state: the safety-island MCU transitions from vision-updated guidance to IMU-propagated navigation and issues conservative commands (e.g., hold/loiter or reduced speed) until valid detections reappear within the buffer horizon.
Under severe glare, the qualitative examples indicate that the proposed model improves both tiny-vessel recall and hard-negative specificity. In particular, the background-only case shows that the baseline is more likely to fire on saturated reflective regions, whereas the proposed pipeline suppresses such clutter more reliably. Additional RGF- and BLSF-specific examples further show that the optional geometric filters target different false-positive modes: reflective depth-unreliable regions for RGF and elongated wave/horizon structures for BLSF.
Future work will extend the impairment model beyond glare to include fog, rain, and haze, which similarly degrade contrast and temporal consistency. Incorporating out-of-sequence measurement handling, longer-range geo-referenced tracking, and multi-sensor fusion—highlighted in recent maritime UAV studies [30]—represents a natural extension. The dual-brain design further provides a stable interface for higher-level decision-making (e.g., DRL-based policies), offering a scalable foundation for resilient maritime UAV autonomy.

Advantages of the Proposed Method

Through an in-depth interpretation of Table 10, the advantages of the proposed method can be summarized along three complementary dimensions.
First, Resi-YOLO is explicitly optimized for tiny-object sensitivity. Although recent architectures such as YOLOv12n and MambaYOLO achieve strong performance on generic datasets (e.g., COCO), their reliance on stride-8 feature representations remains insufficient for maritime targets with apparent widths below 10 pixels [44]. By contrast, Resi-YOLO adopts a stride-4 P2 detection head, which provides finer spatial granularity. When combined with the Normalized Wasserstein Distance (NWD) loss, this design alleviates the mismatch between tiny bounding boxes and IoU-based supervision, resulting in an approximately 7.4% improvement in APsmall compared to YOLOv12n on the SeaDronesSee benchmark.
Second, Resi-YOLO incorporates a physics-aware robustness mechanism at the feature level. Competing approaches such as S3Det primarily rely on offline data augmentation techniques (e.g., Cut-and-Paste) to enrich training distributions [45]. While effective for increasing sample diversity, such strategies cannot adapt to dynamic glare patterns encountered during inference. In contrast, the CBAM embedded in Resi-YOLO operates as an online attention-guidance mechanism, dynamically re-weighting features based on the instantaneous radiometric distribution of the scene. This enables effective suppression of glare-dominated regions prior to feature fusion, explaining the superior stability observed under high-GSS conditions.
Figure 15 qualitatively confirms the robustness trends observed in the quantitative results. Under severe glare, the baseline is more prone to false activations on reflective clutter, whereas Resi-YOLO better preserves true tiny-vessel detections while reducing spurious responses in hard-negative scenes. Additional targeted examples further show that RGF mainly suppresses false responses in depth-unreliable reflective regions, whereas the BLSF removes elongated clutter associated with wave crests or horizon streaks.
Figure 16 provides two targeted qualitative examples to distinguish the roles of the two geometric filters. In Figure 16a, after applying RGF, false responses arising from reflective depth-unreliable regions are largely suppressed while the true vessel is preserved. In Figure 16b, after applying the BLSF, elongated clutter aligned with horizon streaks or wave-crest structures is effectively removed. These targeted examples clarify the complementary roles of the two filters and help explain why the combined geometric-filter setting achieves a lower F P I g l a r e in Table 4.
Finally, from a system-level perspective, existing YOLO variants (v8–v12) and MambaYOLO are implemented as single-brain perception pipelines and do not explicitly account for perception latency induced by thermal throttling or network congestion on embedded platforms. Resi-YOLO uniquely integrates a heterogeneous dual-brain architecture with time-stamped measurement replay (TSMR), decoupling perception uncertainty from control determinism. This safety-oriented closed loop represents a critical step toward operationally robust and autonomous maritime UAV deployment.

7. Conclusions

This paper introduced Resi-YOLO as a system-oriented maritime UAV perception framework that couples a P2-enhanced YOLO11n detector with CBAM-based glare suppression and NWD loss and validates the resulting detector–tracker pipeline under deployment-like video I/O impairments. Beyond per-frame accuracy, we emphasize marine engineering reliability: a heterogeneous dual-brain architecture assigns deep perception to a Jetson Orin Nano while an MCU safety island provides deterministic, low-latency tracking continuity when edge inference is delayed or when communications are intermittent. We further argue that the Glare Severity Score (GSS) is not merely an image-processing metric but an environment-awareness indicator that can guide risk-aware adaptive perception policies in high-glint sea states. Collectively, these contributions provide a practical blueprint for AI-driven drone systems in marine engineering applications. The associated code, configuration templates, and reproducibility checklists will be released to facilitate deployment and comparative studies.
From a marine engineering standpoint, the proposed dual-brain architecture establishes a fail-operational safety envelope for autonomous maritime missions under volatile over-sea links. When video streaming experiences latency cliffs, jitter, or frame drops, the MCU safety island continues deterministic tracking and state propagation via time-stamped measurement replay, while the GPU pipeline gracefully degrades to metadata-first reporting. This co-design keeps navigation and surveillance decisions bounded and auditable under communication uncertainty—a practical requirement for real-world marine operations. By isolating the “perception uncertainty” on the Jetson Orin Nano side from the “control certainty” on the MCU side, our system ensures that transient vision delays do not destabilize the platform. Future work will integrate link-quality and GSS-triggered mode switching to further tighten this safety envelope and explore higher-level autonomy integrations.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/drones10030226/s1, Table S1: Hardware configuration (UAV platform components and key specifications), Table S2: Edge platform comparison between NVIDIA Jetson Nano and NVIDIA Jetson Orin Nano for onboard maritime UAV perception deployment (key specifications affecting real-time Resi-YOLO throughput, power, and reliability), Table S3: Experiment configuration template (corresponding to Table 9 in main text), Table S4: Per-stage pipeline latency comparison between Jetson Nano (Maxwell) and Jetson Orin Nano (Ampere), Table S5: Edge deployment environment (Hardware/software configuration and runtime settings.)

Author Contributions

Conceptualization, S.-E.T.; Methodology, S.-E.T.; Software, S.-E.T.; Validation, S.-E.T. and C.-H.H.; Formal Analysis, S.-E.T.; Investigation, S.-E.T.; Data Curation, S.-E.T. and C.-H.H.; Writing—Original Draft, S.-E.T.; Writing—Review And Editing, S.-E.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by En-Shou Investment Co., Ltd., Taiwan, and Ching Lung Agricultural Technology Co., Ltd., Taiwan.

Data Availability Statement

The public SeaDronesSee dataset analyzed in this study is available at https://macvi.org/, accessed on 16 March 2026. Due to privacy, coastal security, and regulatory constraints, the full in-house coastal UAV dataset cannot be released publicly. To support independent verification, we provide derived statistics for the in-house dataset, including object size bins, GSS distribution, and per-class annotation counts, together with an anonymized keyframe subset covering different glare regimes and tiny-target cases. The TSMR and tracking evaluation scripts, configuration templates, and supplementary reproducibility materials are available at https://github.com/hsieh5737/resi_yolo_gss, accessed on 16 March 2026, Optimized model weights and additional in-house samples are available from the corresponding author upon reasonable request and subject to institutional approval.

Acknowledgments

This work was supported by En-Shou Investment Co., Ltd., and Ching Lung Agricultural Technology Co., Ltd. The authors would also like to thank the AI Center, Chang Jung Christian University, for providing essential computational resources and technical support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lee, Y.H.; Meng, Y.S. Near Sea-Surface Mobile Radiowave Propagation at 5 GHz. Radioengineering 2014, 23, 824–830. [Google Scholar]
  2. Kiefer, B.; Žust, L.; Kristan, M.; Perš, J.; Teršek, M.; Wiliem, A.; Messmer, M.; Yang, C.-Y.; Huang, H.-W.; Jiang, Z.; et al. 2nd Workshop on Maritime Computer Vision (MaCVi) 2024: Challenge Results. In Proceedings of the IEEE/CVF Winter Confer-ence on Applications of Computer Vision (WACV) Workshops, Waikoloa, HI, USA, 3–8 January 2024; pp. 869–891. [Google Scholar]
  3. Schulzrinne, H.; Rao, A.; Lanphier, R. RFC 2326; Real-Time Streaming Protocol (RTSP); RFC Editor: Marina del Rey, CA, USA, 1998. [Google Scholar] [CrossRef]
  4. Schulzrinne, H.; Casner, S.; Frederick, R.; Jacobson, V. RFC 3550; RTP: A Transport Protocol for Real-Time Applications; RFC Editor: Marina del Rey, CA, USA, 2003. [Google Scholar] [CrossRef]
  5. Gettys, J.; Nichols, K. Bufferbloat: Dark buffers in the Internet. Commun. ACM 2012, 55, 57–65. [Google Scholar] [CrossRef]
  6. Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 2260–2270. [Google Scholar]
  7. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
  8. Zhao, X.; Liu, Q.; Li, M.; Li, J.; Zhang, Y.; Huang, Y.; Zhou, J.; Chen, C. YOLOv7-sea: A lightweight and accurate object de-tection model for maritime environments. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 2–7 January 2023; pp. 1–10. [Google Scholar]
  9. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
  10. Wang, Y.; Liu, J.; Zhao, J.; Li, Z.; Yan, Y.; Yan, X.; Xu, F.; Li, F. LCSC-UAVNet: A High-Precision and Lightweight Model for Small-Object Identification and Detection in Maritime UAV Perspective. Drones 2025, 9, 100. [Google Scholar] [CrossRef]
  11. Qin, J.; Li, M.; Zhao, J.; Zhong, J.; Zhang, H. Revolutionize the Oceanic Drone RGB Imagery with Pioneering Sun Glint De-tection and Removal Techniques. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 8326–8335. [Google Scholar]
  12. Zhao, F.; Chen, Y.; Xi, D.; Liu, Y.; Wang, J. Enhanced hermit crabs detection using super-resolution reconstruction and im-proved YOLOv8 on UAV-captured imagery. Mar. Environ. Res. 2025, 210, 107313. [Google Scholar] [CrossRef]
  13. Rusyn, B.; Lutsyk, O.; Kosarevych, R.; Maksymyuk, T. Features extraction from multi-spectral remote sensing images based on multi-threshold binarization. Sci. Rep. 2023, 13, 19655. [Google Scholar] [CrossRef]
  14. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  15. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  16. Wang, Z.; Wang, X. Normalized Gaussian Wasserstein distance for tiny object detection. ISPRS J. Photogramm. Remote Sens. 2022, 190, 119–134. [Google Scholar]
  17. Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing aided hyper inference and fine-tuning for small object detection. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 966–970. [Google Scholar]
  18. Li, L.; Zhang, Y.; Chen, H.; Wang, J.; Xu, K. Spotlight on Small-Scale Ship Detection: Empowering YOLO with Ad-vanced Techniques and a Novel Dataset. In Proceedings of the Asian Conference on Computer Vision (ACCV), Hanoi, Vietnam, 8–12 December 2024; pp. 1–15. [Google Scholar]
  19. Li, H.; Zhao, F.; Xue, F.; Wang, J. Succulent-YOLO: Smart UAV-assisted succulent farmland monitoring with CLIP-based YOLOv10 and Mamba computer vision. Remote Sens. 2025, 17, 2219. [Google Scholar] [CrossRef]
  20. Zhao, F.; Ren, Z.; Wang, J. Smart UAV-assisted rose growth monitoring with improved YOLOv10 and Mamba restoration techniques. Smart Agric. Technol. 2025, 10, 100730. [Google Scholar] [CrossRef]
  21. Tsai, S.-E.; Yang, S.-M.; Hsieh, C.-H. Real-Time Deterministic Lane Detection on CPU-Only Embedded Systems via Binary Line Segment Filtering. Electronics 2026, 15, 351. [Google Scholar] [CrossRef]
  22. Tsai, S.-E.; Hsieh, C.-H. A Real-Time Collision Warning System for Autonomous Vehicles Based on YOLOv8n and SGBM Stereo Vision. Electronics 2025, 14, 4275. [Google Scholar] [CrossRef]
  23. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 1–14. [Google Scholar]
  24. Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
  25. Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
  26. Ciaparrone, G.; Sánchez, F.L.; Tabik, S.; Troiano, L.; Tagliaferri, R.; Herrera, F. Deep learning in video multi-object tracking: A survey. Neurocomputing 2020, 381, 61–88. [Google Scholar] [CrossRef]
  27. Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar] [CrossRef]
  28. Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. HOTA: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
  29. Jocher, G.; Chaurasia, A.; Qiu, J. YOLOv8: Ultralytics Next-Generation Real-Time Object Detector. arXiv 2023, arXiv:2305.09972. [Google Scholar]
  30. Satore, J.L.; Jao, J.; Castilla, R.; Vallar, E.; Galvez, M.C. Comparative Study of YOLOv10, YOLO11 and YOLOv12 Lightweight Models for Multi-Class Maritime Search and Rescue Using UAV Imagery. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, XLVIII-1/W6, 199–204. [Google Scholar] [CrossRef]
  31. Ultralytics. YOLO11N Documentation. Ultralytics Official Documentation 2026. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 16 March 2026).
  32. NVIDIA Corporation. Jetson Orin Nano Developer Kit Carrier Board Specification; SP-11324-001; NVIDIA: Santa Clara, CA, USA, 2024; Available online: https://developer.nvidia.com/downloads/assets/embedded/secure/jetson/orin_nano/docs/jetson_orin_nano_devkit_carrier_board_specification_sp.pdf (accessed on 1 February 2026).
  33. NVIDIA. Jetson Orin Nano Developer Kit User Guide. NVIDIA Developer. Available online: https://developer.nvidia.com/embedded/learn/jetson-orin-nano-devkit-user-guide/index.html (accessed on 30 January 2026).
  34. NVIDIA. Solving Entry-Level Edge AI Challenges with NVIDIA Jetson Orin Nano; NVIDIA Technical Blog; NVIDIA: Santa Clara, CA, USA, 2022; Available online: https://developer.nvidia.com/blog/solving-entry-level-edge-ai-challenges-with-nvidia-jetson-orin-nano/ (accessed on 30 January 2026).
  35. Bilous, N.; Malko, V.; Ahekian, I.; Korobiichuk, I.; Ivanichev, V. Comparative Evaluation of YOLO Models for Human Position Recognition with UAVs During a Flood. Appl. Syst. Innov. 2026, 9, 6. [Google Scholar] [CrossRef]
  36. Ultralytics. Ultralytics YOLO GitHub Repository. GitHub Repository. Available online: https://github.com/ultralytics/ultralytics (accessed on 31 January 2026).
  37. Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
  38. Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO, v8.4.6; Zenodo: Geneva, Switzerland, 2026. [CrossRef]
  39. NVIDIA. Jetson Orin Nano Series Data Sheet; DS-11105-001; NVIDIA: Santa Clara, CA, USA, 2023; Available online: https://forums.developer.nvidia.com/uploads/short-url/mHytGSlaBUsKUAKOtHHjldblsX8.pdf (accessed on 30 January 2026).
  40. NVIDIA. Jetson Orin Nano Technical Specifications. NVIDIA Developer Documentation 2023. Available online: https://developer.nvidia.com/embedded/jetson-modules (accessed on 30 January 2026).
  41. NVIDIA. Jetson Orin Nano Super Developer Kit. Available online: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/ (accessed on 30 January 2026).
  42. Exploring NVIDIA Jetson Orin Nano Super Mode Performance Using Generative AI. Available online: https://www.ridgerun.com/post/exploring-nvidia-jetson-orin-nano-super-mode-performance-using-generative-ai (accessed on 30 January 2026).
  43. Yu, C.; Li, Y.; Zhang, Z.; Wang, X.; Liu, H. SMEP-DETR: Transformer-Based Ship Detection for SAR Imagery with Multi-Edge Enhancement and Parallel Dilated Convolutions. Remote Sens. 2025, 17, 953. [Google Scholar] [CrossRef]
  44. Wang, Z.; Li, C.; Xu, H.; Zhu, X.; Li, H. Mamba YOLO: A Simple Baseline for Object Detection with State Space Model. Proc. AAAI Conf. Artif. Intell. 2025, 39, 8205–8213. [Google Scholar] [CrossRef]
  45. Kurmashev, I.; Semenyuk, V.; Lupidi, A.; Alyoshin, D.; Kurmasheva, L.; Cantelli-Forti, A. Study of the Optimal YOLO Visual Detector Model for Enhancing UAV Detection and Classification in Optoelectronic Channels of Sensor Fusion Systems. Drones 2025, 9, 732. [Google Scholar] [CrossRef]
Figure 1. Representative maritime scenes used for qualitative evaluation under different failure drivers: (a) tiny targets at long range, (b) glare and whitecaps, and (c) motion blur. Each panel reports the Glare Severity Score (GSS) and glare-area ratio to contextualize illumination conditions. Red boxes highlight representative target regions for visual inspection.
Figure 1. Representative maritime scenes used for qualitative evaluation under different failure drivers: (a) tiny targets at long range, (b) glare and whitecaps, and (c) motion blur. Each panel reports the Glare Severity Score (GSS) and glare-area ratio to contextualize illumination conditions. Red boxes highlight representative target regions for visual inspection.
Drones 10 00226 g001
Figure 2. Overview of the proposed dual-brain architecture and representative tiny-vessel detections. The upper panel shows the decoupled Jetson–MCU dataflow, and the lower panels compare baseline YOLO11n and Resi-YOLO on tiny-vessel cases.
Figure 2. Overview of the proposed dual-brain architecture and representative tiny-vessel detections. The upper panel shows the decoupled Jetson–MCU dataflow, and the lower panels compare baseline YOLO11n and Resi-YOLO on tiny-vessel cases.
Drones 10 00226 g002
Figure 3. Qualitative examples showing the effect of the P2-enhanced feature path and NWD-based localization on tiny-vessel detection. The left, middle, and right columns show the input image, YOLO11 baseline (without P2, IoU-based loss), and the proposed Resi-YOLO (P2 + NWD + glare-aware), respectively. (a) Recovery of a true tiny-vessel detection under strong glare, where the baseline produces false positives while the proposed method preserves the correct detection. (b) Suppression of whitecap-induced false positives by reducing background activation. (c) Improved localization of a distant tiny vessel, with more accurate bounding box alignment. (d) Additional distant-vessel example demonstrating more stable bounding-box placement using NWD. Yellow boxes indicate detected vessels or target regions, red boxes indicate false positives; orange label boxes indicate confidence-tagged detections, and “GT” denotes the ground-truth bounding box. Insets show enlarged views of the corresponding regions.
Figure 3. Qualitative examples showing the effect of the P2-enhanced feature path and NWD-based localization on tiny-vessel detection. The left, middle, and right columns show the input image, YOLO11 baseline (without P2, IoU-based loss), and the proposed Resi-YOLO (P2 + NWD + glare-aware), respectively. (a) Recovery of a true tiny-vessel detection under strong glare, where the baseline produces false positives while the proposed method preserves the correct detection. (b) Suppression of whitecap-induced false positives by reducing background activation. (c) Improved localization of a distant tiny vessel, with more accurate bounding box alignment. (d) Additional distant-vessel example demonstrating more stable bounding-box placement using NWD. Yellow boxes indicate detected vessels or target regions, red boxes indicate false positives; orange label boxes indicate confidence-tagged detections, and “GT” denotes the ground-truth bounding box. Insets show enlarged views of the corresponding regions.
Drones 10 00226 g003
Figure 4. Visualization of the CBAM spatial attention map M s ( F ) in Equation (4) under hard-glare conditions. (a) Example showing strong glare regions causing false positive responses in the baseline model. (b) Example illustrating missed vessel detection (false negative) under glare interference. (c) Example highlighting severe glare effects leading to both false positives and missed detections. Yellow boxes indicate vessel candidates or target regions, and the labels FP and FN denote false positive and false negative cases, respectively.
Figure 4. Visualization of the CBAM spatial attention map M s ( F ) in Equation (4) under hard-glare conditions. (a) Example showing strong glare regions causing false positive responses in the baseline model. (b) Example illustrating missed vessel detection (false negative) under glare interference. (c) Example highlighting severe glare effects leading to both false positives and missed detections. Yellow boxes indicate vessel candidates or target regions, and the labels FP and FN denote false positive and false negative cases, respectively.
Drones 10 00226 g004
Figure 5. Dual-brain control state machine defining degradation triggers, recovery validation, and safe transition between Jetson perception and the MCU safety island. Solid arrows indicate confirmed state transitions based on validated conditions, while dashed arrows represent conditional or monitoring-based transitions during degradation or recovery phases.
Figure 5. Dual-brain control state machine defining degradation triggers, recovery validation, and safe transition between Jetson perception and the MCU safety island. Solid arrows indicate confirmed state transitions based on validated conditions, while dashed arrows represent conditional or monitoring-based transitions during degradation or recovery phases.
Drones 10 00226 g005
Figure 6. Time-stamped measurement replay (TSMR) tracking pipeline. Delayed detections are replayed for IMM-UKF correction and followed by forward state re-propagation to maintain deterministic tracking.
Figure 6. Time-stamped measurement replay (TSMR) tracking pipeline. Delayed detections are replayed for IMM-UKF correction and followed by forward state re-propagation to maintain deterministic tracking.
Drones 10 00226 g006
Figure 7. System architecture of the dual-brain design, including the primary edge brain (Jetson Orin Nano) and the MCU safety island (deterministic control domain). Solid arrows indicate deterministic control or confirmed data-flow transitions, while dashed arrows represent non-deterministic or asynchronous communication across system boundaries. The “X” symbol denotes that no direct control path is allowed between the Jetson perception module and the flight controller, ensuring safety isolation.
Figure 7. System architecture of the dual-brain design, including the primary edge brain (Jetson Orin Nano) and the MCU safety island (deterministic control domain). Solid arrows indicate deterministic control or confirmed data-flow transitions, while dashed arrows represent non-deterministic or asynchronous communication across system boundaries. The “X” symbol denotes that no direct control path is allowed between the Jetson perception module and the flight controller, ensuring safety isolation.
Drones 10 00226 g007
Figure 8. Hardware integration of the UAV platform, including top and side views of the onboard modules.
Figure 8. Hardware integration of the UAV platform, including top and side views of the onboard modules.
Drones 10 00226 g008
Figure 9. Jetson–MCU clock synchronization mechanism. Offset calibration and periodic drift compensation maintain consistent time-stamps across modules.
Figure 9. Jetson–MCU clock synchronization mechanism. Offset calibration and periodic drift compensation maintain consistent time-stamps across modules.
Drones 10 00226 g009
Figure 10. Maritime UAV perception pipeline. (a) System-level communication architecture from onboard video acquisition to backend messaging. (b) Onboard Jetson processing stages: decode, preprocess, TensorRT inference, tracking, and event logic.
Figure 10. Maritime UAV perception pipeline. (a) System-level communication architecture from onboard video acquisition to backend messaging. (b) Onboard Jetson processing stages: decode, preprocess, TensorRT inference, tracking, and event logic.
Drones 10 00226 g010
Figure 11. Examples of glare-oriented data augmentations used during training: (a) Mosaic/MixUp (b) synthetic glare patches, and (c) brightness/contrast perturbations. For each subfigure, the left image shows the original input, and the right images show the corresponding augmented results. Orange boxes indicate the vessel/target regions for visual reference.
Figure 11. Examples of glare-oriented data augmentations used during training: (a) Mosaic/MixUp (b) synthetic glare patches, and (c) brightness/contrast perturbations. For each subfigure, the left image shows the original input, and the right images show the corresponding augmented results. Orange boxes indicate the vessel/target regions for visual reference.
Drones 10 00226 g011
Figure 12. Accuracy–throughput trade-off on Jetson Orin Nano for the YOLO11n baseline and Resi-YOLO (P2 + CBAM + NWD).
Figure 12. Accuracy–throughput trade-off on Jetson Orin Nano for the YOLO11n baseline and Resi-YOLO (P2 + CBAM + NWD).
Drones 10 00226 g012
Figure 13. Per-stage latency breakdown of the deployed streaming pipeline on Jetson Orin Nano.
Figure 13. Per-stage latency breakdown of the deployed streaming pipeline on Jetson Orin Nano.
Drones 10 00226 g013
Figure 14. Recall heatmap stratified by object size bin and glare severity (GSS) for the YOLO11n baseline and Resi-YOLO. The right panel shows Δ Recall, and the embedded table reports sample counts per stratum.
Figure 14. Recall heatmap stratified by object size bin and glare severity (GSS) for the YOLO11n baseline and Resi-YOLO. The right panel shows Δ Recall, and the embedded table reports sample counts per stratum.
Drones 10 00226 g014
Figure 15. Qualitative examples under high-glare maritime conditions. Rows (a,b) show extremely tiny vessels (<16 × 16 px), row (c) shows small vessels (16–32 px), and row (d) shows a background-only hard negative. Columns present the input image, YOLO11n baseline, and Resi-YOLO. Red boxes mark false positives; green boxes mark correct detections, and dashed boxes denote background regions with no ground-truth objects.
Figure 15. Qualitative examples under high-glare maritime conditions. Rows (a,b) show extremely tiny vessels (<16 × 16 px), row (c) shows small vessels (16–32 px), and row (d) shows a background-only hard negative. Columns present the input image, YOLO11n baseline, and Resi-YOLO. Red boxes mark false positives; green boxes mark correct detections, and dashed boxes denote background regions with no ground-truth objects.
Drones 10 00226 g015
Figure 16. Qualitative examples of geometric-filter-based false-positive suppression. (a) RGF-specific case in a reflective depth-unreliable region. (b) BLSF-specific case in a horizon-streak/wave-crest clutter scene. Columns show the baseline detection, the output after applying the corresponding geometric filter (RGF or BLSF), and the final cleaned result. Red boxes indicate false positives, and green boxes indicate correct detections.
Figure 16. Qualitative examples of geometric-filter-based false-positive suppression. (a) RGF-specific case in a reflective depth-unreliable region. (b) BLSF-specific case in a horizon-streak/wave-crest clutter scene. Columns show the baseline detection, the output after applying the corresponding geometric filter (RGF or BLSF), and the final cleaned result. Red boxes indicate false positives, and green boxes indicate correct detections.
Drones 10 00226 g016
Table 1. Messaging schema for maritime detection events used in the dual-brain UAV deployment. All interfaces are designed for asynchronous, event-driven communication to minimize bandwidth while preserving real-time safety semantics.
Table 1. Messaging schema for maritime detection events used in the dual-brain UAV deployment. All interfaces are designed for asynchronous, event-driven communication to minimize bandwidth while preserving real-time safety semantics.
InterfaceTopic/EndpointKey FieldsNotes (Rate/QoS/Payload)
MQTTuav/alerttime-stamp (UTC), uav_id, class (swimmer/boat), conf, bbox [x, y, w, h], geo [lat, lon, alt]Event-driven (max 5 Hz); QoS 1; JSON payload ~300 bytes; ultra-low bandwidth.
MQTTuav/statusbatt_volt, link_quality, glare_idx, system_temp1 Hz; QoS 0; health/status monitoring; asynchronous publish (non-blocking).
WebSocket/ws/keyframeimage_base64 (JPEG), detection_id0.2–1 Hz; transmit keyframes only when a detection requires operator confirmation (~50–100 KB per image).
MAVLinkOBSTACLE_DISTANCE (custom)distance, angle, sensor_type2 Hz; UAV publishes basic obstacle distances (if needed for AP).
Note: Additional fields for confidence and track ID can be included in MQTT messages if required by the ground station.
Table 2. Dataset summary and split configuration.
Table 2. Dataset summary and split configuration.
DatasetSrcResCountSplit (tr/va/te)Tiny RatioGlare
SeaDronesSee v2Pub.~4K~14,227
(Tr 8930/Va 1547/Te 3750)
63/11/26%VH (~91%)Med.
(natural)
In-house UAVOurs4K
(3840 × 2160)
~250070/20/10%Med (~60%)Sev. (glare + whitecap)
Table 3. Evaluation metrics and operational meaning.
Table 3. Evaluation metrics and operational meaning.
MetricRecommended DefinitionOperational Meaning
APsmall/RecallsmallCompute on objects with bbox area <32 × 32 after resize (or define bins).Tiny-vessel sensitivity.
FPIglareFalse positives per image on glare-heavy subset.Operator workload/false-alarm control.
Latency (mean/p95)Per-frame time: decode + preprocess + infer + postprocess.Real-time feasibility.
FPSSteady-state FPS at batch = 1 after warm-up.Throughput trade-off.
Latency jitter (σL, p99L)Real-time reliability.Compute standard deviation (σ) and tail (p99) of end-to-end latency over ≥10k frames; report frame-drop rate under wireless congestion and high-glare segments.
Energy per frame (mJ/frame), FPS/WEfficiency (SWaP).Measure average power (W) during steady-state inference; derive mJ/frame = 1000·P_avg/FPS and FPS/W = FPS/P_avg for fair embedded comparisons.
IDF1ID F1 score measuring identity-preserving association over time.Higher is better; complements MOTA by emphasizing identity continuity.
HOTAHigher Order Tracking Accuracy balancing detection and association errors.Reported with TrackEval to avoid overemphasis on detection-only improvements.
IDSW (IDS)Number of identity switches during tracking.Lower indicates more stable tracking and data association.
Table 4. Accuracy results of the core detector ablation study. ✓ indicates that the corresponding module is enabled in the model variant.
Table 4. Accuracy results of the core detector ablation study. ✓ indicates that the corresponding module is enabled in the model variant.
Model Variant P2CBAMNWDSAHIRGFBLSFmAP@0.5 (%)APsmall (%)Recallsmall (%)FPIglare
YOLOv8n (Baseline)------58.418.424.53.5
YOLO11n (Vanilla)------61.221.328.13.2
YOLO11n + P2 -----64.532.841.23.4
YOLO11n + CBAM- ----61.821.928.51.8
YOLO11n + NWD-- ---62.425.631.43.0
Resi-YOLO (Core) ---65.131.539.81.9
Resi-YOLO + SAHI --67.836.244.52.1
Resi-YOLO + RGF 65.031.439.41.5
Resi-YOLO + BLSF 65.031.339.31.6
Resi-YOLO + Geom Filters - 64.931.138.91.2
Resi-YOLO (All-in) 67.535.843.61.4
This configuration represents our final proposed model used for SOTA comparison in Table 10.
Table 5. GSS-stratified detection performance comparison.
Table 5. GSS-stratified detection performance comparison.
GSS Range (Score)Environmental DescriptionSample RatioBaseline RecallResi-YOLO RecallRecall GainBaseline mAP@0.5Resi-YOLO mAP@0.5
Low
(0.0–0.3)
Soft illumination, no direct reflections55%0.580.66+0.0861.2%65.1%
Medium
(0.3–0.6)
Moderate sea-surface glitter, afternoon sunlight30%0.510.61+0.1054.5%61.8%
High
(0.6–1.0)
Extreme specular reflections, intense glare15%0.300.45+0.1541.2%53.7%
Table 6. Tracking stability under different deployment conditions.
Table 6. Tracking stability under different deployment conditions.
Deployment ConditionDetectorTrackerMOTAIDF1IDSPerception Latency (ms)
LIVE-RTSP (no impairment)YOLO11nByteTrack61.566.819852
LIVE-RTSP (no impairment)Resi-YOLOByteTrack66.871.514290
LAG-50 ms (fixed delay)Resi-YOLOByteTrack64.369.1173>100
LAG-50 ms (fixed delay)Resi-YOLOMCU + TSMR66.170.8155>100
JITTER-20 ms (+drop)Resi-YOLOByteTrack61.965.7221Variable
JITTER-20 ms (+drop)Resi-YOLOMCU + TSMR63.467.2189Variable
Table 10. Comparison with recent SOTA object detectors on the SeaDronesSee benchmark.
Table 10. Comparison with recent SOTA object detectors on the SeaDronesSee benchmark.
ModelCore TechniquemAP@0.5 A P s m a l l FPS (Orin Nano)Glare RobustnessFault-Tolerance Design
S3DetFeedback Cut-and-Paste Augmentation73.9% 39.4%~10.2MediumNone
YOLOv12nArea Attention62.4%24.1%~18.5LowNone
YOLO11n-PicoContext Transformer54.8%21.5%~25.0MediumNone
MambaYOLOLinear State-Space Model (SSM)59.2%23.8%~15.5MediumNone
Resi-YOLO (Ours)P2 + CBAM + Heterogeneous Dual-Brain65.1%31.5%12.8†HighTSMR + MCU
Note: 1. Note that R e c a l l s m a l l (Table 4) is reported separately from A P s m a l l ; the latter is used consistently in Table 10 for SOTA comparison. 2. The mAP of S3Det is reported on the iShip-1 dataset and is expected to degrade when evaluated on the SeaDronesSee benchmark. 3. Some results are obtained by directly evaluating COCO-pretrained weights without maritime-specific fine-tuning [18], whereas Resi-YOLO is fine-tuned on maritime data. 4. Resi-YOLO includes a P2 detection head with higher GFLOPS; however, engine-level inference throughput exceeds 30 FPS when operated in Super Mode on Jetson Orin Nano (Table 9) [42], whereas end-to-end streaming FPS is reported separately in Table 8. 5. FPS is measured at 640 × 640 resolution without SAHI slicing; engine-only throughput under different power modes is reported in Table 9.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tsai, S.-E.; Hsieh, C.-H. Glare-Aware Resi-YOLO: Tiny-Vessel Detection with Dual-Brain Edge Deployment for Maritime UAVs. Drones 2026, 10, 226. https://doi.org/10.3390/drones10030226

AMA Style

Tsai S-E, Hsieh C-H. Glare-Aware Resi-YOLO: Tiny-Vessel Detection with Dual-Brain Edge Deployment for Maritime UAVs. Drones. 2026; 10(3):226. https://doi.org/10.3390/drones10030226

Chicago/Turabian Style

Tsai, Shang-En, and Chia-Han Hsieh. 2026. "Glare-Aware Resi-YOLO: Tiny-Vessel Detection with Dual-Brain Edge Deployment for Maritime UAVs" Drones 10, no. 3: 226. https://doi.org/10.3390/drones10030226

APA Style

Tsai, S.-E., & Hsieh, C.-H. (2026). Glare-Aware Resi-YOLO: Tiny-Vessel Detection with Dual-Brain Edge Deployment for Maritime UAVs. Drones, 10(3), 226. https://doi.org/10.3390/drones10030226

Article Metrics

Back to TopTop