MCB-RT-DETR: A Real-Time Vessel Detection Method for UAV Maritime Operations

Liu, Fang; Wei, Yongpeng; Yan, Aruhan; Cao, Tiezhu; Xie, Xinghai

doi:10.3390/drones10010013

Open AccessArticle

MCB-RT-DETR: A Real-Time Vessel Detection Method for UAV Maritime Operations

by

Fang Liu

^*,

Yongpeng Wei

,

Aruhan Yan

,

Tiezhu Cao

and

Xinghai Xie

Naval University of Engineering, Wuhan 430030, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(1), 13; https://doi.org/10.3390/drones10010013 (registering DOI)

Submission received: 20 November 2025 / Revised: 22 December 2025 / Accepted: 25 December 2025 / Published: 27 December 2025

(This article belongs to the Special Issue Artificial Intelligence-Driven Drones Systems for Marine Engineering Applications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

MCB-RT-DETR achieves 82.9% mAP@0.5 and 49.7% mAP@0.5:0.95 on the SeaDronesSee dataset, surpassing the baseline RT-DETR by 4.5% and 3.4%, respectively.
The method maintains real-time inference speed of 50 FPS while significantly improving detection accuracy under complex maritime conditions (e.g., wave interference, scale variations, and small targets).

What are the implications of the main findings?

Presents a resilient visual perception framework to support autonomous UAV maritime operations, enabling reliable ship detection in challenging environments.
Demonstrates strong generalization across diverse datasets (DIOR and VisDrone2019), indicating broad applicability to aerial and remote sensing scenarios.

Abstract

Maritime UAV operations face challenges in real-time ship detection. Complex ocean backgrounds, drastic scale variations, and prevalent distant small targets create difficulties. We propose MCB-RT-DETR, a real-time detection transformer enhanced by multi-component boosting. This method builds upon the RT-DETR architecture. It significantly improves detection under wave interference, lighting changes, and scale differences. Key innovations address these challenges. An Orthogonal Channel Attention (Ortho) mechanism preserves high-frequency edge details in the backbone network. Receptive Field Attention Convolution (RFAConv) enhances robustness against background clutter. A Small Object Detail Enhancement Pyramid (SOD-EPN) strengthens small-target representation. SOD-EPN combines SPDConv with multi-scale CSP-OmniKernel transformations. The neck network integrates ultra-lightweight DySample upsampling. This enables content-aware sampling for precise multi-scale localization. The method maintains high computational efficiency. Experiments on the SeaDronesSee dataset show significant improvements. MCB-RT-DETR achieves 82.9% mAP@0.5 and 49.7% mAP@0.5:0.95. These correspond to improvements of 4.5% and 3.4% relative to the baseline model. Inference speed maintains 50 FPS for real-time processing. The outstanding performance in cross-dataset tests further validates the algorithm’s strong generalization capability on DIOR remote sensing images and VisDrone2019 aerial scenes. The method provides a reliable visual perception solution for autonomous maritime UAV operations.

Keywords:

ship detection; RT-DETR; UAV maritime operations; small object enhancement; attention mechanism; multi-scale feature fusion

1. Introduction

Global maritime activities have increased significantly. Traditional replenishment methods, such as ship transport and aerial lifting, show clear limitations in special scenarios. Examples include severe sea conditions. These methods struggle to meet demands for high efficiency and flexibility. This creates an urgent need for autonomous and intelligent solutions [1].

Against this background, UAV-based autonomous maritime replenishment technology emerges as an innovative solution. It demonstrates significant potential [2]. Achieving precise “last mile” operations in autonomous maritime replenishment relies critically on building a stable and reliable visual guidance system. Within this system, real-time and accurate target detection is especially crucial. Its performance directly determines the success or failure of the entire mission.

However, the complex marine environment presents severe challenges that expose inherent limitations of existing detectors: Complex background clutter caused by waves and cloud shadows challenges the local receptive fields of CNN-based detectors, which struggle to distinguish ship targets from semantically similar wave textures [3]; Significant illumination variations degrade the performance of anchor-based mechanisms in YOLO-like detectors, leading to inconsistent feature representation [4]; Drastic scale variations and prevalent distant small targets exacerbate the information loss in progressive downsampling architectures, resulting in missed detections for small vessels [5].

Deep learning-based object detectors have become mainstream in computer vision. Among them, single-stage detectors like the YOLO series [6] are favored. They achieve a good balance between speed and accuracy. Continuous architectural evolution has led YOLO models to excellent performance on general object detection benchmarks. However, they show inherent limitations when applied directly to maritime ship detection.

Anchor-based mechanisms and progressive downsampling in CNNs often cause insufficient feature representation for small targets [7]. This leads to missed detections. Furthermore, their local receptive fields struggle to effectively distinguish ship targets from semantically similar wave textures and shadows. This increases the false alarm rate [8].

Recently, Transformers have been introduced into object detection. Detection Transformer (DETR) [9] is a representative and promising alternative. DETR utilizes the self-attention mechanism. It achieves global context modeling. It eliminates the need for hand-crafted anchors and Non-Maximum Suppression (NMS). This simplifies the detection pipeline. Subsequent real-time variants, like Real-Time Detection Transformer (RT-DETR) [10], have substantially improved inference speed.

Nevertheless, applying pure Transformer architectures like DETR and RT-DETR to maritime scenarios remains challenging. They often suffer from slow training convergence. Their computational complexity is high. Meeting real-time application demands on embedded systems is difficult. Additionally, without specific architectural enhancements, their performance for small target detection is often suboptimal.

Therefore, there is an urgent need for an object detection framework. This framework must address the unique challenges of the marine environment while meeting the real-time requirements of edge devices. An ideal solution should possess stronger feature discrimination capabilities. This is essential to suppress complex background interference. It also needs enhanced feature representation for multi-scale targets, especially small ones. Computational overhead must be maintained or further optimized.

To address these limitations, this study proposes MCB-RT-DETR (Multi-Component Boosting RT-DETR), a novel maritime detection framework that integrates four coordinated enhancements: (a) Orthogonal attention mechanism for preserving high-frequency edge details against complex backgrounds; (b) Receptive Field convolution for adaptive feature extraction across varying scales; (c) Small object detection boost through a specialized pyramid network; and (d) Dynamic upsampling for precise multi-scale localization. Each component addresses specific maritime challenges while collectively maintaining real-time performance essential for UAV operations. The key contributions of this work include:

(1): A multi-module collaborative enhancement framework for RT-DETR is proposed. Specifically, an Orthogonal Channel Attention mechanism (Ortho) is introduced into the shallow layers of the backbone network. This preserves high-frequency details of ship edges. It also enhances feature discriminative power. In the deep layers, standard convolution is replaced with Receptive Field Attention Convolution (RFAConv). This allows convolution kernel parameters to adapt based on local content. It significantly enhances robustness to background clutter.
(2): A Small Object Detail Enhancement Pyramid Network (SOD-EPN) is designed. This neck module integrates Spatial-to-Depth convolution (SPDConv). SPDConv compresses high-resolution feature maps without losing detail. It also includes a novel CSP-OmniKernel module. This module strengthens feature fusion by employing multi-scale kernel transformation combined with attention. Its goal is to significantly improve detection capability for small ship targets.
(3): An ultra-lightweight Dynamic Upsampler (DySample) is introduced into the neck network. This module replaces traditional interpolation methods. It generates content-aware sampling points. This enables more precise feature map reconstruction. Consequently, it improves localization accuracy for ships of different scales. Its design ensures minimal computational overhead.

Systematic experiments demonstrate the effectiveness of the proposed method. Comprehensive qualitative and quantitative analyses show that MCB-RT-DETR achieves a superior trade-off between detection performance and processing speed. Extensive cross-dataset testing further validates its strong generalization capability. It performs significantly better than various mainstream detectors across different maritime and aerial scenarios.

The structure of this paper is as follows: Section 2 reviews related work on traditional, CNN-based, and Transformer-based object detection methods. Section 3 details the architecture design of our proposed MCB-RT-DETR algorithm. Section 4 details the experimental configuration, outcomes, and analysis. This includes component ablation tests, comparative experiments, and generalization validation. Section 5 discusses the results, analyzing working principles, application scenarios, and limitations. Finally, Section 6 summarizes the paper and outlines future work.

2. Related Work

As a fundamental computer vision task, object detection focuses on recognizing and localizing target instances belonging to specific categories in images or videos [11]. Its typical application in maritime UAV operations, such as ship target detection, is illustrated in Figure 1. This section systematically reviews key technical approaches relevant to this study. It focuses on analyzing their strengths and weaknesses when addressing complex maritime detection challenges.

2.1. Traditional Object Detection Methods and Their Limitations

In the pre-deep learning era, object detection was predominantly based on manually crafted features and heuristic search methods. Representative methods include Histogram of Oriented Gradients (HOG) [12], template matching [13], and edge feature-based approaches [14]. These methods typically employed a sliding window mechanism. This mechanism traverses the image. Features are extracted within each window. The extracted data is subsequently processed by a classifier, including a Support Vector Machine (SVM), to perform classification. While effective in controlled settings, these approaches demonstrate intrinsic drawbacks. They struggle to cope with complex and dynamic marine environments.

(1): Limited Feature Representation: Handcrafted features (e.g., HOG, SIFT) possess weak representational power for targets. They struggle to capture the diverse appearance of ships under varying illumination, viewing angles, and scales. This results in poor generalization capability.
(2): Low Computational Efficiency: The sliding window mechanism generates massive redundant computations. Detection speeds are slow. They fail to meet real-time requirements.
(3): Insufficient Environmental Robustness: Suppression capability against complex background interference is weak. This interference includes sea waves, cloud shadows, and fog. Consequently, false positives and missed detections occur frequently.

Therefore, traditional methods cannot satisfy the demand for high-precision, high-real-time object detection in UAV maritime autonomous operations.

2.2. Deep Learning-Based Object Detection Methods

The emergence of deep learning technologies, especially the advanced feature extraction ability of CNNs, transformed the landscape of object detection. Deep learning-based object detection is primarily divided into two-stage and single-stage paradigms.

2.2.1. Two-Stage Detectors

Two-stage detectors approach the detection task through a two-step process. First, they generate region proposals. Subsequently, the system executes classification tasks and refines the bounding box coordinates for every candidate region. The representative R-CNN series [15,16,17] continuously optimized this pipeline. The Region Proposal Network (RPN) was proposed in the Faster R-CNN framework [17]. This achieved end-to-end training. It significantly improved both speed and accuracy. Subsequent research, such as DetectoRS [18], further enhanced multi-scale feature fusion capabilities. It used recursive feature pyramids and switchable atrous convolutions. This boosted performance for small objects and occluded scenes.

Two-stage methods typically achieve higher detection accuracy. However, their pipeline is relatively complex. Inference speeds are slower. Achieving real-time detection on UAV platforms with limited computing resources remains challenging.

2.2.2. Single-Stage Detectors

Single-stage detectors omit the region proposal generation step. They perform dense prediction directly on the image. Object detection is treated as a regression problem. The Single Shot MultiBox Detector (SSD) [19] is an early representative. Currently, the YOLO series [20,21,22,23,24] is the most prominent. From YOLOv1 [6] proposing a unified architecture, subsequent versions continuously improved. Enhancements were made to the backbone network (e.g., Darknet, CSPNet), neck network (e.g., FPN, PANet), detection head, and training strategies (e.g., label assignment, loss functions). YOLO-based models stand out as advanced solutions for balancing accuracy and speed.

Single-stage detectors offer clear speed advantages. However, their “single-step” design also presents challenges. For dense small targets or targets similar to the background in maritime scenes, insufficient feature resolution and semantic information easily result in detection failures or erroneous identifications. Despite continuous upgrades, the YOLO series, due to the local receptive field characteristics inherent in CNNs [8], still faces bottlenecks. This is especially true in scenarios requiring global contextual information for discrimination, such as distinguishing ship targets from semantically similar wave textures or near sea-sky lines [7,8].

2.3. Application of Transformers in Object Detection

With a powerful self-attention mechanism capable of capturing global dependencies, the Transformer architecture made significant breakthroughs in NLP before being rapidly applied to computer vision. Carion et al. [9] pioneered DETR (DEtection TRansformer) by conceptualizing object detection as a set prediction problem. This model utilizes a CNN backbone for feature extraction and an encoder–decoder Transformer structure to directly output the final set of predictions This eliminates the need for Non-Maximum Suppression (NMS) post-processing. This end-to-end paradigm simplifies the detection pipeline. It gained significant attention for its excellent global reasoning capability.

However, the original DETR is hampered by sluggish training convergence and limited performance on small objects. In response, researchers have proposed various improvements. For example, Xu et al. [25] enhanced DETR by integrating SwinTransformer [26]. This improved small object detection performance but increased computational load. Deformable DETR [27] reduced computational complexity by introducing a deformable attention mechanism. DN-DETR [28] accelerated training convergence through query denoising. As a real-time variant, RT-DETR [10] boosted inference speed substantially with no compromise in accuracy. It achieved this by designing an efficient hybrid encoder (e.g., AIFI) and incorporating a speed-oriented backbone network. This laid the foundation for applying Transformers in real-time scenarios.

The Transformer architecture provides powerful global feature modeling through self-attention. This aids in distinguishing ship targets from macro backgrounds, such as the sea-sky boundary. However, its computational cost is usually high. Early convergence issues and suboptimal small target detection performance remain key areas for ongoing optimization. Successfully applying it to embedded platforms hinges on combining its global advantages with the demands for lightweight design and high efficiency.

2.4. Specialized Research on Maritime Object Detection

Given the unique characteristics of the marine environment, researchers have also conducted numerous specialized studies. These works provide important references for this research.

(1): YOLO-based Improved Methods: Li et al. [29] proposed GGT-YOLO. It integrates Transformer modules to enhance small-object detection capability. It also employs GhostNet to reduce computational costs. This model is specifically designed for UAV maritime patrols, achieving 6.2 M parameters and 15.1 G FLOPs. Li et al. [30] introduced oriented bounding boxes, the CK_DCNv4 module for enhanced geometric feature extraction, and the SGKLD loss function for elongated targets. This led to a notable enhancement in both the precision and resilience of detection systems operating within intricate maritime environments.
(2): Generalization Capability for Complex Scenes: Cheng et al. [31] proposed YOLO-World, which explores open-vocabulary object detection to enhance adaptability to unknown and complex scenes. On the LVIS dataset, it achieves 52.0 FPS. Chen et al. [32] designed a novel feature pyramid network specifically for complex maritime environments. It suppresses complex background features during detection and highlights features of small targets. This achieves more efficient small object detection performance.
(3): Lightweight Deployment: Tang et al. [33] focused on the limited computing resources of UAVs. They designed a channel pruning strategy based on YOLOv8s. This achieved lightweight model deployment, with their method achieving 7.93 M parameters, 28.7 GFLOPs, and 227.3 FPS on the SeaDronesSee dataset. It reflects the urgent need for efficiency in practical applications.
(4): Recent Advances of Transformers in Maritime Detection: In recent years, researchers have begun to explore the application of Transformer architectures in maritime detection. Xing et al. [34] proposed the S-DETR model, which enhances the detection performance for multi-scale ships at sea through a scale attention module and a dense query decoder design. However, its model parameter count remains high, and its real-time performance—with an FPS of only 27.98 on the Singapore Maritime Dataset (SMD)—and robustness in occluded scenarios still need improvement. Wang et al. [35] proposed Ship-DETR, a model designed to enhance the recognition performance of vessels across varying scales under challenging maritime conditions. This is achieved through the integration of three key components: a HiLo attention mechanism, a bidirectional feature pyramid network (BiFPN), and a downsampling module based on Haar wavelets. With Ship-DETR’s FLOPs reaching 43.5 G, the model structure is relatively complex, and it still carries a risk of missed detections in scenarios involving extremely small or densely occluded targets. Jiao et al. [36] adopted an FCDS-DETR model (based on the DETR architecture) as the baseline. By integrating a two-dimensional Gaussian probability density distribution into the attention mechanism alongside a denoising training strategy, their approach markedly enhanced the recognition capability for small-scale objects within low-altitude unmanned aerial vehicle (UAV) imagery captured over maritime environments. However, the model’s high complexity—reaching 190 G FLOPs—and low inference speed—only 5 FPS on the AFO dataset—limit its practical application, and its adaptability to densely occluded scenes remains insufficient.

These specialized studies have effectively addressed specific maritime detection challenges through targeted optimizations. However, three critical limitations persist in current approaches: (a) Most remain confined to CNN-based architectures, lacking the comprehensive global context modeling capabilities essential for distinguishing ships from complex marine backgrounds; (b) Existing Transformer-based maritime detectors, while demonstrating improved accuracy, typically exhibit high computational complexity and insufficient inference speed for real-time UAV applications; (c) Few frameworks systematically coordinate multiple enhancements to address the interconnected challenges of background interference, small target detection, and scale variations within a unified architecture.

Consequently, a significant research gap exists: the need for a detection framework that can effectively leverage Transformer’s global modeling capabilities while meeting the strict computational constraints of UAV platforms. This study aims to bridge this gap by proposing MCB-RT-DETR, which integrates orthogonal attention for edge preservation, receptive field convolution for adaptive feature extraction, small object detection boosting through specialized pyramid networks, and dynamic upsampling for precise localization—all within RT-DETR’s efficient Transformer architecture.

In summary, the field of object detection has progressed from early techniques reliant on manually designed features to the contemporary paradigm dominated by deep learning and Transformer-based architectures. Although single-stage detectors like YOLO offer speed advantages, their local receptive field characteristics are insufficient under complex sea conditions. Transformer models like DETR/RT-DETR provide powerful global context understanding. Yet, their computational efficiency and real-time deployment remain challenging. Existing maritime object detection research, while somewhat targeted, still requires deeper exploration. Constructing a lightweight detection framework is necessary. This framework must effectively utilize global information to suppress background interference. It must also enhance the feature representation of small targets. Furthermore, it must meet strict real-time requirements. This study aims to address this challenge. It seeks to fuse the advantages of the aforementioned technical routes. The goal is to propose a solution better suited for UAV-based maritime ship detection.

3. MCB-RT-DETR Algorithm Design

Attaining accurate and rapid ship target identification is essential for the success of autonomous replenishment operations conducted by unmanned aerial vehicles (UAVs) in maritime settings. Although the RT-DETR baseline model performs well on generic object detection tasks, its original design was not optimized for the unique challenges of complex sea conditions. This chapter details our proposed improved real-time detection Transformer algorithm, MCB-RT-DETR. This algorithm aims to systematically address four core challenges in maritime ship detection: strong background interference, missed detection of small targets, significant target scale variations, and limited computational resources on airborne platforms.

3.1. Baseline Model Analysis

To achieve a good balance between accuracy and speed, and to lay a foundation for subsequent deployment on embedded platforms, we selected RT-DETR-R18, which has relatively lower computational costs, as the baseline model. Compared with RT-DETR-R50 (42.9 M parameters, 134.8 GFLOPs), the R18 version reduces the number of parameters by approximately 53% (20.1 M) and computational cost by about 57% (58.3 GFLOPs), making it more suitable for resource-constrained UAV platforms. On the SeaDronesSee validation set, R18 achieves 78.4% mAP@0.5 while maintaining an inference speed of 50.8 FPS, meeting real-time detection requirements and retaining sufficient computational headroom for subsequent module enhancements. Furthermore, compared to more lightweight variants such as custom minimalist backbone networks, the ResNet-18-based architecture provides a well-recognized and stable trade-off between accuracy and speed, which helps ensure fair and effective evaluation of module contributions. Its core lies in simplifying the detection pipeline using an end-to-end Transformer architecture. It simultaneously balances global attention modeling and computational efficiency through a hybrid encoder. Figure 2 illustrates the complete structural design of the proposed model.

However, through in-depth analysis of its structure and application to maritime ship detection tasks, we identified several key bottlenecks when facing specific challenges:

(1): Insufficient Feature Discrimination and Anti-Interference Capability: The baseline model’s feature extraction and fusion mechanisms have limited adaptability to the maritime environment. In the ResNet-18 backbone, standard convolution and basic attention layers often filter out vital high-frequency details, like ship edge textures, during feature downsampling. Concurrently, the encoder’s global attention mechanism struggles to consistently focus on real targets against the homogeneous sea background. This leads to inadequate suppression of interference like waves and cloud shadows.
(2): Limited Feature Representation for Small Targets: The existing feature pyramid structure (e.g., Cross-Scale Feature Fusion Module, CCFM) primarily serves targets of conventional scales. For distant ships occupying extremely few pixels, high-level features, while semantically rich, suffer severe loss of spatial detail. Conversely, low-level high-resolution features are not effectively utilized due to computational complexity and channel number limitations. This results in degraded small target detection performance.
(3): Inadequate Adaptation to Multi-Scale Differences: The original model employs regular upsampling operations (e.g., nearest-neighbor interpolation) in its neck network. These methods calculate pixel values using fixed rules, ignoring the semantic content of the feature maps. When facing ship targets exhibiting drastic scale changes from close-range to long-range, this content-insensitive sampling approach easily causes feature blurring. This consequently leads to inaccurate localization or missed detections.
(4): Mismatch Between Computational Efficiency and Deployment Requirements: Although RT-DETR-R18 is a relatively lightweight version, certain operations in its neck network (e.g., regular upsampling) present a contradiction between detail reconstruction capability and computational overhead. Further lightweight improvements are still necessary to achieve efficient deployment on resource-constrained platforms like UAVs.

To address these bottlenecks, this chapter will introduce targeted improvements to three core components: the backbone network, the feature pyramid, and the neck network. The enhanced framework is depicted in Figure 3. The following sections elaborate on the design rationale and practical implementation of every modified component.

3.2. Backbone Network Improvements

The backbone network extracts multi-level feature representations from input images. Its performance directly impacts the accuracy of subsequent detection heads. To enhance the ship target feature extraction capability of the RT-DETR-R18 backbone network, we introduced distinct enhancement mechanisms for its shallow and deep layers.

3.2.1. Introducing the Ortho Attention Mechanism

In UAV aerial maritime scenes, most sea surface areas exhibit low-frequency characteristics (e.g., smooth textures). Conversely, ships, their edges, and wave splashes represent critical high-frequency details. Traditional channel attention methods, such as Global Average Pooling (GAP) in SENet, suffer from information loss during feature compression. This hinders the preservation of crucial high-frequency ship details.

Therefore, we integrated the Orthogonal Channel Attention Network (Ortho-Nets) [37] into the BasicBlocks of the first two stages (Stage 1, Stage 2) of the backbone network. Its core idea is that the effectiveness of channel attention primarily relies on the orthogonality of the compression filters, rather than specific frequency selection. Ortho-Nets generates a set of random orthogonal bases as compression filters via the Gram-Schmidt orthogonalization process. This maximizes the preservation of the complete projection of the input feature subspace. Consequently, diverse feature patterns, including high-frequency details, are retained more effectively. Figure 4 presents the orthogonal channel attention module and its placement in the enhanced backbone.

The orthogonal channel attention mechanism processes an input feature map

X \in ℝ^{C \times H \times W}

through two sequential phases:

(1): A random orthogonal filter bank is generated to compress spatial information into a compact channel descriptor. This filter bank is constructed via Gram-Schmidt orthogonalization and remains fixed (non-learnable), ensuring an unbiased projection of input features.
(2): Attention Calculation Stage:

First, the input features

X \in ℝ^{C \times H \times W}

are compressed along the spatial dimension. The compression process is described by Equation (1):

Z_{c} = \sum_{h = 1}^{H} \sum_{w = 1}^{W} (K_{c, h, w} \cdot X_{c, h, w})

(1)

where

Z \in ℝ^{C}

is the compressed channel descriptor,

K_{c, h, w}

is the weight at position c for the filter of channel

(h, w)

, and

X_{c, h, w}

is the pixel value at position

(h, w)

of the c-th channel in the input features.

Next, we adopt the MLP structure of SENet (including ReLU and Sigmoid activation functions) to obtain attention weights through activation, as shown in Equation (2):

A = σ (W_{2} δ (W_{1} Z)) (A \in {(0, 1)}^{C}),

(2)

where

A

is the channel attention weight vector (ranging from 0 to 1),

W_{1} \in ℝ^{C / r \times C}

and

W_{2} \in ℝ^{C \times C / r}

are MLP weight matrices (r denotes the compression ratio).

Finally, we apply attention weighting and perform element-wise multiplication with input features to obtain the weighted feature output, as shown in Equation (3):

X_{a t t} = A ⊙ X,

(3)

The orthogonal kernel

K

in our design is static and non-learnable. This is a deliberate choice: it deterministically enforces orthogonality among the compression filters, ensuring an unbiased, complete subspace projection of the input features. Making

K

learnable would not only make it difficult to maintain the orthogonality constraint during training—undermining the mechanism’s theoretical foundation—but also introduce significant extra parameters, increasing model complexity and training instability. This contradicts our core objectives of efficiency and robustness.

Based on this design, we insert the Ortho module after the second convolution within the shallow BasicBlock, employing a residual connection to combine its output with the original input. This design achieves a balance between attention refinement and lossless information transmission. It allows the model to concentrate on vessel outlines and textural details present in shallow features, thus strengthening its ability to distinguish targets from complex backgrounds.

3.2.2. Introducing RFAConv Convolution

In the deep layers of the backbone network (Stage4, Stage5), feature map size shrinks and spatial information becomes highly abstract. At this point, the model relies more on extracting inter-channel relationships and deep semantic information. Traditional convolution uses parameter sharing, which struggles to adapt to semantic differences across image regions.

To address this issue, we replace standard convolution with Receptive Field Attention Convolution (RFAConv) [38] in the deep BasicBlock. RFAConv dynamically generates attention weights for each position within the receptive field, enabling convolution kernel parameters to adaptively adjust to local content. This allows more flexible modeling of local semantic changes. Figure 5 displays its architecture and the location where it is integrated.

For attention weight training, RFAConv employs a hierarchical feature interaction protocol: (a)

k \times k

non-overlapping receptive fields are extracted from input feature map

X \in ℝ^{C \times H \times W}

using PyTorch’s Unfold operation; (b) Average pooling aggregates global contextual information from each receptive field to reduce computational complexity; (c) 1 × 1 grouped convolutions model cross-channel interactions within receptive fields; (d) Softmax activation generates normalized attention weights

A_{rf}

. These weights are jointly optimized with network parameters through standard backpropagation, enabling adaptive focus on vessel-relevant regions while maintaining computational efficiency comparable to standard convolution.

The operation of RFAConv includes four steps (taking

k \times k

convolution as an example):

(1): Receptive field feature extraction: Using group convolution (number of groups equals input channel count C), we expand each $k \times k$ receptive field of input feature $X \in ℝ^{C \times H \times W}$ into a vector. This yields the receptive field feature $F_{rf} = C \times k^{2} \times H \times W$ .
(2): Attention weight generation: The attention weight map $A_{rf} = k^{2} \times H \times W$ is generated through the hierarchical feature interaction protocol described above, where each weight reflects the learned importance of corresponding spatial position within its receptive field context.
(3): Weighted feature fusion: Apply the attention weights $A_{rf}$ to the receptive field features $F_{rf}$ to obtain the weighted features $F_{weighted}$ .
(4): Feature reorganization and convolution: Reshape the weighted features back to spatial dimensions. Next, a standard convolution layer with a kernel and stride size of k performs downsampling and feature fusion, producing the final feature map.

This enables the convolution to dynamically concentrate on crucial areas associated with vessel targets inside its receptive field. It is particularly effective when handling local ship structures or partial occlusions. This strengthens the model’s resilience against complex backgrounds.

3.3. Design of Small Object Detail-Enhancement Pyramid (SOD-EPN)

To improve the detection of distant small vessels at sea, we propose the Small Object Detail-Enhancement Pyramid Network (SOD-EPN), which substitutes the cross-scale fusion module in the original RT-DETR. Its structure is shown in Figure 6.

3.3.1. Spatial Compression with SPDConv

Directly introducing the high-resolution P2 feature layer (from Stage 2 of the backbone network) provides rich details but significantly increases computational cost. To solve this, we introduce Spatial-to-Depth Convolution (SPDConv) [39] as a pre-compressor.

Compared to alternative spatial compression techniques such as pooling operations or strided convolutions, SPDConv offers several distinct advantages for maritime vessel detection. First, traditional pooling methods (max-pooling, average-pooling) inevitably discard fine-grained spatial information during downsampling, which is detrimental for detecting small vessels in UAV imagery. Second, strided convolutions create information bottlenecks that may obscure critical vessel features. In contrast, SPDConv preserves all spatial information by transforming it into additional channels through the space-to-depth transformation, ensuring no feature loss during downsampling. This information-preserving characteristic is particularly valuable for maintaining vessel structural details and boundary information in complex maritime environments. Furthermore, SPDConv’s parameter-free design enhances computational efficiency while avoiding the learning biases that may arise in parameterized downsampling methods. This characteristic renders it particularly suitable for deployment in real-time UAV systems, where maintaining high precision and operational speed is critical. Its structure is shown in Figure 7.

For an input feature map

X \in ℝ^{C \times H \times W}

, SPDConv first performs spatial sub-sampling with a stride of 2 to generate four sub-graphs, as shown in Equation (4):

\begin{array}{l} X_{1} = X [:, :, : : 2, : : 2] \\ X_{2} = X [:, :, : : 2, 1 : : 2] \\ X_{3} = X [:, :, 1 : : 2, : : 2] \\ X_{4} = X [:, :, 1 : : 2, 1 : : 2] \end{array},

(4)

Each sub-graph has a size of

(C \times \frac{H}{2} \times \frac{W}{2})

. These sub-features are then merged along the channel axis to produce a restructured high-dimensional representation, as expressed in Equation (5):

X^{'} = Concat (X_{1}, X_{2}, X_{3}, X_{4}) \in ℝ^{4 C \times \frac{H}{2} \times \frac{W}{2}},

(5)

Finally, feature fusion and dimensionality reduction are performed via a convolution layer, as shown in Equation (6):

F = {Conv}_{k = 3} (X^{'}) \in ℝ^{O \times \frac{H}{2} \times \frac{W}{2}},

(6)

Here, O denotes the number of output channels. This operation preserves spatial information integrity while converting high-resolution P2 features (e.g., ship edge details) into a resolution compatible with the P3 layer. It provides rich contextual information for small targets for subsequent fusion.

3.3.2. Multi-Scale Fusion with CSP-OmniKernel

To improve the representational capacity of merged features, we designed the CSP-OmniKernel (CSP-O) module as a post-enhancer. This module combines the gradient path separation idea from CSPNet [40] and the multi-scale kernel transformation from Omni-Kernel Network [41]. Its structure is shown in Figure 8.

CSPNet Idea: Divide the input feature along the channel axis into two components,

X_{1}

and

X_{2}

.

X_{1}

undergoes complex feature transformation, while

X_{2}

acts as an identity mapping. Finally, concatenate the two parts. This design alleviates gradient conflicts and reduces computational redundancy. Its simplified formula is:

Y = concat (T (X_{1}), X_{2}),

(7)

where

T

is a feature transformation function. It reduces computational complexity while preserving the complete gradient flow. “Concat” denotes concatenation along the channel dimension.

OmniKernel Module: We build a multi-branch feature extractor as the complex transformation branch in CSP. The large-kernel pathway employs depthwise separable convolutions (such as 31 × 31, 1 × 31, 31 × 1) to model extensive dependencies and oriented contextual cues. The global pathway integrates frequency-domain channel attention with spatial attention for improved overall context awareness. Meanwhile, the local pathway utilizes 1 × 1 depthwise convolution to restore high-frequency details.

CSP-O Module: Embed the OmniKernel module into the CSP framework. Its output is expressed as:

Y = concat (OmniKernel (X_{1}), X_{2}),

(8)

where

OmniKernel (X_{1})

is the output of the OmniKernel module.

SOD-EPN Workflow: First, the P2 feature is compressed by SPDConv. Then it is concatenated with the upsampled P4 feature and the original P3 feature. Next, the concatenated feature is fed into the CSP-O module for multi-scale feature enhancement and fusion. This architecture markedly enhances the representational capacity for small vessel targets while maintaining manageable computational overhead.

3.4. Lightweight Neck Network

The neck module integrates multi-scale backbone features and upscales them to dimensions suitable for detection. Regular upsampling operations in the original RT-DETR neck (e.g., bilinear interpolation, nearest-neighbor interpolation) are one reason for insufficient localization accuracy of multi-scale targets.

We replace the original nearest-neighbor upsampling with the ultra-lightweight dynamic upsampler DySample [42], which reformulates upsampling as dynamic position resampling. Compared to parameter-free interpolation, DySample introduces only 0.05 M parameters and 0.3 GFLOPs of computational overhead. This minimal cost is justified by its content-aware sampling capability, which significantly outperforms simple interpolation methods. Moreover, DySample achieves performance comparable to more complex upsamplers like CARAFE [43] with merely 3% of the parameters and 20% of the computational cost, making it an optimal choice for efficient feature enhancement. Its structure is shown in Figure 9.

DySample consists of three steps:

(1): Continuous signal modeling: Apply bilinear interpolation to the input feature map $X \in ℝ^{C \times H \times W}$ to obtain the continuous representation $X_{cont}$ , as shown in Equation (9):

$X_{cont} = BilinearInterpolate (X, s),$

(9)

where $BilinearInterpolate (\cdot)$ denotes bilinear interpolation; $s$ is the target upsampling rate.
(2): Offset prediction: Predict position offsets $O \in ℝ^{2 \times s H_{in} \times s W_{in}}$ for each target pixel using a lightweight network (composed of depth-wise separable convolution and point convolution). as shown in Equation (10)

$O = {Conv}_{1 \times 1} (GELU ({Conv}_{3 \times 3} (X))),$

(10)

Here,

{Conv}_{3 \times 3}

uses 3 × 3 depth-wise separable convolution to extract local context. GELU refers to the Gaussian Error Linear Unit, an activation function.

{Conv}_{1 \times 1}

maps features to the offset space.

(3): Dynamic sampling execution: Combine the continuous signal $X_{cont}$ and offset field $O$ . Resample at dynamically generated sampling points to obtain the final high-resolution output $X^{'}$ .

$X^{'} = GridSample (X_{cont}, G + α \cdot O),$

(11)

Here,

G \in ℝ^{2 \times s H \times s W}

is the standard upsampling grid, initialized with bilinear interpolation to ensure

s^{2}

sampling points are evenly distributed in a 1 × 1 region.

α

is a range factor (set to 0.25), and

G + α \cdot O

forms the dynamic sampling point set. The range factor

α

governs the maximum offset of sampling points. Its empirical value of

α = 0.25

is chosen to strike a balance between the freedom for feature reconstruction and the avoidance of excessive distortion to the original spatial structure.

GridSample (\cdot)

is the resampling function, outputting

X^{'} \in ℝ^{C \times s H \times s W}

.

The DySample module is implemented at critical upsampling stages within the neck: following the AIFI output and prior to merging with S4, and after generating Y4 and before fusion with S3. This content-aware mechanism allows the upsampling process to reconstruct ship target edges and contours more accurately. Especially when target scales change drastically, it effectively improves localization accuracy. Meanwhile, its lightweight nature ensures real-time performance.

4. Results

This section systematically assesses the performance of the proposed MCB-RT-DETR method. We begin by describing the experimental setup, including datasets, evaluation criteria, and implementation specifics. Subsequently, ablation studies are conducted to validate the contribution of each enhanced component. Following this, the approach is compared against leading object detection models to highlight its overall competitive performance. Finally, cross-domain dataset evaluations are performed to examine its generalization capability.

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

To fully assess the algorithm’s detection capability across different scenarios, we selected three representative public datasets in our experiments. The statistics of their target count, size, and position distribution are presented in Figure 10.

SeaDronesSee [44]: A large-scale dataset specifically constructed for UAV maritime visual tasks. We follow the official split: 8930 training images, 1547 validation images, and 3750 test images. The image resolutions range from 1080p to 4K, and we uniformly resize them to 640 × 640 for training. The dataset contains 5 categories, with the ‘boat’ category constituting the majority of instances. The target scales exhibit a long-tailed distribution characterized by “small when far, large when near,” with complex backgrounds, making it ideal for validating performance in real maritime scenarios.
DIOR [45]: A large-scale optical remote sensing image object detection dataset. Following the official split of 11,725 training images and 11,738 test images, we train on the entire training set and specifically evaluate the ‘ship’ category on the test set (comprising 35,186 instances in total). All images have a uniform resolution of 800 × 800 pixels.
VisDrone2019 [46]: One of the most challenging UAV aerial object detection datasets currently available. We use its training set (6471 images) for training and evaluate generalization performance across all 10 categories on the validation set (548 images). The images have varying resolutions and are similarly resized to 640 × 640.

4.1.2. Evaluation Metrics

We use general evaluation metrics in the object detection field:

Mean Average Precision (mAP): We use mAP@0.5 (mAP at IoU threshold of 0.5) as the main metric to measure the model’s comprehensive detection ability. Additionally, we report mAP@0.5:0.95 (average mAP over IoU thresholds from 0.5 to 0.95 with a step of 0.05) to evaluate localization accuracy.
Precision and Recall: Precision (P) measures the reliability of detection results; Recall (R) measures the model’s ability to cover real targets.
Frames Per Second (FPS): We test the average time to process a single image on a single NVIDIA RTX3090 GPU and calculate FPS to evaluate the model’s inference speed.
Params (M): We report the number of model parameters in millions (M) to indicate model complexity and memory footprint.
FLOPs (G): We calculate the number of floating-point operations (in gigaFLOPs, G) for a standard input size to assess.

4.1.3. Implementation Details

The detailed experimental configurations are summarized in Table 1.

4.2. Ablation Experiments

We follow the widely adopted Microsoft COCO dataset evaluation standard to categorize targets based on pixel area into three scales: small objects (area < 32² pixels), medium-sized objects (32² ≤ area ≤ 96² pixels), and large objects (area > 96² pixels). In the SeaDronesSee dataset, the distribution of target scales is 21,339 small targets, 2713 medium targets, and 79 large targets. This composition results in small targets constituting 88.4% of all annotated objects, making the dataset inherently focused on the small-target detection challenge.

Given this dataset characteristic, it is important to emphasize that the reported overall performance metrics (mAP@0.5, precision, recall) primarily reflect small-target detection capability. This evaluation focus aligns perfectly with the paper’s objective of enhancing detection performance for UAV maritime applications, where small and distant targets represent the predominant challenge in real-world scenarios.

For reproducibility, each experiment is repeated three times with different random seeds, and results are reported as averages. The standard deviation of mAP values across runs is consistently below 0.5%, confirming the stability of our findings.

To validate the contribution of every enhanced component, we performed structured ablation studies on the SeaDronesSee validation set with RT-DETR-R18 as the baseline. Specifically, we evaluate four key components of our MCB-RT-DETR framework: (1) the Orthogonal Channel Attention module (Ortho) integrated into the backbone; (2) the Receptive Field Attention Convolution (RFAConv) incorporated into the backbone; (3) the Small Object Detail-Enhancement Pyramid Network (SOD-EPN) replacing the original feature pyramid; and (4) the Dynamic Upsampler (DySample) module added to the neck. Additionally, we include an ablation study comparing the static versus learnable variants of the orthogonal kernel in the Ortho module. Experimental results are shown in Table 2.

(1): Module Effectiveness: The addition of any single module improves the baseline model’s mAP@0.5, validating the effectiveness of each proposed component. Specifically: (a) SOD-EPN provides the largest recall improvement (+5.7%), significantly reducing small target missed detections; (b) DySample substantially boosts inference speed (+20.7% FPS) while maintaining competitive accuracy; (c) RFAConv achieves the best single-module mAP@0.5 improvement (+2.7%). We also evaluated a learnable variant of the orthogonal filter (Table 2, row “+Ortho (Learn)”), which underperforms the static version in both mAP@0.5 and recall, validating our design choice of using a static kernel to ensure unbiased feature preservation and avoid optimization bias.
(2): Combination Effects: The integration of different modules reveals complementary strengths. On one hand, the Ortho + RFAConv combination achieves the highest precision (93.1%) and overall mAP@0.5:0.95 (49.4%), demonstrating that enhancements to the backbone effectively boost feature discriminability. On the other hand, the SOD-EPN + DySample combination attains the highest detection accuracy mAP@0.5 (83.6%) but at a reduced inference speed of 36.4 FPS, highlighting the inherent accuracy-speed trade-off in feature pyramid design. Specifically, while SOD-EPN preserves high-resolution detail features and DySample performs dynamic content-aware upsampling, their direct combination introduces additional computational overhead when processing multi-scale features, which is the primary cause of the speed reduction.
(3): Final Model: The complete MCB-RT-DETR (integrating all modules with architectural optimizations including feature resolution alignment between SOD-EPN and DySample, selective module integration based on complementary analysis, and inference pipeline optimizations through layer fusion) achieves the optimal balance between accuracy and efficiency: mAP@0.5 increases by 4.5% (from 78.4% to 82.9%), mAP@0.5:0.95 improves by 3.4% (from 46.3% to 49.7%), while maintaining real-time processing at 50 FPS. With 22.11 M parameters and 65.2 G FLOPs, the model demonstrates reasonable computational complexity for the achieved performance level, verifying the superiority of our multi-module collaborative design.
(4): Small-target Detection Focus: Given the SeaDronesSee dataset composition (88.4% small targets), the substantial recall improvement (from 75.0% to 81.0%) is particularly significant, demonstrating our method’s effectiveness in addressing the primary challenge of maritime UAV operations—detecting small and distant vessels.
(5): Hyperparameter Sensitivity Analysis: We conducted a sensitivity analysis on the key hyperparameter in DySample—the offset range factor $α$ . Values of $α \in \{0.1, 0.25, 0.5\}$ were tested on the SeaDronesSee validation set. The results show that with $α = 0.1$ , the model is too conservative, yielding a limited mAP@0.5 improvement (+1.3%). When $α = 0.5$ , the excessive offset range leads to local distortion in the feature space, causing the accuracy to drop (mAP@0.5 only + 0.8%) and a slight decrease in FPS. In contrast, $α = 0.25$ achieves a significant accuracy gain (mAP@0.5 + 1.6%) while maintaining the highest FPS, validating this empirical value as a good trade-off between accuracy and efficiency.

Additionally, we visualized the loss curves of the baseline and final models during training (Figure 11 and Figure 12). MCB-RT-DETR’s losses converge faster and reach lower final values, indicating better optimization performance and generalization ability.

4.3. Comparative Experiments

For a thorough evaluation of MCB-RT-DETR, we compared it extensively against leading object detection approaches using the SeaDronesSee test set. The comparison includes: (1) YOLO series models (YOLOv10, YOLOv11, YOLOv12, YOLOv13) representing the latest advancements in CNN-based real-time detection; (2) Transformer-based detectors (DN-DETR) for fair architecture comparison; and (3) specialized maritime detection frameworks (MFEF-YOLO [47], MSO-DETR [48]) that are specifically designed for maritime scenarios.

For a fair comparison, we ensured consistent experimental conditions across all models: (1) All models were trained for 200 epochs using the official dataset splits as described in Section 4.1; (2) Input image resolution was uniformly set to 640 × 640 pixels for SeaDronesSee and VisDrone2019, and 800 × 800 for DIOR; (3) The same data augmentation pipeline (including Mosaic, random horizontal flipping, HSV color space perturbation, and multi-scale scaling) was applied to all models during training; (4) All experiments were conducted on identical hardware (NVIDIA RTX 3090 GPU) under the same software environment (PyTorch 2.0.0). While optimizer choices and learning rate schedules followed each model’s official recommendations to respect their architectural characteristics, this standardized approach ensures that performance differences primarily reflect architectural advantages rather than training condition variations.

As shown in Table 3, our MCB-RT-DETR achieves the best balance between accuracy and efficiency across all evaluated metrics:

MCB-RT-DETR achieves state-of-the-art accuracy (82.9% mAP@0.5, 81.0% recall) while maintaining real-time performance (50 FPS). The method demonstrates an optimal accuracy-speed trade-off: compared to high-speed YOLOv11, it gains +21.9% mAP@0.5; compared to accurate MSO-DETR, it achieves +4.7% higher accuracy. Our approach also outperforms maritime-specific detectors (MFEF-YOLO, MSO-DETR), validating its effectiveness for maritime scenarios. The high recall is particularly significant given SeaDronesSee’s 88.4% small targets, demonstrating robust small-target detection capability. With 22.11 M parameters, the model is computationally efficient while leveraging Transformer’s global context and specialized maritime modules to address complex sea conditions.

The heatmaps generated by Grad-CAM++ for YOLOv12, YOLOv13, and our method are shown in Figure 13.

In challenging scenarios such as wave reflection or dense small targets, our model exhibits a more concentrated heat response on ship structures with clearer boundaries, while showing reduced activation toward background interference (e.g., waves). This provides an intuitive visual demonstration of the model’s improved focus.

For an additional quantitative assessment of attention focus and precision, we adopt attention entropy as a measurement metric. For each heatmap region corresponding to a predicted bounding box, we compute the Shannon entropy of the normalized attention weights as:

Entropy = - \sum_{i} (p_{i} \cdot \log (p_{i}))

where

p_{i}

represents the normalized attention value of the i-th pixel. A lower entropy value indicates more focused and certain model attention. Based on 1000 randomly selected ship-containing samples from the SeaDronesSee test set, our MCB-RT-DETR achieves an average attention entropy of 2.25, significantly lower than YOLOv12 (2.89) and YOLOv13 (3.06). This result quantitatively confirms that our model directs attention more precisely toward the ship targets, with less dispersion to background distractions.

4.4. Generalization Verification

To verify MCB-RT-DETR’s generalization across different scenarios, we performed cross-domain tests on the DIOR and VisDrone2019 datasets.

4.4.1. Ship Detection on DIOR Dataset

Our evaluation concentrated on the “ship” class within the DIOR dataset, benchmarking against YOLOv8, YOLOv11, and the RT-DETR-R18 baseline. The outcomes are presented in Table 4.

On the DIOR dataset, our method achieved an overwhelming advantage in the “ship” category: precision reached95.5%, and mAP@0.5 hit95.7%. This proves the improved model has strong ability to identify and locate multi-type, multi-scale ship targets in remote sensing images.

4.4.2. General Object Detection on VisDrone2019 Dataset

To further verify the improved algorithm’s generalization to UAV aerial scenarios, we conducted cross-domain validation on VisDrone2019. This dataset is characterized by low-altitude views, dense small targets (e.g., pedestrians, vehicles), and complex urban backgrounds. Its scene diversity and target distribution complement SeaDronesSee, enabling systematic evaluation of the algorithm’s adaptability to real UAV applications. Test results are shown in Table 5.

In completely different UAV urban scenarios, our method still achieved the best performance: recall and mAP metrics outperformed all comparison algorithms. This suggests that MCB-RT-DETR’s improved feature extraction and multi-scale processing extend beyond maritime vessels to general small objects in UAV imagery, highlighting its robust generalization capacity. Meanwhile, the lightweight design ensures real-time inference speed of 50 FPS, facilitating subsequent UAV edge deployment.

In conclusion, the series of experiments fully verify MCB-RT-DETR’s comprehensive advantages in precision, speed, and generalization.

5. Discussion

This section deeply analyzes and discusses the experimental results. We first dissect the collaborative working mechanism among the improved modules, then explain the performance advantages from the algorithm’s internal mechanism, and conclude with an impartial assessment of the method’s practical utility and existing constraints.

5.1. Module Collaboration Mechanism Analysis

Ablation experiments show that the improved modules in this paper are not simply stacked—they form an organically collaborative performance enhancement system.

This collaboration first appears in the complementarity of the feature extraction chain: Ortho attention and RFAConv in the backbone act on shallow and deep features, respectively. The former effectively preserves high-frequency details like ship edges. The latter suppresses background interference via content-adaptive convolution. Together, they ensure key information flows from pixel-level details to target-level semantics.

Furthermore, at the feature pyramid level, the SOD-EPN module uses SPDConv to compress small-target details while retaining fidelity. It then enhances multi-scale features via CSP-OmniKernel, forming a “first fidelity then enhancement” fusion strategy. This enhances the representational capacity for small objects.

Moreover, within the neck module, the DySample upsampler uses ultra-lightweight content-aware sampling to accurately reconstruct target boundaries. It converts front-end feature advantages into final localization accuracy, balancing performance and computational efficiency.

In summary, these modules are interlinked. They optimize core links like feature extraction, multi-scale fusion, and feature reconstruction. Together, they lay the foundation for MCB-RT-DETR’s efficient performance.

5.2. Internal Reasons for Performance Advantages

The significant performance advantages of MCB-RT-DETR lie in its full-link enhancement design—from feature perception to localization reconstruction.

On one hand, by integrating the global contextual modeling capability of Transformer with the local feature extraction strength of CNN, the model can macroscopically grasp the sea-sky background. It also strengthens discriminative ability for local key ship structures via the dynamic receptive field mechanism of Ortho attention and RFAConv. Thus, it maintains high precision under complex texture interference like waves.

On the other hand, to tackle small-object detection difficulties, the SOD-EPN module utilizes SPDConv to retain details without degradation. It then enhances multi-scale features via CSP-OmniKernel, fundamentally improving small-target representation. This is reflected in more concentrated response areas in the heatmap (Figure 13) and higher recall.

Moreover, in the localization stage, the DySample upsampler reconstructs feature maps more accurately than traditional interpolation via content-aware dynamic sampling. It improves bounding box accuracy (mAP@0.5:0.95) and provides reliable visual guidance for UAV autonomous operations.

Furthermore, the strong cross-domain generalization capability demonstrated by MCB-RT-DETR (evidenced by its excellent performance on DIOR and VisDrone2019) stems from its architectural design: (1) The Ortho and RFAConv modules focus on enhancing the model’s perception of essential target characteristics (e.g., edges, structures) while suppressing domain-specific background interference (e.g., wave textures, urban buildings), which facilitates learning more generalizable representations; (2) The SOD-EPN module improves the model’s robustness to the common cross-domain challenge of small object detection through information-lossless compression and multi-scale kernel-based feature enhancement; (3) The overall lightweight design prevents overfitting, enabling the model to concentrate its capacity on universal patterns.

This comprehensive enhancement strategy ensures that MCB-RT-DETR maintains high performance across diverse maritime and aerial scenarios while meeting real-time processing requirements.

5.3. Practical Application Prospects and Limitations

With high precision, fast inference speed, and strong generalization ability, MCB-RT-DETR shows broad prospects in UAV maritime operations. It can be used in autonomous replenishment, maritime search and rescue, maritime supervision, and marine environment monitoring. Its excellent performance on VisDrone2019 also indicates potential in low-altitude UAV tasks like urban security and traffic monitoring.

However, this method still has several limitations: its resilience in severe weather scenarios, such as dense fog or heavy rainfall, requires further validation; its detection performance for densely occluded targets in port scenarios needs improvement; and for efficient deployment on resource-constrained embedded UAV platforms, further lightweight processing (such as model quantization and pruning) and stability evaluation are required. These limitations clearly indicate promising directions for future research.

6. Conclusions

To address the urgent demand for high-precision, real-time ship detection in UAV maritime autonomous replenishment tasks, we deeply analyzed core challenges faced by existing detectors under complex sea conditions—background interference, small-target miss detection, severe scale variations, and limited computing resources; against this backdrop, we proposed MCB-RT-DETR, an improved RT-DETR-based ship detection algorithm: we introduced orthogonal channel attention and receptive field attention convolution to enhance feature discriminative ability, designed a small-target detail enhancement pyramid network to strengthen small-target representation, and adopted an ultra-lightweight dynamic upsampler to improve multi-scale ship localization accuracy while optimizing inference speed, ultimately constructing an efficient collaborative detection framework; experimental results show that MCB-RT-DETR achieves 82.9% mAP@0.5 and 49.7% mAP@0.5:0.95 on the SeaDronesSee dataset—4.5% and 3.4% higher than the baseline—while maintaining a real-time processing speed of 50 FPS, and its excellent performance on the DIOR dataset (ship category) and VisDrone2019 dataset further verifies its strong generalization ability across different maritime scenarios and general UAV aerial tasks, laying a solid technical foundation for reliable visual guidance of UAV maritime autonomous systems.

Looking forward, several research directions merit further investigation: (1) extending the framework to multi-modal sensing scenarios, (2) developing adaptive mechanisms for dynamic maritime environments with a focus on weather robustness (e.g., fog, rain, glare), (3) optimizing the model for edge deployment on smaller UAVs, and (4) exploring cross-domain applications in related maritime surveillance tasks.

Author Contributions

Conceptualization, F.L. and Y.W.; methodology, F.L.; validation, A.Y. and T.C.; formal analysis, X.X.; investigation, Y.W.; resources, F.L.; data curation, A.Y.; writing—original draft preparation, Y.W.; writing—review and editing, F.L. and T.C.; visualization, X.X.; supervision, F.L.; project administration, F.L.; funding acquisition, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was not funded by any parties.

Data Availability Statement

Data will be made available upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lau, Y.-Y.; Chen, Q.; Poo, M.C.-P.; Ng, A.K.Y.; Ying, C.C. Maritime transport resilience: A systematic literature review on the current state of the art, research agenda and future research directions. Ocean Coast. Manag. 2024, 251, 107086. [Google Scholar] [CrossRef]
Pensado, E.A.; López, F.V.; Jorge, H.G.; Pinto, A.M. UAV Shore-to-Ship Parcel Delivery: Gust-Aware Trajectory Planning. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 6213–6223. [Google Scholar] [CrossRef]
Kim, J.-H.; Kim, N.; Park, Y.W.; Won, C.S. Object Detection and Classification Based on YOLO-V5 with Improved Maritime Dataset. J. Mar. Sci. Eng. 2022, 10, 377. [Google Scholar] [CrossRef]
Deng, H.; Wang, S.; Wang, X.; Zheng, W.; Xu, Y. YOLO-SEA: An Enhanced Detection Framework for Multi-Scale Maritime Targets in Complex Sea States and Adverse Weather. Entropy 2025, 27, 667. [Google Scholar] [CrossRef]
Zeng, S.; Yang, W.; Jiao, Y.; Geng, L.; Chen, X. SCA-YOLO: A new small object detection model for UAV images. Vis. Comput. 2024, 40, 1787–1803. [Google Scholar] [CrossRef]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Zhang, L.; Huang, L. Ship Plate Detection Algorithm Based on Improved RT-DETR. J. Mar. Sci. Eng. 2025, 13, 1277. [Google Scholar] [CrossRef]
Wu, W.; Fan, X.; Hu, Z.; Zhao, Y. CGDU-DETR: An End-to-End Detection Model for Ship Detection in Day–Night Transition Environments. J. Mar. Sci. Eng. 2025, 13, 1155. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Shao, F.F.; Chen, L.; Shao, J.; Ji, W.; Xiao, S.; Ye, L.; Zhuang, Y.; Xiao, J. Deep Learning for Weakly-Supervised Object Detection and Localization: A Survey. Neurocomputing 2022, 496, 192–207. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar] [CrossRef]
Zhai, J.; Long, L.; Liu, N.; Wan, Q. Improved K-means Template Matching Target Recognition. In Proceedings of the IEEE 6th International Conference on Pattern Recognition and Artificial Intelligence, Xiamen, China, 18–20 August 2023; pp. 492–497. [Google Scholar] [CrossRef]
Arshad, N.; Moon, K.-S.; Kim, J.-N. An adaptive moving ship detection and tracking based on edge information and morphological operations. In Proceedings of the International Conference on Graphic and Image Processing, Cairo, Egypt, 1–2 October 2011; SPIE: Bellingham, WA, USA, 2011; pp. 474–479. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Qiao, S.; Chen, L.-C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10213–10224. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of theComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the International Conference on Advances in Data Engineering and Intelligent Computing Systems, Chennai, India, 15–16 March 2024; pp. 1–6. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Alif, M.A.R.; Hussain, M. Yolov12: A breakdown of the key architectural features. arXiv 2025, arXiv:2502.14740. [Google Scholar] [CrossRef]
Xu, F.C.; Alfred, R.; Pailus, R.H.; Ge, L.; Shifeng, D.; Chew, J.V.L.; Guozhang, L.; Xinliang, W. DETR Novel Small Target Detection Algorithm Based on Swin Transformer. IEEE Access 2024, 12, 115838–115852. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Qi, Y.; Cao, H. SDFA-Net: Synergistic Dynamic Fusion Architecture with Deformable Attention for UAV Small Target Detection. IEEE Access 2025, 13, 110636–110647. [Google Scholar] [CrossRef]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 13619–13627. [Google Scholar] [CrossRef]
Li, Y.; Yuan, H.; Wang, Y.; Xiao, C. GGT-YOLO: A novel object detection algorithm for drone-based maritime cruising. Drones 2022, 6, 335. [Google Scholar] [CrossRef]
Li, Y.; Tian, Y.; Yuan, C.; Yu, K.; Yin, K.; Huang, H.; Yang, G.; Li, F.; Zhou, Z. YOLO-UAVShip: An Effective Method and Dataset for Multi-View Ship Detection in UAV Images. Remote Sens. 2025, 17, 3119. [Google Scholar] [CrossRef]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16901–16911. [Google Scholar] [CrossRef]
Chen, C.C.; Zeng, W.M.; Zhang, X.L. HFPNet: Super Feature Aggregation Pyramid Network for Maritime Remote Sensing Small-Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5973–5989. [Google Scholar] [CrossRef]
Tang, P.; Zhang, Y. LiteFlex-YOLO: A lightweight small target detection network for maritime unmanned aerial vehicles. Pervasive Mob. Comput. 2025, 111, 102064. [Google Scholar] [CrossRef]
Xing, Z.; Ren, J.; Fan, X.; Zhang, Y. S-DETR: A Transformer Model for Real-Time Detection of Marine Ships. J. Mar. Sci. Eng. 2023, 11, 696. [Google Scholar] [CrossRef]
Wang, Y.; Li, X. Ship-DETR: A Transformer-Based Model for Efficient Ship Detection in Complex Maritime Environments. IEEE Access 2025, 13, 66031–66039. [Google Scholar] [CrossRef]
Jiao, Z.; Wang, M.; Qiao, S.; Zhang, Y.; Huang, Z. Transformer-Based Object Detection in Low-Altitude Maritime UAV Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4210413. [Google Scholar] [CrossRef]
Salman, H.; Parks, C.; Swan, M.; Gauch, J. Orthonets: Orthogonal channel attention networks. In Proceedings of the IEEE International Conference on Big Data, Sorrento, Italy, 15–18 December 2023; pp. 829–837. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar] [CrossRef]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; Springer: Cham, Switzerland, 2022; pp. 443–459. [Google Scholar] [CrossRef]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar] [CrossRef]
Cui, Y.; Ren, W.; Knoll, A. Omni-kernel network for image restoration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 1426–1434. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6027–6037. [Google Scholar] [CrossRef]
Fu, R.; Hu, Q.; Dong, X.; Gao, Y.; Li, B.; Zhong, P. Lighten CARAFE: Dynamic lightweight upsampling with guided reassemble kernels. In Proceedings of the International Conference on Pattern Recognition, Montreal, QC, Canada, 21–25 August 2024; Springer: Cham, Switzerland, 2024; pp. 383–399. [Google Scholar] [CrossRef]
Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. Seadronessee: A maritime benchmark for detecting humans in open water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2260–2270. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 213–226. [Google Scholar] [CrossRef]
Liu, Q.; Yu, H.; Zhang, P.; Geng, T.; Yuan, X.; Ji, B.; Zhu, S.; Ma, R. MFEF-YOLO: A Multi-Scale Feature Extraction and Fusion Network for Small Object Detection in Aerial Imagery over Open Water. Remote Sens. 2025, 17, 3996. [Google Scholar] [CrossRef]
Li, J.; Hua, Y.; Xue, M. MSO-DETR: A Lightweight Detection Transformer Model for Small Object Detection in Maritime Search and Rescue. Electronics 2025, 14, 2327. [Google Scholar] [CrossRef]

Figure 1. Example of ship target detection.

Figure 2. RT-DETR-R18 architecture. Arrows indicate data flow between components, and colors distinguish different functional modules (e.g., light blue for ConvNormLayer, light green for Conv-BN).

Figure 3. Architecture of the improved RT-DETR. Dash line boxes indicate the positions where improvements are to be made. Arrows represent the direction of data flow between modules.

Figure 4. Structure of the cross-dimensional attention and its location in the backbone improvements.

Figure 5. Structure of RFAConv and its location in the backbone improvements.

Figure 6. Architecture of the Small Object Detail-Enhancement Pyramid Network. The left part of this diagram corresponds to the original module, while the right part represents the improved module; arrows indicate connection directions, and dashed boxes denote positions for component addition or module improvements.

Figure 7. Structure of SPDConv. Arrows indicate the direction of data flow between components.

Figure 8. CSP-OmniKernel module. The input is split along the channel dimension, with one part undergoing an identity mapping and the other processed by the OmniKernel module; The resulting outputs are subsequently combined to generate the final feature representation. Arrows indicate data flow direction, and dash line boxes denote the replacement of the original module with the OmniKernel module.

Figure 9. DySample architecture. Arrows indicate the direction of data flow between components. The dash line box encloses the three core compo-nents of DySample, corresponding to its three operational steps: Offset Predictor (offset prediction), Continuous Signal Modeling Layer (continuous signal modeling), and Dynamic Sampling Executor (dynamic sampling execu-tion).

Figure 10. Statistics of Target Count, Size, and Position Distribution Across Three Public Datasets. For each dataset, three sub-figures are presented: (a) class distribution histogram showing the number of instances per category; (b) target size distribution based on bounding box area; (c) spatial distribution of target centers relative to the image coordinate system.

Figure 11. Training curves of the baseline model.

Figure 12. Training curves of the improved model MCB-RT-DETR.

Figure 13. Heatmap comparison in typical scenarios. (a) Heatmap comparison for target detection in reflective backgrounds; (b) Heatmap comparison for dense small object detection; (c) Heatmap comparison for large-scale target detection.

Table 1. Experimental Configuration.

Parameter	Value
GPU	NVIDIA RTX 3090
Framework	PyTorch 2.0.0
Optimizer	AdamW
Initial learning rate	1 × 10⁻⁴
Weight decay	1 × 10⁻⁴
LR scheduler	Cosine annealing (lrf = 0.001)
Training epochs	200
Input image size	640 × 640
Batch size	4

Table 2. Ablation Study Results of Different Modules.

Model	P/%	R/%	mAP@ 0.5/%	mAP@0.5: 0.95/%	FPS	Params(M)	FLOPs(G)
Baseline	91.1	75.0	78.4	46.3	50.8	20.10	58.3
+Ortho	91.1	77.3	79.3	47.1	44.8	20.25	60.5
+Ortho (Learn)	90.5	76.8	78.9	46.8	44.5	20.26	60.9
+RFAConv	91.4	79.6	81.1	48.1	46.7	20.40	61.8
+SOD-EPN	91.4	80.7	82.5	47.7	50.5	21.61	63.6
+DySample	89.4	77.1	80.0	46.8	61.3	20.12	56.4
+Ortho + RFAConv	93.1	78.1	82.1	49.4	42.6	20.65	65.3
+SOD-EPN + DySample	91.0	80.7	83.6	47.3	36.4	21.62	67.7
MCB-RT-DETR (all)	90.8	81.0	82.9	49.7	50.0	22.11	65.2

Table 3. Comparison of different Object Detection Models on the SeaDronesSee Dataset.

Model	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	FPS	Params(M)	FLOPs(G)
YOLOv10	73.8	55.0	61.0	36.0	114.9	2.71	8.2
YOLOv11	76.4	57.6	61.0	37.0	140.8	2.60	6.3
YOLOv12	80.3	63.4	68.7	41.7	77.5	2.55	6.5
YOLOv13	80.4	62.3	67.3	40.7	54.6	2.46	6.4
MFEF-YOLO	75.8	68.4	71.2	41.0	99.1	2.3	11.7
MSO-DETR	91.0	75.3	78.2	46.9	53.7	6.5	30.5
DN-DETR	75.2	56.6	60.1	36.2	31.5	44.0	94.1
Ours	90.8	81.0	82.9	49.7	50.0	22.11	65.2

Table 4. Comparison of different object detection models on the “ship” category in the DIOR dataset.

Model	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%
YOLOv8	93.5	88.9	93.5	58.8
YOLOv11	94.5	90.1	94.6	59.1
RT-DETR-R18	93.0	93.1	94.8	60.8
Ours	95.5	91.9	95.7	62.4

Table 5. Comparison of experimental results of different algorithms on the VisDrone2019 dataset.

Method	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%
YOLOv11	42.0	32.8	32.5	18.9
RT-DETR-R18	61.7	44.3	46.4	28.2
Ours	61.5	47.3	48.0	29.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, F.; Wei, Y.; Yan, A.; Cao, T.; Xie, X. MCB-RT-DETR: A Real-Time Vessel Detection Method for UAV Maritime Operations. Drones 2026, 10, 13. https://doi.org/10.3390/drones10010013

AMA Style

Liu F, Wei Y, Yan A, Cao T, Xie X. MCB-RT-DETR: A Real-Time Vessel Detection Method for UAV Maritime Operations. Drones. 2026; 10(1):13. https://doi.org/10.3390/drones10010013

Chicago/Turabian Style

Liu, Fang, Yongpeng Wei, Aruhan Yan, Tiezhu Cao, and Xinghai Xie. 2026. "MCB-RT-DETR: A Real-Time Vessel Detection Method for UAV Maritime Operations" Drones 10, no. 1: 13. https://doi.org/10.3390/drones10010013

APA Style

Liu, F., Wei, Y., Yan, A., Cao, T., & Xie, X. (2026). MCB-RT-DETR: A Real-Time Vessel Detection Method for UAV Maritime Operations. Drones, 10(1), 13. https://doi.org/10.3390/drones10010013

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

MCB-RT-DETR: A Real-Time Vessel Detection Method for UAV Maritime Operations

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Traditional Object Detection Methods and Their Limitations

2.2. Deep Learning-Based Object Detection Methods

2.2.1. Two-Stage Detectors

2.2.2. Single-Stage Detectors

2.3. Application of Transformers in Object Detection

2.4. Specialized Research on Maritime Object Detection

3. MCB-RT-DETR Algorithm Design

3.1. Baseline Model Analysis

3.2. Backbone Network Improvements

3.2.1. Introducing the Ortho Attention Mechanism

3.2.2. Introducing RFAConv Convolution

3.3. Design of Small Object Detail-Enhancement Pyramid (SOD-EPN)

3.3.1. Spatial Compression with SPDConv

3.3.2. Multi-Scale Fusion with CSP-OmniKernel

3.4. Lightweight Neck Network

4. Results

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. Ablation Experiments

4.3. Comparative Experiments

4.4. Generalization Verification

4.4.1. Ship Detection on DIOR Dataset

4.4.2. General Object Detection on VisDrone2019 Dataset

5. Discussion

5.1. Module Collaboration Mechanism Analysis

5.2. Internal Reasons for Performance Advantages

5.3. Practical Application Prospects and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI