Ship Plate Detection Algorithm Based on Improved RT-DETR

Zhang, Lei; Huang, Liuyi

doi:10.3390/jmse13071277

Open AccessArticle

Ship Plate Detection Algorithm Based on Improved RT-DETR

by

Lei Zhang

^*

and

Liuyi Huang

^*

College of Fishery, Ocean University of China, Qingdao 266003, China

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(7), 1277; https://doi.org/10.3390/jmse13071277

Submission received: 19 May 2025 / Revised: 25 June 2025 / Accepted: 26 June 2025 / Published: 30 June 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

To address the challenges in ship plate detection under complex maritime scenarios—such as small target size, extreme aspect ratios, dense arrangements, and multi-angle rotations—this paper proposes a multi-module collaborative detection algorithm, RT-DETR-HPA, based on an enhanced RT-DETR framework. The proposed model integrates three core components: an improved High-Frequency Enhanced Residual Block (HFERB) embedded in the backbone to strengthen multi-scale high-frequency feature fusion, with deformable convolution added to handle occlusion and deformation; a Pinwheel-shaped Convolution (PConv) module employing multi-directional convolution kernels to achieve rotation-adaptive local detail extraction and accurately capture plate edges and character features; and an Adaptive Sparse Self-Attention (ASSA) mechanism incorporated into the encoder to automatically focus on key regions while suppressing complex background interference, thereby enhancing feature discriminability. Comparative experiments conducted on a self-constructed dataset of 20,000 ship plate images show that, compared to the original RT-DETR, RT-DETR-HPA achieves a 3.36% improvement in mAP@50 (up to 97.12%), a 3.23% increase in recall (reaching 94.88%), and maintains real-time detection speed at 40.1 FPS. Compared with mainstream object detection models such as the YOLO series and Faster R-CNN, RT-DETR-HPA demonstrates significant advantages in high-precision localization, adaptability to complex scenarios, and real-time performance. It effectively reduces missed and false detections caused by low resolution, poor lighting, and dense occlusion, providing a robust and high-accuracy solution for intelligent ship supervision. Future work will focus on lightweight model design and dynamic resolution adaptation to enhance its applicability on mobile maritime surveillance platforms.

Keywords:

ship plate detection; RT-DETR; High-Frequency Enhanced Residual Block; Pinwheel-shaped Convolution; Adaptive Sparse Self-Attention

1. Introduction

With the rapid development of intelligent shipping systems, automatic ship identification technology has become a core requirement for maritime traffic management, smuggling inspection, and emergency rescue scenarios [1]. As the legal identifier of vessels, the precise detection and recognition of ship license plates constitute a critical component of maritime regulatory systems. However, unlike text detection in natural scenes, ship license plate detection in practical scenarios faces unique challenges: In ports, densely berthed vessels not only cause mutual occlusion but also lead to irregular deformations and partial loss of license plates due to hull structures and imaging angles; during maritime navigation, glare from strong sunlight and character blurring caused by salt spray corrosion further exacerbate detection difficulties [2,3].

Recent years have witnessed significant advancements in object detection research. Convolutional Neural Network (CNN)-based models (e.g., YOLO series [4,5,6,7,8,9,10,11], Faster R-CNN [12], SSD [13]) have demonstrated outstanding performance in generic object detection tasks through multi-scale feature fusion and region proposal mechanisms. Nevertheless, these methods exhibit limitations when addressing the specificity of ship license plate detection. For instance, the Feature Pyramid Network (FPN) proposed by Lin et al. [14] enhances small object detection capability through multi-scale feature fusion, yet its detection accuracy significantly deteriorates in scenarios with extreme aspect ratios and densely arranged license plates [15]. Additionally, traditional methods show inadequate adaptability to rotated targets. While the Rotation Region Proposal Network (RRPN) introduced by Nabati et al. [16] can generate rotated bounding boxes, its high computational complexity hinders real-time performance requirements [17].

Recent years have witnessed growing attention to Transformer-based object detection models, such as DETR [18] and RT-DETR [19]. RT-DETR (Real-Time DETR) combines the global attention mechanism of Transformers with the local feature extraction capability of convolutional neural networks, demonstrating strong potential in object detection. Compared to traditional convolutional neural networks, RT-DETR excels at capturing long-range dependencies, making it particularly suitable for detecting objects in complex backgrounds and at multiple scales. However, despite achieving a good balance between real-time performance and accuracy, RT-DETR still faces limitations in detecting small objects, especially ship identification plates.

In the international research community, several improvements have been proposed to address small object detection in complex scenes. For example, Du et al. [20] introduced a multi-scale feature extraction method based on attention mechanisms to enhance small object detection precision; Li et al. [21] optimized model robustness under complex backgrounds by incorporating sparse attention mechanisms. Nevertheless, the application of these methods to ship plate detection tasks remains underexplored.

Ship identification plates serve as crucial markers for recognizing vessel identities in modern shipping operations. Accurate detection of ship plates enables rapid verification of vessel identity, covering ownership information, registration details, and historical navigation records. This capability is essential for optimizing vessel scheduling, rational berth allocation, and efficient cargo handling at ports [22]. In the domain of waterway traffic safety, real-time monitoring of ship plates helps regulatory authorities precisely track vessel navigation status, effectively prevent unauthorized vessels from entering restricted waters, reduce navigation conflicts and accident risks, and quickly identify involved vessels when incidents occur, thus providing critical information for follow-up handling and rescue.

As the shipping industry advances towards automation and intelligence, ship plate detection forms a fundamental step closely linked to automated port logistics processes such as cargo transportation and handling. The efficiency and accuracy of plate detection directly impact the operational effectiveness of the entire maritime logistics chain and play a vital role in promoting technological upgrades within the industry.

Despite rapid advances in object detection technology, ship plate detection still confronts significant challenges. First, ship plates often appear as small-scale objects in images, especially when captured from long distances or wide-angle views. Their minimal proportion in the image hinders detection models from obtaining sufficient feature information, resulting in missed or false detections. Traditional detection algorithms struggle to extract features from small targets, further exacerbating this issue. Second, ship plates vary greatly in shape, installation position, size, and color across different vessels. Port scenes are complex and contain many structures and equipment resembling ship plates, such as ventilation openings and railings, which interfere with detection and increase false positive rates. Moreover, varying lighting conditions on water—influenced by weather, time, and seasons—including strong light, weak light, backlighting, and water reflections, degrade image quality and thereby reduce detection accuracy. Image resolution and clarity also directly limit detection performance, as low-quality images render ship plate details hard to discern, greatly challenging detection algorithms.

To address these challenges, this paper proposes a multi-module collaborative ship plate detection network, RT-DETR-HPA, which integrates an improved RT-DETR algorithm. The main objective is to develop an efficient and robust ship plate detection model capable of handling complex backgrounds, low resolution, long-distance views, and varying illumination and angles, meeting the shipping industry’s demand for high-precision detection. The primary contributions are summarized as follows:

Proposing an improved High-Frequency Enhancement Residual Block (HFERB) and fusing it with the backbone of RT-DETR. The HFERB module adopts a dual-branch structure to process high-frequency information and local features, respectively, followed by effective fusion. Deformable convolution is introduced into HFERB to dynamically adapt to the shape and position changes in license plates caused by occlusion, enabling efficient capture of partially visible license plate information. A spatially adaptive filter is employed in the local branch to monitor nonlinear deformations in real time, ensuring that text and numeric features on tilted or distorted license plates remain clear and distinguishable.
Introducing a Pinwheel-shaped Convolution (PConv) employing multi-directional convolution kernels to extract ship plate features from multiple angles, precisely capturing edge contours and local details such as characters and digits, enhancing small target detection accuracy.
Employing Adaptive Sparse Self-Attention (ASSA) to improve the AIFI module by automatically selecting important regions and filtering noise, thus improving computational and information utilization efficiency and enhancing the model’s capability to analyze correlations between overall and local ship plate features.

The remainder of this paper is organized as follows: Section 2 describes the methodology, Section 3 details the experimental design and procedures, Section 4 presents the results and analysis, and Section 5 concludes the study.

2. Materials and Methods

2.1. RT-DETR

The improvements proposed in this paper are based on the RT-DETR-ResNet18 model. RT-DETR (Real-Time Detection Transformer) is an advanced object detection framework that combines the global feature extraction capability of the Transformer with the local feature capturing advantages of convolutional neural networks (CNNs), enabling real-time object detection. ResNet18 [23] is a lightweight residual network characterized by a small number of parameters and high computational efficiency, making it well-suited as the backbone network for RT-DETR. Figure 1 illustrates the overall architecture of RT-DETR, which comprises a backbone network, a hybrid encoder, and a decoder integrated with an auxiliary detection head. In Figure 1, the black arrow displays the entire network data flow from the input image to the final output. The blue arrow represents cross scale feature interaction, where features from different scales are fused. The red arrow represents the interaction of features within the scale, where features within the same scale are processed to enhance their representation.

The basic architecture of RT-DETR-ResNet18 consists of the following components:

Backbone: ResNet18 is employed as the backbone network. Thanks to its residual block design, it effectively alleviates the vanishing gradient problem in deep networks and improves training performance. The backbone extracts multi-scale features from the input image. The output features from the last three stages (S3, S4, and S5) serve as the inputs to the encoder.

Efficient Hybrid Encoder: This module efficiently processes multi-scale features by decoupling intra-scale interactions and cross-scale fusion. Specifically, the encoder is composed of two submodules:

Attention-based Intra-scale Feature Interaction (AIFI) module: This module performs intra-scale interactions on high-level features (S5) to capture relationships between conceptual entities in the image.
CNN-based Cross-scale Feature Fusion Module (CCFM): This module fuses features across different scales to fully leverage multi-scale information.

Decoder: The decoder, equipped with an auxiliary prediction head, iteratively refines the initial object queries to generate the final bounding boxes and confidence scores. RT-DETR introduces an IoU-aware query selection mechanism that provides the decoder with higher-quality initial object queries by enforcing IoU constraints during training.

2.2. RT-DETR-HPA

To address the complex environmental challenges encountered in aerial ship plate detection—such as extreme aspect ratios, multi-angle rotations, and low resolution—this paper proposes a ship plate detection network called RT-DETR-HPA, based on an improved RT-DETR framework. As shown in Figure 1, RT-DETR-HPA uses RT-DETR-ResNet18 as the baseline. It integrates High-Frequency Enhanced Residual Blocks (HFERB) [24] into the backbone network to enhance multi-scale feature representation, while replacing certain conventional convolutional layers with Pinwheel-shaped Convolutions [25] to optimize local feature extraction capabilities. During the encoder stage, the Attention-based Intra-scale Feature Interaction (AIFI) module employs an Adaptive Sparse Self-Attention (ASSA) mechanism [26] to achieve efficient feature interaction and interference filtering. Finally, the decoder generates high-precision detection results based on the improved feature maps.

2.2.1. High-Frequency Enhancement Residual Block

Ship plates typically appear in images as small regions with high contrast. The High-Frequency Enhancement Residual Block (HFERB) enhances the high-frequency features of ship plates, making them more prominent against complex backgrounds. This enhancement helps improve the detection model’s recognition accuracy, especially in low-resolution or long-distance images, by better capturing critical details of ship plates and reducing false positives and missed detections caused by detail loss.

The HFERB module is designed based on a dual-branch residual network architecture. This dual-branch structure consists of a high-frequency enhancement branch and a local feature extraction branch. The high-frequency enhancement branch focuses on extracting and enhancing high-frequency information, such as edges and textures, from the image. The local feature extraction branch is responsible for capturing the local features of the license plate. Features extracted by both branches are concatenated and fused along the channel dimension, and subsequently combined with the input features via a residual connection. This design simultaneously strengthens the model’s ability to extract high-frequency information while preserving the overall structure and semantic information of the image, thereby enhancing the richness and accuracy of feature representation.

To enable the model to adapt more flexibly to changes in license plate shape and position under complex scenarios like occlusion and deformation, deformable convolution [27] replaces standard convolution operations within the local feature extraction branch. Deformable convolution extends regular convolution by introducing an additional offset learning mechanism. Specifically, for each element in the feature map, deformable convolution learns an offset to dynamically adjust the spatial position of the convolution kernel, thereby better accommodating local geometric variations in the license plate caused by occlusion or deformation.

In addition to deformable convolution, the local feature extraction branch employs spatially adaptive filters [28] to address nonlinear deformations of the license plate resulting from factors such as shooting angles and perspective changes. Spatially adaptive filters can dynamically adjust their filtering parameters based on the spatial location and deformation characteristics of the license plate. This enables real-time monitoring and correction of nonlinear deformations, ensuring the clarity and recognizability of text and digit features on tilted or distorted license plates. Specifically, the network learns a set of filter parameters for each spatial location, which are used to adapt the filtering operation. For instance, for a license plate with a certain tilt angle, the spatially adaptive filter can enhance features along the tilt direction while suppressing interfering information from other directions. This preserves the integrity and clarity of the text and digit features on the plate, improving the model’s recognition accuracy.

The features extracted by the high-frequency enhancement branch (containing enhanced high-frequency information) and the local feature extraction branch (containing local features with deformation and spatially adaptive information) are concatenated and fused along the channel dimension, forming a more comprehensive feature representation. Subsequently, the concatenated features are combined with the input features via a residual connection. This residual connection helps retain the original feature information, mitigating information loss during feature transformation. It also facilitates gradient backpropagation during training, alleviating the vanishing gradient problem in deep networks, thereby enhancing the stability and convergence speed of model training. The specific structure is illustrated in Figure 2.

By integrating the improved HFERB module into the RT-DETR backbone, the model can more effectively capture and utilize high-frequency information at multiple scales, thereby enhancing its perception of target details. This integration not only preserves the Transformer architecture’s strength in global information capture but also augments the model’s understanding of local details, contributing to improved detection accuracy—particularly when processing images with rich textures and complex backgrounds.

2.2.2. Pinwheel-Shaped Convolution

In ship license plate detection tasks, the license plates are prone to significant tilting and rotation due to changes in hull positions and shooting angles. Additionally, as small targets in images, license plates require critical local details such as edge contours, characters, and numbers for accurate detection. The fixed receptive field of traditional convolution kernels fails to flexibly capture features in diverse directions, leading to notable degradation in detection performance under tilting or rotating conditions. To comprehensively and effectively capture local detail information across multiple directions, this study innovatively introduces the Pinwheel-shaped Convolution. Through a three-stage design of asymmetric padding → directional convolution → feature fusion (Figure 3), it enables rotation-adaptive local feature extraction.

The Pinwheel-shaped Convolution module simulates a windmill-like convolution pattern, extracting target features from multiple directions. The PConv module applies asymmetric padding to the input tensor X and performs four groups of parallel convolution operations in the first layer. Different padding parameters

p_{1}

,

p_{2}

,

p_{3}

,

p_{4}

are set for different regions, corresponding to convolution kernels

K_{1}

,

K_{2}

,

K_{3}

, respectively, where each

p_{i}

denotes the number of pixels padded on the left, right, top, and bottom sides. Each convolution kernel

K_{i}

convolves with the padded input, producing four feature maps

F_{1}

,

F_{2}

,

F_{3}

,

F_{4}

This process can be expressed as follows:

F_{i} = Conv (X, K_{i}, p_{i}), i = 1, 2, 3, 4,

(1)

where Conv denotes the convolution operation,

K_{i}

is the i-th convolution kernel, and

p_{i}

is the corresponding padding parameter.

The four feature maps

F_{1}

,

F_{2}

,

F_{3}

,

F_{4}

obtained from the first-layer parallel convolutions are concatenated (Cat) along the channel dimension to form a new tensor F. Through concatenation, features extracted from different directions are fused, enriching the feature information. The concatenation operation is defined as follows:

F_{c a t} = C a t (F_{1}, F_{2}, F_{3}, F_{4}),

(2)

where Cat represents concatenation along the channel axis.

The concatenated tensor

F_{c a t}

then undergoes a convolution operation with a kernel

K_{n o r m}

without padding. This convolution normalizes the fused features and adjusts the output feature map’s channel number to meet the input requirements of subsequent network layers, resulting in the final output feature map,

F_{o u t}

. The normalization convolution is expressed as follows:

F_{o u t} = Conv (F_{c a t}, K_{n o r m}, p = 0),

(3)

where K denotes the convolution kernel used for fusing the concatenated features and adjusting the number of channels. Meanwhile, p represents the padding size for the convolution operation, where p = 0 indicates no padding is applied.

To enhance training stability and speed, Batch Normalization (BN) and the Sigmoid Linear Unit (SiLU) activation function are applied after each convolutional operation. Batch Normalization accelerates convergence and mitigates gradient vanishing or exploding issues; SiLU introduces nonlinearity, enabling the model to learn more complex feature relationships. The operation is given by the following:

F_{b n} = B N (F_{o u t}),

(4)

F_{f i n a l} = S i L U (F_{b n}),

(5)

where BN denotes batch normalization and SiLU denotes the Sigmoid Linear Unit activation.

Through this structural design, the PConv module extracts ship plate features from multiple angles, especially accurately capturing edge contours and local details such as characters and digits. This multi-directional feature extraction not only enriches the feature representation but also improves the model’s adaptability to ship plates under varying viewpoints. Consequently, it helps achieve more accurate localization and recognition of ship plates within complex backgrounds. In practical applications, the PConv module effectively enhances the model’s sensitivity to ship plate details and reduces missed and false detections caused by viewpoint changes and small target sizes, thereby improving overall ship plate detection performance.

2.2.3. Adaptive Sparse Self-Attention

In complex ship plate detection scenarios, images often contain various sources of interference, such as port background facilities, water reflections, and other ship components. To help the model focus on regions containing ship plates and filter out these distractions, this paper introduces an Adaptive Sparse Self-Attention (ASSA) mechanism to improve the Attention-based Intra-scale Feature Interaction (AIFI) module. This enhancement strengthens the correlation analysis between global and local features of ship plates, further improving detection accuracy and robustness.

ASSA consists of two branches: Sparse Self-Attention (SSA) and Dense Self-Attention (DSA). As shown in Figure 4, SSA employs a ReLU-based sparse attention mechanism that filters out irrelevant interactions with low matching scores between queries and keys. This reduces the participation of ineffective features and helps concentrate on the most valuable information exchanges. The computation of SSA is defined as follows:

S S A = R e L U^{2} (Q K^{T} / \sqrt{d} + B) V,

(6)

where Q, K, and V represent Query, Key, and Value matrices, respectively, d is the dimensionality of the keys, and B denotes learnable relative positional bias. In this context, the ReLU function sets elements in the attention weight matrix that are below zero to 0, thereby creating a sparse attention mechanism.

DSA utilizes the standard softmax-based dense attention mechanism to complement SSA, ensuring that critical information is not lost during the sparse filtering process. The DSA calculation is expressed as follows:

D S A = S o f t M a x (Q K^{T} / \sqrt{d} + B) V,

(7)

In Equation (7), we adopt the standard Softmax-based Dense Self-Attention mechanism. By normalizing all attention weights into a probability distribution via the Softmax function, this approach enables the model to comprehensively consider the relationships between all Key elements and Query elements.

The ASSA module dynamically fuses SSA and DSA outputs via an adaptive weight coefficient α, which is learned during training. This mechanism enables the model to prioritize sparse attention for noise suppression (e.g., water reflections) and dense attention for critical feature retention (e.g., plate characters), achieving a balance between efficiency and accuracy. The fusion is calculated as follows:

A S S A (Q, K, V) = α \cdot S S A (Q, K, V) + (1 - α) \cdot D S A (Q, K, V),

(8)

where α is an adaptive weight coefficient automatically learned by the model during training.

By incorporating the ASSA module, the model can more effectively focus on critical regions of ship plate targets and reduce background interference, enhancing detection precision. Especially under varying illumination conditions, different viewing angles, and partial occlusions of ship plates, ASSA helps maintain stable detection performance. Experimental results demonstrate that the ASSA module significantly improves robustness and accuracy in ship plate detection tasks under complex backgrounds.

3. Results

3.1. Dataset Construction and Preprocessing

This study utilizes a self-constructed fishing vessel image dataset, which encompasses fishing vessel images captured under various backgrounds and lighting conditions. The following describes the detailed process of dataset construction and preprocessing.

3.1.1. Data Acquisition Environment and Equipment Configuration

The data collection was conducted at seven representative fishing ports located in coastal cities of China. These include Shazikou Central Fishing Port in Qingdao, Shandong Province (with a capacity of 800 vessels), the first-class fishing port in Dongying (capacity of 400 vessels), Dongkou Fishing Port, Xikou Wharf, Datuan West Wharf, and Datuan East Wharf in Yantai City (each with an approximate capacity of 800 fishing vessels), as well as the Lianhuashan National Central Fishing Port in Panyu, Guangdong Province (capacity of 1000 vessels). These ports serve as typical fisheries hubs in coastal regions, featuring diverse types of fishing vessels and complex port entry and exit scenarios, as shown in Figure 5. In Figure 5a,b, the light blue shapes enclosed by red dashed lines are port areas. The circular icons indicate camera positions, and the yellow icons represent ship AIS (Automatic Identification System) locations. In Figure 5c, the circular icons denote camera positions, the red frame designates the fishing port area, and the yellow icons show the latitude-longitude positioning of the fishing port.

The imaging equipment includes a dual-spectrum camera (model: DH-TPC-PT8441B) and an infrared intelligent dome camera (model: DH-SD-6C3432XB-HNR), both providing a visible light resolution of 2560 pixels × 1440 pixels. The devices were installed on fixed brackets at a height of 3 m above the entry and exit channels of the fishing ports, covering panoramic views of the channels. During nighttime data acquisition, infrared supplementary lighting was used to ensure image clarity.

3.1.2. Data Collection Scheme Design

A systematic data collection scheme was designed to ensure diversity and representativeness of the samples. The camera operated continuously over extended periods, and images were gathered at regular intervals from the video feed. This approach allowed capturing scenes at different times of day (including daylight and night, using infrared) and under varying illumination conditions. The data collection covered multiple scenarios and viewpoints—for example, the camera recorded activity across different days and environmental conditions—to avoid bias and overfitting to any single scene. By distributing the capture times and conditions, the scheme ensured that the resulting dataset reflects a wide range of real-world situations for the subsequent model training.

3.1.3. Data Statistics and Sample Distribution

A total of approximately 10,000 frames of image data were collected for this study. Daytime samples accounted for 80% to simulate routine operational scenarios, while nighttime samples made up 20% to test the model’s robustness in low-light environments. In terms of background settings, 80% were simple backgrounds and 20% were complex backgrounds, including scenes with dense fishing vessels and occlusion.

Additionally, approximately 15% of the samples exhibited significant rotation (>15°) or perspective distortion to enhance the model’s adaptability to multi-angle license plates, as shown in Figure 6. This distribution design ensures that the dataset comprehensively covers core challenges in license plate detection, such as small target recognition, handling extreme aspect ratios, distinguishing densely arranged objects, and detecting multi-angle rotations, providing a robust data foundation for subsequent model training and testing.

3.1.4. Data Cleaning and Storage

LabelImg was selected as the unified image annotation tool for this data labeling task due to its ease of use and suitability for our specific annotation requirements.

For the collected images of ship license plates (vessel identification numbers), horizontal bounding boxes were used to mark the entire visible text region. Samples with ≥50% visible text in occluded areas were included in the annotation scope, while those with occlusion > 50% or completely missing text regions were excluded. Additionally, images where the fishing vessel was not prominent were cropped to ensure the vessel was clearly visible and centered, thus maintaining data quality.

The cleaned data was categorized and stored according to date and scene, following a naming convention of “type code + date-time” to facilitate efficient and accurate data retrieval. The JPG format was chosen for image storage, balancing reduced storage space with acceptable image quality. Annotation results were saved in the Pascal VOC standard XML format, which is conducive to subsequent model training and evaluation, laying a solid foundation for the successful construction and performance improvement of the model.

3.1.5. Data Augmentation Strategy

To enhance the model’s adaptability to different scenarios, lighting conditions, and angles, data augmentation was primarily performed using two methods: random rotation and brightness adjustment. Specifically, random rotation was applied within the range of [−15°, 15°] to simulate various tilting angles of ship names, effectively addressing the challenge of multi-angle rotations caused by changes in ship orientation or viewing perspective. Brightness adjustment involved randomly modifying the image brightness within the range of [0.7, 1.3] to simulate complex lighting conditions such as strong light, backlight, low light, and water surface reflection, thereby improving the model’s adaptability to different lighting environments.

To ensure the comprehensiveness of the augmented data while reasonably controlling the sample distribution and training scale, the following strategy was adopted: 50% of the sample data was randomly selected for random rotation augmentation, and the other 50% was used for brightness adjustment augmentation. After data augmentation, the total number of samples expanded from the original 10,000 frames to 20,000 frames, significantly increasing the data sample size and covering various scenarios and interference factors. Examples of the dataset are shown in Figure 7.

3.1.6. Dataset Partitioning Strategy

To verify the model’s effectiveness and generalization ability, the dataset was divided into three parts: training set, validation set, and test set. As shown in Figure 8, the data division ratio was 70%, 20%, and 10%, respectively, dedicated to model training, parameter tuning, and performance evaluation. The dataset originated from actual scene images captured by cameras at the entrance and exit of the fishing port, comprising 20,000 images covering a variety of complex scenarios. After division, the training set contained 14,000 samples, the validation set 4000, and the test set 2000. The figure below shows the distribution ratio of the dataset in the training set, validation set, and test set.

3.2. Experimental Platform and Environment Configuration

The experiments were conducted using the PyTorch framework on a Linux operating system. An NVIDIA RTX 4090 GPU, together with CUDA 11.8, was employed to accelerate model training and inference. The detailed hardware and software configurations are summarized in Table 1.

3.3. Evaluation Metrics

The quantitative evaluation standards included four metrics: Precision (P), Recall (R), Average Precision (AP), and mean Average Precision (mAP). Their calculation formulas are as follows:

Precision = \frac{TP}{TP + FP},

(9)

Recall = \frac{TP}{TP + FN},

(10)

A P = \int_{0}^{1} Precison (t) dt,

(11)

m A P = \frac{\sum_{n = 1}^{N} A P_{n}}{N},

(12)

where TP is the number of correctly detected targets, FP is the number of background misclassified as targets, FN is the number of undetected targets, n is the number of target detection categories, and N is the total number of obstacles detected. AP measures the algorithm’s accuracy in detecting negative obstacles, and mAP is the average of all AP values, indicating the overall detection accuracy. Higher mAP indicates better negative obstacle detection robustness. mAP can be calculated at different IoU thresholds; this study used mAP@50 and mAP@50–90. In practical object detection, model performance depends not only on accuracy but also on detection speed, measured in FPS. FPS is the number of images the model can process per second.

3.4. Training Process

1. Learning Rate Strategy

During the training process, the initial learning rate was set to 0.001. A cosine annealing scheduler was employed to gradually reduce the learning rate. This strategy ensures stable model training and accelerates convergence. It allows for rapid updates to the model parameters in the early stages of training, while the gradual reduction in the learning rate in later stages helps refine the optimization process and prevents overfitting.

2. Batch Size

Each training batch consisted of 32 images, and the training was conducted for a total of 200 epochs. This configuration ensured that the model could adequately learn the data distribution characteristics.

3. Regularization Measures

To prevent overfitting, regularization methods were incorporated into the training process. Dropout was set to 0.1, which involves randomly dropping some neuron connections to enhance the model’s generalization ability. Additionally, weight decay was set to 1 × 10⁻⁴, applying L2 regularization to constrain the size of the weights and further improve the model’s stability.

During training, the loss values of the training set and validation set were recorded and plotted to create the loss curves shown in Figure 9. In the early stages of training, the loss values were relatively high but gradually converged as training progressed. The validation set loss values remained stable, indicating that the model had achieved good generalization performance at this point.

4. Discussion

4.1. Ablation Study Analysis

To systematically evaluate the effectiveness of the proposed improvements, this study employed a progressive ablation experiment based on the controlled variables method. Using the original RT-DETR model as a baseline, it sequentially incorporated three modules: Hybrid Feature Enhanced Receptive Field Block (HFERB), Partial Convolution (PConv), and Adaptive Spatial Attention (ASSA). By comparing performance differences across various module combinations, the study verified each component’s contribution, with results presented in Table 2.

Analysis of single-module improvements demonstrates: The HFERB module increases mAP50 by 0.77% (93.76% → 94.53%) but reduces FPS by 3.5% (37.5 → 36.2), indicating that multi-scale feature fusion enhances representational capacity at the cost of increased computational load. The PConv module achieves a 2.1% FPS gain (37.5 → 38.3) while improving accuracy, verifying its effectiveness in optimizing model efficiency through redundant computation reduction. The ASSA module exhibits the most significant improvement, boosting precision by 1.45% (92.44% → 93.89%) while substantially increasing FPS by 12.8% (37.5 → 42.3), proving that spatial attention mechanisms effectively suppress interference from complex aquatic backgrounds.

Investigation of module combination effects reveals: When HFERB and PConv work synergistically, recall significantly improves by 2.83% (91.65% → 94.48%), demonstrating complementarity between local feature enhancement and receptive field expansion. The combination of HFERB and ASSA achieves a mAP50–95 of 61.78%, representing a 2.60 percentage point increase over the baseline, illustrating that attention mechanisms enhance the discriminability of multi-scale features. Meanwhile, the PConv + ASSA pair maintains high operational efficiency (40.9 FPS) while delivering optimal single-combination performance (mAP50 96.60%), reflecting an excellent balance between computational efficiency and detection accuracy.

Validation of the full-module integration scheme confirms: The complete model incorporating HFERB, PConv, and ASSA achieves comprehensive breakthroughs in accuracy metrics: mAP50 reaches 97.12% (+3.36% over baseline), mAP50–95 attains 61.90% (+2.72%), while inference speed remains at 40.1 FPS—meeting real-time detection requirements. These results verify synergistic interactions between modules, enabling performance leaps even under computational constraints.

Experimental conclusions indicate:

All proposed modules significantly enhance ship detection performance.
HFERB and PConv exhibit positive synergy at the feature extraction level, while ASSA enhances feature discriminability through spatial attention mechanisms.
The module combinations achieve an optimal accuracy-speed trade-off, providing a reliable solution for practical engineering applications.

4.2. Comparative Experimental Analysis

To comprehensively evaluate the performance of the proposed RT-DETR-HPA model, this section conducts comparative experiments with representative two-stage detectors (e.g., Faster R-CNN) and single-stage detectors (e.g., YOLOv3m, YOLOv5m, YOLOv8m, YOLOv12m), as well as the original RT-DETR model. All experiments are performed under identical datasets and training conditions. The results are presented in Table 3.

From the data presented in Table 3, it is evident that the traditional two-stage detector, Faster R-CNN, performs poorly across all evaluation metrics. Its mAP@50 reaches only 52.41%, and mAP@50–90 is 39.86%, both significantly lower than those of the other models. Furthermore, it achieves only 20.5 FPS, the lowest among all compared models. These results indicate that the region proposal-based detection paradigm is limited not only in accuracy but also in inference speed, rendering it unsuitable for ship plate detection tasks that demand both high precision and real-time performance.

In contrast, the YOLO series of single-stage detectors (medium-scale models) demonstrates substantially better overall performance. As the series evolves (YOLOv3m → YOLOv5m → YOLOv8m → YOLOv12m), both mAP@50 and mAP@50–90 steadily improve. For instance, YOLOv12m achieves an mAP@50 of 91.65%, mAP@50–90 of 60.03%, and maintains a high inference speed of 52.6 FPS, striking a favorable balance between accuracy and efficiency. However, the recall values across the YOLO models increase relatively slowly. Even for YOLOv12m, the recall is 89.41%, indicating a persistent risk of missed detections in complex scenes with occlusions or dense backgrounds.

Notably, the RT-DETR model, based on a Transformer architecture, exhibits strong performance in both recall and precision. Compared with YOLOv12m, RT-DETR improves recall and mAP@50 by 2.24 and 2.11 percentage points, respectively. Although its FPS (37.5) is lower than that of the YOLO models, it remains within the acceptable range for real-time detection. These results verify the advantage of the Transformer-based architecture in modeling long-range dependencies, which is particularly beneficial for handling scale variations and occlusion in ship plate detection tasks.

Building on this, the proposed RT-DETR-HPA model achieves the best performance across all metrics. Compared to the original RT-DETR, it improves precision to 96.26%, recall to 94.88%, mAP@50 to 97.12%, and mAP@50–90 to 61.90%, while also boosting inference speed to 40.1 FPS. When compared to YOLOv12m, RT-DETR-HPA demonstrates superior accuracy, with mAP@50 and recall increased by 5.47 percentage points each, significantly reducing the risk of missed detections. These experimental results clearly confirm the effectiveness of the proposed modules: HFERB enhances multi-scale feature representation, PConv improves multi-angle local detail extraction, and ASSA strengthens spatial feature discrimination through background suppression. Additionally, the model’s high mAP@50–90 further reflects its robustness at higher IoU thresholds, which is crucial for precise localization in real-world ship plate recognition applications.

To assess the actual detection performance of each model, several images from real ship plate detection scenarios were selected for testing, covering good and poor lighting conditions, and simple and complex backgrounds. The visualization results in Figure 10 show that the improved model can more accurately locate ship plates in complex backgrounds, with clearer recognition of edges and text details, and fewer false and missed detections, further confirming the effectiveness of the proposed improvements.

5. Conclusions

This paper proposes RT-DETR-HPA, a ship plate detection model based on an enhanced RT-DETR framework. The model incorporates three key modules: the High-Frequency Enhanced Residual Block (HFERB) to strengthen edge feature extraction, the Pinwheel-shaped Convolution (PConv) to improve multi-angle adaptability, and the Adaptive Sparse Self-Attention (ASSA) mechanism to suppress complex background interference. Evaluated on a self-constructed dataset containing 20,000 ship plate images, RT-DETR-HPA achieves 97.12% mAP@50, a recall rate of 94.88%, and a real-time inference speed of 40.1 FPS. These results significantly outperform mainstream detection models and demonstrate the model’s effectiveness in mitigating missed and false detections caused by low resolution, poor lighting, and dense occlusions in practical maritime environments.

Despite its strong overall performance, RT-DETR-HPA still shows a degree of performance degradation under extremely low-light conditions and adverse weather (e.g., heavy rain or fog). This may be attributed to the limited representation of such scenarios in the training dataset, leading to reduced generalization capability. Additionally, in cases of heavy occlusion or extreme rotation of the ship plates, the model may still suffer from occasional false positives or missed detections.

To address these limitations, future research will focus on the following directions: (1) enriching the training dataset with a higher proportion of extreme environmental and viewpoint samples, such as drawing on the generative adversarial network framework proposed by Chen X et al. to synthesize a more diverse severe weather da-taset [29], and introducing more targeted data augmentation strategies to enhance model robustness under challenging conditions; (2) conducting systematic evaluations of detection performance under varying target sizes, image resolutions, and viewing distances, as well as performing fine-grained comparative experiments across different maritime background scenarios to further validate the model’s applicability and reliability in real-world deployments; and (3) exploring lightweight attention head pruning techniques and dynamic resolution adaptation mechanisms to enable deployment on resource-constrained mobile maritime surveillance devices, thereby promoting the practical integration of RT-DETR-HPA into intelligent shipping systems.

Author Contributions

Conceptualization, L.Z.; Methodology, L.Z.; Software, L.Z.; Validation, L.Z.; Formal analysis, L.Z.; Investigation, L.Z.; Resources, L.Z. and L.H.; Data curation, L.Z.; Writing—original draft, L.Z.; Writing—review & editing, L.Z.; Visualization, L.Z.; Supervision, L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zou, Y.; Zhang, Y.; Wang, S.; Jiang, Z.; Wang, X. Ship regulatory method for maritime mixed traffic scenarios based on key risk ship identification. Ocean Eng. 2024, 298, 117105. [Google Scholar] [CrossRef]
Xu, F.; Chen, C.; Shang, Z.; Peng, Y.; Li, X. A CRNN-based method for Chinese ship license plate recognition. IET Image Process. 2024, 18, 298–311. [Google Scholar] [CrossRef]
Dan, W.; Yan, J. Ship collision risk analysis in port waters integrating GRA algorithm and BPNN. Transp. Saf. Environ. 2025, 7, tdaf012.1. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Khanam, R.; Hussain, M. What is yolov5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2023, arXiv:2207.02696. [Google Scholar]
Liu, Q.; Jiang, R.; Xu, Q.; Wang, D.; Sang, Z.; Jiang, X. Yolov8n_bt: Research on classroom learning behavior recognition algorithm based on improved yolov8n. IEEE Access 2024, 12, 36391–36403. [Google Scholar] [CrossRef]
Wang, C.; Yeh, I.; Liao, H.M. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision 2016, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Zhang, R.; Zhang, L.; Su, Y.; Yu, Q.; Bai, G. Automatic vessel plate number recognition for surface unmanned vehicles with marine applications. Front. Neurorobotics 2023, 17, 1131392. [Google Scholar] [CrossRef]
Nabati, R.; Qi, H. RRPN: Radar Region Proposal Network for Object Detection in Autonomous Vehicles. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019. [Google Scholar] [CrossRef]
Zhou, C.; Liu, D.; Wang, T.; Tian, J. M³ ANet: Multi-Modal and Multi-Attention Fusion Network for Ship License Plate Recognition. IEEE Trans. Multimed. 2023, 26, 5976–5986. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision–ECCV 2020, Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Du, Z.; Liang, Y. Object detection of remote sensing image based on multi-scale feature fusion and attention mechanism. IEEE Access 2024, 12, 8619–8632. [Google Scholar] [CrossRef]
Li, Y.; Wang, L.; Chen, S. Visual Attention Guided Sparse Reconstruction for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5005615. [Google Scholar] [CrossRef]
Liu, B.; Wu, S.; Zhang, S.; Hong, Z.; Ye, X. Ship license numbers recognition using deep neural networks. J. Phys. Conf. Ser. 2018, 1060, 012064. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Li, A.; Zhang, L.; Liu, Y.; Zhu, C. Feature modulation transformer: Cross-refinement of global representation via high-frequency prior for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12514–12524. [Google Scholar]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 9202–9210. [Google Scholar]
Zhou, S.; Chen, D.; Pan, J.; Shi, J.; Yang, J. Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2025. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Arthanari, S.; Moorthy, S.; Jeong, J.H.; Joo, Y.H. Adaptive spatially regularized target attribute-aware background suppressed deep correlation filter for object tracking. Signal Process. Image Commun. 2025, 136, 117305. [Google Scholar] [CrossRef]
Chen, X.; Wei, C.; Xin, Z.; Zhao, J.; Xian, J. Ship detection under low-visibility weather interference via an ensemble generative adversarial network. J. Mar. Sci. Eng. 2023, 11, 2065. [Google Scholar] [CrossRef]

Figure 1. Architecture of the RT-DETR model. The black arrow shows the entire network data flow from the input image to the final output. The blue arrow indi-cates cross-scale feature interaction, where features from different scales are fused. The red arrow represents in-tra-scale feature interaction, where features within the same scale are processed to enhance their representation.

Figure 2. HFERB module with deformable convolution and spatially adaptive filter. The circle with a plus sign inside represents the element wise addition operation. Used to combine the output of residual connections, where the input is added to the output of a layer block. This is a common technique in residu-al networks that can help solve gradient vanishing problems and improve training stability.

Figure 3. The structure of the PConv module. The arrows numbered 1–4 represent the sequential positions of the convolutional kernel covering the input feature map, illustrating the concept of receptive field in CNN. Each arrow corresponds to an input image region that af-fects the output of a specific neuron, and the receptive field expands with increasing network depth. This sequence demonstrates how continuous layers aggregate spatial information to form a larger effective coverage area.

Figure 4. The structure of the ASSA module. The dotted line is used to frame the entire module, and the colored blocks denote the contents written inside them. The colored boxes denote different types of mathematical operations: A dot (•) inside a circle indicates a Dot Prod-uct for linear algebraic operations on vectors or matrices. A plus sign (⊕) inside a circle signifies Element-wise Ad-dition. A multiplication sign (⊗) inside a circle represents Matrix Multiplication. These notations clarify the specific operations applied to the data within the framework.

Figure 5. (a) Qingdao Shazikou Central Fishing Port; (b) Qingdao Dongying Level-1 Fishing Port; (c) Yantai Dongkou and Xikou Fishing Port. In (a,b), the light blue shapes enclosed by red dashed lines are port areas. The circular icons indicate camera positions, and the yellow icons represent ship AIS (Automatic Identification System) locations. In (c), the circular icons denote camera positions, the red frame designates the fishing port area, and the yellow icons show the latitude-longitude positioning of the fishing port.

Figure 6. Dataset Examples. (a) Good lighting (frontal view); (b) poor lighting; (c) simple background (without rotation); (d) complex background (with occlusion + perspective distortion).

Figure 7. Data augmentation. (a) Brightness adjustment; (b) random rotation.

Figure 8. Visualization of data division ratio.

Figure 9. Training and validation loss curves.

Figure 10. Examples of detection results for the models. (a) YOLOv3s method; (b) YOLOv5s method; (c) YOLOv8s method; (d) YOLOv12s method; (e) RT-DETR-HPA; (f) RT-DETR-HPA.

Table 1. Experimental platform environment configuration.

Hardware	Software
GPU: RTX 4090 (24G × 4)	OS: Ubuntu 20.04 LTS
CPU: Intel Xeon 8375C	Framework: PyTorch 2.0
Memory: 64 GB	CUDA Version: 11.8
Storage: 16 TB SSD	Libraries: NumPy1.26.4, Matplotlib3.9.4

Table 2. Ablation experimental data.

Methods	Precision	Recall	mAP50	mAP50–90	FPS
RT-DETR	92.44	91.65	93.76	59.18	37.5
RT-DETR + HFERB	93.25	92.70	94.53	60.39	36.2
RT-DETR + PConv	93.56	92.48	94.37	60.12	38.3
RT-DETR + ASSA	93.89	92.35	94.65	60.41	42.3
RT-DETR + HFERB + PConv	94.65	94.48	96.41	61.72	38.7
RT-DETR + HFERB + ASSA	95.34	94.23	96.28	61.78	40.7
RT-DETR + PConv + ASSA	95.84	94.45	96.60	61.35	40.9
RT-DETR + HFERB + PConv + ASSA	96.26	94.88	97.12	61.90	40.1

Table 3. Experimental data comparing multiple models.

Methods	Precision	Recall	mAP50	mAP50–90	FPS
Faster R-CNN	82.35	78.68	52.41	39.86	20.5
YOLOv3m	88.90	81.12	65.34	49.26	28.6
YOLOv5m	91.80	84.25	82.75	56.83	39.2
YOLOv8m	93.65	86.38	88.23	58.74	44.1
YOLOv12m	95.28	89.41	91.65	60.03	52.6
RT-DETR	92.44	91.65	93.76	59.18	37.5
RT-DETR-HPA	96.26	94.88	97.12	61.90	40.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Huang, L. Ship Plate Detection Algorithm Based on Improved RT-DETR. J. Mar. Sci. Eng. 2025, 13, 1277. https://doi.org/10.3390/jmse13071277

AMA Style

Zhang L, Huang L. Ship Plate Detection Algorithm Based on Improved RT-DETR. Journal of Marine Science and Engineering. 2025; 13(7):1277. https://doi.org/10.3390/jmse13071277

Chicago/Turabian Style

Zhang, Lei, and Liuyi Huang. 2025. "Ship Plate Detection Algorithm Based on Improved RT-DETR" Journal of Marine Science and Engineering 13, no. 7: 1277. https://doi.org/10.3390/jmse13071277

APA Style

Zhang, L., & Huang, L. (2025). Ship Plate Detection Algorithm Based on Improved RT-DETR. Journal of Marine Science and Engineering, 13(7), 1277. https://doi.org/10.3390/jmse13071277

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ship Plate Detection Algorithm Based on Improved RT-DETR

Abstract

1. Introduction

2. Materials and Methods

2.1. RT-DETR

2.2. RT-DETR-HPA

2.2.1. High-Frequency Enhancement Residual Block

2.2.2. Pinwheel-Shaped Convolution

2.2.3. Adaptive Sparse Self-Attention

3. Results

3.1. Dataset Construction and Preprocessing

3.1.1. Data Acquisition Environment and Equipment Configuration

3.1.2. Data Collection Scheme Design

3.1.3. Data Statistics and Sample Distribution

3.1.4. Data Cleaning and Storage

3.1.5. Data Augmentation Strategy

3.1.6. Dataset Partitioning Strategy

3.2. Experimental Platform and Environment Configuration

3.3. Evaluation Metrics

3.4. Training Process

4. Discussion

4.1. Ablation Study Analysis

4.2. Comparative Experimental Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI