Ship-RT-DETR: An Improved Model for Ship Plate Detection and Identification

Qin, Chang; Ji, Xiaoyu; Mo, Zhiyi; Mo, Jinming

doi:10.3390/jmse13112205

Open AccessArticle

Ship-RT-DETR: An Improved Model for Ship Plate Detection and Identification

¹

Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University, Wuzhou 543002, China

²

Engineering Research Center of Inland River Intelligent Shipping, University of Guangxi, Wuzhou 543002, China

³

Guang Xi Xi Jiang Development & Investment Group Co., Ltd., Wuzhou 543009, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(11), 2205; https://doi.org/10.3390/jmse13112205

Submission received: 12 September 2025 / Revised: 10 November 2025 / Accepted: 14 November 2025 / Published: 19 November 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Ship License Plate Recognition (SLPR) technology serves as a fundamental technological foundation for maritime transportation management. Automated ship identification enhances both regulatory oversight and operational efficiency. However, current recognition models demonstrate significant limitations, including their inability to detect objects in complex environments and challenges in maintaining real-time performance while ensuring accuracy, thereby limiting their practical applicability. This study proposes a novel cascaded framework that integrates RT-DETR-based detection with OCR capabilities. The framework incorporates several key methodological innovations: optimizing the RT-DETR backbone through efficient partial convolutions during training to improve computational efficiency; implementing Conv3XC to modify the ResNet18-backbone BasicBlock using a triple convolutional layer configuration with an enhanced RepC3 kernel design for better feature extraction; and integrating learned position encoding (LPE) to improve the AIFI position encoding mechanism, thereby enhancing detection capabilities. After region detection, PP-OCRv3 is used for character recognition. Experimental results demonstrate the superior performance of our approach: Ship-RT-DETR achieves 96.2% detection accuracy with a 28.5% reduction in parameters and 67.3 FPS, while PP-OCRv3 achieves 91.6% recognition accuracy. Extensive environmental validation across diverse weather conditions (sunny, cloudy, rainy, and foggy) confirms the framework’s robustness, maintaining a detection accuracy above 90% even in challenging foggy conditions, with minimal performance degradation (a 7.7% decrease from optimal conditions). The system’s consistent performance across various environmental conditions (detection standard deviation: 2.84%, OCR confidence standard deviation: 0.0295) establishes a novel and robust methodology for practical SLPR applications.

Keywords:

ship plate detection; RT-DETR; partial convolution; PaddleOCR

1. Introduction

The Xijiang River, a critical inland waterway in China, has experienced unprecedented growth in maritime traffic, which presents escalating challenges in ship management and safety oversight. The safe and efficient management of maritime activities is globally challenged by complex and dynamic environmental conditions, a concern that extends from inland waterways to extreme scenarios like Arctic ice navigation, where decision-making must account for variable conditions and acceptable risk levels [1]. Although ship license plates are essential identifiers in naval operations, current identification methods primarily rely on manual observation and basic electronic monitoring systems, which have proven inadequate in meeting the evolving demands of modern maritime management. In contrast to well-established fields such as Automatic License Plate Recognition (ALPR) [2,3,4,5] and Scene Text Recognition (STR) [6,7,8], SLPR [9,10,11,12,13] has garnered relatively limited academic attention. This research gap can be attributed to two fundamental challenges: the complex operational environments typical of maritime settings and the persistent lack of specialized training datasets necessary for developing robust models.

The proposed SLPR framework specifically addresses ship license plates in the Xijiang River system, which exhibit a dual identification format consisting of standardized plates and painted designations. While these formats typically coexist on individual ships and incorporate Chinese characters and numerals, they present distinct structural characteristics and recognition challenges. Multi-line, densely arranged text configurations characterize standardized plates, whereas painted designations feature narrow spacing, variable typography, frequent occlusion, and degraded visibility, with some instances even utilizing handwritten characters. Although ship license numbers demonstrate partial semantic correlation with provincial registration and ship classification attributes, these relationships generally lack consistent logical mapping patterns. Moreover, the extensive nature of the Chinese character set introduces significant complexity to the text recognition task [14], presenting unique challenges beyond those encountered in conventional Latin character-based systems.

Figure 1 presents a comprehensive overview of our proposed Chinese SLPR framework. The system employs a two-stage cascade architecture: initial license plate localization is performed using an RT-DETR-based detection algorithm, followed by text recognition utilizing PP-OCRv3 technology. This integrated approach processes input ship images through the detection pipeline to generate precise plate localization results, which are subsequently analyzed by the Optical Character Recognition (OCR) module to produce final ship identification information.

While the general SLPR framework shares similarities with vehicular license plate detection, fundamental differences exist in implementation and challenges. Figure 2 illustrates various Chinese ship license plates, displaying ship identification information against green backgrounds. The distinctive characteristics of SLPR, compared to traditional vehicle license plate recognition and STR, can be categorized into several key challenges: (1) Spatial and Structural Variability: ship license plates exhibit significant variability through non-standardized positioning and layouts, heterogeneous character attributes, including typography, orientation, scale, and background variations, and complex multi-line configurations that combine Chinese characters with numerals. (2) Environmental and Visual Challenges: Environmental challenges encompass background confusion with visually similar elements, frequent occlusion and contamination specific to Xijiang River environments, suboptimal illumination conditions, and variations introduced by manual painting methods. (3) Data Scarcity: Limited availability of comprehensive ship license plate datasets Insufficient training data for robust model development. Consequently, existing advanced algorithms for scene text recognition [15,16] and automatic license plate recognition [2,3] cannot be directly applied to SLPR due to two primary constraints. First, environmental factors such as ship motion on water surfaces, variable lighting conditions, and adverse weather significantly impact image acquisition quality. Second, substantial variations in font specifications, mounting positions, and capture angles introduce additional recognition challenges. These limitations necessitate targeted modifications to existing algorithmic frameworks to enhance recognition accuracy in practical maritime applications.

SLPR initiates with plate detection, employing deep learning-based object detection algorithms to identify plate locations in preprocessed images. While these algorithms effectively detect license plates and output positional and dimensional information, several limitations exist in current approaches. Early methods [17] struggled with scale complexity due to their reliance on default anchor boxes for prediction frame calculations. Furthermore, conventional approaches utilizing rectangular and rotated rectangular bounding boxes introduce significant background noise in detection results. Although algorithms employed in studies [11,18] implemented axis-aligned rectangles to define Regions of Interest (RoI) bounding boxes, this representation method frequently incorporates excessive background information, compromising the purity of target regions. While studies [19,20,21,22,23,24,25,26] achieved real-time ship classification, their scope was limited to ship categorization without addressing license plate localization. After successful plate detection, OCR serves as a crucial processing stage, facilitating the transformation of detected textual elements into machine-interpretable data for robust information extraction.

Developing robust detection and recognition models requires extensively annotated datasets of ship license plates. Model robustness requires datasets incorporating a comprehensive spectrum of plate configurations, including heterogeneous typographic elements, diverse environmental contexts, and varying meteorological conditions. The detection model is trained to localize license plate regions, while the recognition model is optimized for character identification. Subsequently, test datasets are employed to evaluate system accuracy and efficiency, validating compliance with required specifications and practical operational capability. This framework can be integrated into broader maritime management systems, including intelligent recognition systems, dynamic monitoring platforms, and information management infrastructures combining detection and recognition components.

We present a novel cascade framework for SLPR developed using the Xijiang ship dataset to address these challenges. Our approach draws inspiration from partial convolution [27], Conv3X from Swift Parameter-free Attention Network [28], and Learned Positional Encoding [29]. The cascade architecture decomposes complex tasks into independent modules, facilitating flexible integration of new processing components to accommodate evolving requirements. The framework incorporates several key technical implementations to enhance overall system performance: efficient Partial Convolution enables computational optimization through selective channel operations for multi-scale feature extraction. In contrast, the original Recp3 is enhanced through a triple convolutional layer configuration, and system performance is further optimized through AIFI_LPE implementation. The primary contributions of this work are fourfold:

1. Development of a comprehensive Xijiang ship license plate dataset incorporating multiple viewpoints, backgrounds, and typography variations.

2. Implement an efficient partial convolution network as a backbone replacement for enhanced speed performance.

3. Modification of the ResNet18 backbone’s BasicBlock using Conv3XC to improve high-level feature extraction capabilities.

4. Enhancement of the AIFI position encoding mechanism through LPE for improved detection performance.

The specific structure of this paper is as follows: Section 1 introduces the background and significance of the project’s SLPR. Section 2 reviews recent developments in SLPR. Section 3 provides detailed descriptions of the ship dataset and proposed framework. Section 4 outlines model evaluation metrics. Section 5 presents experimental results, comparisons, and discussion. Section 6 concludes with a summary and future research directions.

2. Related Work

The two primary components of the SLPR system are ship license plate detection and text recognition. The following sections provide a comprehensive review of deep learning-based approaches for both detection and recognition methodologies.

2.1. Ship License Plate Detection

The authors in [18] proposed a coarse-to-fine localization methodology for ship license plate detection, employing a three-stage approach. In the first stage, the system utilizes the Maximally Stable Extremal Regions (MSER) algorithm in combination with geometric features, regional intensity, and color mean similarity for initial text region extraction. Following this, feature-based refinement is applied to the coarse localization results to accurately identify candidate text regions containing actual characters, thereby determining the precise locations of the license plates. Finally, the method incorporates compensation and optimization of partial localization results by analyzing color and brightness similarity features between characters within the same license plate. Experimental validation demonstrated a ship license plate localization accuracy of 70.38%.

Deep learning technologies have significantly enhanced accuracy and speed in object detection applications. Object detection algorithms can be broadly categorized into two approaches: (1) Two-Stage Detection: Represented by the R-CNN family, these methodologies employ a two-stage computational pipeline: a primary phase of region proposal generation using selective search or anchor-based mechanisms is followed by discriminative classification that evaluates object presence and category within the spatially constrained candidate areas. While achieving high detection precision, their computational complexity results in relatively slower execution speeds. (2) Single-Stage Detection: Exemplified by SSD [30] and the YOLO series [24,25,26], these algorithms maintain detection accuracy while significantly improving operational speed, demonstrating effectiveness in practical applications. However, both approaches rely on Non-Maximum Suppression (NMS) for redundant bounding box elimination, which introduces significant limitations in terms of computational efficiency, sequential processing requirements that impact inference speed, and sensitivity to threshold adjustments across different scenarios that may result in target loss. Liu et al. [11] developed a method for localizing multi-style Ship License Numbers (SLNs) in natural scenes, which combines a transfer learning-based deep convolutional neural network for character sequence detection with two key algorithms derived from three prior SLN features: an SLN region generation algorithm and a pseudo-SLN filtering algorithm based on low-level similarity, with their method achieving an F-measure of 0.614 when validated on the ZJUSHIPS950 dataset using 1374 annotated ship license numbers. However, the approaches in [11] and [18] frequently introduce background noise in predicted bounding boxes, potentially leading to erroneous text recognition results.

Beyond the challenges specific to license plate localization, the broader field of ship detection must also address significant variations in ship orientation. Rotation-aware detection frameworks have emerged to address this issue by generating rotated bounding boxes. For instance, the Rotation YOLO Model (RYM) incorporates a rotation-decoupled head and an attentional mechanism to accurately detect tilted ships in maritime images, achieving an average accuracy of 96.7% [31]. This paradigm highlights a technological pathway distinct from axis-aligned detection, focusing on precise geometric representation for the ship hull itself.

Reference [25] introduces a lightweight real-time detection approach named YOLOSeaShip, built upon the YOLOv7-tiny model. This method reduces the number of parameters and enhances computational efficiency by replacing the original 3 × 1 convolution with partial convolutions in the ELAN module.

Reference [26] presents a lightweight ship detection model named YOLOv7-Ship. The proposed framework uses a Coordinate Attention Mechanism embedded at critical nodes within the backbone network to enhance spatial feature discrimination. Meanwhile, the ODconv modules are integrated into the Efficient Layer Aggregation Network, enabling comprehensive feature extraction through multi-dimensional convolution kernels that capture information from various directions. The modified YOLOv7-Ship variant achieves a mean Average Precision improvement of 2.2 percentage points compared to the baseline YOLOv7-Tiny model.

Early research primarily focused on the detection of ship types. However, in recent years, end-to-end object detection technologies based on Transformer architectures have gained widespread attention in the research community due to their streamlined and efficient detection process. Carion et al. [32] proposed the Detection Transformer (DETR), an end-to-end object detector based on the Transformer framework. This work innovatively eliminates the need for manually defined anchor points and Non-Maximum Suppression (NMS) components in the traditional detection pipeline. Instead, it employs a bipartite matching mechanism to directly predict a set of corresponding objects, thereby simplifying the detection process and alleviating the performance bottleneck caused by the NMS step.

Based on this foundation, the Baidu research team proposed RT-DETR (Real-Time DETR) [33]. By optimizing the network architecture and reducing computational complexity, RT-DETR achieves real-time performance while maintaining high detection accuracy. Experimental results demonstrate that the model significantly outperforms YOLO series detection models of similar scale regarding the speed-accuracy trade-off.

The study presented in [27] proposes an innovative partial convolution to eliminate unnecessary calculations and minimize memory access demands. Experimental validation confirms that the proposed framework maintains competitive accuracy in vision-related applications while achieving superior computational efficiency compared to conventional counterparts. Therefore, this paper draws inspiration from this concept to optimize the structure of RT-DETR.

2.2. Ship License Plate Recognition

Upon successful localization of vehicular identification zones, OCR is applied to recognize the characters on the plate. CRNN [34] represents a seminal architecture among text recognition methodologies, demonstrating widespread adoption across diverse applications. The architecture implements a fully convolutional design, wherein CNN extract sequential image features, followed by a bidirectional Long Short-Term Memory network for frame-wise label distribution prediction. Connectionist Temporal Classification subsequently transforms these predictions into definitive label sequences. This methodology enables variable-length text recognition, requiring only line-level annotations rather than character-level supervision. Despite marginally lower performance compared to contemporary state-of-the-art approaches, CRNN’s computational efficiency and minimal resource requirements render it particularly suitable for deployment in practical applications.

A novel approach for textual recognition [12] enhances conventional CRNN frameworks through strategic integration of Spatial Transformer Networks (STNs). This architecture modification enables automated rectification of deformed text instances via a four-stage operational workflow: convolutional layers initially capture spatial features, followed by STN-executed geometric normalization compensating for perspective distortions and irregular layouts. Subsequent bidirectional LSTM modules perform contextual sequence modelling, with final transcription accomplished through Connectionist Temporal Classification decoding. The hierarchical design synergistically combines spatial transformation capabilities with temporal sequence analysis, effectively addressing challenges in multi-line text interpretation while maintaining computational efficiency.

Reference [35] introduces a novel scene text detector, TextField. This method’s primary contribution lies in the incorporation of a directional field feature for each text point, which points to the nearest text boundary. This directional field is learned through a fully convolutional neural network and is represented as a 2D vector image. Unlike traditional segmentation-based methods, TextField stores binary text mask information and includes directional information to differentiate adjacent text instances, thereby enhancing the performance of text detection.

The study in [36] introduces a unified training architecture that optimizes textual region localization and cross-modal feature alignment within an end-to-end learning paradigm. Specifically, the method first employs a text region proposal network to generate candidate regions from the input image, then optimizes and enhances these candidate regions using an improved validation network.

CRNN, as a significant model in the field of OCR, demonstrates exceptional performance in text recognition within complex scenarios due to its unique structure and advantages. However, existing ship license plate detection and recognition models still suffer from suboptimal detection speeds. To tackle this challenge and enhance computational efficiency in maritime object identification, this investigation introduces an optimized backbone architecture for the RT-DETR framework. The inference speed is significantly improved by reducing the model parameters while maintaining accuracy. After the license plate locations are obtained, the PP-OCR detection algorithm is applied to extract the text.

3. The Proposed Method

This section first describes the process of collecting the Xijiang ship dataset and then provides a detailed explanation of the proposed SLPR framework. The framework consists of the acquisition of ship plates, extraction, preprocessing, and character recognition. Specifically, the system first employs the Ship-RT-DETR model to detect and localize the ship license plate regions in the input image. Then, the OCR algorithm is applied for character recognition.

3.1. Dataset Construction

This section introduces the methodological framework for constructing and annotating our SLPR dataset.

3.1.1. Image Acquisition of Ships in Xijiang

Guang Xi Xi Jiang Development & Investment Group Co., Ltd. provided the ship dataset. For our dataset, we captured 2000 images of ship license plates in various sizes, fonts, plate colors, and orientations under different weather conditions along the Changzhou Hydro-Junction. Some examples of the collected license plate images are shown in Figure 3. The dataset includes pictures captured at different times and under various weather conditions.

In addition, several factors were carefully considered when selecting the images to ensure the dataset’s representativeness, diversity, and suitability for the intended purpose. These include incorporating pictures captured at different times and under various weather conditions to account for diversity.

Finally, the license plate data was accurately annotated. The dataset annotation process utilized the PaddleOCR framework for automated labelling and annotation generation. Subsequently, the complete dataset underwent systematic partitioning into three subsets: training, validation, and test cohorts, following a stratified 70-10-20 distribution ratio, respectively. Of this test set (400 images, 20% of the total), a balanced subset was meticulously curated, comprising 100 images for each of the four primary weather conditions: sunny, cloudy, rainy, and foggy. Among these, the ‘foggy’ subset presents the most significant challenge due to severely degraded visibility, markedly reduced image contrast, and blurred license plate contours, providing a rigorous test for generalization under adverse conditions. Figure 4 illustrates the dataset with annotated bounding boxes.

3.1.2. Annotation

This study employed the PaddleOCR annotation tool to label the dataset. The process is outlined as follows: First, the PPOCRLabel annotation tool was launched, and the label mode was selected, which supports both text detection box annotation and text content recognition annotation. The image data to be processed was then imported into the annotation tool using a batch folder import method to improve efficiency. During the annotation phase, operators performed two key steps for each image: first, they precisely located the text region by drawing a rectangular box with the mouse; second, they entered the corresponding text content for that region. During the annotation process, strict control was maintained over the compactness of the detection boxes to ensure high-quality annotations.

After annotation, the exported results include the following key files: (1) crop_img: stores the text images cropped according to the annotation boxes, which are used for training the text recognition model; (2) file state: records the annotation status information of each image in the dataset; (3) label: contains the annotated data used to train the text detection model; and (4) rec_gt: saves the text label information to train the text recognition model.

The annotated data provided high-quality supervisory information for subsequent model training. The standardized annotation process ensured the accuracy and consistency of the training data.

3.2. Ship Plate Detection Based on Ship-RT-DETR

3.2.1. Overview of RT-DETR

The RT-DETR architecture integrates three fundamental components: a backbone network, an Efficient Hybrid Encoder, and a Transformer Decoder. Conventional CNN architectures, including ResNet variants and Baidu’s proprietary HGNet, are employed as backbone networks. The Efficient Hybrid Encoder operates through two synergistic modules: an attention-enhanced single-scale feature interaction module (AIFI) and a CNN-driven cross-scale feature fusion module (CCFF). These modules collectively process multi-level features extracted from the final three backbone stages (S3, S4, S5). AIFI implements transformer-based encoding exclusively on the highest-resolution feature map (S5), achieving computational efficiency while preserving semantic representation capabilities. The CCFF architecture incorporates multiple fusion units, each containing dual

1 \times 1

convolutional layers for channel dimension modulation and three RepConv layers optimized for multi-scale feature integration. Final feature representations from both pathways undergo element-wise summation to generate enhanced output features.

The decoder in the RT-DETR architecture employs an IoU-aware query mechanism to transform encoded feature sequences into detection predictions. To further enhance the quality of the queries and the accuracy of predictions, this study introduces an uncertainty-minimized query selection mechanism, quantifying the uncertainty between class and location predictions. The query with the lowest uncertainty is selected as the initial object query. Based on the optimized initial query, the decoder iteratively refines the detection boxes and their confidence scores to generate the final detection results.

Considering the complex and variable environmental conditions in ship license plate detection scenarios and the constraints on the model scale, this study selects RT-DETR-R18 as the base architecture for optimization. The RT-DETR series includes various specifications such as R18, R34, R50, R50m, R101, L, and X, with RT-DETR-R18 architecture demonstrating an optimal compromise between computational efficiency and detection performance, with its structural configuration comprehensively illustrated in Figure 5.

Consequently, the unmodified RT-DETR-R18 model is established as the baseline for all subsequent ablation studies and performance comparisons. This baseline configuration utilizes the original ResNet18 backbone, standard convolutional blocks, and fixed positional encoding in the AIFI module, without incorporating any of the proposed innovations. All improvements discussed in the following sections are evaluated by incrementally integrating them into this baseline model to quantify their individual and collective contributions.

3.2.2. Improvement of RT-DETR

This study systematically optimizes the RT-DETR-R18 network architecture to improve detection accuracy and computational efficiency. The main improvements include three aspects: First, the CSP_PMSFA structure is introduced into the backbone network to replace the original BasicBlock module, effectively reducing the parameters while maintaining feature extraction capabilities. Second, the Recp3 structure within the BasicBlock is optimized through a three-layer convolutional network, enhancing the network’s ability to capture fine-grained features. Finally, a learnable position encoding mechanism is incorporated into the AIFI (Attention-based Intra-scale Feature Interaction) module, improving the model’s ability to encode positional information and enabling efficient multi-scale feature fusion. The improved network architecture is shown in Figure 6.

Design of CSP_PMSFA Module

To address the parameter expansion issue caused by conventional convolutions in traditional backbone networks while maintaining spatial feature extraction capabilities, Ref. [26] introduces the partial convolution (PConv) technique from FasterNet. PConv, through its lightweight design, performs convolution operations only on a subset of channels in the input feature map, significantly reducing computational overhead. This approach enhances resource efficiency for datasets exhibiting continuity or periodic patterns by employing the initial or terminal contiguous channels of feature maps as computationally representative subsets during processing.

Building upon the concept above, this study introduces the CSP_PMSFA structure, as depicted in Figure 7. This module employs a partial convolution strategy for multi-scale feature extraction, fuses features from different scales using

1 \times 1

convolution layers and incorporates a residual connection mechanism to preserve the original information. This design enhances the model’s feature representation capability while reducing computational latency.

Module Design of Conv3xcc3

The RepC3 module is a variant of the CSP (Cross Stage Partial) structure, built upon RepConv, and is primarily used in the bottleneck layers of neural networks. Let the input tensor be denoted as x, with its mathematical expression as follows:

y = cv 3 (m (cv 1 (x)) + cv 2 (x))

(1)

Among them,

cv 1 (x)

: apply the output of the first convolutional layer to input x,

cv 2 (x)

: apply the output of the second convolutional layer to input x,

m (\cdot)

: sequentially apply n RepConv layers, and

cv 3 (\cdot)

: the final convolution applied to the summation output is an identity transformation when

c = c_{2}

.

This study employs the Conv3XC structure to optimize the BasicBlock module in the ResNet18 backbone network to enhance the detection model’s performance. Conv3XC integrates multiple convolutional layers, batch normalization layers, and activation functions while incorporating a deployment mode optimization mechanism. This approach effectively transforms the multi-layer structure into a single convolutional layer, improving inference efficiency.

In training mode, the forward computation is as follows:

x_{pad} = Pad (x, pad = (1, 1, 1, 1))

(2)

x_{pad}

represents the padded input tensor, which is necessary to accommodate the

3 \times 3

convolution operation.

The convolutional sequence is processed through the self. Conv module and its core computational process can be expressed as:

h_{1} = C o n v 2 D (x_{pad}, kernel size = 1)

(3)

h_{2} = C o n v 2 D (h_{1}, kernel size = 3)

(4)

h_{3} = C o n v 2 D (h_{2}, kernel size = 1)

(5)

In the training mode, the forward computation is as follows:

y_{conv} = h_{3}

(6)

Meanwhile, the model incorporates a skip connection mechanism, applying a

1 \times 1

convolution operation to the original input through the

sk (x)

function, forming a shortcut path:

y_{sk} = C o n v 2 D (x, kernel size = 1)

(7)

Subsequently, the model fuses the output of the convolution sequence with the result of the skip connection to combine the features:

y_{sum} = y_{conv} + y_{sk}

(8)

To ensure the stability of feature distribution, the fused features are subjected to normalization:

y_{bn} = BN (y_{sum})

(9)

Finally, the normalized features undergo a nonlinear transformation through the SiLU activation function, yielding the final output of the layer:

y_{act} = SiLU (y_{bn})

(10)

Therefore, the deployment mode formula is expressed as follows:

y = SiLU (C o n v 2 D_{eval} (x))

(11)

Module Design of AIFI_LPE

This study introduces a Learnable Position Encoding (LPE) mechanism in the Attention-based Intra-scale Feature Interaction (AIFI) module to enhance the model’s ability to encode positional information. The computational formula for traditional Fixed Position Encoding (FPE) is as follows:

\begin{matrix} PE (p o s, 2 i) & = sin (\frac{p o s}{10, 000^{2 i / d_{m o d e l}}}) \end{matrix}

(12)

PE (p o s, 2 i + 1) = cos (\frac{p o s}{10, 000^{2 i / d_{m o d e l}}})

(13)

The positional encoding mechanism employs

d_{model}

to denote the model’s dimensionality,

p o s

for position identification, and i as the dimension index.

In contrast to static positional encoding methodologies, LPE implements an adaptive strategy for capturing sequential relationships. This approach enables direct optimization of positional representations through gradient-based learning mechanisms rather than relying on deterministic algorithmic computations. In the initialization stage of neural network architecture, the position encoding matrix is generated with randomly configured parameters that follow a uniform distribution. At this time, its spatial position and vector space mapping relationship have not yet been effectively associated. The embedding matrix undergoes continuous and differentiable geometric deformation in the parameter space through gradient-based optimization mechanisms such as error backpropagation and adaptive learning rate adjustment. This process represents the learning of position encodings. In the following, this paper will decompose the forward propagation formula and the static method for constructing 2D sinusoidal position embeddings. The position embeddings are generated using the function

LPE (x)

:

pos_embed = LPE(x)

(14)

where

LPE

calculates the learned positional encoding in the flattened spatial dimensions.

The final positional embedding for each spatial location is given by:

embedding (i, j) = [sin (ω \cdot i), cos (ω \cdot i), sin (ω \cdot j), cos (ω \cdot j)]

(15)

where i and j are grid indices, and

ω

is the angular frequency vector.

LPE integrates positional embeddings as trainable parameters optimized via gradient descent, enabling task-specific adaptation and automated discovery of optimal positional features. This data-driven approach captures complex positional relationships in sequences, enhancing performance in irregular structures or tasks requiring precise positional analysis. Implementation involves refining embeddings to learn hierarchical features aligned with model architecture and objectives.

3.2.3. Using OCR for SLPR

Following the localization of ship license plates by the detection algorithm, OCR is employed to extract the textual information from the cropped image regions. This study adopts PP-OCRv3 as the core recognition engine. The selection of this specific model was predicated on a comprehensive evaluation, with the decision guided by several critical criteria that align with the demanding requirements of the SLPR task:

1. Superior Efficacy in Chinese Text Recognition: PP-OCRv3 is explicitly engineered and optimized for recognizing complex Chinese characters, which are a fundamental and challenging component of the target ship plates. Its architectural advancements, including enhanced text detection and recognition networks, directly address the challenges posed by the large vocabulary and dense stroke structures of the Chinese writing system, thereby ensuring higher accuracy compared to generic OCR models.

2. Demonstrated State-of-the-Art Performance: The framework integrates contemporary advances in deep learning for document and scene text analysis. It has consistently demonstrated top-tier performance on multiple public benchmarks, providing a reliable foundation for achieving high recognition accuracy in our application.

3. Exceptional Environmental Robustness: PP-OCRv3 exhibits strong generalization capabilities under a wide spectrum of challenging conditions, such as fluctuating illumination, diverse viewing angles, and variable image resolutions. This robustness is paramount for maritime environments, where such variations are prevalent.

4. Computational Efficiency for Real-Time Processing: The model is designed with an emphasis on inference speed, achieving an optimal balance between accuracy and latency. This characteristic is crucial for integrating seamlessly into our cascaded framework and meeting the overarching goal of real-time ship identification.

Upon successful localization, the cropped license plate image is fed into the PP-OCRv3 pipeline for text extraction and sequence decoding.

It is imperative to note that the recognition of Chinese ship license plates entails unique challenges that extend beyond those encountered in typical alphanumeric license plate systems. The task requires the accurate interpretation of a heterogeneous sequence often comprising both complex Chinese characters and numerals. Chinese characters, characterized by a vast glyph set, structurally intricate and spatially dense strokes, and the prevalence of visually similar glyphs (e.g., the distinction between ‘己’, ‘已’, and ‘巳’), demand a significantly higher level of discriminative capability from the recognition model. In contrast, alphanumeric recognition benefits from a constrained, small set of symbols with comparatively simpler topological features. The high recognition accuracy attained by our framework, as detailed in Section 5.4, substantiates the effectiveness of PP-OCRv3 and validates its particular suitability for overcoming the distinctive challenges inherent to Chinese SLPR.

4. Experiments

All experiments in this study were based on NVIDIA RTX3080*2 GPUs and using Python 3.8 and the Pytorch 1.11.0 deep learning framework. To evaluate the advantages of the proposed SLPR framework, we conducted a comprehensive assessment through an ablation experiment, a visual experiment, and a comparative experiment.

4.1. Evaluation Metrics of Ship License Plate Detection

In selecting performance evaluation metrics, the mean Average Precision (mAP) is used as the core evaluation metric. This metric provides a comprehensive reflection of the model’s detection accuracy and effectively evaluates its stability across different scenarios [37].

In the evaluation framework for object detection tasks, the following standard metrics are employed to quantify the model’s performance: True Positive (TP) represents the model’s ability to accurately identify target objects, explicitly referring to instances where objects are successfully detected and correctly classified within a given image. This metric directly reflects the detection accuracy of the model. False Positive (FP) measures the model’s false detection instances, which occur in two scenarios: first, when the background is incorrectly identified as a target object, i.e., detection boxes are generated in regions without targets; second, when an incorrect class label is assigned to an existing target object. This metric reflects the model’s robustness against interference. False Negative (FN) assesses the model’s missed detections, i.e., the number of target objects in the image that the model did not successfully detect. This metric is crucial for evaluating the model’s detection completeness. The performance assessment of object detection algorithms primarily employs precision-recall metrics as fundamental quantitative indicators. Precision quantifies the detector’s capacity to minimize false identifications. The following formula defines precision:

P r e c i s i o n = \frac{T P}{T P + F P}

(16)

The recall formula is as follows:

R e c a l l = \frac{T P}{T P + F N}

(17)

A robust object detection system demonstrates its effectiveness when precision and recall metrics are maintained at elevated levels across progressively increasing confidence thresholds while sustaining high precision across varying recall levels.

Finally, the mAP is the average of the AP values across all object detection categories.

4.2. Evaluation Metrics of Ship License Plate Recognition

Character-level accuracy is a commonly used metric for evaluating the overall model accuracy [38]. This metric quantifies the predictive accuracy rate by measuring the percentage of distinct observations in the dataset that yield valid forecast outcomes.

C L A = \frac{N u m b e r o f c o r r e c t l y r e c o g n i z e d c h a r a c t e r s}{T o t a l n u m b e r o f c h a r a c t e r s} \times 100

(18)

5. Results

This section presents the results of the ship plate recognition framework. Our framework was experimentally validated using a curated maritime vessel dataset, and comparative analysis was performed against existing methodologies under standardized evaluation metrics.

5.1. Ablation Experiment

This study implemented a comprehensive set of architectural enhancements in the RT-DETR framework, with systematic validation of the efficacy of the components through rigorous ablation analyzes. As detailed in Table 1, the integration of CSP_PMSFA modules into the backbone network achieved substantial model compression while preserving detection performance. The optimized architecture exhibited a 28.5% reduction in parameter quantity relative to the baseline RT-DETR-R18 configuration, while maintaining competitive accuracy metrics.

Critically, to address potential generalization concerns, we validated the CSP_PMSFA module across diverse environmental conditions. The module demonstrated exceptional robustness with only 2.84% standard deviation in detection accuracy across varying weather scenarios, confirming its practical applicability in real-world settings.

Furthermore, the employment of Conv3XC to enhance RepC3 in the Swift Parameter-free Attention Network yielded an improvement of 1.0% in accuracy and 2.2% in mAP0.5:0.9. Notably, the Conv3XC optimization maintained outstanding performance stability across challenging conditions, with detection mAP sustaining above 90.1% even in adverse foggy environments. This represents only a 7.7% performance degradation from optimal sunny conditions (97.8% mAP), validating the optimization’s generalization capability without computational efficiency sacrifices.

The introduction of LPE to improve the positional encoding mechanism in AIFI significantly enhanced the model’s ability to learn positional relationships. Compared to the original RT-DETR-R18 model, the enhanced architecture achieved a 1.3% accuracy increase, 1.1% mAP0.5 improvement, and 5.7% parameter reduction. The LPE mechanism proved particularly effective in complex environmental conditions, as evidenced by the system’s ability to maintain high OCR confidence scores (average 0.943) with minimal variation (std: 0.0295) across all weather scenarios.

In conclusion, these architectural modifications collectively optimized the model structure, improved detection accuracy, reduced parameter count, and demonstrated exceptional robustness to environmental variations, directly addressing potential concerns regarding real-world deployment.

5.2. Visual Experiment

The visual comparison of the experimental results (as shown in Figure 8) demonstrates that the proposed detection network exhibits significant advantages over the baseline network on the Xijiang ship dataset. Specifically, the improved network shows more substantial detection capabilities for irregular ship plates, such as those with hand-painted or blurred backgrounds. This performance enhancement can be primarily attributed to two key improvements: First, introducing the three-channel convolutional layer configuration in the RepC3 module enhances the network’s feature extraction ability, improving feature purity and detection performance. Second, the AIFI_LPE module, by dynamically adjusting sampling positions, significantly enhances the network’s adaptability to geometric variations of the target, including scale, posture, viewpoint changes, and local deformations.

5.3. Comparative Experiment

To validate the proposed algorithm’s superiority in object detection, the proposed method was compared to several state-of-the-art algorithms, including YOLOv8, YOLOv10, YOLOv11, and RT-DETR-R18, using the Xijiang ship dataset. The experimental results of the training set are shown in Figure 9. The improved ship-RT-DETR model demonstrates the best performance, confirming the proposed improvements’ effectiveness.

5.4. OCR

Traditional methods primarily rely on the PP-OCRv2 model to recognize ship plate information in ship images. In this study, the PP-OCRv3 model was employed to recognize ship plate information in a custom-built ship dataset. As shown in Table 2, the accuracy achieved by using the PP-OCRv3 model reached 91.6%, representing a 6.3% improvement in the recognition accuracy of ship plate text compared to the PP-OCRv2 model. Additionally, the PP-OCRv3 model demonstrated higher frames per second (FPS), indicating enhanced efficiency.

5.5. Environmental Robustness and Generalization Validation

To comprehensively address concerns regarding environmental robustness and generalization capability, we conducted systematic evaluation across four distinct weather conditions representing practical deployment scenarios. As detailed in Table 3, the proposed framework maintains exceptional performance stability despite significant environmental challenges.

The detection module demonstrated remarkable consistency, with an average mAP of 94.8% and minimal performance variation (standard deviation: 2.84%). Even in foggy conditions, the system still achieved 90.1% detection, with only a 7.7% decrease in accuracy compared to clear weather. This steady performance in tough conditions shows our improvements are effective for real-world use.

Regarding the two-stage framework concern, our results demonstrate effective error management between the detection and recognition stages. The OCR module achieved a perfect 100% success rate in sunny, cloudy, and rainy conditions, with only a moderate reduction to 77.8% in foggy conditions. Importantly, the OCR confidence remained exceptionally high across all scenarios (average: 0.943), with foggy conditions actually achieving the highest confidence score (0.947), indicating reliable recognition when detection is successful.

The partial convolution operations exhibited outstanding stability, as evidenced by consistent OCR confidence scores (std: 0.0295) across weather variations. Similarly, Conv3XC optimization demonstrated balanced performance without compromising generalization, maintaining detection accuracy above 95.5% in normal conditions and degrading gracefully to 90.1% only in the most extreme foggy scenarios.

While current validation focuses on the Xijiang ship dataset, the system’s consistent performance across four distinct weather conditions—coupled with successful recognition of diverse Chinese license plate formats including complex combinations like “桂桂平货3099” and “横县谢圩968” provides compelling evidence of generalization capability for maritime recognition applications.

6. Conclusions

This paper proposes a novel cascade framework for ship plate recognition, combining an enhanced RT-DETR-based detection algorithm with PP-OCRv3 recognition. Our approach introduces three key innovations: CSP_PMSFA modules for computational efficiency (28.5% parameter reduction), Conv3XC for enhanced feature extraction, and AIFI_LPE for improved positional encoding. Experimental results demonstrate superior performance, with Ship-RT-DETR achieving 96.2% detection accuracy at 67.3 FPS and PP-OCRv3 attaining 91.6% recognition accuracy.

The framework’s environmental robustness was validated through comprehensive testing across diverse weather conditions, maintaining stable performance with 94.8% average detection accuracy and minimal degradation even in challenging foggy conditions (90.1% accuracy). These results confirm the practical applicability of our approach for real-world maritime operations, while future work will focus on expanding dataset diversity and developing end-to-end recognition frameworks for further performance enhancement. The methodological innovations presented, while validated on the Xijiang River dataset, provide a potentially transferable framework for maritime intelligent transportation systems, with particular relevance for inland waterway management worldwide.

Author Contributions

Conceptualization, C.Q. and X.J.; methodology, C.Q.; software, C.Q. and Z.M.; validation, C.Q. and X.J.; formal analysis, C.Q.; investigation, C.Q. and J.M.; resources, C.Q.; data curation, C.Q.; writing—original draft preparation, C.Q.; writing—review and editing, C.Q., X.J., Z.M. and J.M.; visualization, C.Q.; supervision, C.Q.; project administration, C.Q.; funding acquisition, Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62466051), 2023 Wuzhou City Science Research and Technology Development Plan Project (Grant 2023A02009), and 2023 Wuzhou University level scientific research project (Grant 2023B003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Due to privacy concerns, the public dataset was not used.

Acknowledgments

We extend our sincere gratitude to the Editor and anonymous reviewers for their invaluable feedback and constructive suggestions that have significantly enhanced the quality of this manuscript.

Conflicts of Interest

Author Jinming Mo was employed by the company Guang Xi Xi Jiang Development & Investment Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Goncharov, V.K. The safety of arctic ice navigation on the basis of acceptable risk. Transp. Saf. Environ. 2025, 7, tdaf035. [Google Scholar] [CrossRef]
Salma; Saeed, M.; ur Rahim, R.; Gufran Khan, M.; Zulfiqar, A.; Bhatti, M.T. Development of ANPR framework for Pakistani vehicle number plates using object detection and OCR. Complexity 2021, 2021, 5597337. [Google Scholar] [CrossRef]
Moussaoui, H.; Akkad, N.E.; Benslimane, M.; El-Shafai, W.; Baihan, A.; Hewage, C.; Rathore, R.S. Enhancing automated vehicle identification by integrating YOLO v8 and OCR techniques for high-precision license plate detection and recognition. Sci. Rep. 2024, 14, 14389. [Google Scholar] [CrossRef]
Rathi, R.; Sharma, A.; Baghel, N.; Channe, P.; Barve, S.; Jain, S. License plate detection using YOLO v4. Int. J. Health Sci. 2022, 6, 9456–9462. [Google Scholar] [CrossRef]
Aljelawy, Q.M.; Salman, T.M. License plate recognition in slow motion vehicles. Bull. Electr. Eng. Inform. 2023, 12, 2236–2244. [Google Scholar] [CrossRef]
Long, S.; He, X.; Yao, C. Scene text detection and recognition: The deep learning era. Int. J. Comput. Vis. 2021, 129, 161–184. [Google Scholar] [CrossRef]
Lin, H.; Yang, P.; Zhang, F. Review of scene text detection and recognition. Arch. Comput. Methods Eng. 2020, 27, 433–454. [Google Scholar] [CrossRef]
Gao, Y.; Chen, Y.; Wang, J.; Lu, H. Semi-supervised scene text recognition. IEEE Trans. Image Process. 2021, 30, 3005–3016. [Google Scholar] [CrossRef]
Liu, D.; Cao, J.; Wang, T.; Wu, H.; Wang, J.; Tian, J.; Xu, F. SLPR: A deep learning based Chinese ship license plate recognition framework. IEEE Trans. Intell. Transp. Syst. 2022, 23, 23831–23843. [Google Scholar] [CrossRef]
Zhou, C.; Liu, D.; Wang, T.; Tian, J.; Cao, J. M³ANet: Multi-modal and multi-attention fusion network for ship license plate recognition. IEEE Trans. Multimed. 2023, 26, 5976–5986. [Google Scholar] [CrossRef]
Liu, B.; Lyu, X.; Li, C.; Zhang, S.; Hong, Z.; Ye, X. Using transferred deep model in combination with prior features to localize multi-style ship license numbers in nature scenes. In Proceedings of the 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), Boston, MA, USA, 6–8 November 2017; pp. 506–510. [Google Scholar]
Liu, B.; Wu, S.; Zhang, S.; Hong, Z.; Ye, X. Ship license numbers recognition using deep neural networks. In Journal of Physics: Conference Series, Proceedings of the 2018 2nd International Conference on Data Mining, Communications and Information Technology (DMCIT 2018), Shanghai, China, 25–27 May 2018; IOP Publishing: Bristol, UK, 2018; Volume 1060, p. 012064. [Google Scholar]
Abdulraheem, A.; Suleiman, J.T.; Jung, I.Y. Enhancing the Automatic Recognition Accuracy of Imprinted Ship Characters by Using Machine Learning. Sustainability 2023, 15, 14130. [Google Scholar] [CrossRef]
Li, T.; Yang, F.; Song, Y. Visual attention adversarial networks for Chinese font translation. Electronics 2023, 12, 1388. [Google Scholar] [CrossRef]
Ke, W.; Wei, J.; Hou, Q.; Feng, H. Rethinking text rectification for scene text recognition. Expert Syst. Appl. 2023, 219, 119647. [Google Scholar] [CrossRef]
Wang, Q.; Huang, Y.; Jia, W.; He, X.; Blumenstein, M.; Lyu, S.; Lu, Y. FACLSTM: ConvLSTM with focused attention for scene text recognition. Sci. China Inf. Sci. 2020, 63, 1–14. [Google Scholar] [CrossRef]
Gao, M.; Du, Y.; Yang, Y.; Zhang, J. Adaptive anchor box mechanism to improve the accuracy in the object detection system. Multimed. Tools Appl. 2019, 78, 27383–27402. [Google Scholar] [CrossRef]
Liu, B.; Sheng, J.; Dun, J.; Zhang, S.; Hong, Z.; Ye, X. Locating various ship license numbers in the wild: An effective approach. IEEE Intell. Transp. Syst. Mag. 2017, 9, 102–117. [Google Scholar] [CrossRef]
Chen, Z.; Chen, D.; Zhang, Y.; Cheng, X.; Zhang, M.; Wu, C. Deep learning for autonomous ship-oriented small ship detection. Saf. Sci. 2020, 130, 104812. [Google Scholar] [CrossRef]
Zhang, J.; Huang, W.; Zhuang, J.; Zhang, R.; Du, X. Detection Technique Tailored for Small Targets on Water Surfaces in Unmanned Vessel Scenarios. J. Mar. Sci. Eng. 2024, 12, 379. [Google Scholar] [CrossRef]
Lu, D.; Tang, H.; Teng, L.; Tan, J.; Wang, M.; Tian, Z.; Wang, L. Multiscale Feature-Based Infrared Ship Detection. Appl. Sci. 2023, 14, 246. [Google Scholar] [CrossRef]
Kim, K.; Hong, S.; Choi, B.; Kim, E. Probabilistic ship detection and classification using deep learning. Appl. Sci. 2018, 8, 936. [Google Scholar] [CrossRef]
Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. Seaships: A large-scale precisely annotated dataset for ship detection. IEEE Trans. Multimed. 2018, 20, 2593–2604. [Google Scholar] [CrossRef]
Han, X.; Zhao, L.; Ning, Y.; Hu, J. ShipYolo: An enhanced model for ship detection. J. Adv. Transp. 2021, 2021, 1060182. [Google Scholar] [CrossRef]
Jiang, X.; Cai, J.; Wang, B. YOLOSeaShip: A lightweight model for real-time ship detection. Eur. J. Remote Sens. 2024, 57, 2307613. [Google Scholar] [CrossRef]
Jiang, Z.; Su, L.; Sun, Y. YOLOv7-Ship: A Lightweight Algorithm for Ship Object Detection in Complex Marine Environments. J. Mar. Sci. Eng. 2024, 12, 190. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Wan, C.; Yu, H.; Li, Z.; Chen, Y.; Zou, Y.; Liu, Y.; Yin, X.; Zuo, K. Swift parameter-free attention network for efficient super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 6246–6256. [Google Scholar]
Liu, X.; Yu, H.F.; Dhillon, I.; Hsieh, C.J. Learning to encode position for transformer with continuous dynamical model. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 6327–6335. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Chen, X.; Wu, H.; Han, B.; Liu, W.; Montewka, J.; Liu, R.W. Orientation-aware ship detection via a rotation feature decoupling supported deep learning approach. Eng. Appl. Artif. Intell. 2023, 125, 106686. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Jiang, Y.; Luo, Z.; Liu, C.L.; Choi, H.; Kim, S. Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6449–6458. [Google Scholar]
Wang, H.; Bai, X.; Yang, M.; Zhu, S.; Wang, J.; Liu, W. Scene text retrieval via joint text detection and similarity learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4558–4567. [Google Scholar]
Padilla, R.; Netto, S.L.; Da Silva, E.A. A survey on performance metrics for object-detection algorithms. In Proceedings of the 2020 international conference on systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; pp. 237–242. [Google Scholar]
Nguyen, T.T.H.; Jatowt, A.; Coustaty, M.; Doucet, A. Survey of post-OCR processing approaches. ACM Comput. Surv. (CSUR) 2021, 54, 1–37. [Google Scholar] [CrossRef]

Figure 1. Proposed SLPR Framework Diagram. The Chinese text “桂平宏远” is a sample vessel license plate, meaning “Guiping Hongyuan”, which serves as the recognition target in this study.

Figure 2. Collected Dataset with Ship License Plates. All Chinese characters are vessel license plates used as recognition targets. The translations are: “贵港天佑” (Guigang Tianyou), “平南润发” (Pingnan Runfa), “腾龙” (Tenglong), “桂平南货” (Guiping Nanhuo), “远丰” (Yuanfeng), “冠洋” (Guanyang), “顺泰” (Shuntai), and “鸿丰” (Hongfeng). It is important to note that all visual data in this study are derived from real-world surveillance footage, where non-English text (such as timestamps and camera IDs) is inherent metadata and an integral part of the authentic data source.

Figure 3. Collected Images of Ships in Xijiang. As with all data in this study, the Chinese text (the date format in the top-left and the camera ID “西球” in the bottom-right) is inherent metadata from the source surveillance video. The date follows the “Year-Month-Day” format, and “西球” identifies the specific camera location.

Figure 4. Annotate the Ship Image with the Location of the Ship’s License Plate. As with all data in this study, the Chinese text (the date format in the top-left and the camera ID “西球” in the bottom-right) is inherent metadata from the source surveillance video. The date follows the “Year-Month-Day” format, and “西球” identifies the specific camera location.

Figure 5. RT-DETR-R18 Model Structure. The colors in the diagram differentiate the types of neural network modules: green represents Convolutional (Conv) layers, red denotes BasicBlock modules, orange signifies the MaxPool2d layer, yellow corresponds to RepC3 modules, and gray indicates Concatenation (Concat) operations.

Figure 6. Improved RT-DETR-R18 Model Structure. The color coding highlights the novel modules proposed in this work: dark orange represents the CSP_PMSFA module, light orange denotes the AIFI_LPE module, and yellow indicates the Conv3xcc3 module.

Figure 7. CSP_PMSFA Module Structure Diagram. Green: cascaded convolutions (

3 \times 3

–

5 \times 5

–

7 \times 7

); blue: preserved original features; final: channel concatenation with

1 \times 1

convolution and residual connection for efficient multi-scale feature fusion.

Figure 7. CSP_PMSFA Module Structure Diagram. Green: cascaded convolutions (

3 \times 3

–

5 \times 5

–

7 \times 7

); blue: preserved original features; final: channel concatenation with

1 \times 1

convolution and residual connection for efficient multi-scale feature fusion.

Figure 8. Comparison of detection effects before and after improvement: (a) original input image, (b) RT-DETR, and (c) proposed model.

Figure 9. Comparison of mainstream algorithms: (a) mAP0.5, (b) mAP0.5:0.95, (c) loss, (d) precision, and (e) recall.

Table 1. Results of the ablation experiment.

CSP	Conv3	AIFI	P%	R%	AP.5%	AP.5:0.9%	Parms%	FPS	F1%
×	×	×	95.0	95.3	96.0	64.8	38.6	63.7	94.0
✓	×	×	95.3	94.6	95.6	63.5	27.6	66.3	92.0
×	✓	×	94.8	94.6	95.7	64.4	39.3	48.7	94.0
×	×	✓	94.7	95.3	95.9	64.9	38.8	67.4	94.0
✓	✓	×	96.0	95.3	95.9	66.2	28.3	58.3	95.0
×	✓	✓	96.0	95.3	95.9	66.2	39.5	62.4	93.0
✓	✓	✓	96.2	95.3	97.0	64.7	28.5	67.3	95.0

Note: ✓: module used; ×: module not used.

Table 2. Comparison of OCR models.

Model	Acc	Norm_edit_dis	FPS
PP-OCRv2	85.30%	0.953	402.7
PP-OCRv3	91.60%	0.965	1613

Table 3. Environmental Robustness Evaluation Across Weather Conditions.

Condition	Detection mAP@0.5	OCR Success Rate	OCR Confidence
Sunny	97.8%	100.0%	0.903
Cloudy	95.7%	100.0%	0.986
Rainy	95.5%	100.0%	0.936
Foggy	90.1%	77.8%	0.947

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qin, C.; Ji, X.; Mo, Z.; Mo, J. Ship-RT-DETR: An Improved Model for Ship Plate Detection and Identification. J. Mar. Sci. Eng. 2025, 13, 2205. https://doi.org/10.3390/jmse13112205

AMA Style

Qin C, Ji X, Mo Z, Mo J. Ship-RT-DETR: An Improved Model for Ship Plate Detection and Identification. Journal of Marine Science and Engineering. 2025; 13(11):2205. https://doi.org/10.3390/jmse13112205

Chicago/Turabian Style

Qin, Chang, Xiaoyu Ji, Zhiyi Mo, and Jinming Mo. 2025. "Ship-RT-DETR: An Improved Model for Ship Plate Detection and Identification" Journal of Marine Science and Engineering 13, no. 11: 2205. https://doi.org/10.3390/jmse13112205

APA Style

Qin, C., Ji, X., Mo, Z., & Mo, J. (2025). Ship-RT-DETR: An Improved Model for Ship Plate Detection and Identification. Journal of Marine Science and Engineering, 13(11), 2205. https://doi.org/10.3390/jmse13112205

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ship-RT-DETR: An Improved Model for Ship Plate Detection and Identification

Abstract

1. Introduction

2. Related Work

2.1. Ship License Plate Detection

2.2. Ship License Plate Recognition

3. The Proposed Method

3.1. Dataset Construction

3.1.1. Image Acquisition of Ships in Xijiang

3.1.2. Annotation

3.2. Ship Plate Detection Based on Ship-RT-DETR

3.2.1. Overview of RT-DETR

3.2.2. Improvement of RT-DETR

Design of CSP_PMSFA Module

Module Design of Conv3xcc3

Module Design of AIFI_LPE

3.2.3. Using OCR for SLPR

4. Experiments

4.1. Evaluation Metrics of Ship License Plate Detection

4.2. Evaluation Metrics of Ship License Plate Recognition

5. Results

5.1. Ablation Experiment

5.2. Visual Experiment

5.3. Comparative Experiment

5.4. OCR

5.5. Environmental Robustness and Generalization Validation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI