Object Detection Method of Inland Vessel Based on Improved YOLO

Wang, Yaoqi; Song, Jiasheng; Wang, Yichun; Wang, Rongjie; Chen, Hongyu

doi:10.3390/jmse13040697

Open AccessArticle

Object Detection Method of Inland Vessel Based on Improved YOLO

by

Yaoqi Wang

¹

,

Jiasheng Song

^1,2

,

Yichun Wang

^1,2,

Rongjie Wang

^1,2

and

Hongyu Chen

^1,2,*

¹

School of Marine Engineering, Jimei University, Xiamen 361021, China

²

Fujian Provincial Key Laboratory of Naval Architecture and Ocean Engineering, Xiamen 361021, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(4), 697; https://doi.org/10.3390/jmse13040697

Submission received: 25 February 2025 / Revised: 23 March 2025 / Accepted: 28 March 2025 / Published: 31 March 2025

(This article belongs to the Section Coastal Engineering)

Download

Browse Figures

Versions Notes

Abstract

In order to solve the problems of low accuracy of the current mainstream target detection algorithms in identifying small target ships, complex background interference such as coastline buildings and trees, and the influence of ship occlusion on ship target detection, an inland river ship detection method based on improved YOLOv10n: CDS-YOLO is proposed under the premise of keeping the model lightweight. Firstly, the CAA attention module is introduced into the Backbone network, and the C2f_CAA module is constructed at the same time to enhance the features of the central region and improve the understanding ability of complex scenes. Then, the Conv of the Backbone network was replaced with DBB to enhance the expression ability of a single convolution and enrich the feature space. Finally, GSConv and VovGSCSP in Slim-Neck are introduced into the Neck network to optimize the network architecture, reduce part of the model complexity, and further improve the performance of the model. Experimental results show that compared with YOLOv10n, CDS-YOLO has the detection accuracy, recall rate and mAP@0.5 increased by 3.7%, 2% and 0.9% respectively, reaching 98.4%, 97.4% and 99.2% respectively, indicating that CDS-YOLO has good accuracy and robustness in the detection and classification of inshore ships.

Keywords:

YOLOv10; object detection; attention mechanism; ship detection

1. Introduction

With the rapid development of inland waterways and its important role in cargo transportation, water management, and other fields, ship target recognition technology has gradually become a key means to ensure the safety of inland waterways and improve shipping efficiency. In the complex inland environment, ship targets often present diverse appearance features and variable size distribution, and are affected by factors such as water reflection, occlusion, and illumination changes, which makes the task of ship target recognition face great challenges.

Ship detection is a typical object detection task. Object detection methods can usually be divided into two categories: traditional machine 0vision methods [1] and deep learning methods [2]. However, traditional mechanical vision methods are not only sensitive to illumination changes and viewpoint changes but also show great limitations when dealing with small-size targets in complex backgrounds. In addition, the previous ship target detection methods usually rely on manual feature extraction, which has the problems of slow detection speed and unable to accurately locate the position of the ship. Especially in the inland river environment with complex traffic and nearshore background, the traditional methods make it difficult to separate the ship from the complex background and detect it. These shortcomings make it difficult for traditional methods to meet the high precision requirements of ship target detection in practical applications. At present, deep learning technology has achieved a considerable degree of development. Among these advancements, the multi-feature layer extraction strategy of convolutional neural networks can better adapt to the recognition and location of ship targets. This strategy has become an effective way to improve the performance of ship target detection.

The object detection method based on deep learning has been widely used in today’s industrial field because of its excellent feature learning ability and multi-task processing ability. Such methods are mainly divided into two types: the first is a two-stage detection algorithm represented by Fast R-CNN (Fast Region-Based Convolutional Neural Network) [3] and Faster R-CNN [4]. These algorithms perform well in detection accuracy, but their detection speed is relatively slow because they need to complete candidate region generation and object classification in two steps. The second is a single-stage detection algorithm, including typical representatives such as YOLO (You Only Look Once) [5], SSD (Single Shot MultiBox Detector) [6], and RetinaNet [7]. The single-stage detection method outputs the coordinates and class probabilities of the target simultaneously through a single network, which significantly improves the detection speed compared with the two-stage detection algorithm. In the field of ship target detection, a single-stage algorithm shows greater application potential by virtue of its efficiency and practicability. Gao et al. [8] introduced the dynamic denoising module into YOLOv4 to construct a new feature pyramid network and enhance the representation ability. Based on the RPDet algorithm, Xu et al. [9] introduced the path aggregation feature pyramid structure and used CIoU as the bounding box loss function to realize the ship target detection. Woo et al. [10] proposed ConNextV2, which can be applied to Yolo to significantly improve the performance of the network on various recognition benchmarks. Kim et al. [11] proposed an effective channel attention pyramid (ECAP-YOLO) to avoid greater information loss.

In order to further improve the performance of the model, Liu et al. [12] inserted CA attention into the backbone network of YOLOv7-tiny, and the SE module generated weights for each channel through the global average pooling operation and enhanced the important features between channels through weighting. This method can effectively suppress the influence of irrelevant features and improve the target segmentation ability of the model in complex backgrounds. At the same time, the SPP structure is modified and the original loss function is replaced by SIoU, which improves the model detection performance. Li et al. [13] inserted a CA module into YOLOv3, which enhanced the correlation between channel and spatial features by combining coordinate information. Compared with the traditional attention mechanism, the CA module can capture spatial location information and channel dependence at the same time and performs well, especially in small target detection. Zhao et al. [14] inserted the CBAM module into the feature extraction stage of YOLOv5n. CBAM combines channel attention and spatial attention at the same time. It can effectively highlight important features and ignore background noise.

The above methods are optimized in different degrees on the basis of the general target detection algorithm to improve the performance of ship target detection. However, these methods still struggle to effectively separate the ship from the complex onshore background when facing nearshore scenes. At the same time, its feature extraction and expression ability is relatively weak for occluded ship targets and small-size ship targets, which leads to unsatisfactory detection results and limits its application performance in actual complex environments. While YOLOv10n is lightweight and efficient, it struggles with detecting small objects in complex backgrounds and handling occlusions due to its limited feature extraction capabilities. On the other hand, YOLOv11, despite its improved accuracy and speed, is computationally expensive and less suitable for real-time applications on resource-constrained devices.

In response to the above problems, this paper proposes an inland river ship detection and identification method based on improved YOLOv10 [15]: CDS-YOLO. Firstly, the context anchor attention mechanism [16] (CAA) was introduced, and the C2f_CAA module was constructed to enhance the features of the central region and improve the understanding ability of complex scenes. Then, the dynamic bottleneck block [17] (DBB convolution) was used to replace the traditional convolution module in the Backbone network, which enhanced the expression ability of a single convolution and enriched the feature space. Finally, in the Neck part, the traditional C2f module is replaced by the VoVGSCSP structure of Slim-Neck [18] and the ordinary convolution is replaced by GSConv, which optimizes the network structure, reduces the complexity of the model, and realizes more efficient multi-scale feature fusion. Experimental results show that the CDS-YOLO proposed in this paper has better detection ability than YOLOv10 under the interference of complex background information, and has better detection effect for ship occlusion and small targets.

2. Method of This Article

This section first introduces the overall framework of the proposed method, as illustrated in Figure 1. The first part provides an overview of the YOLOv10 baseline network. The second part focuses on the design of the feature extraction network, where the Context Anchor Attention (CAA) mechanism is introduced into the Backbone network. The CAA mechanism combines global semantics and local features through context anchors, effectively enhancing the network’s ability to focus on key regions. Additionally, CAA optimizes feature distribution through feature division and remodeling, further improving the network’s perception of multi-scale features. Based on this, we propose an improved module, C2f_CAA, which integrates CAA into the C2f module. This integration enables the Backbone network to more accurately extract salient target features in complex backgrounds while suppressing interference and noise.

To further enhance the diversity of feature representation and improve the model’s detection performance for small and occluded objects, we replace the traditional convolution module with a Diverse Branch Block (DBB). DBB expands the feature expression space through a multi-branch design and employs structural reparameterization techniques to enrich feature diversity and enhance the network’s representation capability, all while maintaining inference efficiency.

The third part describes the lightweight optimization applied to the Neck network by introducing the GSConv (Grouped and Shuffled Convolution) and VoVGSCSP (VoVNet Grouped and CSP) modules from the Slim-Neck architecture. GSConv combines the advantages of standard convolution and depthwise separable convolution, achieving efficient cross-channel feature fusion through a feature shuffling mechanism. This significantly reduces computational costs while enhancing the expression of multi-scale features. VoVGSCSP integrates the strengths of multi-branch feature extraction and cross-stage feature fusion from VoVNet, further improving feature fusion efficiency through branch extraction and cross-channel compression.

Through these designs, the Neck network not only significantly reduces the number of model parameters and computational complexity but also maintains high detection performance. Moreover, it effectively improves the detection capability for small and medium-sized targets in complex backgrounds, providing a reliable solution for real-time detection tasks on resource-constrained devices.

2.1. YOLOv10 Algorithm

As the latest version of the YOLO series, YOLOv10 further optimizes the network structure on the basis of maintaining high efficiency and accuracy. It has the advantages of fewer parameters, fast calculation speed, and strong compatibility, which is very suitable for scenarios with limited hardware resources. YOLOv10 is mainly composed of a Backbone feature extraction network (Backbone), a Neck network (Neck), and detection Head (Head). The input image is first passed through the Backbone network, and features are extracted by a design including a large convolution kernel, a Partial Self-Attention module (PSA), and a space-channel separation down-sampling module. The feature maps output by Backbone includes multiscale feature maps with 8×, 16×, and 32× downsampling. These feature maps are then fed into the Neck network. The Neck part of YOLOv10 uses enhanced feature fusion modules, such as BiC and RepGFPN modules, to achieve efficient fusion of multi-scale features and significantly improve the detection ability of small objects. Finally, the fused feature map is passed to the detection Head, which outputs the classification result and the precise location of the object. The Head part of YOLOv10 is lightweight and designed to further reduce the computational cost and inference delay while maintaining the detection accuracy. The YOLOv10n and the CDS-YOLO network structure are shown in Figure 1.

2.2. Improvements Based on CAA Module

The attention mechanism has become an essential component in deep learning models due to its capability to enhance feature representation through flexible structural designs and powerful feature enhancement capabilities. Classical attention mechanisms such as the Squeeze-and-Excitation [19] (SE) and Convolutional Block Attention Module [20] (CBAM) have demonstrated remarkable performance in allocating attention weights across dimensions and integrating features effectively. These methods significantly enhance the model’s ability to capture critical features, which is crucial for tasks involving complex data. However, the computational efficiency and feature representation capabilities of these traditional attention mechanisms can still be optimized, particularly in scenarios requiring the processing of intricate and diverse feature patterns.

To address these limitations and further improve the expressive capacity of features, this paper introduces the Context Anchor Attention (CAA) mechanism. CAA integrates global semantic information with localized feature details by leveraging the context anchor mechanism, which directs the model’s focus toward critical areas of the feature map. Unlike traditional attention modules, CAA adopts a unique approach by partitioning the channel dimension into multiple sub-feature groups. This division, followed by feature reshaping, optimizes the feature distribution and enhances the model’s perception of multi-scale features. Furthermore, CAA employs a cross-dimensional interaction mechanism to fuse feature outputs from various branches, effectively enriching feature representation and improving the multi-scale perception capability of the network.

A distinctive strength of CAA lies in its ability to balance computational efficiency with feature learning capacity. By focusing computational resources on key areas and integrating critical information across dimensions, CAA significantly enhances the model’s ability to learn complex feature patterns without introducing substantial computational overhead. The implementation of the CAA module consists of the following steps:

F_{p o o l} = {C o n v}_{1 \times 1} (P_{a v g} (X))

(1)

where

X

is the input feature map,

P_{a v g}

is the global average pooling operation,

{C o n v}_{1 \times 1}

represents the

1 \times 1

convolution operation.

F_{s t r i p}^{h} = {D W C o n v}_{1 \times k_{b}} (F_{p o o l})

(2)

F_{s t r i p}^{v} = {D W C o n v}_{k_{b} \times 1} (F_{s t r i p}^{h})

(3)

Equations (2) and (3) represent the use of two 1D strip convolutions to extract long-distance features in the horizontal and vertical directions, respectively, where

k_{b}

is the convolution kernel size.

A = σ ({C o n v}_{1 \times 1} (F_{s t r i p}^{v}))

(4)

Equation (4) represents the fusion of the output of the direction-aware convolution into the original features to generate attention weights, where

σ

is the Sigmoid activation function.

F_{a t t n} = A ⊙ X + X

(5)

Equation (5) represents the enhancement of the original features by the weighting operation, where

⊙

is the element-wise multiplication.

In this paper, the CAA attention mechanism is introduced into the Backbone network of YOLOv10, and the CAA attention module is used to improve the C2f module. The CAA is inserted into the output stage of the C2f module to construct the C2f_CAA module and replace the C2f module in the Backbone network. To enhance Backbone’s ability to express the key regions of the input feature map. Through the context anchor mechanism, CAA can more accurately capture the global semantics and key regional features, so as to effectively focus on the salient features of the ship target in the complex background, while suppressing the interference of background noise. This improvement significantly improves the feature extraction ability of Backbone for small objects and occluded objects, and provides more efficient and accurate feature representations for the subsequent Neck and Head parts, thereby improving the overall detection performance of the model. The CAA attention module structure is shown in Figure 2.

2.3. Improvement of Convolution Module

In YOLOv10, the output of the Backbone network is a three-dimensional tensor-horizontal, spatial, and channel dimensions. In order to enhance the multi-scale perception ability, spatial semantic feature expression ability and feature learning adaptability of the model, this paper replaces the standard convolution (Conv) in the original structure of YOLOv10 with Diverse Branch Block (DBB). As a general convolution module, DBB enriches the feature expression space by multi-branch design and introduces a nonlinear enhancement mechanism in the training phase, which effectively improves the diversity of features and the representation ability of the network. In addition, DBB adopts the structure reparameterization technique to simplify the complex multi-branch structure into a single convolution in the inference stage, thus taking into account the balance between model performance and inference efficiency. This design makes DBB not only suitable for single-stage object detection model but also show superior feature learning ability in a variety of tasks.

In the training phase, the multi-branch structure of the DBB module includes three typical paths: the original

K \times K

convolution branch, which is used to capture local features; The sequential convolution branch,

1 \times 1

convolution followed by

K \times K

convolution, was used for feature compression and context modeling. Average pooling branch, feature smoothing via a combination of

\begin{matrix} 1 \times K \end{matrix}

or

K \times 1

convolutions and average pooling.

The multi-branch structure formula of DBB module is expressed as follows:

O = \sum_{i} (I * F_{i} + b_{i})

(6)

where

I

represents the input feature map;

F_{i}

and

b_{i}

denote the kernel and bias of the ith branch;

*

denotes convolution operation;

O

denotes the output feature map of the module.

Each branch is independently equipped with a Batch Normalization (BN) layer, which provides a nonlinear enhancement in the training phase to further improve the feature representation ability. The convolution operation with BN can be expressed as follows:

O_{j} = γ_{j} \cdot \frac{I * F_{j} - μ_{j}}{σ_{j}} + β_{j}

(7)

where

μ_{j}

and

σ_{j}

are the mean and standard deviation of the JTH channel, respectively;

γ_{j}

and

β_{j}

are learnable scaling and bias parameters.

In the inference stage, the DBB module merges the parameters of multiple branches into an equivalent single convolutional layer through the structure reparameterization technique. For the convolution kernel

F^{'}

and bias

b^{'}

, the merging formula is as follows:

F^{'} = \sum_{i} F_{i}, b^{'} = \sum_{i} b_{i}

(8)

In this way the complex multi-branch training structure is simplified into a standard

K \times K

convolutional layer, which greatly reduces the computational cost of the inference stage without affecting the performance of the model.

The introduction of DBB into YOLOv10 enhances the adaptation ability of the backbone network to complex spatial and semantic features. By adopting the multi-branch design of DBB, the network can flexibly capture diverse and complex feature patterns, which is crucial for detecting objects at different scales and contexts. The structure reparameterization technique ensures that this enhanced learning ability will not harm the efficiency of the model in the inference process, and the YOLOv10 model after replacing the original convolution with DBB convolution greatly improves the expression ability of a single convolution. Figure 3 shows the DBB convolutional structure.

2.4. Improvement of Neck Network Based on Slim-Neck

Slim-Neck is an efficient neck architecture designed for a lightweight object detection network. The core of Slim-Neck is to greatly reduce the computational cost while ensuring the detection performance of the model by introducing an efficient GSConv module and an optimized feature fusion strategy. Compared with the traditional neck network, Slim-Neck significantly reduces the number of parameters and computational complexity while maintaining the detection accuracy, which makes it perform particularly well in real-time detection tasks and resource-constrained scenarios.

The efficiency of Slim-Neck is mainly due to the application of the Grouped and Shuffled Convolution (GSConv) module. GSConv combines the characteristics of Standard Convolution (SC) and Depthwise Separable Convolution (DSC), and achieves efficient fusion of cross-channel features through the feature Shuffle mechanism. Specifically, the computation of GSConv is divided into two branches.

Standard convolution is used in the main branch, which is expressed as follows:

O_{S C} = {C o n v}_{K \times K} (F_{i n})

(9)

where

O_{S C}

outputs features for the main branch,

F_{i n}

is the input feature map, and

K \times K

is the size of the convolution kernel.

The auxiliary branch uses depthwise separable convolution, which is expressed as follows:

O_{D S C} = {DepthwiseConv}_{K \times K} (PointwiseConv (F_{i n}))

(10)

where PointwiseConv is a

1 \times 1

convolution for channel mapping, and DepthwiseConv is used to reduce computational cost and focus spatial features.

After concatenating the output features of the two branches in the channel dimension, information exchange is realized through feature shuffling. The final output features are expressed as follows:

O_{G S C o n v} = C o n c a t (S h u f f l e (O_{S C}, O_{D S C}))

(11)

where Concat represents the channel concatenation operation and Shuffle is used to mix the channel features to enhance the expressive power. The GSConv structure is shown in Figure 4.

On this basis, Slim-Neck further introduces an optimized VoVGSCSP (VoVNet Grouped and CSP) module to enhance the multi-scale feature fusion ability. VoVGSCSP combines the multi-branch feature extraction of VoVNet and the cross-stage feature fusion mechanism of CSP (Cross Stage Partial).

The input features are first divided into two parts, denoted as:

F_{1}, F_{2} = S p l i t (F_{i n})

(12)

F_{1}^{(i)} = {G S C o n v}_{i} (F_{1}), i \in [1, n]

(13)

where,

F_{1}

is the main branch feature and

F_{2}

is the bypass reserved feature. Equation (13) indicates that the master branch extracts multi-scale features through multiple GSConv branches, and

n

is the number of branches.

After concatenating the two features in the channel dimension, the two features are compressed and enhanced by a GSConv layer, and the fused feature map is finally output:

F_{m e r g e d} = {G S C o n v}_{f i n a l} (C o n c a t (F_{1}^{(1)}, F_{1}^{(2)}, \dots, F_{1}^{(n)}, F_{2}))

(14)

The GS bottleneck and VoVGSCSP structures are shown in Figure 5.

In this work, the Slim-Neck architecture is incorporated into CDS-YOLO, a model designed for ship detection tasks. Slim-Neck not only reduces the model’s computational overhead but also enhances its adaptability to diverse detection scenarios, such as ships in cluttered or dynamic backgrounds.

To further boost detection performance, the model employs a dual-attention mechanism within Slim-Neck, augmenting the representation of ship target features. This dual-attention mechanism ensures that the model focuses on salient features while mitigating background interference. The architecture achieves a fine balance between performance and computational efficiency, making it particularly effective for detecting small or occluded targets.

3. Experiment and Analysis

3.1. Experiment Configuration and Parameters

The experiment configuration of this paper is Windows 11 (64-bit) home Chinese version, the CPU is Intel Core-14600KF @ 3.5GHz, the GPU is NVIDIA GeForece RTX 4070Ti Super, and the running memory is 32GB. CUDA version 12.4 and Pytorch version 2.4.0 were used, and the pre-trained weights were used for transfer learning. The training parameters configuration is shown in Table 1.

3.2. Experimental Data Set

The dataset used in this paper is Sesships [21] inland ship dataset, which contains 7000 images of inland ships under different backgrounds, weather conditions, viewpoints, and different times. The Seaships dataset contains images of six classes of ships, namely mining ships, general cargo ships, bulk cargo ships, container ships, passenger ships, and fishing boats. Partial samples are shown in Figure 6. Firstly, the dataset was divided into a training and validation set with 6300 images and a test set with 700 images according to the ratio of 9:1. The training and validation set was then divided into a training set with 5670 images and a validation set with 630 images according to the ratio of 9:1.

3.3. Evaluation Metrics

In order to evaluate the recognition accuracy of the model, Precision, mean Average Precision (mAP) and Recall are used as evaluation indicators. Among them, the precision is the precision rate, which indicates the proportion of correctly predicted positive samples in the predicted positive samples. AP represents the average accuracy of a single class. mAP is the average value of Average Precision (AP), and the higher the value, the better the detection effect. Recall is the proportion of correctly predicted positive samples among the positive samples. The expressions for precision, recall, average precision, and mean average precision are as follows:

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %

(15)

R e c a l l = \frac{T P}{T P + F N}

(16)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P (i)

(17)

where: TP stands for true positives; FP stands for False Positives. TN stands for True Negatives. FN stands for False Negatives. n is the number of ship classes, n = 6 in the dataset in this paper.

3.4. Comparative Experiments

In order to further verify the superiority of the improved algorithm in this paper, CDS-YOLO is compared with mainstream object detection algorithms such as YOLOv5, YOLOv8 [22], Fast R-CNN, Faster R-CNN, Mask R-CNN, and SSD under the same experimental conditions, and the results are shown in Table 2 and Figure 7.

As illustrated in Table 3, the CDS-YOLO model introduced in this study demonstrates varying degrees of enhancement in detection accuracy, recall rate, and mAP@0.5 relative to other leading object detection algorithms. Experimental evidence corroborates the superior performance of CDS-YOLO for inshore vessel detection. The processing speed in YOLOv10n original network and CDS-YOLO is 0.1 ms preprocess, 1.0 ms inference, 0.0 ms loss, respectively, 0.1 ms postprocess per image and 0.1ms preprocess, 1.1ms inference, 0.0 ms loss, 0.1 ms postprocess per image.

Figure 8 shows a visual example of the ship target detection results of YOLOv10n and CDS-YOLO. In the first set of examples, when there are two types of ships in the detection range and the ships are occluded, the detection confidence of CDS-YOLO for bulk carriers is increased from 0.63 to 0.87 compared with YOLOv10n, which is more significant. In the second set of examples, when YOLOv10n misses detection, CDS-YOLO can correctly detect all ships in the image and its accuracy is better than that of the former. Therefore, CDS-YOLO is better than YOLOv10n in detecting ship targets under complex backgrounds. In the third set of instances, CDS-YOLO significantly improves the detection confidence of small target ships.

In summary, the CDS-YOLO algorithm model proposed in this paper maintains the lightweight nature of the original algorithm. Additionally, it demonstrates enhanced feature perception and extraction capabilities. These improvements lead to higher detection accuracy for ship targets in complex inland river backgrounds. Furthermore, the model performs better in scenarios involving mutual occlusion of ships and the detection of small target ships. As a result, CDS-YOLO offers significant advantages for inland river ship target detection.

3.5. Ablation Experiment

In order to verify the effectiveness of the improvement of YOLOv10n algorithm model in this paper, under the premise of the same experimental parameters, an ablation experiment is designed based on YOLOv10n, and the influence of each improved method on the performance of the model is analyzed. The results of the ablation experiment are shown in Table 2, where “√” means that this method is used to improve the model in the experiment, and “×” means that it is not used.

As shown in Table 3, the experimental results are analyzed as follows:

In Group A, the CAA attention mechanism was integrated into the Backbone network of YOLOv10n, and the C2f_CAA module was constructed simultaneously. After this improvement, the precision increased by 2.9%, the recall rate increased by 1.9%, and the mAP@0.5 increased by 0.5%. These results demonstrate the superiority of the CAA attention mechanism in feature extraction, enabling better capture of image features in inland ship images.

Group B and Group C experiments represent further improvements to YOLOv10n. In Group B, the traditional Conv in the Backbone network was replaced by DBB convolution. In Group C, the GSConv and VoVGSCSP modules from the Slim-Neck architecture were introduced. Both improvements enhanced the model’s performance to varying degrees. Notably, when the Slim-Neck structure was introduced, the number of parameters and floating-point operations decreased from 2.7 M parameters and 8.4 GFLOPS to 2.66 M parameters and 8.0 GFLOPS, respectively. These results verify the effectiveness of DBB convolution in enhancing feature representation and the lightweight nature of the Slim-Neck structure.

Group D and Group E experiments present the results of pairwise combinations of two improved methods applied to YOLOv10n after integrating the CAA attention mechanism into the Backbone network. The experimental results of these pairwise combinations show slight improvements compared to the first three groups of experiments.

The final group presents the experimental results of the CDS-YOLO algorithm proposed in this paper, trained on the dataset. Multiple sets of comparative experiments demonstrate that CDS-YOLO achieves the best overall performance, benefiting from its efficient multi-scale feature learning and cross-regional feature fusion capabilities. Compared to YOLOv10n, CDS-YOLO improves precision, recall rate, and mAP@0.5 by 3.7%, 2%, and 0.9%, respectively, reaching 98.4%, 97.4%, and 99.2%. These results indicate that CDS-YOLO can adapt to various scene changes and exhibits strong generalization performance. Additionally, while the computational cost does not increase significantly, CDS-YOLO meets the requirements for deployment on devices with limited computational resources.

Furthermore, Figure 9 compares the precision, recall, and mAP@0.5 change curves of CDS-YOLO and YOLOv10n during training, further illustrating the performance advantages of the proposed method.

Precision is the proportion of predicted positive samples that are actually positive, and Recall is the proportion of actual positive samples that are true. The larger the area occupied in the PR graph, the better the recognition effect. The PR curve is shown in Figure 10 and Figure 11. By comparing the PR curve of the two, it can be intuitively seen that the improved model has significantly improved the accuracy, which proves the effectiveness of CDS-YOLO in improving the performance of inshore ship target detection.

4. Discussion

The proposed CDS-YOLO method addresses the challenges of poor recognition accuracy in small target ship detection and the interference caused by complex backgrounds, such as shoreline buildings, trees, and ship occlusion. By introducing the CAA attention module into the Backbone network and constructing the C2f_CAA module, the model enhances the features of the central region, thereby improving its ability to understand complex scenes. Additionally, replacing Conv with DBB enriches the feature space and enhances the expression ability of a single convolution. The integration of GSConv and VovGSCSP in the Neck network further optimizes the architecture, reducing model complexity while improving performance.

The experimental results on the Seaships dataset demonstrate the effectiveness of the proposed method, with detection accuracy, recall rate, and mAP@0.5 reaching 98.4%, 97.4%, and 99.2%, respectively. These results indicate that CDS-YOLO not only enhances the extraction and integration of multi-scale information but also improves the model’s detection performance in complex backgrounds. The method is particularly effective in capturing small target detection objects, showcasing its accuracy and robustness in the detection and classification of inshore ships.

However, the study has certain limitations. The proposed method has primarily been validated on the Seaships dataset, and its generalizability to other datasets or real-world scenarios remains to be investigated. Furthermore, the computational efficiency of the model in real-time applications needs thorough evaluation to ensure its practicality in operational environments.

5. Conclusions

In conclusion, the CDS-YOLO method presents a significant improvement in ship target detection, particularly in complex environments with small targets and background interference. The integration of attention mechanisms, enhanced convolution operations, and optimized network architecture contributes to the model’s high accuracy and robustness. Despite its promising performance, further research is needed to validate the method’s generalizability and computational efficiency in real-time applications.

Future research directions could explore the integration of CDS-YOLO with newer versions of the YOLO series, such as YOLOv11, to leverage the latest architectural advancements. Investigating how CDS-YOLO performs on these advanced models could provide valuable insights for further enhancing ship target detection in complex environments. Additionally, expanding the validation to diverse datasets and real-world scenarios would strengthen the method’s applicability and reliability.

Author Contributions

Conceptualization, Y.W. (Yaoqi Wang) and J.S.; methodology, Y.W. (Yaoqi Wang); software, Y.W. (Yaoqi Wang) and Y.W. (Yichun Wang).; validation, Y.W. (Yichun Wang), J.S. and R.W.; formal analysis, Y.W. (Yaoqi Wang); investigation, Y.W. (Yichun Wang); resources, H.C.; data curation, Y.W. (Yaoqi Wang); writing—original draft preparation, Y.W. (Yaoqi Wang); writing—review and editing, J.S. and H.C.; visualization, Y.W. (Yaoqi Wang); supervision, H.C.; project administration, H.C.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

The authors would like to thank the Fujian Provincial Key Laboratory of Naval Architecture and Ocean Engineering for providing the computational resources and technical support necessary for this research. We also acknowledge the Seaships dataset for providing the data used in this study.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

CAA	Context Anchor Attention
DBB	Diverse Branch Block
GSConv	Ghost Shuffle Convolution
VovGSCSP	VoV-Ghost Shuffle Cross Stage Partial
YOLO	You Only Look Once
mAP	mean Average Precision
CSP	Cross Stage Partial
SE	Squeeze-and-Excitation
CBAM	Convolutional Block Attention Module
SAR	Synthetic Aperture Radar
PSA	Partial Self-Attention
FPN	Feature Pyramid Network
BiC	Bi-directional Concatenation
RepGFPN	Reparameterized Generalized Feature Pyramid Network
DSC	Depthwise Separable Convolution
SC	Standard Convolution
BN	Batch Normalization
SGD	Stochastic Gradient Descent
PR	Precision-Recall
TP	True Positives
FP	False Positives
TN	True Negatives
FN	False Negatives

References

Fu, Y.; Downey, A.R.J.; Yuan, L.; Zhang, T.; Pratt, A.; Balogun, Y. Machine learning algorithms for defect detection in metal laser-based additive manufacturing: A review. J. Manuf. Process. 2022, 75, 693–710. [Google Scholar] [CrossRef]
Saberironaghi, A.; Ren, J.; El-Gindy, M. Defect Detection Methods for Industrial Products Using Deep Learning Techniques: A Review. Algorithms 2023, 16, 95. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Gao, Y.; Wu, Z.; Ren, M.; Wu, C. Improved YOLOv4 Based on Attention Mechanism for Ship Detection in SAR Images. IEEE Access 2022, 10, 23785–23797. [Google Scholar] [CrossRef]
Cong-An, X.; Hang, S.; Long, G.; Jun-Feng, W.; Wen-Jun, Y.; Tao, J.; Hai-Yang, W. Feature aligned ship detection based on improved RPDet in SAR images. Displays 2022, 74, 102191. [Google Scholar] [CrossRef]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Kim, M.; Jeong, J.; Kim, S. ECAP-YOLO: Efficient Channel Attention Pyramid YOLO for Small Object Detection in Aerial Image. Remote Sens. 2021, 13, 4851. [Google Scholar] [CrossRef]
Liu, Y.; Wang, X. SAR Ship Detection Based on Improved YOLOv7-Tiny. In Proceedings of the 2022 IEEE 8th International Conference on Computer and Communications (ICCC), Chengdu, China, 9–12 December 2022. [Google Scholar]
Li, D.; Zhang, Z.; Fang, Z.; Cao, F. Ship Detection with Optical Image Based on CA-YOLO v3 Network. In Proceedings of the 2023 3rd International Conference on Frontiers of Electronics, Information and Computation Technologies (ICFEICT), Yangzhou, China, 26–29 May 2023. [Google Scholar]
Zhao, X.; Song, Y.; Shi, S.; Li, S. Improving YOLOv5n for lightweight ship target detection. In Proceedings of the 2023 IEEE 3rd International Conference on Computer Systems (ICCS), Qingdao, China, 22–24 September 2023. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G.J.A. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse Branch Block: Building a Convolution as an Inception-like Unit. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.-S.J.A. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. SeaShips: A Large-Scale Precisely Annotated Dataset for Ship Detection. IEEE Trans. Multimed. 2018, 20, 2593–2604. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024. [Google Scholar]
Simonyan, K.; Zisserman, A.J.C. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Zhao, X.; Song, Y.; Shi, S.; Li, S. Research on Lightweight Ship Target Detection Algorithm Based on Improved YOLOv5. In Proceedings of the 2023 IEEE 11th International Conference on Information, Communication and Networks (ICICN), Xi’an, China, 17–20 August 2023. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal Models for the Mobile Ecosystem. In Computer Vision—ECCV 2024; Springer Nature Switzerland: Cham, Switzerland, 2025. [Google Scholar]

Figure 1. (a) YOLOv10n network structure (b) CDS-YOLO network structure.

Figure 2. (a) CAA attention module; (b) C2f_CAA module.

Figure 3. DBB structure.

Figure 4. Structure diagram of GSConv.

Figure 5. (a) GS bottleneck; (b) VoVGSCSP structures.

Figure 6. Partial samples of Seaships.

Figure 7. Model performance comparison.

Figure 8. Visualization example of detection results (a) Seaships dataset; (b) Partial detection results of YOLOv10n; (c) Partial detection results of CDS-YOLO.

Figure 9. Comparison of experimental results between CDS-YOLO and YOLOv10n (a) Precision; (b) Recall; (c) mAP@0.5.

Figure 10. PR curve of YOLOv10n.

Figure 11. PR curve of CDS-YOLO.

Table 1. Training parameters configuration.

Parameter	Value
lr0	0.01
lrf	0.01
momentum	0.937
Weight decay	0.00005
Images size	640 × 640
epochs	180
optimizer	SGD
Batch size	16

Table 2. Comparison tests (bold font indicates best results).

Model	P/%	R/%	mAP@0.5/%
YOLOv5	96.4	96.2	98.3
YOLOv8 [22]	96.7	94.6	98.8
Faster R-CNN [4]	97.1	97.2	97.6
Faster R-CNN(VGG16) [23]	90.1	91.6	93.7
Faster R-CNN(VGG19) [23]	93.0	94.1	96.3
Mask R-CNN [24]	95.7	93.5	95.8
SSD [6]	96.8	95.2	97.2
CA-YOLOv3 [13]	96.2	97.0	95.1
DCS-YOLO [25]	96.8	93.4	97.1
Improved YOLOv5n [14]	96.2	96.6	98.4
YOLOv10n(MobileNetv4) [26]	97.0	93.3	98.2
CDS-YOLO (ours)	98.4	97.4	99.2

Table 3. Comparison of ablation experiment results.

Algorithms	Improve			Precision/%	Recall/%	mAP@0.5/%
Algorithms	CAA	DBB	Slim-Neck	Precision/%	Recall/%	mAP@0.5/%
YOLOv10n	×	×	×	94.7	95.4	98.3
A	√	×	×	97.6	97.3	98.8
B	×	√	×	97.1	96.4	98.9
C	×	×	√	97.3	96.3	98.9
D	√	√	×	97.9	97.0	98.9
E	√	×	√	97.6	97.2	99.1
CDS-YOLO	√	√	√	98.4	97.4	99.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Song, J.; Wang, Y.; Wang, R.; Chen, H. Object Detection Method of Inland Vessel Based on Improved YOLO. J. Mar. Sci. Eng. 2025, 13, 697. https://doi.org/10.3390/jmse13040697

AMA Style

Wang Y, Song J, Wang Y, Wang R, Chen H. Object Detection Method of Inland Vessel Based on Improved YOLO. Journal of Marine Science and Engineering. 2025; 13(4):697. https://doi.org/10.3390/jmse13040697

Chicago/Turabian Style

Wang, Yaoqi, Jiasheng Song, Yichun Wang, Rongjie Wang, and Hongyu Chen. 2025. "Object Detection Method of Inland Vessel Based on Improved YOLO" Journal of Marine Science and Engineering 13, no. 4: 697. https://doi.org/10.3390/jmse13040697

APA Style

Wang, Y., Song, J., Wang, Y., Wang, R., & Chen, H. (2025). Object Detection Method of Inland Vessel Based on Improved YOLO. Journal of Marine Science and Engineering, 13(4), 697. https://doi.org/10.3390/jmse13040697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Object Detection Method of Inland Vessel Based on Improved YOLO

Abstract

1. Introduction

2. Method of This Article

2.1. YOLOv10 Algorithm

2.2. Improvements Based on CAA Module

2.3. Improvement of Convolution Module

2.4. Improvement of Neck Network Based on Slim-Neck

3. Experiment and Analysis

3.1. Experiment Configuration and Parameters

3.2. Experimental Data Set

3.3. Evaluation Metrics

3.4. Comparative Experiments

3.5. Ablation Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI