Multi-Type Ship Target Detection in Complex Marine Background Based on YOLOv11

Wang, Yao; Zeng, Weigui; Xu, Huiqi; Jiang, Yi; Liu, Minggang; Xiao, Chuanliang; Zhao, Ke

doi:10.3390/pr13010249

Open AccessArticle

Multi-Type Ship Target Detection in Complex Marine Background Based on YOLOv11

by

Yao Wang

¹,

Weigui Zeng

^1,*,

Huiqi Xu

¹,

Yi Jiang

¹,

Minggang Liu

¹,

Chuanliang Xiao

² and

Ke Zhao

²

¹

Naval Aviation University, Yantai 264001, China

²

Electrical and Electronic Engineering Academy, Shandong University of Technology, Zibo 255000, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(1), 249; https://doi.org/10.3390/pr13010249

Submission received: 2 December 2024 / Revised: 25 December 2024 / Accepted: 14 January 2025 / Published: 16 January 2025

(This article belongs to the Section Automation Control Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Realizing accurate control of ship target information in complex marine environments is of great significance for maintaining marine environment security and safeguarding maritime sovereignty. With the rapid development of material technology and manufacturing industry, the types and styles of ships are increasing, and the distribution of multi-type ships on the sea is widespread. How to realize the accurate detection and identification of dynamic multi-type ship targets in the complex marine environment is an important and difficult problem that needs to be solved urgently in current marine environment detection. In this paper, an improved YOLOv11 ship target detection algorithm is proposed, which firstly utilizes the improved EfficientNetv2 network to replace the original backbone network of YOLOv11 to improve the learning ability of ship features under complex sea conditions; in order to solve the problem of interference by moving objects at sea when detecting dense ship targets and reduce the problems of missing detection and false alarms, the algorithm borrows from ConvNext block idea in the process of a neck feature pyramid network fusion; the algorithm introduces the WIoU loss function, which compensates for the effect of the small number of pixels of the small target in the process of regression loss computation, so as to improve the network’s performance in detecting small targets. In order to test the network performance in actual application scenarios, the article builds a visible ship target dataset, including complex background, occlusion and overlap, small targets, and other factors. Through experimental verification, the detection accuracy of the improved algorithm is improved by 5.6% compared with the original algorithm, and compared with typical algorithms in terms of detection accuracy, speed, and number of parameters, ablation experiments are designed to comprehensively validate and analyze the algorithm’s performance.

Keywords:

YOLOv11; multi-type ship; target detection; complex environment; algorithm improvement

1. Introduction

In recent years, with the increasing frequency of maritime political, trade, and cultural exchanges, ships play an important role as carriers of material and cultural communication. In order to ensure the safety of marine environments, smooth traffic, and an efficient operation level, the real-time detection of ship targets has long been the focus of wide attention. The marine environment is complex and changeable, and complex natural environments, such as rain, snow, and thunderstorms, bring certain difficulties to the interpretation of ship targets. In addition, the complex island background, such as the intricate distribution of islands and reefs, ports and terminals, and maritime facilities and equipment, will cause certain occlusions and overlapping interference for the ship and affect the accuracy of image detection. Moreover, the types and sizes of ships at sea are diverse, especially the small multi-dynamic targets at sea are important and difficult targets for sea surface monitoring, early warning, and tracking.

There are many traditional methods for target detection of maritime vessels, such as the constant virtual alarm rate method for SAR images [1], edge detection of ship infrared images based on adaptive Canny operator [2], and detection methods using wavelet transform for ordinary optical ship targets. The ship target detection method for SAR images uses a scattering mechanism to process the image through the different beam propagations between the ship and the background or other targets to realize the ship detection. The ship infrared image detection method has good penetration, is not affected by ambient brightness, and can realize the interpretation of ship target information by fuzzy mathematics and spectral classification. Detection methods based on common optical images can achieve rapid detection of multiple targets in a small sea area with high detection accuracy, which can be achieved using fuzzy analysis, wavelet change method, gray histogram segmentation method, and so on. Traditional sea ship target detection methods are difficult to obtain images with; most need to set target characteristics, leading to larger workload, difficulty in setting characteristics, and vulnerability to weather conditions, marine environments, sea clutter, complex electromagnetic interference, and other factors. SAR images are highly correlated with polarization methods, weather, wind, etc. Infrared image detection is easily disturbed by non-uniform quantization noise, and the image clarity is poor; visible-light image detection is not adaptable to the shape of the target, weather conditions, and whether the appearance is completely displayed.

The complex marine environment is not only affected by the complex natural environment but also aggravated by the complex electromagnetic environment composed of the interweaving and overlapping radiation sources of various frequency bands, such as navigation signals, communication signals, radio broadcasting, and television signals, which limit the ability of target detection methods based on infrared and radar images and make it difficult to meet the needs of accurate detection. The problems, including occlusion, overlap, and complex island background, all bring great interference to target detection. The wide distribution of long-range small targets at sea is relatively small in the image, which also increases the difficulty of processing. The 21st century has brought about the rapid development of related technologies in the field of artificial intelligence; target detection networks based on deep learning have gradually matured, and the technology has a high-performance detection advantage, which is useful in achieving accurate detection and identification of the target; since its inception, its network and algorithms have been continuously improved and enhanced, and the detection accuracy has reached a level comparable to the human eye. Deep learning-based target detection algorithms are divided into second-order target detection algorithms and first-order target detection algorithms; the second-order target detection algorithms have higher detection accuracy, but the detection time is long, and it is difficult to meet the real-time demand.

The rigid object detection problem of the first-order object detection network is simplified to an end-to-end processing problem, which greatly improves the detection efficiency, but the detection accuracy is slightly worse than that of the second-order object detection network. Scholars from all walks of life have carried out long-term research and improvement work on the first-order target detection network. Wang et al. [3] introduced the CFE (Comprehensive Feature Enhancement) module in CFENet [4] in the YOLOv3 network and improved the detection performance of small and medium-sized ship objects in visible images; Yang et al. [5]. proposed anchorless ship objects based on a rotating enclosure frame; Gu Jiaojiao et al. [6]. introduced a multi-scale feature fusion module in a Faster R-CNN network to improve the network detection performance; MA Z F [7] proposed the processing method of multi-channel SAR image fusion, which improves the detection accuracy of the YOLOv4 network.

The YOLO series first-order target detection algorithm has high detection accuracy, fast speed, and little influence by interference, especially in small targets. The subsequent improved networks proposed make the network model more streamlined, and the training and detection speed have been greatly improved [8].

The YOLOv11 target detection network proposed in 2024 has better-balanced the relationship between detection efficiency and accuracy, and the detection accuracy and speed have been improved by leaps and leaps compared with previous target detection methods, with outstanding performance in a number of visual processing tasks. In addition, the YOLOv11 target detection algorithm based on visible images is not affected by complex electromagnetic environment interference, can more accurately interpret the blocked target and small target information, and can more accurately distinguish the targets of multiple types of ships. The YOLOv11 algorithm is more lightweight and suitable for application in scenarios with limited hardware conditions. In the future, the network can be migrated to embedded platforms and carried to mobile platforms, such as drones, to realize real-time detection of marine ship information, providing favorable conditions. In order to further improve the performance of accurate detection of multiple types of dynamic targets in complex marine environments, especially for the detection tasks with the influence of occlusion and overlapping situations, small targets, and dynamic factors; to improve the detection efficiency; and, at the same time, to consider the limitations of UAVs and other platforms on the performance of the equipment, this paper proposes an improved YOLOv11 ship target detection network and carries out a study based on visible-light images to avoid the impact of the complex electromagnetic environment interference on infrared/radar image acquisition equipment at sea and environmental interference on infrared/radar image acquisition equipment. The main work is as follows:

The original YOLOv11 backbone network is replaced with the improved EfficientNetv2 network, and the channel attention mechanism CA is introduced to improve the ship feature learning ability under complex sea conditions.
In order to solve the problem of detecting dense ship target interference by moving objects at sea, reducing leakage and false positives, the algorithm in the process of neck feature pyramid network fusion ConvNeXt Block ideas, which allows for targeted segmentation of target feature information area in 2D space; this method reduces the influence of high semantic noise in the process of context semantic fusion and supports the model extraction being more conducive to the characteristics of ship class classification.
As if the traditional IoU distance index is affected by the limited number of pixels and may adversely affect the detection of small target vessels, the WIoU loss function is introduced in the model of this experiment. This loss function helps to compensate for the effects of small targets with fewer pixels during the regression loss calculation and thus improve the detection performance of small targets.
A self-built visible-light ship image dataset was created, including a variety of complex backgrounds, occlusion overlaps, small target scenes, ship targets covering a variety of scales, and different images from different angles, to enhance network adaptability.

2. Principles of the YOLOv11 Algorithm

2.1. The YOLO Series Algorithm

The YOLO (You Only Look Once: Unified, Real-Time Object Detection) family of algorithms is a first-order target detection network proposed in 2015 by Joseph Redmon, Ali Farhadi et al. Compared with the second-order target detection network, the first-order target detection network carries out the tasks of feature extraction, classification, and regression in a convolutional network and no longer generates candidate frames, which transforms the target detection problem into an end-to-end regression problem and has the characteristics of a lighter network, faster detection speed, and shorter training time.

The performance of the different networks on the MS COCO dataset is shown in Table 1:

The YOLO network divides the image into a grid form, directly predicts the whole image, and regresses the enclosure box (bounding box) and type of the target at different positions in the input image. In this way, the detection speed of YOLO network can be greatly improved, and through the training of the whole image, the background information of the image can be better obtained, and the target and background factors can be fully identified. The misdetection rate of the background trained by extracting candidate box (proposal) can be reduced by 50%. Moreover, the YOLO series network model is more lightweight, with stronger generalization and the ability to learn abstract features. YOLOv1 Error may occur when positioning the target, and the detection accuracy of dense targets and small targets is not high [9]. The problems existing in YOLOv1 are improved in YOLOv2, and the detection accuracy is improved to some extent [10]. When YOLOv3 was proposed in 2018, Darknet-53 residual module was added on the basis of YOLOv2 to readjust the network structure and solve the deep network gradient problem. In addition, in the object classification, the Logistic function is used to replace the original softmax function, which can conduct multi-label target detection. The YOLOv3 network model has been improved to become more complex, and although the detection accuracy has been improved, the detection speed is average [11].

The YOLO series subsequently produces a variety of improved networks, and the network is more streamlined, and the training and detection speed has been greatly improved (Figure 1).

The YOLO series network performance pair is shown in Figure 2:

In 2024, the YOLOv11 target detection network, on the basis of the previous version, made obvious architecture improvements and training adjustments, further enhancing the YOLO series network algorithm performance and flexibility, the advantages of fast, accurate, and efficient network; the target detection, tracking, image classification, instance segmentation, and attitude estimation of tasks are the optimal performances of the network. YOLOv11 applies enhanced techniques for extracting features to better adapt to complex scenes and accurately capture more details and complex features.

2.2. YOLOv11 Algorithm Principle

2.2.1. Network Structure

YOLOv11 consists of the following main parts:

1. Backbone network (Backbone): The pre-trained convolutional neural network CSPDarknet is used to extract image features, and the network structure is shown in Figure 3.

The pre-training model parameters can save computational resources, accelerate the training process of new tasks, and learn the pre-training weights. In the new task, the pre-training weights can integrate the model into the new problem scenario and improve the performance of the problem and avoid the overfitting problem for small data.

YOLOv11 uses the C3k2 module in the main backbone of the network, and the module uses two convolutions instead of one large one to speed up the feature extraction process. The module structure is shown in Figure 4.

YOLOv11 improves the SPP module into the SPPF (Spatial Pyramid Pooling Fast) module, which adopts fast spatial pyramid pooling to speed up the efficiency of extracting features and can do multi-scale processing of the input features to improve the robustness of the model to the change in the target size. The SPPF module has a more compact architecture and faster computation, which can improve the quality of the detection without a significant increase in the computation amount while improving the detection quality. The results of the module are shown in Figure 5.

2. The feature fusion layer (Neck) incorporates different scale features to enable the model to identify multi-scale target information. YOLOv11 uses an improved PANet structure to enhance the level of feature fusion. The PANet takes a horizontal connection approach, and the information can be conveyed in different layers of the feature map, which enables the higher-level features to make fuller use of the detailed information in the lower layers. It usually contains a feature pyramid network (FPN) and a path aggregation network (PAN), and the structure is shown in Figure 6.

The main function of the detection head (Head) is to predict the category probability and the bounding box. The Head part of YOLOv11 has multiple branches to perform the prediction and can handle the multi-scale target detection problem. YOLOv11 adds two new depth-separable convolutional DWConv in the classification detection header of the decoupled header to reduce the number of parameters and operations, and the structure is shown in Figure 7.

In addition, YOLOv11 uses the adaptive anchor frame (auto-anchor) to automatically adjust the anchor frame configuration for different datasets. This method not only eliminates the complex program of the manual anchor frame but ensures that the size of the anchor frame is more compatible with the target object, thus improving the accuracy of detection.

The YOLOv11 network structure is shown in Figure 8.

2.2.2. Design of the Loss Function

For the bounding-box regression problem, YOLOv11 uses the EIoU (Extended IoU) loss function to compare the IoU (Intersection over Union) function. The EIoU comprehensively considers the overlapping region between the real box and the prediction box and introduces the penalty term of aspect ratio and center point offset, which accelerates the network convergence and improves the accuracy of prediction. The formula is as follows:

\begin{array}{l} L_{E I o U} & = L_{I o U} + L_{d i c} + L_{a s p} \\ = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{{(c_{w})}^{2} + {(c_{h})}^{2}} + \frac{ρ^{2} (w, w^{g t})}{{(c_{w})}^{2}} + \frac{ρ^{2} (h, h^{g t})}{{(c_{h})}^{2}} \end{array}

(1)

where

ρ

denotes the calculation of the Euclidean distance between the real box and the center point of the prediction box;

c_{w}

,

c_{h}

denote the height and width of the smallest outer box covering the two boxes, respectively;

b

,

b^{g t}

represent the centers of the prediction box and the real box, respectively;

w

,

w^{g t}

represent the widths of the prediction box and the real box, respectively; and

h

,

h^{g t}

represent the heights of the prediction box and the real box, respectively.

2.2.3. YOLOv11 Network Training

1. Data augmentation

YOLOv11 improves model generalization and effectively reduces the occurrence of overfitting by using Mosaic, Mixup, and other data enhancement techniques.

The Mosaic data enhancement method was used for four pairs of images; it randomly cropped, scaled, and distributed the number of small targets, increasing them, a rich expansion of the dataset which enhanced the network’s robustness. The Mosaic method of training can be directly used on the image data for computing, which is less demanding on the hardware. The principle of the method is shown in Figure 9. The red box in the figure indicates the box selection area of the target by the algorithm.

The Mixup data enhancement method weighs two images and annotations and enriches the data style. The principle is shown in Figure 10.

2. Hyperparameter optimization

YOLOv11 references hyperparameter auto-optimization methods, such as optimizing weight decay, learning rate, etc., for different datasets.

3. Model training

YOLOv11 employs Mixed Precision Training, which accelerates the training process and reduces the amount of graphics memory used while maintaining model accuracy.

Combined with the actual marine environment and task requirements, using YOLOv11 network to improve the accurate detection capability of ship targets in the complex marine environment, mainly facing the following problems:

The marine environment is complex and changeable, the natural environment is harsh, and the ship target detection of visible-light images at sea is seriously disturbed by complex sea conditions;
In the densely distributed area of ships, the established target is susceptible to other ships, islands, reefs, and buildings and missed detection and misreport;
Some small ship targets will be hidden in the densely distributed waters and flexible with the help of complex backgrounds, such as islands and reefs; small ship targets have less pixels in the image, which brings some trouble to the target detection work.

3. Improved YOLOv11 Dynamic Multi-Type Ship Target Detection Algorithm

In order to improve the performance of accurate and efficient detection of diverse ship targets in complex scenes, this paper mainly focuses on strengthening the backbone network, improving the neck network and optimizing the loss function to enhance the network detection performance based on visible-light images.

3.1. EfficientNetv2 + CA-ECNet

Influenced by the complex sea conditions and severe weather and the performance of image acquisition equipment, the collected target images of ships sailing at sea are prone to uneven light and uneven imaging, which need to further strengthen the extraction of target features by the YOLOv11 backbone network. The EfficientNetv2 model integrates a series of new network architecture design and training strategies to improve the detection performance and training efficiency of the model. In addition, the ECNet network is constructed to improve the EfficientNetv2 network to further improve the ability of the model to learn the characteristics of multiple types of target ships in complex environments.

3.1.1. EfficientNetv2

EfficientNetv2 [12] is a deep learning model released by Mingxing Tan and Quoc V. Le et al. in 2021. It is the latest version of the EfficientNet series. On the basis of maintaining the model performance, it has faster training rate, higher operational efficiency, and smaller model size. EfficientNetv2 is a significant advancement in the field of deep learning and has important contributions to real-world application problems. The network performance comparison is shown in Figure 11.

EfficientNetv2, using a new search method—training-aware NAS—can focus on the training speed of the model while searching, rather than other networks that only evaluate performance at the end, balancing the relationship between detection accuracy, training speed, and parameter number, with higher accuracy and faster training speed; the network also tightens the size of the search space by reducing unnecessary search items, such as pooling skip, and focusing on channel size (channel sizes).

EfficientNetv2 introduces the Fused-MBConv structure in the shallow network, which is optimized on the basis of MBConv, fuses expansion convolution and Depthwise convolution, and replaces them with a 3 × 3 standard convolutional layer, simplifying the network architecture and reducing the computing cost, especially in the training phase, which is important for deep learning applications in resource-constrained scenarios. The module structure is shown in Figure 12.

In order to better exert the performance of Fused-MBConv, EfficientNetv2 optimized the scaling strategy based on the idea of composite scaling, and the model can better balance the relationship between rate and performance. EfficientNetv2 also employs Progressive Learning, a strategy that accelerates convergence, avoids premature saturation, and improves model performance.

3.1.2. CA Mechanism

In this paper, the EfficientNetV2 network is used as the backbone network to lose some feature information of the target while generating feature maps of different resolutions. When the network processes feature maps of different receptive fields through a convolution operation, it obtains the effective features of the target and produces some interference information and accuracy loss. In order to improve the accuracy of model detection under limited resources, it is necessary to provide more resources for the important features of the target, namely the weight in the neural network. The introduced attention mechanism can enable the convolutional neural network to extract effective features and delete useless features, which can greatly improve the efficiency and accuracy of image processing. Commonly used attention mechanisms are SE, CBATM, ECA, etc. When the existing attention mechanism obtains the channel attention, the channel-processing method is mostly average pooling/global maximum pooling, which will cause the loss of spatial information in the target part. In the detection of ship targets in complex marine environments, spatial information is particularly important for accurately judging the target position and size information.

Coordinate attention (CA) [13], in response to the insufficient ability of traditional mechanisms in processing spatial information and increasing model complexity, integrates position information into channel attention, and by calculating the attention on the two dimensions of height and width, it can accurately capture the relationship between different channels and the spatial distribution of features and more effectively capture the visual task of long-distance dependence on the important features and characteristics of the visual task, which can improve the accuracy of feature representation, and the mechanism can be flexibly inserted into typical mobile networks, such as MobileNetv2, MobileNet, and EfficientNet, which basically do not increase the computational overhead.

The performance of different attention mechanisms is compared in the same backbone network (MobileNetV2), as shown in Figure 13.

The structure of the CA mechanism is shown in Figure 14.

The input feature map’s dimensions are C × H × W, where C is the number of channels, and H and W are height and width. After spatial pooling to subsidize the height and width direction information, feature fusion and transformation, batch normalization and nonlinear activation, classification and 2D convolution, and sgmoid activation and recalibration to the weight of the original features are performed, and then, output the feature map of C × H × W, which contains the channel information and spatial location information.

3.2. Improving the Neck Network (ConvNeXt Block Thought)

The marine environment is complex, and the widely distributed islands and reefs, marine buildings, and smoke curtains and clouds all bring a certain occlusion interference to the ship targets. In the densely distributed areas of ports and terminals, the mutual occlusion of ships also makes it more difficult to detect them. The YOLOv11 network is susceptible to occlusion interference when extracting the ship image feature information, which makes the model focus on the local pixel positions only. In order to effectively utilize the context to capture the target information, it is necessary to stack the convolutional layers several times. However, direct repeated stacking of these layers leads to computational inefficiency and difficulty in optimizing the model. In this paper, a new CCB (C3k2 ConvNeXt Block, CCB) module is constructed using a spanning information connection to avoid this problem. The CCB module consists of the C3k2 module and the ConvNeXt Block module. These modules can significantly improve the model’s ability to capture contextual information.

ConvNeXt [14] is a computer vision model released by Meta AI researchers in 2022 which explores the potential performance of CNNs (convolutional neural networks) for image recognition problems, especially in comparison with mainstream models, such as Vision Transformer (ViT). The design of the ConvNeXt structure draws on the concept of the Transformer model and is able to perform better than the Transformer under the premise of guaranteeing the efficiency of the model.

The module structure is shown in Figure 15.

3.3. WIoU Loss Function

YOLOv11 uses the EIoU loss function, replacing the aspect ratio with a regression on the width and height values, and such processing may result in a situation where the overlap area of the real frame and the predicted frame is the same, but the width and height values are different, leading to different loss values; secondly, it does not effectively deal with the balance of difficult and easy samples, which may cause excessive attention to the simple samples in optimization during target detection and neglect the difficult samples’ learning; again, it did not consider the problem of the angle of the detection frame: if the angle is different from the real frame, even if the overlap area is the same, it will still affect the overlap accuracy.

WIOU (Wise-IoU) [15] is a new loss function proposed for the improvement of the bounding-box regression loss (bounding). The quality of the anchor frame is evaluated using a dynamic non-monotonic focusing mechanism (FM) with gradient gain, which can improve the overall performance of the ship detection model by reducing the effect of harmful gradients while ensuring the effect of high-quality anchor frames.

In Figure 16, the overlapping area is as follows:

S_{u} = w h + w_{g t} h_{g t} - W_{i} H_{i}

(2)

The improved overall structure diagram of YOLOv11 is shown in Figure 17.

The algorithm pseudocode is shown in Algorithm 1.

Algorithm 1. YOLOv11 Improved model algorithm.
1:	begin
2:	//Defines the input image size and number of categories
3:	input_size = (640, 640)
4:	num_classes = 6
5:	//Defines the number of sizes of the anchor boxes
6:	anchors = []
7:	num_anchors = len(anchors)
8:	function YOLOv11_improve_detector(input):
9:	//backbone
10:	x = Conv(input, 64, 3, 2)
11:	x = Conv(x, 128, 3, 2)
12:	x = MBConvBlock_CA(x, 256, 3, 1, 6)
13:	x = Conv(x, 256, 3, 2)
14:	x = MBConvBlock_CA(x, 512, 3, 1, 6)
15:	out1 = x
16:	x = Conv(x, 512, 3, 2)
17:	x = C3k2(x, 512, 3, 2)
18:	out2 = x
19:	x = Conv(x, 1024, 3, 2)
20:	x = C3k2(x, 1024, 3, 2)
21:	x = SPPF(x, 1024, 5)
22:	x = C2PSA(x, 1024)
23:	out3 = x
24:	//head
25:	x = Upsample(None, 2)
26:	x = Concat(1)
27:	x = CCB(512)
28:	x = PANetFPN(x)
29:	//output
30:	output1 = Conv(out1, num_anchors`×`(num_classes + 5), 1)
31:	output2 = Conv(out2, num_anchors`×`(num_classes + 5), 1)
32:	output3 = Conv(out3, num_anchors`×`(num_classes + 5), 1)
33:	return output1, output2, output3
34:	//Traversing each frame, T is the mp4 file timestamp array
35:	for t in range(T) do:
36:	for img in range (T + 1) do:
37:	if Judgment_score(YOLOv11_improve_detector(img)) > 0.25, draw box, continue
38:	end for
39:	end for

4. Experimental Results and Comparative Analysis

4.1. Dataset Construction

For something more in line with the actual marine environment and image acquisition equipment perspective, verify the improved YOLOv11 algorithm in the complex island background of the dynamic detection performance of ship targets through the existing dataset selection and network search to collect a lateral view angle of a visible-light ship image, including multi-angle, multi-scale information. The self-built visible image dataset of the ship contains many types of ships; the dataset contains a block-overlapping scene, a complex nearshore islands and reef background, and small target images to train and verify the network detection performance in the marine environment.

In the dataset, ship types were divided into engineering ships, cargo boats, passenger ships, sailboats, and fishing boats, with a total of 6846 pieces. The image display of the dataset is shown in Figure 18, and the structure is shown in Table 2.

The YOLOv11 dataset was constructed, and all the images are shown in a .jpg format, which is named form 0000001. The targets in the picture are marked by a labeling annotation tool, and the targets are marked with the rectangular box most fitting the size of the outer contour of the ship. The targets with incomplete contours caused by occlusion overlap or other reasons are also calibrated in detail, so as to ensure that the network fully learns and trains on the targets of different types, shapes, and sizes.

4.2. Experimental Environment

This experiment was completed in an Ubuntu18.04 operating system using a Pytorch 1.10 deep learning framework, cuda11.1 + cudnn8.0.4 + opencv4.5.8 environment; hardware environment: Intel (R) Xeon (R) Platinum i9-13900k, Nvidia GeForce RTX 4090; and python language. See Table 3 for details.

4.3. Setting the Experimental Parameters

The input image size of this experiment was 640 × 640, the initial learning rate was 0.01, the learning rate was updated using stochastic gradient descent, the momentum was 0.937, and the weight delay was 0.0005. During the training process, Mosaic data enhancement was used to read four pictures each time and then flip and zoom, respectively, so as to enrich the detection background. The label smoothing was set to 0.01 to prevent model overfitting and increase the generalization of the model. All models were trained on 500 epochs, and the batch size was set to 32 and number of worker threads to 16.

4.4. Evaluation Indicators

(1) Precision

Precision is the accuracy/accuracy rate, that is, the ratio of all samples that really belong to this category.

(2) Recall

Recall is the recall rate/recall rate, also known as the detection rate, that is, the proportion of the number of targets detected as positive classes to the total number of all detected as target classes.

(3) AP (Average precision)

AP is the graph area between the precision–recall curve (PR curve) and the X axis.

(4) mAP (Mean Average Precision)

The mAP is the average of the AP ordered by all the query results.

(5) Identification speed (FPS)

The FPS is the number of identified images per second, per unit frame/s.

4.5. Analysis of the Experimental Results

4.5.1. Analysis of the Results of Improved Network Experiments

Use the self-built visible-light ship target dataset to train the improved dynamic ship target detection network, and the ratio of the training set and the test set is 8:2. The Mosaic data enhancement was used during the training process, and after 500 rounds of training, the network gradually converged and stabilized. The network LOSS curves are shown in Figure 19.

As can be seen from the curve change, the loss value is large at the beginning of the training; then, the loss value decreases greatly, and then the region is stable. The box_loss curve and dfl_loss curve in the test set curve gradually flatten after about 50 rounds, and the cls_loss curve flattens after 75 rounds. In general, the test set loss function curve leveled off earlier than the training-set curve, while the fluctuation of the test set loss function curve is slightly, but not significantly, larger than the training set. This is because the test set pictures do not exist in the training set, and there may be some different images, which is a normal situation. After 500 rounds of training, the loss curve gradually declines and flattens; the loss values are small. In the process of loss value decline, there are no sharp fluctuations, no overfitting, under fitting, the depth of the model construction, the size, and dataset are reasonable, the training task is better, and the algorithm obtains better convergence.

The PR curve of the network is shown in Figure 20.

The PR curve refers to the precision–recall curve, which shows the change in model accuracy and recall rate. The higher the accuracy, the more it means that most of the positive samples predicted by the model are true positive samples, and the higher the recall rate, the more it means that the model can correctly detect more real positive samples. The PR curve of five types of ship targets in the dataset is shown in Figure 20, when IoU is set at 0.5. The squarer the curve is, the more accurate and comprehensive the detection results are. It can be seen that the engineering ship and passenger ship have the best effect: the detection accuracy reaches 0.999, and the curve is closer to the square compared with other classes. The detection accuracy of the cargo ship is 0.798, and the detection accuracy is relatively low in the five types. The mAP value of the network reaches 0.897. The average detection accuracy of the algorithm is higher, which can be seen from the blue PR curve. Although the accuracy decreases with the increasing recall rate, the accuracy value of the model can also be guaranteed at a high level when the recall rate is high. The detection accuracy of the different types of targets is shown in Table 4.

As can be seen from the statistical results of all kinds of targets, the network shows excellent performance in the detection of multiple types of ship targets. The average detection accuracy of the network reaches 89.7%, and the detection time of a single picture is about 1.5 ms, which can meet the real-time requirements. The detection effect of engineering ships and passenger ships should reach 0.999, mainly due to the large pixel proportion of these two types of targets in the image and the obvious image characteristics. The detection accuracy of cargo ships is slightly lower than that of other categories, which may be caused by the large difference in the target characteristics of cargo ships collected in the dataset. The network detection accuracy of the sailing class reaches 0.811, reflecting its excellent performance in small-target detection.

The two image tables in Figure 21 represent the mean accuracy when the IoU is 50% and the mean IoU threshold between 90% and 95%. The latter requires higher requirements, so the mAP is slightly lower than the left figure, and the mAP 50 index is relatively loose. It can be seen that the final average accuracy is stable at about 0.92, showing the overall performance of the model under the condition of low rigor. The average accuracy of the model reached about 0.8 under 90–95 condition, indicating that the model can also detect and locate the targets more accurately under strict IoU conditions.

The actual detection effect of the network is shown in Figure 22.

In Figure 22 is an actual rendering of the detection; in complex environments, the network can better overcome the nearshore island background interference influence, can accurately distinguish a variety of types of ship targets, and the small target detection performance is good, which can accurately detect the image of small target information. The method can detect the ship type and location information accurately in the scenes with overlapping occlusion.

4.5.2. Comparison Experiments

For the testing analysis, a network was used to compare the advantages and effectiveness of the typical network algorithm, considering the actual environment for target detection accuracy, speed, and hardware capacity limit in the same experimental environment and parameter setting for the YOLOv8n, YOLOv10n, and YOLOv11n network training, with iterations of 500 epoch, and the network performance was tested with the basic statistical network performance parameters shown in Table 5.

Compared with the other network models, the improved algorithm proposed in this paper has a higher precision rate of 93.9% and an average detection accuracy of 89.7%, and the network recall rate of 77.1% is also significantly improved compared with YOLOv11n. Also, the model has fewer parameters than YOLOv8n and YOLOv10n and higher detection accuracy than YOLOv11n, with only a small number of parameters; the optimized algorithm detects a single image in 1.5 ms, which is faster than Mask R-CNN, CenterNet, and YOLOv10n. Due to the deepening of the number of layers by the network improvement, it requires a lot of time and effort to improve the feature extraction ability. The feature extraction ability, at the same time, needs more parameter computation, resulting in a decrease in inference speed, with YOLOv8n; inYOLOv11n, the detection speed is slightly different, but the detection accuracy is significantly improved, and the detection time can meet the real-time requirements.

In summary, the analysis improved the optimization of YOLOv11’s target detection network in terms of parameters and processing time, reflecting the speed advantage of the lightweight model, the various ship targets and shielding overlap in a complex environment, and the high-accuracy small target detection. The model can accurately meet the target interpretation in the complex marine environment, improving real-time ship monitoring at sea and providing feasible technical support.

4.5.3. Ablation Experiments

This chapter proposes three aspects of YOLOv11 for conducting ablation experiments to verify the function of each module in the network. Based on the actual environment, the application needs to consider the image-processing equipment’s computing performance and platform limit, respectively, from the detection accuracy, number and detection time of the module-improved network’s comprehensive comparative analysis performance, unchanged experimental environment and parameters, experimental dataset for self-built visible-light ship target dataset, and ablation experiment results, as shown in Table 6. ‘×’ means the module is not used in the new network. ‘√’ means the module is used in the new network.

From Table 6, it can be seen that, under the same experimental environment, the detection accuracy of the improved YOLOv11 network is 89.7%, which is improved by 5.6% compared to the original YOLOv11 network’s accuracy of 84.1%. This proves that the improved methods can improve the detection accuracy of the network efficiently. The EfficientNetv2 module is introduced on the basis of the original YOLOv11 network, which optimizes the model-training mode, strengthens the network training and learning process, better balances the relationship between the network recognition performance and the model efficiency, and increases only a small number of network parameters while improving the detection accuracy by 1.8%. After the introduction of the CA mechanism into the network, the network detection accuracy is increased by 0.7%. The CA mechanism can accurately capture the relationship between different channels and the spatial distribution of features by aggregating the input features in the horizontal and vertical directions, which effectively improves the accuracy of the feature expression, and it can more accurately extract the spatial information of the ship target to improve the accurate interpretation of the target position and size information for the problem of ship target detection in the complex marine environment. For the problem of ship target detection in the complex marine environment, the spatial information of the ship target can be extracted more accurately to improve the accurate interpretation of the target position and size information. Moreover, the CA mechanism can assign specific weights to the feature values of different channels to enhance the attention to useful features and inhibit the effect of useless features, which reduces the unnecessary waste of resources, lowers the computational complexity, and lowers the number of network parameters for the introduction of the CA mechanism. The introduction of the ConvNeXt Block module uses the spanning information connection to simplify the network model and reduce the overall number of parameters of the model, while the ConvNeXt Block significantly improves the model’s ability to capture contextual information and is able to effectively deal with the situation where the target is subjected to occlusion and interference, and the network’s detection accuracy is again improved by 0.7%. The network uses the WIoU Loss Function to use a dynamic non-monotonic focusing mechanism, which ensures a high-quality anchor frame effect while reducing the effect of harmful gradients, improving the overall performance of the ship detection model and improving the network detection accuracy by 2.4% without increasing the number of residual parameters. The network detection rate is 1.5 ms, which is increased compared to the original network, mainly due to the introduction of the EfficientNetv2 module and the CA mechanism. The complexity of the algorithm model is increased, and the detection rate of the algorithm is effectively improved after the using of the ConvNeXt Block module (Figure 23).

The first to the fifth picture correspond to the detection results of the improved network in Table 6, respectively. From the comparison of the results of the detection effect graph, after the optimization of the network, the accuracy of target detection is significantly improved, the classification of target categories is more accurate, the effect of small target detection is better, and more small ship targets in the picture can be detected, with a lower leakage detection rate. The improved YOLOv11 network has a better effect on the detection of multiple types of ship targets in complex environments, which is more suitable for complex marine environments, and is able to accurately identify the ships and small targets that are subjected to the interference of the occlusion. The speed of the network detection is able to satisfy the real-time requirements, and the number of model parameters is in line with the characteristics of the lightweight network, which is able to satisfy the application of the unmanned aerial platforms and other application scenarios, where the computational resources are limited.

In conclusion, comparing the YOLOv11 network with the improved and optimized network, the detection rate can meet the real-time requirements, and the overall detection accuracy is improved by 5.9% with the introduction of a small number of parameters, which is more adaptable to the detection of multiple types of ship targets in complex scenarios. It can meet the demand for accurate and efficient target detection in real-time scenarios, and it can provide feasible technical support and theoretical support for marine monitoring, ship monitoring, and other tasks. It provides feasible technical support and theoretical support for marine monitoring and ship monitoring.

5. Conclusions

In this paper, to improve the performance of moving-ship target detection in complex marine environments, an improved YOLOv11 moving-ship target detection method is proposed. The performance of the network was trained and verified by a self-built visible ship target dataset. The advantages of the newly constructed network are mainly reflected in the following:

The network detection accuracy reaches 89.7%, and the detection time of a single image is only 1.5 ms, which shows excellent performance for the presence of occlusion overlap and small target detection, and is able to adapt to the interference of complex background information;
The network effectively balances the relationship between detection accuracy and efficiency and significantly improves the detection accuracy of the model with the introduction of only a small number of parameters;
Construct a visible ship image dataset, which effectively circumvents the influence of radar and infrared images that are susceptible to multiple electromagnetic waves, smoke curtains, and complex island and shore backgrounds during data acquisition.

In the next step, we will focus on how to build a more detailed ship image dataset and combine infrared images to realize all-day and all-weather ship target detection.

Author Contributions

Y.W.: resources, formal analysis, program compilation and writing—original draft. W.Z.: writing—review and editing. H.X.: methodology, project administration. Y.J.: investigation and resources. M.L.: methodology. C.X.: resources, formal analysis. K.Z.: resources, formal analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article.

Acknowledgments

Here, I would like to express my gratitude to Zhu Pingyun, my advisor, for his control and guidance of the overall idea of my article during the research period. At the same time, I would like to thank Xiao Chuanliang for his careful teaching and guidance on the structure of the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, S. Research on Inshore Ship Target Recognition and Anti-jamming Technology based on Infrared Polarization Imaging. Master’s Thesis, Yantai University, Yantai, China, 2016; pp. 23–41. [Google Scholar]
Qiao, Y.J. Object Recognition and Detection of Geographical Entity Based on Outdoor Augmented Reality. Master’s Thesis, Wuhan University, Wuhan, China, 2017; pp. 19–31. [Google Scholar]
Wang, Y.C.; Ning, X.G.; Leng, B.H.; Fu, H. Ship Detection Based on Deep Learning. In Proceedings of the 2019 IEEE International Conference on Mechatronics and Automation (ICMA), Tianjin, China, 4–7 August 2019; pp. 275–279. [Google Scholar]
Zhao, Q.; Sheng, T.; Wang, Y.; Ni, F.; Cai, L. CFENet: An Accurate and Efficient Single-shot Object Detector for Autonomous Driving. arXiv 2018, arXiv:1806.09790. [Google Scholar]
Yang, Y.; Pan, Z.; Hu, Y.; Ding, C. CPS-Det: An Anchor-free Based Rotation Detector for Ship Detection. Remote Sens. 2021, 13, 2208. [Google Scholar] [CrossRef]
Gu, J.J.; Li, B.Z.; Liu, K.; Jiang, W. Infrared ship target detection algorithm based on improved Faster R-CNN. Infrared Technol. 2021, 43, 170–178. [Google Scholar]
Jiang, J.; Fu, X.; Qin, R.; Wang, X.; Ma, Z. High-speed Lightweight Ship Detection Algorithm Based on YOLO-V4 for Three-channels RGB SAR Image. Remote Sens. 2021, 13, 1909. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Mark-Liao, H.-Y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Ming, X.T.; Le Quoc, V. EfficientNetV2: Smaller Models and Faster Training. arXiv 2021, arXiv:2104.00298v3. [Google Scholar]
Qi, B.H.; Da, Q.Z.; Jia, S.F. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13719. [Google Scholar]
Zhuang, L.; Han, Z.M.; Chao, Y.W. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2–8. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]

Figure 1. Structure diagram of YOLO network.

Figure 2. Performance of the YOLO family of algorithms.

Figure 3. CSPDarknet feature extraction network.

Figure 4. The C3k2 module.

Figure 5. The SPPF module.

Figure 6. FPN + PAN structure diagram.

Figure 7. Comparison of the YOLOv11 and YOLOv8’s decoupling head.

Figure 8. The YOLOv11 network structure diagram.

Figure 9. Mosaic digital enhancement.

Figure 10. The Mixup data enhancement mode.

Figure 11. Network performance comparison diagram.

Figure 12. The MBConv structure and the Fused-MBConv structure.

Figure 13. Comparison of attention mechanism.

Figure 14. The CA mechanism.

Figure 15. The ConvNeXt network structure diagram.

Figure 16. Schematic of WIoU function.

Figure 17. The improved YOLOv11 model.

Figure 18. Dataset picture.

Figure 19. The network loss function curve.

Figure 20. The network PR graph.

Figure 21. The network mAP graph.

Figure 22. Actual test effect diagram.

Figure 23. The actual detection results of the ablation experiment.

Table 1. The algorithm performance comparison.

	R-CNN	Fast R-CNN	Faster R-CNN	YOLO
mAP/%	58.5	70	73.2	45
FPS	-	0.5	7	63.4

Table 2. Dataset structure.

Categories	Engineering Ship	Cargo Boat	Passenger Ship	Sailboat	Fishing Boat	Total
Quantity	1002	1242	1289	1486	1827	6846

Table 3. Experimental environmental parameters.

Parameter	Parameter Values
Operating system	Ubuntu 18.04
Digit capacity	32 Positions
Run memory	64 GB
Video card	GeForce RTX 4090
Memory	24 GB
CPU	Intel(R) Xeon(R) Platinum i9-13900k
Deep learning framework	Pytorch 1.10
CUDA edition	11.1
Cudnn edition	8.0.4

Table 4. Detection accuracy of various types of ships.

Type	Construction Vessel	Cargo Boat	Fishing Boat	Passenger Ship	Sail Boat	mAP50	Detection Time
AP	0.999	0.798	0.879	0.999	0.811	0.897	1.5 ms

Table 5. Ship detection and comparison experiment.

Model	Precision (%)	Recall (%)	mAP50(%)	Params	Times (ms)
Mask R-CNN [1]	75.4	65.9	71.1	426,720,391	8.7
CenterNet [2] (Hourglass)	86.9	74.1	80.1	95,432,146	3.1
YOLOv8n [3]	88.5	77.7	83.4	3,006,818	1.2
YOLOv10n [4]	89.9	75.3	81.7	2,696,756	1.7
YOLOv11n [5]	91.7	75.2	84.1	2,583,322	1.1
Ours	93.9	77.1	89.7	2,591,134	1.5

Table 6. Ship detection and ablation experiment.

EfficientNetv2	CA	ConvNeXt Block	WIoU	Precision (%)	Recall (%)	mAP50 (%)	Params	Times (ms)
×	×	×	×	91.7	75.2	84.1	2,583,322	0.9
√	×	×	×	92.5	73.5	85.9	2,674,034	1.2
√	√	×	×	89.1	77.0	86.6	2,670,638	1.6
√	√	√	×	92.7	72.4	87.3	2,591,134	1.4
√	√	√	√	93.9	77.1	89.7	2,591,134	1.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Zeng, W.; Xu, H.; Jiang, Y.; Liu, M.; Xiao, C.; Zhao, K. Multi-Type Ship Target Detection in Complex Marine Background Based on YOLOv11. Processes 2025, 13, 249. https://doi.org/10.3390/pr13010249

AMA Style

Wang Y, Zeng W, Xu H, Jiang Y, Liu M, Xiao C, Zhao K. Multi-Type Ship Target Detection in Complex Marine Background Based on YOLOv11. Processes. 2025; 13(1):249. https://doi.org/10.3390/pr13010249

Chicago/Turabian Style

Wang, Yao, Weigui Zeng, Huiqi Xu, Yi Jiang, Minggang Liu, Chuanliang Xiao, and Ke Zhao. 2025. "Multi-Type Ship Target Detection in Complex Marine Background Based on YOLOv11" Processes 13, no. 1: 249. https://doi.org/10.3390/pr13010249

APA Style

Wang, Y., Zeng, W., Xu, H., Jiang, Y., Liu, M., Xiao, C., & Zhao, K. (2025). Multi-Type Ship Target Detection in Complex Marine Background Based on YOLOv11. Processes, 13(1), 249. https://doi.org/10.3390/pr13010249

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Type Ship Target Detection in Complex Marine Background Based on YOLOv11

Abstract

1. Introduction

2. Principles of the YOLOv11 Algorithm

2.1. The YOLO Series Algorithm

2.2. YOLOv11 Algorithm Principle

2.2.1. Network Structure

2.2.2. Design of the Loss Function

2.2.3. YOLOv11 Network Training

3. Improved YOLOv11 Dynamic Multi-Type Ship Target Detection Algorithm

3.1. EfficientNetv2 + CA-ECNet

3.1.1. EfficientNetv2

3.1.2. CA Mechanism

3.2. Improving the Neck Network (ConvNeXt Block Thought)

3.3. WIoU Loss Function

4. Experimental Results and Comparative Analysis

4.1. Dataset Construction

4.2. Experimental Environment

4.3. Setting the Experimental Parameters

4.4. Evaluation Indicators

4.5. Analysis of the Experimental Results

4.5.1. Analysis of the Results of Improved Network Experiments

4.5.2. Comparison Experiments

4.5.3. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI