EGM-YOLOv8: A Lightweight Ship Detection Model with Efficient Feature Fusion and Attention Mechanisms

Li, Ying; Wang, Siwen

doi:10.3390/jmse13040757

Open AccessArticle

EGM-YOLOv8: A Lightweight Ship Detection Model with Efficient Feature Fusion and Attention Mechanisms

by

Ying Li

and

Siwen Wang

^*

Navigation College, Dalian Maritime University, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(4), 757; https://doi.org/10.3390/jmse13040757

Submission received: 10 March 2025 / Revised: 31 March 2025 / Accepted: 8 April 2025 / Published: 10 April 2025

(This article belongs to the Special Issue Application of Advanced Technologies in Maritime Safety—Second Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurate and real-time ship detection is crucial for intelligent waterborne transportation systems. However, detecting ships across various scales remains challenging due to category diversity, shape similarity, and complex environmental interference. In this work, we propose EGM-YOLOv8, a lightweight and enhanced model for real-time ship detection. We integrate the Efficient Channel Attention (ECA) module to improve feature extraction and employ a lightweight Generalized Efficient Layer Aggregation Network (GELAN) combined with Path Aggregation Network (PANet) for efficient multi-scale feature fusion. Additionally, we introduce MPDIoU, a minimum-distance-based loss function, to enhance localization accuracy. Compared to YOLOv8, EGM-YOLOv8 reduces the number of parameters by 13.57%, reduces the computational complexity by 11.05%, and improves the recall rate by 1.13%, demonstrating its effectiveness in maritime environments. The model is well-suited for deployment on resource-constrained devices, balancing precision and efficiency for real-time applications.

Keywords:

ship detection; feature fusion; attention mechanism; maritime transportation

1. Introduction

Maritime navigation safety is one of the most critical concerns for maritime authorities. Proactive maritime traffic monitoring is not only a significant feature of autonomous transportation models but also an indispensable functional component of Intelligent Waterborne Transportation Systems (IWTS) [1]. To guarantee the security and effectiveness of maritime transportation, accurate identification of ship targets is essential for numerous maritime surveillance and situational awareness tasks. In congested and complex urban inland waterways or ports, the variety of ship types and the substantial congestion in maritime routes, coupled with the accelerated progress of autonomous ships, have exacerbated numerous safety issues. Events like vessel collisions and other marine mishaps occur frequently [2,3]. Safe navigation at sea relies on effective environmental perception and decision-making. Current methods such as the Automatic Identification System (AIS) and radar have their limitations [4]. Visual perception serves as the “eyes” of a ship, capable of identifying various obstacles and navigational aids in the environment. Ship detection is a crucial component in enhancing waterborne traffic security. Early, accurate, and rapid monitoring of ships near coastal areas can notably diminish the likelihood of maritime incidents, thereby improving navigation safety and the efficiency of port management [5]. However, owing to the variety of vessel sizes and categories, along with the intricate and dynamic marine environment, existing detection methods still struggle to balance model accuracy, speed, and parameter efficiency. Moreover, when ships of varying scales appear in the same scene or when ships are occluded by one another, current models predominantly prioritize larger vessel targets while neglecting smaller ones, often culminating in missed detections and false alarms [6]. Therefore, researching a more lightweight, efficient, and accurate ship detection method holds significant importance.

Ship target detection, as an important branch of the object detection field, has seen many researchers develop advanced algorithms for real-time monitoring. Zwemer et al. [7] utilized the SSD detector to identify ship targets. Hu et al. [8] improved upon this approach by replacing the backbone network with a more efficient one and introducing the CBAM module to upgrade the feature extraction procedure, consequently improving the accuracy of ship recognition. Shao et al. [9] proposed a ship perception framework based on YOLOv2, which first performs object recognition using a CNN and then applies saliency detection for ship monitoring in complex environments. Li et al. [10] leveraged DenseNet and YOLOv3 to detect unmanned surface vehicles (USV) in maritime scenarios, increasing the reliability of small target identification. Zhou et al. [11] optimized YOLOv5 by replacing standard convolution operations with depth-wise convolutions and introducing the Coordinate Attention (CA) mechanism to concentrate on critical features, consequently boosting the model’s detection capability. Chen et al. [12] enhanced YOLOv7 by combining Spatial Pyramid Pooling (SPP) with the Shuffle Attention (SA) mechanism to form a new neck structure, improving the model’s capability to perceive multi-scale ships. Nevertheless, these techniques still struggle to recognize a high volume of ships with irregular aspect ratios in complex marine environments. Additionally, the large number of model parameters results in high computational resource requirements, making deployment difficult, with persistent problems such as erroneous detections, overlooked targets, and limited accuracy.

In intricate maritime conditions, ships exhibit significant variations in categories and aspect ratios. Due to the limitations of imaging perspectives in near-shore areas, ships are often occluded by one another, leading to suboptimal detection performance. Accurate identification requires the acquisition of multi-scale discriminative details from ships. However, existing detection methods still struggle to establish a trade-off between detection reliability and algorithmic complexity, making it difficult to fulfill the operational requirements of practical scenarios. As a result, this work proposes a lightweight and efficient model named EGM-YOLOv8 for precise detection of visible light vessel images, which is an enhanced version of the YOLOv8l algorithm. Firstly, we embed the Efficient Channel Attention (ECA) module within the backbone network used for extracting comprehensive detail features, enhancing the feature extraction capability. Secondly, we combine the lightweight Generalized Efficient Layer Aggregation Network (GELAN) with Path Aggregation Network (PANet), constructing a more lightweight and efficient neck network, which can better integrate ship information at different levels. Additionally, we refine the loss function by introducing an accurate minimum distance loss function based on ship geometric features, MPDIoU. This allows for optimizing the gradient gain allocation strategy, improving the efficiency of the model during training. We performed comprehensive experiments using the publicly available Seaships and Mcships datasets, evaluating EGM-YOLOv8 against CNN-based detectors and other ship detection methods in the maritime domain to investigate the effectiveness of EGM-YOLOv8 and its individual components. The results highlight the superior performance of EGM-YOLOv8 in detecting inshore vessels.

The major advancements presented in this work are described in the following.

We propose a ship detection method termed EGM-YOLOv8, which integrates the efficient ECA module into the backbone network of YOLOv8, enhancing the network’s capability to extract vessel features.
We combine the lightweight GELAN and PANet to construct a more lightweight and efficient neck network, which better integrates vessel information from different levels, addressing the trade-off between model precision, speed, and parameters.
We use MPDIoU instead of the original bounding box loss function, thereby enhancing the convergence speed of the detection model and improving regression accuracy in predicting results.
Comprehensive experiments are performed using the publicly accessible Seaships and Mcships datasets to validate the contribution of various improvement modules and diverse attention mechanisms on the performance of ship detection. Additionally, the reliability of inshore vessel detection is established compared to domain-specific and general CNN-based detectors.

The structure of the remaining sections of this paper is outlined as follows. Section 2 provides an overview of related work. Section 3 details the design of the proposed model. Section 4 presents the experimental setup and results. Finally, Section 5 discusses the conclusions and implications of this study.

2. Related Work

2.1. Object Detection

Object detection is a critical component of visual computing systems. Traditional ship detection algorithms typically involve three main steps: region proposals generation; manual feature extraction, depending on ship dimensions and form; and the classification and regression of proposals [13]. In the process of region proposals generation, the exhaustive method is commonly used to traverse the entire image, which poses challenges in terms of long processing times and high space complexity. Manual feature extraction based on ship scale and shape often suffers from low accuracy and weak generalization. With the progression of artificial intelligence and deep learning technologies, utilizing image recognition and neural network techniques for ship detection has become a significant research direction. Approaches leveraging Convolutional Neural Network (CNN) have swiftly attained notable success in object detection and are broadly divided into two categories: two-stage and one-stage methods. Two-stage detection frameworks initially produce region proposals to pinpoint all potential areas that might include targets. These regions are subsequently fed into a classifier and regressor for object classification and location prediction. Representative methods include R-CNN [14], Fast R-CNN [15], and Faster R-CNN [16]. In two-stage detectors, backbone networks such as VGG [17] and ResNet [18] are first used to extract input features, followed by the production of numerous candidate regions using the Region Proposal Network (RPN) [19], which are then utilized for target classification and regression prediction. Two-stage detectors were initially preferred due to their higher detection accuracy. However, the generation of candidate regions is time-consuming and computationally intensive, often leading to missed detections of small targets [20]. Conversely, one-stage object detection methods immediately localize and classify objects using a backbone feature extraction network. Typical frameworks include YOLO [21], SSD [22], and RetinaNet [23]. Even though these methods offer faster detection speeds, they also exhibit certain shortcomings in terms of accuracy.

The YOLO series of models has been rapidly evolving due to their exceptional detection speed. In 2023, Ultralytics released YOLOv8 [24], which shows considerable improvements in both accuracy and speed of detection tasks. Zhu et al. [25] proposed a transformer prediction head to replace the original detection head in YOLOv5 for accurate target recognition in scenarios with high traffic density. Zhao et al. [26] made improvements based on this and designed a cross-layer asymmetric Transformer (CA-Trans) to effectively reduce the number of model parameters. Wang et al. [27] proposed a drone target detection model based on YOLOv8. This model replaced the original convolution operations with RepVGG re-parameterization modules, enabling the model to capture rich features during the training phase. They also introduced the ParNet to boost the model’s perception of features at multiple spatial locations. The combination of the ParNet attention mechanism and RepVGG re-parameterization modules improved both detection accuracy and inference speed. Chen et al. [28] introduced the Mixed Local Channel Attention (MLCA) mechanism into the C2f module, constructing the mAtt-C2f module as a replacement for the original C2f. They incorporated the Large Separable Kernel Attention (LSKA) mechanism into the SPPF module, enabling the network to automatically ignore irrelevant background information and mitigate feature redundancy. Additionally, they enhanced the detection head by employing the Dynamic Detection Head (DyHead), allowing the network to capture feature relationships across different scales and shapes, thereby improving the expressiveness of the detection head. Huang et al. [29] proposed an enhanced SAR ship detection model based on YOLOv8, replacing the C2f module with an expanded residual module, which improves the model’s ability to recognize multi-scale targets and enrich feature representations. Chen et al. [30] proposed an enhanced adaptive ship detection network based on an improved YOLOv8 for foggy sea environments. Li et al. [31] proposed a ship detection network based on a multi-input attention feature fusion module and edge feature enhancement module.

In the context of near-shore ship detection research, Liu et al. [32] proposed an Anchor-guided Attention Refinement Network (AARN), which incorporates an Attention Feature Filtering Module (AFFM) and an Anchor-guided Alignment Detection Module (AADM) to adapt to the diversity of near-shore ship poses. The AFFM leverages high-level semantic features to construct a feature pyramid network, overcoming background interference from surrounding structures through multi-level feature extraction. The AADM utilizes anchor-correlated features to reliably detect possible near-shore ships, mitigating the misalignment between refined anchors and pyramid features. Zhou et al. [33] replaced the neck network in YOLOv5 with the BiFPN, achieving effective feature integration by adding feature fusion paths. They also integrated the CA module to facilitate the model in more precisely localizing and identifying target regions. Additionally, a transformer module was introduced during the feature fusion stage to extract long-range feature dependencies, maximizing the retention of both global and local information across multi-scale feature layers. Liu et al. [34] combined incremental learning techniques with YOLOv5, employing an improved BiFPN and CA module to enhance the network’s feature combination capabilities and improve the accuracy of discriminative feature extraction. Although current ship detection methods have made significant progress, challenges remain when ships of varying sizes and scales are present, which can lead to missed detections or false alarms. From theoretical perspectives such as feature extraction and feature fusion, we have improved the YOLOv8 algorithm to adapt it for ship detection tasks, thereby enhancing the accuracy of near-shore ship detection.

2.2. The YOLOv8 Model

As a relatively stable model in the YOLO series, YOLOv8 demonstrates significant performance improvements. The backbone network of YOLOv8 is comparable to YOLOv5 but incorporates modifications in the CSP layers, drawing inspiration from YOLOv7, and utilizes advanced loss functions and detection head networks. YOLOv8 consists of three main components: the backbone network, the neck, and the detection head, as illustrated in Figure 1. YOLOv8 constructs a more lightweight C2f module to replace the C3 module in YOLOv5. This module is an improvement based on CSP and adopts the design principles of the Efficient Layer Aggregation Network (ELAN) from YOLOv7 [35]. It replaces three consecutive convolutions with two convolutions, and enables the fusion of higher-layer features with contextual information while reducing the number of convolutions, thereby maintaining network efficiency. The SPPF module is an improvement based on spatial pyramid pooling. It performs pooling operations on feature maps of different scales, boosting the model’s adaptability and robustness to targets of varying sizes. The backbone structure derives multi-scale feature layers from the input data through multiple convolutional operations, providing a foundation for subsequent feature fusion stages. The neck structure integrates FPN [36] and PANet [37]. While YOLOv5 employs PANet as its neck network, achieving feature merging using top-down and bottom-up connections, YOLOv8 improves upon this by reducing the number of upsampling and downsampling operations, thereby lowering computational complexity. By integrating the C2f module, YOLOv8 enhances cross-stage connections, enabling the enhanced integration of feature maps across multiple scales and further improving feature integration effectiveness. The detection head is a critical part of the object detection network, which generates the final detection results of the image, including class labels, bounding boxes, and confidence scores. It adopts the decoupled structure, using a classification branch to predict class probabilities for each anchor box and a regression branch to predict bounding box coordinates and confidence scores. For the classification branch, binary cross-entropy (BCE) loss is employed, derived from the actual and predicted class labels of each prior box, addressing class imbalance issues. For the regression branch, a synthesis of Distribution Focal Loss (DFL) [38] and Complete Intersection over Union (CIoU) [39] is used. This enhances the model’s perceptiveness to hard-to-classify instances, directing more attention to challenging predictions during training while reducing focus on easy-to-predict targets, thereby improving detection accuracy.

3. Proposed Method

YOLOv8 project categorizes the network into five different sizes (n, s, m, l, x) dependent on various combinations of depth, width, and maximum channels. In this paper, we chose YOLOv8l as the baseline. Building upon YOLOv8l, we propose an enhanced and lightweight ship detection framework termed EGM-YOLOv8, aiming to improve the positioning and recognition performance of ships under various backgrounds. Firstly, we introduce the ultra-lightweight ECA mechanism into the backbone network to boost the network’s performance to differentiate between targets and backgrounds. This is achieved by significantly improving ship recognition accuracy and model generalization ability while neglecting negligible computational overhead. Secondly, we combine GELAN and PANet to construct a more lightweight and efficient neck network, capturing multi-level ship features and enhancing feature diversity in the network fusion process. This effectively reduces false and missed detections in ship recognition. Lastly, we incorporate the MPDIoU loss function to optimize the gradient gain allocation strategy, enhancing the flexibility of the detector to multi-scale target variations, improve the generalization ability of the model, achieve enhanced convergence speed and obtain more precise regression outcomes. The structure of EGM-YOLOv8 is displayed in Figure 2.

3.1. Improvement of the Backbone Module

The attention mechanism assigns varying weights to different features, enabling the model to concentrates on important information, thereby achieving the goal of attending to object information. In computer vision, attention mechanisms are widely applied as they can adaptively adjust feature representations and dynamically allocate computational resources based on the importance of input data, optimizing information flow. This not only improves the adaptability of the model to complex scenes but also enhances its robustness to variations in scale, shape, and background. Channel attention mechanisms have exhibited considerable promise in boosting the performance of the network. In the backbone network, we construct a reliable feature extraction module, C2f_ECA, whose primary architecture depicted in Figure 2. The backbone network is designed following YOLOv8 and combines the superior feature extraction performance of the ECA module. It consists of the CBS, C2f_ECA, and SPPF. An efficient channel attention module is integrated into the backbone to make the model pay attention to the ship target during feature extraction, enabling cross-channel information interaction. This can suppress unimportant background information and effectively identify ship areas. It is a good way to strengthen the capability of the network to extract ship characteristics with a negligible number of parameters.

ECA mainly includes cross-channel interaction and one-dimensional convolution operation. Cross-channel interaction represents a new feature combination method that enhances the expression of specific semantics. The architecture of the ECA module is shown in Figure 3. The input feature map

F_{i n} \in R^{C \times H \times W}

performs global average pooling (GAP) and cross-channel interaction operation to obtain the aggregate feature map

F_{g}

, where

C

represents the cross-channel interaction operation, as shown in Equation (1).

F_{g} = C (G A P (F_{i n})) .

(1)

For a given aggregate feature

F_{g} \in R^{C \times H \times W}

, if dimensionality is not reduced, channel attention can be learned through the following Equation (2),

ω = σ (W F_{g}),

(2)

where W is a diagonal matrix with C elements, or a full matrix with

C \times C

elements, as shown in Equation (3) below.

W = [\begin{matrix} w^{1, 1} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & w^{C, C} \end{matrix}] o r [\begin{matrix} w^{1, 1} & \dots & w^{1, C} \\ ⋮ & ⋱ & ⋮ \\ w^{1, C} & \dots & w^{C, C} \end{matrix}] .

(3)

However, involving cross-channel interactions requires a large number of parameters, resulting in high model complexity. In ECA, the weight matrix

W_{k}

is used to learn the relationship between channels. ECA obtains local cross-channel interactions by measuring the interaction between the features of each channel and its k adjacent channels,

W_{k}

is represented by Equation (4) as follows.

W_{k} = [\begin{matrix} w^{1, 1} & \dots & w^{1, k} & 0 & 0 & \dots & \dots & 0 \\ 0 & w^{2, 2} & \dots & w^{2, k + 1} & 0 & \dots & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋮ & ⋮ \\ 0 & \dots & 0 & 0 & \dots & w^{C, C - k + 1} & \dots & w^{C, C} \end{matrix}] .

(4)

For each aggregate feature

F_{g i}

, only its interactions with k neighbors are considered, as shown in Equation (5).

ω_{i} = σ (\sum_{j = 1}^{k} w_{i}^{j} F_{g i}^{j}), F_{g i}^{j} \in Ω_{i}^{k},

(5)

where

σ

is the sigmoid function,

Ω_{i}^{k}

defines the set of k adjacent channels of

F_{g i}

. A typical strategy is to allow all channels to share the same set of weight parameters, avoiding the need for dimensionality reduction using 1D convolutions, effectively achieving multi-channel interaction. The weight of feature

F_{g i}

can be computed as shown in Equation (6).

ω_{i} = σ (\sum_{j = 1}^{k} w^{j} F_{g i}^{j}), F_{g i}^{j} \in Ω_{i}^{k} .

(6)

From the above equation, it can be observed that all channels share the same inclination parameters, improving the performance of the model. The primary function of the ECA module is to extract appropriate cross-channel interactions between local features, enhancing the efficiency of feature extraction and fusion in the model. A key consideration is the scope of these interactions, which is determined by the kernel size k in the convolution operation. In various convolutional structures, the interaction range can be optimized by adjusting the convolution blocks for different channels. However, this process requires extensive cross-validation for fine-tuning, leading to high computational costs. The emergence of group convolution improves the original convolution structure and reduces a large number of computational parameters [40]. High-dimensional or low-dimensional channel features are convolved through a fixed number of groups. Following related principles, ECA determines the size of k through an adaptive parameter selection method. Specifically, it is reasonable to assume that the coverage range of interactions is positively correlated with channel dimension C, where k and C may have a mapping

ϕ

as shown in Equation (7).

C = ϕ (k) .

(7)

The primary consideration is the linear relationship, i.e.,

ϕ (k) = γ * k - b

. Nevertheless, relationships described by linear functions are overly constrained. Conversely, it is widely recognized that the number of filters in the convolution operation is typically set to a power of 2. Thus, deriving from linear relationships to nonlinear relationships to introduce potential solutions, as shown in Equation (8).

C = ϕ (k) = 2^{(γ * k - b)} .

(8)

Meanwhile, given the channel dimension C, the size of the convolution kernel k can be dynamically determined, as shown in Equation (9).

k = ψ (C) = {|\frac{{log}_{2} (C)}{γ} + \frac{b}{γ}|}_{o d d},

(9)

where

{|x|}_{o d d}

represents the nearest odd number to x. According to the experimental results of ECA [41], the values of

γ

and b are 2 and 1, respectively. Finally, the output feature map

F_{o u t}

of ECA module can be computed as delineated by Equation (10).

F_{o u t} = F_{i n} \otimes ω, F_{o u t} \in R^{C \times H \times W} .

(10)

EGM-YOLOv8 integrates the ECA into the C2f module, increasing the weight of ship-dense areas in ship images during feature extraction. This makes the network more inclined to capture ship-related features, providing more valuable feature representations for subsequent feature fusion stages. While negligibly increasing the parameter count, it augments the network’s ability to extract and fuse feature layers of different scales, thereby enhancing ship recognizability.

3.2. Improvement of the Neck Module

The feature pyramid network has been proven to effectively improve the detection model’s ability to handle targets of different scales. FPN and PANet are utilized in the neck of YOLOv5, YOLOv7, and YOLOv8 for the integration of multi-scale features. YOLOv5 utilizes the C3 module as the main structure of the backbone to improve efficiency. YOLOv7, on the other hand, employs the ELAN Block to replace the BottleNeck as the main gradient flow branch. The C2f module used in YOLOv8 is designed based on the ideas of both the C3 module and ELAN, allowing YOLOv8 to obtain richer gradient flow information while ensuring lightweightness. During the feature fusion stage, due to the progressive feature extraction and spatial transformations of input image layer by layer, a significant amount of information is lost, a phenomenon known as information bottleneck. We apply the Generalized Efficient Layer Aggregation Network (GELAN), which is derived by combining the features of CSPNet [42] and ELAN [43]. This design maximizes accuracy while minimizing parameters and failure instances. GELAN is designed by taking into account the number of parameters, computational burden, accuracy, and runtime efficiency. Compared to the latest methods based on deep convolution, GELAN achieves better parameter utilization using only conventional convolution operators. This design allows users to freely choose suitable computing blocks for different inference devices.

In CSPNet, the base feature layer is split into two parts through a conversion layer. One part passes through dense blocks and transition layers, while the other part is integrated with the propagated feature maps to the subsequent stage. These branches are merged via concatenation operations and again pass through a conversion layer, as shown in Figure 4a. Compared to CSPNet, ELAN employs a series of stacked convolutional layers, where the output of each convolutional block serves as the input to the subsequent layer, and the outputs from all layers are aggregated and subsequently processed through convolution operations, as illustrated in Figure 4b. ELAN mainly consists of VoVNet [44] and CSPNet, enhancing the gradient flow across the entire network through a stacked architecture within its computing blocks. GELAN integrates the designs of CSPNet and ELAN. It adopts the concept of segmentation and reorganization of CSPNet and introduces the hierarchical convolution operations of ELAN in each part, as shown in Figure 4c. The difference is that GELAN not only uses hierarchical convolution operations, but also can also use any computing block, facilitating the network more adaptable and able to be customized according to different application requirements.

In this work, GELAN employs two primary building blocks stacked together, namely the efficient layer aggregation block (RepNCSP) and the computational block (CBS). These aggregated transformations across multiple network branches effectively capture the multi-scale features of different ships, as depicted in Figure 2. The design of GELAN considers lightweight, inference speed, and accuracy to boost the general effectiveness of the model. The optionality of modules and partitions shown in Figure 4 further strengthen the dynamic adaptability adaptability and modifiability of the network. This structure of GELAN allows it to facilitate multiple types of computational blocks, enabling it to more seamlessly acclimate to diverse computational needs and hardware restrictions. Overall, the architecture of GELAN aims to provide a more versatile and efficient network capable of addressing tasks ranging from lightweight to complex deep learning tasks while conserving or strengthening processing effectiveness and computational performance. In this way, GELAN is intended to mitigate the shortcomings of current architectures, providing a scalable solution to accommodate the future developments in deep learning.

3.3. Loss Function

In the detection head of YOLOv8, CIoU is used for bounding box regression (BBR), which includes three components: the overlapping area of the predicted and the ground-ruth bounding box, the distance between the center points, and the aspect ratio between the predicted and the ground-truth bounding box, as shown in Equation (11).

L_{C I o U} = 1 - \frac{|R \cap R^{g t}|}{|R \cup R^{g t}|} + \frac{ρ^{2} (R, R^{g t})}{l^{2}} + α W

(11)

where

ρ

denotes the Euclidean distance between the centroids of the predicted and ground-truth bounding boxes, R and

R^{g t}

represent the centroids of the predicted and ground-truth bounding boxes, respectively, and l represents the diagonal measurement of the minimal enclosing box containing both the predicted and true bounding boxes.

α

is a weighting function, and W represents a parameter for quantifying aspect ratio consistency. The definitions of

α

and W are given by the following Equations (12) and (13).

α = \frac{W}{1 - \frac{|R \cap R^{g t}|}{|R \cup R^{g t}|} + W} .

(12)

W = \frac{4}{π^{2}} {(arctan \frac{w^{g t}}{h^{g t}} - arctan \frac{w}{h})}^{2} .

(13)

Among them,

w^{g t}

and

h^{g t}

denote the width and height of the ground-truth bounding box, and w and h represent the width and height of the predicted bounding box. From Equation (13), it can be observed that the aspect ratio of the predicted bounding box aligns with that of the ground-truth,

W = 0

. However, when the target detection data contains numerous low-quality samples, the reinforcement of bounding boxes for low-quality samples by the CIoU loss function may impair the enhancement of model detection accuracy, and the penalty function may degrade and become ineffective, hindering the convergence of the model. Therefore, this consideration provides an advantage in addressing the gradient smoothness problem and has a positive impact on gradient descent optimization algorithms. Most existing loss functions, such as CIoU, do not incorporate image dimensions, rendering them incapable of optimizing situations where the predicted and ground-truth bounding boxes share an identical aspect ratio, yet exhibit entirely distinct width and height measurements.

In this work, we employ the Intersection over Union with Minimum Points Distance (MPDIoU) [45] as a similarity comparison metric for bounding boxes, replacing the CIoU loss. By minimizing the distances between the upper left corner and lower right corner of the predicted bounding box and annotated bounding box, MPDIoU reduces computational complexity and attains precise and effective bounding box regression. The calculation steps of MPDIoU are described in detail below.

A^{p r e d}

and

A^{g t}

are the predicted and the ground-truth bounding box of target,

A^{p r e d} = (x_{1}^{p r e d}, y_{1}^{p r e d}, x_{2}^{p r e d}, y_{2}^{p r e d})

,

A^{g t} = (x_{1}^{g t}, y_{1}^{g t}, x_{2}^{g t}, y_{2}^{g t})

, where

(x_{1}^{p r e d}, y_{1}^{p r e d})

and

(x_{2}^{p r e d}, y_{2}^{p r e d})

represent the point coordinates of the upper-left corner and lower-right corner of the predicted bounding box,

(x_{1}^{g t}, y_{1}^{g t})

and

(x_{2}^{g t}, y_{2}^{g t})

represent the upper left corner and lower right corner of the ground-truth bounding box, respectively. w and h represent the width and height of the input data.

d_{1}^{2} = {(x_{1}^{p r e d} - x_{1}^{g t})}^{2} + {(y_{1}^{p r e d} - y_{1}^{g t})}^{2} .

(14)

d_{2}^{2} = {(x_{2}^{p r e d} - x_{2}^{g t})}^{2} + {(y_{2}^{p r e d} - y_{2}^{g t})}^{2} .

(15)

To calculate the area of

A^{g t}

, the following Equation (16) is utilized.

A^{g t} = (x_{2}^{g t} - x_{1}^{g t}) \times (y_{2}^{g t} - y_{1}^{g t}) .

(16)

Similarly, the area of

A^{p r e d}

is computed using the following Equation (17).

A^{p r e d} = (x_{2}^{p r e d} - x_{1}^{p r e d}) \times (y_{2}^{p r e d} - y_{1}^{p r e d}) .

(17)

The overlapping area I between

B^{p r e d}

and

B^{g t}

is calculated using the following Equation (18).

I = \{\begin{matrix} (x_{2}^{I} - x_{1}^{I}) \times (y_{2}^{I} - y_{1}^{I}), & i f x_{2}^{I} > x_{1}^{I}, y_{2}^{I} > y_{1}^{I} \\ 0 & o t h e r w i s e \end{matrix},

(18)

which satisfies

x_{1}^{I} = max (x_{1}^{p r e d}, x_{1}^{g t})

,

x_{2}^{I} = min (x_{2}^{p r e d}, x_{2}^{g t})

,

y_{1}^{I} = max (y_{1}^{p r e d}, y_{1}^{g t})

and

y_{2}^{I} = min (y_{2}^{p r e d}, y_{2}^{g t})

. The calculation formula for IoU is shown in Equation (19), while the calculation method for MPDIoU is presented in Equation (20).

I o U = \frac{I}{A^{g t} + A^{p r e d} - I} .

(19)

M P D I O U = I o U - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}} .

(20)

Finally, the loss function

L_{M P D I o U}

based on MPDIoU as shown in Equation (21), and Figure 5 illustrates the factors of the MPDIoU loss function.

L_{M P D I o U} = 1 - M P D I o U .

(21)

4. Experimental Design and Results Analysis

4.1. Dataset

The experimental data used for training and reasoning is the publicly near-shore ship dataset, Seaships [46], which was taken by visual cameras in the Hengqin New District of Zhuhai and includes various weather conditions, low-light environments, and instances of ship occlusion. It is composed of 7000 images in total, covering six ship classes: ore carrier (OC), bulk cargo carrier (BCC), general cargo ship (GCS), container ship (CS), fishing boat (FB), and passenger ship (PS). The distribution of bounding boxes and the number of ship instances of different classes are shown in Figure 6. During the experiments, the dataset was randomly divided into training set, validation set and test set in a ratio of 8:1:1, containing 5600, 700 and 700 images, respectively. Figure 7 shows the number of ship category instances contained in the dataset.

4.2. Implementation Details

The environment required for the experiment is Ubuntu 20.04 system, NVIDIA RTX 4090 (24 GB) GPU and Intel(R) Xeon(R) Silver 4214R CPU @ 2.40 GHz, based on Pytorch 1.12 framework. In YOLOv8, there are models with different sizes of n, s, m, l and x. Considering the limitation of computing resources, this study implements EGM-YOLOv8 based on YOLOv8l. In the following experiments, the size of the input image is 640 × 640, and the batch size is set to 16. During the training process, we found that the model tends to be stable around the 120th epoch. During the training phase, we use mosaic data augmentation technology to fold and rotate the input images to increase the sample types, and turn it off in the last 10 epochs. To minimize computational resource consumption, we set the number of epochs for all model training to 200. We employ Stochastic Gradient Descent (SGD) as the optimizer, with an initial learning rate of 0.01 and the final learning rate of 0.001, the momentum is 0.937, and the weight decay is 0.0005.

4.3. Evaluation Metrics

In this section, precision (P), recall (R), and mean average precision (

m A P

) are employed to assess the detection performance of the model, and the number of parameters (unit M) and inference time (unit ms) are used to evaluate the detection efficiency of the model. P represents the ratio of the number of objects correctly detected by the algorithm to the total number of objects detected by the algorithm, as defined in Equation (22). R represents the ratio of the number of objects detected by the algorithm to the number of objects that actually exist, as defined in Equation (23).

P = \frac{T P}{T P + F P} .

(22)

R = \frac{T P}{F P + F N} .

(23)

where

T P

represents the positive samples correctly classified by the algorithm.

T N

represents the negative samples correctly classified by the algorithm.

F P

represents the negative samples misclassified by the algorithm.

F N

represents the positive samples misclassified by the algorithm. The P-R curve of the algorithm can be described based on precision and recall, where

A P

is characterized as the region enclosed by the P-R curve and the coordinate axes, as shown in Equation (24).

A P = \int_{0}^{1} P (R) d R

(24)

For multiple categories,

m A P

takes into account the performance measures of P and R at the same time. It is the most widely adopted in target detection. A higher

m A P

value indicates greater detection accuracy, as demonstrated in Equation (25).

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(25)

where

A P_{i}

represents the AP value of category i, and n is the number of categories. mAP@0.5 and mAP@0.5:0.95 are selected as critical metrics to measure detection performance, where mAP@0.5 represents the average accuracy value when the IoU threshold is set to 0.50, mAP@0.5:0.95 represents the average mAP with IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05. A higher mAP@0.5 reflects stronger capability of the algorithm in detecting object locations, whereas a higher mAP@0.5:0.95 implies that the algorithm can achieve better detection accuracy across various application scenarios and requirements.

Additionally, the detection efficiency of the model is measured using the number of parameters and inference time. The number of parameters depends on the complexity of architecture, the number of layers, the number of neural nodes in each layer, etc. More parameters in a model usually means a larger model size, which also means more data and computing resources are needed for training. In practical applications, it is essential to optimize the trade-off between model complexity and computational expense. Inference time represents the period necessary to evaluate a test image, which reflects the processing efficiency of the algorithm. The longer the inference time, the lower the network speed, but different hardware conditions may also cause variability in inference time.

4.4. Experimental Results and Discussion

4.4.1. Ablation Study on Overall Architecture

Firstly, we evaluate the EGM-YOLOv8 model on a collected dataset of ships, which includes six common types found in inland waterways. The loss function primarily serves to highlight the benefits of the model training configuration. A lower loss value indicates better detection performance of the trained model. The training loss function of EGM-YOLOv8 is depicted in Figure 8. As shown, following 200 epochs of training, the loss value steadily decreased and converged, while precision, recall, and mAP metrics exhibited significant improvements, rendering the model more adept at detection. Next, the effectiveness and reliability of the model are verified via component analysis experiments, and the specific contribution of each improvement strategy to the optimization of the detection model is explored, exploring the individual impact of each enhancement strategy on the detection model. Here, we implement EGM-YOLOv8 based on YOLOv8l and other baseline models, and they use the same hyperparameters during training. Additionally, we assess six incomplete EGM-YOLOv8 models by removing each module separately. Among them, YOLOv8+ECA denotes exchanging the original C2f module in the YOLOv8 backbone for C2f_ECA. YOLOv8+GELAN represents replacing the PANet in the original YOLOv8 neck with GELAN. YOLOv8+MPDIoU indicates replacing the bounding box regression loss with MPDIoU. “w/o ECA” means the backbone part of EGM-YOLOv8 uses the original C2f module from YOLOv8 without adding an attention module. “w/o GELAN” denotes that the EGM-YOLOv8 uses the neck of YOLOv8. “w/o MPDIoU” implies that EGM-YOLOv8 uses the CIoU loss function from YOLOv8 originally.

Table 1 displays the results of a series of ablation studies of EGM-YOLOv8, where √ and × indicate the specific modules enabled and disabled in the compared methods, respectively. From Table 1, it can be observed that integrating the ECA, GELAN, and MPDIoU modules into the YOLOv8 led to varying levels of improvement, enhancing both the speed and accuracy of ship detection. EGM-YOLOv8 integrates convolutional and ECA attention modules in the backbone, enhancing the extraction capability of ship features in dense ship areas and improving the precision of nearshore ship target detection. Combining GELAN and PANet in the neck part for feature fusion effectively captures multi-scale ship features, facilitates cross-layer information propagation, reduces missed detections, and maximizes the balance between model parameters and accuracy. The results indicate that while enhancing feature extraction capability, the improved neck network reduces model parameters by

5.21 \times 10^{6}

, increases recall by 1.34%, and improves mAP@0.5 by 0.3%. Therefore, the combination of convolutional and ECA in the backbone achieves enhanced feature extraction, while GELAN in the neck boosts feature fusion and cuts down on model parameters effectively. Furthermore, introducing MPDIoU loss based minimum points distance in YOLOv8 effectively improves the detection precision mAP@0.5:0.95 by 0.24%.

Finally, relative to YOLOv8, EGM-YOLOv8 demonstrated enhancements in P, R, mAP@0.5, and mAP@0.5:0.95. Particularly, R increased by 1.13%, mAP@0.5 improved by 0.41%, and mAP@0.5:0.95 increased by 0.47%. Additionally, EGM-YOLOv8 further reduces the algorithmic complexity and parameter count of the YOLOv8. EGM-YOLOv8 reduces parameters by 13.57% compared to the YOLOv8. This lightweight model is noteworthy as it enhances vessel detection accuracy, fulfills the criteria for lightweight design, is compatible with hardware, and supports the utilization of subsequent detection outcomes. Despite the overall inference time increasing from 15.2 ms to 22.3 ms, the model retains its real-time processing ability. Consequently, it can be asserted that EGM-YOLOv8 is proficient in detecting ship targets, capturing more detailed ship features, reducing background interference, enabling the extraction of relevant distinctive characteristics, and reducing occurrences of erroneous and omitted detections, thus improving the

m A P

value of the network.

In order to more clearly assess the influence of EGM-YOLOv8 on vessel detection, the curves of different models on various evaluation metrics during training are visualized, as shown in Figure 9, including (a) P curve, (b) R curve, (c) mAP@0.5 curve, and (d) mAP@0.5:0.95 curve. It is noticeable that the EGM-YOLOv8 (red curve) exceeds the YOLOv8 in every evaluation indicators, exhibiting faster convergence and better detection precision. It surpasses the YOLOv8 with regard to mAP@0.5 and mAP@0.5:0.95, indicating that EGM-YOLOv8 can provide precise category and positional information for ships in complex scenarios, demonstrating better stability and reliability in ship detection tasks. Figure 10 shows the P-R curves of YOLOv8 and EGM-YOLOv8, it is evident that EGM-YOLOv8 performs quite well across most classes. EGM-YOLOv8 model achieves an average precision of 0.991, which is better than the YOLOv8. The larger the area between the P-R curve and the two coordinate axes, the superior the model’s performance. The average precision of EGM-YOLOv8 in detecting passenger ships is 0.991, showing an improvement of 2.16% in average precision compared to the baseline. Even with the smallest amount of training data for passenger ships, EGM-YOLOv8 resolves the data imbalance issue and improves the detection capability of limited-sample data while ensuring real-time processing.

Figure 11 shows the visual representation of the detection outcomes of the EGM-YOLOv8 and YOLOv8 for multiple classes of ships. It is evident that both models exhibit satisfactory performance in detecting targets with single scenes and complete ship bodies, with the EGM-YOLOv8 model achieving relatively higher detection scores. Both models, before and after improvement, are capable of identifying all types of ships and providing relatively accurate position information. For OC with distinct ship body features, they obtain high category scores. Similarly, for GCS in low-light conditions, both models are able to correctly identify the number, category, and position information of ships. Regarding relatively smaller FB and PS, both models before and after improvement also provide relatively high confidence scores. Therefore, both models before and after improvement are capable of detecting all categories under normal conditions, but the EGM-YOLOv8 can obtain higher confidence category scores.

To offer a more thorough evaluation of the efficacy and reliability of EGM-YOLOv8, Figure 12 illustrates the real-time detection effect of ships by EGM-YOLOv8 and YOLOv8 under different scenarios, displaying the classes and confidence levels of the bounding boxes. In scenario (a), where two vessels of distinct categories with resembling looks are partially occluded, both EGM-YOLOv8 and YOLOv8 accurately recognize the two vessels. However, the occlusion of one ship causes interference for the YOLOv8 model, resulting in duplicate detections. In scenario (b), where there are multiple ships of different types in a hazy environment, both YOLOv8 and EGM-YOLOv8 can correctly detect multiple ships. However, EGM-YOLOv8 obtains higher confidence in ship detection compared to YOLOv8, especially for the overlapping fishing boats, where EGM-YOLOv8 accurately identifies both ships and provides more precise detection box positions. In scenario (c), involving detection of tiny and partially visible, both these models can recognize the incomplete fishing boat. However, YOLOv8 misclassifies the fishing boat as a passenger ship, indicating an erroneous detection. In contrast, EGM-YOLOv8 successfully identifies the ship and provides more accurate type and position information. Similarly, in scenario (d), due to the interference of the coastline in the background, the original YOLOv8 model generates false positives, mistaking the coastline for a bulk cargo carrier. EGM-YOLOv8 avoids this issue and is capable of precisely detecting passenger ships and fishing boats with limited training data and smaller size, with an increased confidence level in ship identification. In summary, EGM-YOLOv8 demonstrates the ability to correctly identify the quantity, category, and position data of numerous vessels under various circumstances, displaying higher detection accuracy and significantly reducing ship omission and false alarm rates.

4.4.2. Comparative Experiment on the Embedding of Different Attention Modules

In order to examine the effect of diverse attention mechanisms on the efficacy object detection models, this experiment embedded three different attention mechanisms into the backbone of YOLOv8: channel attention (SE [47] and CA [48]) and spatial attention (CBAM [49]), and substituted the C2f_ECA with corresponding C2f_SE, C2f_CA, and C2f_CBAM modules, without making any other modifications to other components. These models are then tested and compared on the Seaships dataset, and their efficacy is assessed using evaluation metrics to observe the influence of various attention mechanisms on the detection capability. The experimental outcomes are presented in Table 2.

The experimental results indicate that integrating convolutional and attention mechanisms effectively enhances the detection capability of the model. Despite a decrease in inference speed, the requirements for real-time detection are still fulfilled, and the model parameters are reduced by approximately 5 M. From Table 2, it can be observed that models integrating C2f_SE and C2f_CA modules achieve the highest accuracy in ship detection, with the highest mAP@0.5 as well. Relative to this, models integrating the C2f_CBAM module exhibit a minor rise in parameter quantity by 0.1 M and an improvement of 0.1% in mAP@0.5:0.95. EGM-YOLOv8 incorporates the ECA mechanism, which is a more streamlined and efficient block, constructs the C2f_ECA module, achieving the best accuracy in mAP@0.5:0.95. The analysis shows that models integrating channel attention mechanisms reduce parameters and improve accuracy. Therefore, considering all evaluation metrics and parameter quantities comprehensively, we ultimately choose to integrate the C2f_ECA module, which effectively reduces model parameters, enhances detection capabilities, and outperforms the original YOLOv8 model across all evaluation metrics.

4.4.3. Compared with Common Detection Methods

To thoroughly evaluate EGM-YOLOv8, a set of carefully designed relative experiments were conducted to compare it with generic detection methods, containing Faster R-CNN, YOLOv6, YOLOv7, YOLOv8, TPH-YOLOv5 [25], TPH-YOLOv5++ [26], RT-DETR [50], DAMO-YOLO [51], and YOLO-MS [52]. All comparative models were trained on the Seaships dataset, the comparative outcomes are displayed in Table 3. Through quantitative analysis of Table 3, Faster R-CNN has a recall rate of 97.2%, demonstrating good detection capability; however, its inference time far exceeds 40 ms, not meeting the immediate detection. Although YOLOv6, YOLO-MS and TPH-YOLOv5 have essentially achieved immediate detection standards, their precision in identifying vessels is much lower than EGM-YOLOv8. In comparison with YOLOv8 and TPH-YOLOv5++, despite EGM-YOLOv8 is slower in inference speed, with comparable parameter counts, it excels in other assessment criteria. EGM-YOLOv8 achieves a 1.13% increase in R, a 0.40% improvement in mAP@0.5, and a 0.47% improvement in mAP@0.5:0.95 relative to YOLOv8. Moreover, in terms of mAP@0.5:0.95, EGM-YOLOv8 outperforms TPH-YOLOv5++ by 5.86%. EGM-YOLOv8 demonstrates a significant advantage in recall rate. Regarding model complexity, YOLOv7 displays the fastest inference speed compared to all other methods, but its average precision performance is average. However, EGM-YOLOv8 attains a significant stability in both detection precision and inference speed, achieving a recall rate of 98.3% and an average precision of 84.8% in mAP@0.5:0.95, yielding the best results. Additionally, EGM-YOLOv8 diminishes the parameter quantity by 5.21 M relative to YOLOv8. To expand the scope of analysis, a public Mcships dataset [53] is added, which includes 7996 ship images captured from different observation positions, varying weather conditions, and occlusions at different scales. Table 4 presents experimental results of various comparative models on the Mcships dataset. The results demonstrate that EGM-YOLOv8 significantly outperforms other comparative models in terms of detection accuracy on this dataset, showcasing its applicability.

Figure 13 illustrates the P-R curves for general comparative models in detecting different types of ships. It is also apparent that as the recall rate rises, the curves of EGM-YOLOv8 are higher than the others in most ship categories, and competitive precision rates are achieved for most ship categories. Particularly for visually similar categories like BCC and GCS, our proposed model achieves mAP@0.5:0.95 of 85.3% and 87.1%, respectively, demonstrating competitive performance. Even for smaller targets such as FB and PS, the average precision consistently improves. Notably, for the category with the fewest instances, PS, EGM-YOLOv8 also secures the optimal identification results.

The specific identification outcomes of various models are illustrated in Figure 14. Qualitative comparisons of some representative challenging cases among these seven detectors were conducted. EGM-YOLOv8 excels in all these intricate situations and is closer to ground-truth. Nevertheless, the other models frequently experience difficulties in particular scenes. For instance, in scenarios where incomplete object detection occurs (Scene a), EGM-YOLOv8 achieves higher scores in correctly detecting categories compared to other comparative models, while Faster R-CNN encounters duplicate detection problems. In cases of severe occlusion and obstruction in ship detection (Scene b), YOLOv7 and YOLOv6 exhibit missed detection issues, failing to detect the ore carrier hidden behind the bulk cargo carrier. Additionally, both TPH-YOLOv5 and Faster R-CNN detect the ore carrier, but also encounter duplicate detection problems for the bulk cargo carrier. EGM-YOLOv8 can overcome interference from complex backgrounds, correctly detect ships obscured by other ships, and achieve relatively high confidence scores. Subsequently, for distant and small-sized passenger ships (Scene c), the detection performance of TPH-YOLOv5 is relatively poor, and although Faster R-CNN obtains higher confidence scores, it also experiences duplicate detection issues. Lastly, for small and unclear fishing boat targets (Scene d), YOLOv6, TPH-YOLOv5, TPH-YOLOv5++, and Faster R-CNN all encounter missed detection problems. Given the unclear and incomplete nature of small targets, which may lack distinct features, EGM-YOLOv8 correctly detects the targets, reducing the occurrence of missed detections. In summary, EGM-YOLOv8 is better suited for vessel identification in complicated and multi-scale backgrounds compared to other comparative models. It provides correct categories and accurate position information for visually similar and partially occluded ships, as well as accurate detection of the quantity, type, and position information of incomplete ships in images, avoiding complex background interference, incorrect detections, and duplicate detections, with higher confidence in vessel identification. Therefore, EGM-YOLOv8 demonstrates superior system performance on ship datasets.

4.4.4. Comparisons with Domain-Specific Models

In this part, EGM-YOLOv8 is evaluated against several other ship detection methods, including AARN [32], YOLOv5ship [33], IL-YOLOv5 [34], and ALF-YOLO [54]. To guarantee uniformity in the outcomes, mAP@0.5 and mAP@0.5:0.95 are employed to assess the detection capability of the models, while the Frames Per Second (FPS) metric proposed in [32] is employed to measure detection efficiency. The definition of FPS is given by Equation (26).

F P S = \frac{1}{t_{i m g}},

(26)

where

t_{i m g}

indicates the duration needed for the model to test one image.

Table 5 presents the detection results of EGM-YOLOv8 in comparison to other ship detection methods across different ship categories. It is apparent from Table 5 that EGM-YOLOv8 achieves the same results as ALF-YOLO in terms of the mAP@0.5 detection accuracy metric. Although the EGM-YOLOv8 model slightly underperforms ALF-YOLO in mAP@0.5:0.95, it reduces the parameter total by 10.7% compared to ALF-YOLO and achieves a higher FPS score. Therefore, EGM-YOLOv8 achieves an optimal balance between detection accuracy and speed compared to ALF-YOLO. EGM-YOLOv8 improves recognition precision for OC, BCC, GCS, and CS by 0.51%, 0.10%, 0.3%, and 0.41% relative to the IL-YOLOv5, respectively. While it shows a slight decrease of 0.02% for FB detection, the overall mAP@0.5 is improved by an average of 0.20%. The advancement in the mAP@0.5:0.95 metric is even more substantial, with an average increase of 7.34%. AARN extracts ship features through a feature refinement module, uses an attention mechanism to capture deep features for constructing a feature pyramid network, and employs the AADM module to correct bounding boxes, overcoming background interference. However, due to the weaker ability of its backbone to capture important features, its recognition performance is inferior to EGM-YOLOv8 in vessel detection. Both YOLOv5ship and IL-YOLOv5 are based on the YOLOv5 architecture and incorporate the CA module to highlight ship feature information and identify target regions. YOLOv5ship optimizes multi-scale feature layers by adding a transformer module in the feature combination phase, while IL-YOLOv5 boosts feature fusion capabilities using an improved BiFPN. ALF-YOLO is a marine vessel recognition model founded on YOLOv8, employs a progressive feature integration technology to enhance feature combination across non-adjacent levels and adds a small target recognition head to strengthen the recognition efficiency of small targets. While ALF-YOLO attains the highest detection precision value within the compared models, it also possesses the greatest parameter volume and lower detection efficiency. In contrast, EGM-YOLOv8 adopts a more lightweight attention mechanism and neck network. By integrating convolution and the lightweight yet efficient ECA into the block of backbone, EGM-YOLOv8 can suppress irrelevant information, enhance feature extraction capabilities, and enrich ship detail features, enabling more efficient feature fusion. Additionally, the fusion of GELAN and PANet in the neck network allows for better integration of multi-scale ship features, reducing the overall network parameters and making the network more flexible. Furthermore, the deployment of the efficient and accurate MPDIoU loss function, which is better suited for ship datasets, improves target detection accuracy. Overall, EGM-YOLOv8 ensures a well-balanced trade-off between detection performance and inference speed compared to these benchmark models.

5. Conclusions

This work proposes a lightweight and efficient EGM-YOLOv8 model for accurate ship detection. Firstly, the compact and effective ECA module is incorporated into the C2f module for multi-level feature extraction, achieving enhanced feature extraction capability. Secondly, in the neck, we combine the lightweight GELAN with PANet to construct a more lightweight and efficient neck network, which can better integrate ship-related features at different levels, ensuring maximal retention of global ship feature information, and further improving the accuracy of ship detection. Additionally, we introduce an accurate minimum distance loss function based on ship geometric features, MPDIoU, which optimizes the gradient gain allocation strategy, enabling the model to achieve an effective balance of various targets in the training process, improving the generalization potential of the model, achieving accelerated convergence, and delivering improved regression accuracy. Finally, on the Seaships dataset, EGM-YOLOv8 is compared with other models such as Faster-RCNN, YOLOv6, YOLOv7, TPH-YOLOv5, and TPH-YOLOv5++ to appraise its recognition accuracy and overall effectiveness. Results from the experiments confirm that EGM-YOLOv8 significantly outshines these methods in ship detection tasks. Compared to YOLOv8, it improves recall by 1.13%, mAP@0.5:0.95 by 0.47%, cuts the parameter count by 13.57% while reducing computational cost by 11.05%, saving computational resources and hardware consumption. In subsequent studies, we plan to explore more efficient backbone networks and even lighter detection head networks to enhance the model’s robustness to data noise. Techniques such as quantization, pruning, and knowledge distillation will also be considered to compress the model’s parameters while maintaining its performance.

Despite these contributions, this study has several limitations: (1) Limited Dataset Scope: The experiments were conducted on the Seaships and Mcships datasets, which focuses on near-shore vessel detection. The model’s performance in open-sea environments, adverse weather conditions, and high-density maritime traffic needs further validation. (2) Inference Speed Trade-off: While the model reduces parameter count and computational complexity, its inference speed is slightly slower than the original YOLOv8. This trade-off needs further optimization, especially for real-time edge deployments. (3) Generalization to Other Maritime Tasks: The study primarily addresses visible-light ship detection. Extending the approach to multimodal data (e.g., infrared or SAR images) could enhance robustness in diverse maritime conditions.

To further advance lightweight ship detection, future work will focus on: (1) Optimizing Inference Speed: Exploring efficient backbone networks and lighter detection heads to improve real-time performance on edge devices. (2) Model Compression Techniques: Applying quantization, pruning, and knowledge distillation to reduce model size while preserving accuracy. (3) Expanding Benchmarking Datasets: Evaluating the model on multiple maritime datasets to ensure robustness across different environmental conditions and vessel types. (4) Multimodal Fusion Strategies: Integrating radar, infrared, and visible-light data to improve detection accuracy in adverse weather and low-visibility conditions.

Author Contributions

Y.L.: Conceptualization, Methodology, Software, Experiments, Validation, Formal analysis, Writing—original draft, Writing—review & editing. S.W.: Supervision, Resources, Funding acquisition, Experiments, Writing—review & editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Key R& D Program of China (2024YFB3908800), Cultivation Program for the Excellent Doctoral Dissertation of Dalian Maritime University (0034012401), Fundamental Research Funds for the Central Universities (3132023507), and Dalian High-Level Talent Innovation Program (2022RG02).

Data Availability Statement

No new data were created in this study. The data analyzed in this study are from dataset SeaShips (accessed on 1 January 2025), which could be downloaded from the website, https://github.com/jiaming-wang/SeaShips.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lehtola, V.; Montewka, J.; Goerlandt, F.; Guinness, R.; Lensu, M. Finding safe and efficient shipping routes in ice-covered waters: A framework and a model. Cold Reg. Sci. Technol. 2019, 165, 102795.1–102795.14. [Google Scholar]
Namgung, H.; Kim, J.S. Collision risk inference system for maritime autonomous surface ships using COLREGs rules compliant collision avoidance. IEEE Access 2021, 9, 7823–7835. [Google Scholar]
Vagale, A.; Oucheikh, R.; Bye, R.T.; Osen, O.L.; Fossen, T.I. Path planning and collision avoidance for autonomous surface vehicles I: A review. J. Mar. Sci. Technol. 2021, 26, 1292–1306. [Google Scholar]
Qian, L.; Zheng, Y.; Li, L.; Ma, Y.; Zhou, C.; Zhang, D. A new method of inland water ship trajectory prediction based on long short-term memory network optimized by genetic algorithm. Appl. Sci. 2022, 12, 4073–4088. [Google Scholar] [CrossRef]
Namgung, H. Local route planning for collision avoidance of maritime autonomous surface ships in compliance with COLREGs rules. Sustainability 2021, 14, 198. [Google Scholar] [CrossRef]
Vagale, A.; Bye, R.T.; Oucheikh, R.; Osen, O.L.; Fossen, T.I. Path planning and collision avoidance for autonomous surface vehicles II: A comparative study of algorithms. J. Mar. Sci. Technol. 2021, 26, 1307–1323. [Google Scholar]
Zwemer, M.H.; Wijnhoven, R.G.; de With Peter, H.N. Ship Detection in Harbour Surveillance based on Large-Scale Data and CNNs. In Proceedings of the VISIGRAPP, Funchal-Madeira, Portugal, 27–29 January 2018; Volume 5, pp. 153–160. [Google Scholar]
Hu, C.; Zhu, Z.; Yu, Z. Ship Identification Based on Improved SSD. In Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering, Xiamen, China, 21–23 October 2022; pp. 476–482. [Google Scholar]
Shao, Z.; Wang, L.; Wang, Z.; Du, W.; Wu, W. Saliency-aware convolution neural network for ship detection in surveillance video. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 781–794. [Google Scholar]
Li, H.; Deng, L.; Yang, C.; Liu, J.; Gu, Z. Enhanced YOLO v3 tiny network for real-time ship detection from visual image. IEEE Access 2021, 9, 16692–16706. [Google Scholar]
Zhou, S.; Yin, J. YOLO-Ship: A Visible Light Ship Detection Method. In Proceedings of the 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 14–16 January 2022; pp. 113–118. [Google Scholar]
Chen, Z.; Liu, C.; Filaretov, V.F.; Yukhimets, D.A. Multi-Scale Ship Detection Algorithm Based on YOLOv7 for Complex Scene SAR Images. Remote Sens. 2023, 15, 2071. [Google Scholar] [CrossRef]
Corbane, C.; Najman, L.; Pecoul, E.; Demagistri, L.; Petit, M. A complete processing chain for ship detection using optical satellite imagery. Int. J. Remote Sens. 2010, 31, 5837–5854. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QU, Canada, 7–12 December 2015; Volume 28, pp. 1–14. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NA, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhu, L.; Xie, Z.; Liu, L.; Tao, B.; Tao, W. Iou-uniform r-cnn: Breaking through the limitations of rpn. Pattern Recognit. 2021, 112, 107816. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NA, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Kim, J.H.; Kim, N.; Won, C.S. High-Speed Drone Detection Based On Yolo-V8. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–2. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Conference, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Zhao, Q.; Liu, B.; Lyu, S.; Wang, C.; Zhang, H. TPH-YOLOv5++: Boosting Object Detection on Drone-Captured Scenarios with Cross-Layer Asymmetric Transformer. Remote Sens. 2023, 15, 1687. [Google Scholar] [CrossRef]
Wang, F.; Wang, H.; Qin, Z.; Tang, J. UAV target detection algorithm based on improved YOLOv8. IEEE Access 2023, 11, 116534–116544. [Google Scholar] [CrossRef]
Shen, L.; Lang, B.; Song, Z. DS-YOLOv8-Based Object Detection Method for Remote Sensing Images. IEEE Access 2023, 11, 125122–125137. [Google Scholar] [CrossRef]
Huang, Y.; Han, D.; Han, B.; Wu, Z. ADV-YOLO: Improved SAR ship detection model based on YOLOv8. J. Supercomput. 2025, 81, 34. [Google Scholar] [CrossRef]
Chen, Y.; Ren, J.; Li, J.; Shi, Y. Enhanced Adaptive Detection of Nearby and Distant Ships in Fog: A Real-Time Multi-Scale Target Detection Strategy. Digit. Signal Process. 2024, 158, 104961. [Google Scholar] [CrossRef]
Li, Z.; Ma, H.; Guo, Z. MAEE-Net: SAR ship target detection network based on multi-input attention and edge feature enhancement. Digit. Signal Process. 2025, 156, 104810. [Google Scholar] [CrossRef]
Liu, D.; Zhang, Y.; Zhao, Y.; Shi, Z.; Zhang, J.; Zhang, Y.; Zhang, Y. AARN: Anchor-guided attention refinement network for inshore ship detection. IET Image Process. 2023, 17, 2225–2237. [Google Scholar] [CrossRef]
Zhou, W.; Peng, Y. Ship detection based on multi-scale weighted fusion. Displays 2023, 78, 102448. [Google Scholar] [CrossRef]
Liu, W.; Chen, Y. IL-YOLOv5: A Ship Detection Method Based on Incremental Learning. In Proceedings of the International Conference on Intelligent Computing, Chennai, India, 28–29 April 2023; pp. 588–600. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Zhang, T.; Qi, G.J.; Xiao, B.; Wang, J. Interleaved group convolutions. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4373–4382. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WT, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WT, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. Designing network design strategies through gradient path analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar]
Lee, Y.; Hwang, J.W.; Lee, S.; Bae, Y.; Park, J. An energy and GPU-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 1–11. [Google Scholar]
Siliang, M.; Yong, X. Mpdiou: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. Seaships: A large-scale precisely annotated dataset for ship detection. IEEE Trans. Multimedia 2018, 20, 2593–2604. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-yolo: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M.M. YOLO-MS: Rethinking multi-scale representation learning for real-time object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 1–14. [Google Scholar] [CrossRef]
Zheng, Y.; Zhang, S. Mcships: A large-scale ship dataset for detection and fine-grained categorization in the wild. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Wang, S.; Li, Y.; Qiao, S. ALF-YOLO: Enhanced YOLOv8 based on multiscale attention feature fusion for ship detection. Ocean. Eng. 2024, 308, 118233. [Google Scholar] [CrossRef]

Figure 1. The overall structure of YOLOv8.

Figure 2. The comprehensive structure of EGM-YOLOv8.

Figure 3. Architectural illustration of the Efficient Channel Attention (ECA) module.

Figure 4. The structure of different feature fusion network: (a) CSPNet, (b) ELAN, and (c) GELAN.

Figure 5. Factors of

L_{M P D I o U}

.

Figure 5. Factors of

L_{M P D I o U}

.

Figure 6. The classes and quantity of ship instances in the Seaships dataset.

Figure 7. The distribution of ship instances in the data.

Figure 8. Loss curve of EGM-YOLOv8.

Figure 9. Curves of various evaluation indicators of different models in ablation experiments, (a) Precision, (b) Recall, (c) mAP@0.5 and (d) mAP@0.5:0.95.

Figure 10. The P-R curves of the YOLOv8 and EGM-YOLOv8.

Figure 11. The detection effect of EGM-YOLOv8 and YOLOv8 for various classes of ships.

Figure 12. (a–d) Results of ship detection under different scenarios between the EGM-YOLOv8 and YOLOv8.

Figure 13. (a–f) P-R curves of various models for identifying different ship types.

Figure 14. Instances of detection outcomes from the comparison models.

Table 1. Experimental results of EGM-YOLOv8 and different variant models of YOLOv8.

Model	Improvement			P	R	mAP@0.5	mAP@0.5:0.95	Parameters (M)	Inference Time (ms)
Model	ECA	GELAN	MPDIoU	P	R	mAP@0.5	mAP@0.5:0.95	Parameters (M)	Inference Time (ms)
YOLOv8	×	×	×	0.978	0.972	0.987	0.844	43.61	15.2
YOLOv8+ECA	√	×	×	0.988	0.963	0.989	0.843	43.61	14.9
YOLOv8+GELAN	×	√	×	0.969	0.985	0.990	0.840	38.40	18.7
YOLOv8+MPDIoU	×	×	√	0.974	0.981	0.989	0.846	43.61	15.7
w/o ECA	×	√	√	0.975	0.986	0.990	0.840	38.40	18.2
w/o GELAN	√	×	√	0.971	0.981	0.990	0.843	43.61	17.7
w/o MPDIoU	√	√	×	0.979	0.984	0.990	0.843	38.40	20.0
Ours	√	√	√	0.978	0.983	0.991	0.848	38.40	22.3

Table 2. Comparative analysis of backbone modules with diverse attention components.

Types of Attention	P	R	mAP@0.5	mAP@0.5:0.95	Parameters (M)	Inference Time (ms)
C2f_SE	0.986	0.978	0.991	0.842	38.49	20.5
C2f_CBAM	0.976	0.981	0.989	0.843	38.60	28.1
C2f_CA	0.985	0.981	0.992	0.842	38.49	21.1
Ours	0.978	0.983	0.991	0.848	38.40	22.3

Table 3. Evaluation of detection efficacy between EGM-YOLOv8 and current common detection models on the Seaships dataset.

Model	P	R	mAP@0.5	mAP@0.5:0.95	Parameters (M)	Inference Time (ms)
Faster R-CNN	0.718	0.972	0.962	0.601	51.75	70.0
YOLOv6	0.978	0.976	0.989	0.827	110.87	33.6
YOLOv7	0.980	0.980	0.993	0.816	36.51	12.3
YOLOv8	0.978	0.972	0.987	0.844	43.61	15.2
TPH-YOLOv5	0.967	0.969	0.986	0.781	45.40	33.2
TPH-YOLOv5++	0.977	0.967	0.987	0.801	41.52	19.2
DAMO-YOLO	0.986	0.975	0.988	0.842	51.97	18.6
YOLO-MS	0.983	0.969	0.989	0.831	50.36	15.6
RT-DETR	0.966	0.964	0.988	0.797	31.99	36.4
Ours	0.978	0.983	0.991	0.848	38.40	22.3

Table 4. Experimental results of different comparison models on the Mcships dataset.

Model	P	R	mAP@0.5	mAP@0.5:0.95	Parameters (M)	Inference Time (ms)
Faster R-CNN	0.531	0.906	0.858	0.477	51.75	80.9
YOLOv6	0.909	0.848	0.919	0.668	110.87	16.5
YOLOv7	0.914	0.868	0.925	0.630	36.51	12.3
YOLOv8	0.928	0.866	0.933	0.687	43.61	20.4
TPH-YOLOv5	0.876	0.786	0.867	0.588	45.40	35.4
TPH-YOLOv5++	0.904	0.850	0.910	0.632	41.52	21.7
DAMO-YOLO	0.929	0.833	0.908	0.666	51.97	23.2
YOLO-MS	0.886	0.847	0.905	0.648	50.36	15.7
RT-DETR	0.910	0.839	0.889	0.638	31.99	26.2
Ours	0.932	0.866	0.934	0.690	38.40	20.3

Table 5. Comparative analysis of detection precision across ship-specific models.

Model	Metrics	All	OC	BCC	GCS	CS	FB	PS	Parameters (M)	FPS
AARN	mAP@0.5	0.947	0.948	0.947	0.958	0.980	0.927	0.923	35.82	45
AARN	mAP@0.5:0.95	0.702	0.677	0.708	0.718	0.786	0.659	0.666	35.82	45
YOLOv5ship	mAP@0.5	0.976	0.984	0.963	0.975	0.983	0.972	0.980	40.30	60
YOLOv5ship	mAP@0.5:0.95	0.710	0.644	0.678	0.741	0.794	0.656	0.744	40.30	60
IL-YOLOv5	mAP@0.5	0.989	0.990	0.992	0.987	0.981	0.992	0.991	29.80	94
IL-YOLOv5	mAP@0.5:0.95	0.790	0.759	0.792	0.831	0.834	0.750	0.777	29.80	94
ALF-YOLO	mAP@0.5	0.991	0.995	0.994	0.986	0.985	0.991	0.995	42.51	38
ALF-YOLO	mAP@0.5:0.95	0.850	0.850	0.859	0.870	0.866	0.796	0.857	42.51	38
Ours	mAP@0.5	0.991	0.995	0.993	0.990	0.985	0.990	0.991	38.40	45
Ours	mAP@0.5:0.95	0.848	0.845	0.853	0.871	0.866	0.798	0.854	38.40	45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Wang, S. EGM-YOLOv8: A Lightweight Ship Detection Model with Efficient Feature Fusion and Attention Mechanisms. J. Mar. Sci. Eng. 2025, 13, 757. https://doi.org/10.3390/jmse13040757

AMA Style

Li Y, Wang S. EGM-YOLOv8: A Lightweight Ship Detection Model with Efficient Feature Fusion and Attention Mechanisms. Journal of Marine Science and Engineering. 2025; 13(4):757. https://doi.org/10.3390/jmse13040757

Chicago/Turabian Style

Li, Ying, and Siwen Wang. 2025. "EGM-YOLOv8: A Lightweight Ship Detection Model with Efficient Feature Fusion and Attention Mechanisms" Journal of Marine Science and Engineering 13, no. 4: 757. https://doi.org/10.3390/jmse13040757

APA Style

Li, Y., & Wang, S. (2025). EGM-YOLOv8: A Lightweight Ship Detection Model with Efficient Feature Fusion and Attention Mechanisms. Journal of Marine Science and Engineering, 13(4), 757. https://doi.org/10.3390/jmse13040757

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EGM-YOLOv8: A Lightweight Ship Detection Model with Efficient Feature Fusion and Attention Mechanisms

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. The YOLOv8 Model

3. Proposed Method

3.1. Improvement of the Backbone Module

3.2. Improvement of the Neck Module

3.3. Loss Function

4. Experimental Design and Results Analysis

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Experimental Results and Discussion

4.4.1. Ablation Study on Overall Architecture

4.4.2. Comparative Experiment on the Embedding of Different Attention Modules

4.4.3. Compared with Common Detection Methods

4.4.4. Comparisons with Domain-Specific Models

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI