1. Introduction
As environmental pollution issues become increasingly severe, floating garbage in water bodies is causing growing damage to ecosystems [
1], not only affecting the stability of aquatic ecosystems but also posing potential threats to shipping safety, water-based operations, and tourism activities. Currently, the cleanup of floating debris on water surfaces primarily relies on manual operations, which suffer from issues such as low efficiency, high costs, and poor real-time performance. To achieve efficient and cost-effective cleanup methods, image-processing-based automatic detection techniques have emerged as a key technology. These methods analyze collected images to automatically identify the location and category of targets, significantly enhancing real-time performance and accuracy of detection.
In recent years, the rapid development of deep learning has driven the widespread application of object detection technology, giving rise to a variety of detection frameworks, including Faster R-CNN [
2], SSD [
3], and the YOLO series [
4]. These methods have been widely applied in practical scenarios such as industrial quality inspection and traffic monitoring, demonstrating excellent performance and application value. Yang Huapeng et al. [
5] used the Faster R-CNN method to effectively identify six common objects on the water surface, enhancing the target recognition capabilities of unmanned boats in complex water environments. However, this method has limited detection capabilities for small objects and involves high computational complexity. To address computational resource constraints and detection accuracy requirements in practical deployments, Henar et al. [
6] designed an edge temperature monitoring system based on a multi-task convolutional neural network, significantly improving the system’s operational efficiency and energy consumption performance on embedded platforms such as PCs and Jetson.
Among these, the YOLO [
7] series, as a representative of single-stage detection, has performed exceptionally well in edge scenarios due to its end-to-end, high-speed, and flexible deployment characteristics. In recent years, it has also been gradually applied to water surface target detection for identifying floating debris, obstacles, and vessels. However, due to issues such as strong reflections, water wave disturbances, dense targets, and irregular occlusions in water surface scenarios, traditional YOLO models still face challenges in terms of accuracy and robustness [
8].
To address these issues, researchers have conducted extensive studies from the perspectives of model accuracy, structural lightweighting, and environmental adaptability. For example, Liu Tao et al. [
9] improved the anchor box settings and feature fusion structure of YOLOv3, significantly enhancing the accuracy of sea surface target detection. However, limitations remain in terms of anchor box dependency and target scale adaptability. Bochkovskiy et al. [
10] proposed YOLOv4, a real-time object detection system that integrates multiple detection technology innovations from the past two years. It achieved outstanding performance on the MS COCO dataset, becoming the new mainstream detection framework. However, this method still has a certain false positive rate in complex water surface backgrounds and exhibits unstable recognition performance for non-structured targets such as floating debris. To balance detection accuracy and model lightweightness, the MSA-YOLOv5 model proposed by Park Jong-chan et al. [
11] achieved an average accuracy of 98.3% on the VHR-10 dataset with only 1.795 million parameters, balancing accuracy, and lightweightness in small object detection tasks suitable for real-time applications in resource-constrained environments. Huang Chengwen et al. [
12] replaced the C3 structure in YOLOv5 with a Transformer encoder and introduced a small object detection layer and CBAM module, achieving higher accuracy and better performance in water surface object detection, but this increased model complexity and impacted real-time performance. YOLOv8 [
13], as one of the important updated versions in this series, introduces anchor-free design and a modular structure, balancing detection performance and engineering deployment flexibility. Based on YOLOv8, Zhang Haozhi et al. [
14] proposed an improved lightweight water obstacle detection algorithm, achieving real-time and accurate detection of water obstacles. Wang Jie et al. [
15] proposed an improved algorithm, YOLOv8-MSS. Experimental results on the public dataset FloW-Img showed that this algorithm improved detection accuracy by 5% and 2.6% compared to the original model, further enhancing the model’s accuracy and practicality while maintaining real-time performance.
Therefore, this paper proposes an improved object detection algorithm based on YOLOv8, aiming to enhance its detection accuracy and stability in complex water surface environments. Our contributions are as follows:
- (1)
Adding the RFAConv (receptive-field attention convolution) spatial attention mechanism to the backbone network enables the model to better handle details and complex patterns in detection images, effectively addressing missed and false detections.
- (2)
To improve detection accuracy for small objects, the original three-layer detection head is replaced with a four-layer ASFF module to achieve multi-scale feature fusion. Combined with the DFL strategy, this improves the model’s accuracy and stability in complex water surface scenes.
- (3)
To avoid a decrease in inference speed and an increase in model size caused by the introduction of additional modules, this study adopted the Slim-neck architecture, which consists of the GSConv (Ghost-Shuffle Convolution) and VoV-GSCSP (VoVNet-based Ghost-Shuffle Cross Stage Partial) modules in the neck structure.
- (4)
Applying the MPDIoU (Minimum Point Distance Intersection over Union) loss function to the bounding box regression task replaces the traditional CIoU method, improving localization accuracy and convergence effects with a more stable geometric metric.
2. Method
2.1. YOLOv8 Algorithm
YOLOv8 [
16] is the eighth-generation core version of the YOLO series released by Ultralytics in early 2023. As shown in
Figure 1, the overall architecture of YOLOv8 consists of three parts: the backbone network, the neck network, and the detection head. Building on the architectural advantages of YOLOv5 [
17], this version has undergone systematic optimization and upgrades in model structure, training strategies, and inference efficiency. Although YOLOv12, proposed in 2024 by a team including Tsinghua University and Alibaba DAMO Academy, represents the latest research progress in this series to some extent, YOLOv8 continues to be widely applied in object detection due to its lightweight design, high-precision detection performance, and excellent engineering scalability, and has become one of the key foundational architectures for secondary development. To adapt to different platform conditions and application requirements, YOLOv8 offers five scaled variants: YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv8l (large), and YOLOv8x (extra-large). These variants incrementally increase in network depth and width, covering a wide range of scenarios from lightweight rapid deployment to high-precision, high-computational-power applications, achieving a dynamic balance between model complexity and detection performance. Among these, YOLOv8n is more suitable for resource-constrained real-time deployment environments, while YOLOv8x excels in accuracy and is ideal for high-performance applications with ample computational resources.
Figure 2 illustrates the key structural details of each module. The Conv (Convolutional Block) is the most basic building block in YOLOv8, typically consisting of a standard convolution (Conv2d), batch normalization, and an activation function, used to extract local spatial features from images and perform channel mapping; the C2f (Cross Stage Partial with two convolution layers and feature fusion) module is structurally simpler than the CSP (Cross Stage Partial Network) in YOLOv5, using a dual-convolution branch to achieve cross-stage feature fusion, thereby enhancing semantic modeling capabilities while reducing parameter counts and computational overhead; the SPPF (Spatial Pyramid Pooling-Fast) module is a lightweight design based on SPP (Spatial Pyramid Pooling), which extracts multi-scale features by concatenating multiple max pooling operations. This not only enhances the receptive field and context awareness but also avoids redundant computations, thereby improving inference efficiency.
2.2. Overview of the Improved Algorithm
Given the dual requirements of real-time performance and lightweight architecture for water surface object detection tasks, this paper selects YOLOv8n as the base architecture for improvement. This model features a small parameter count and fast inference speed, enabling it to maintain detection accuracy while effectively reducing computational overhead. It is suitable for resource-constrained practical applications and provides a solid performance foundation for subsequent structural optimizations. Within this framework, we propose an improved YOLOv8n model. The specific structure is shown in
Figure 3: In the backbone network, the C2f module in the first layer is replaced with C2f_RFAConv to enhance feature modeling for reflective regions and lay the foundation for small object detection; additionally, the detection head is upgraded to a four-branch structure, with an extra small object detection layer P2 added, and combined with the Four-Detect-ASFF multi-scale feature fusion module to achieve more precise small object detection; In the neck network, the C2f module is replaced with the VoV-GSCSP module, and GSConv is used instead of traditional convolutions to improve feature fusion efficiency while controlling model size, thereby avoiding a significant increase in parameter count and computational complexity due to the addition of too many modules.
2.3. RFAconv Module
Zhang et al. [
18] proposed RFAConv (Receptive-Field Attention Convolution), a novel spatial attention mechanism whose core idea is to deeply integrate spatial attention with convolution operations, thereby enhancing the ability of convolutional neural networks (CNNs) to model spatial features. This mechanism uses attention maps guided by receptive field size to adaptively weight the importance of different spatial positions. This enables the network to suppress response shifts in overexposed areas caused by mirror reflections and strengthen edge semantic expression when processing high-brightness regions, thereby reducing missed and false detection.
In water surface waste detection scenarios, small objects account for a large proportion, and under conditions of strong reflections and complex backgrounds, their fine-grained features are often weakened during deep feature propagation, leading to missed detections or inaccurate localization. To enhance the characterization of small objects at the early stage of feature extraction and to mitigate the impact of reflection interference, the original C2f module in the first layer of the backbone network is replaced with a C2f_RFAConv module. While retaining the efficient feature splitting and fusion advantages of C2f, this module incorporates the spatial feature enhancement mechanism of RFAConv, enabling shallow features to integrate both local details and global contextual information. As a result, it improves small object detection accuracy and robustness against reflection interference prior to multi-scale feature fusion.
The block diagram of module C2f_RFAConv is shown in
Figure 4. Its operation process can be divided into three stages: channel compression and division, multi-layer residual enhancement, and feature fusion and output. Its structural process can be expressed as follows:
Given an input feature map
, it is first passed through a
convolution to reduce the channel dimension to
, i.e.:
In the above equation, represents the feature map after channel compression, represents the convolution operation, R is a set of real numbers, B represents the batch size, represents the number of intermediate channels, and H and W represent the spatial dimensions (height and width) of the feature map.
Subsequently,
is split along the channel dimension into two parts, namely the first branch feature
and the second branch feature, each with
channels, which are used for the main path connection and the subsequent residual computation:
Taking
as the initial input,
successive
operations are performed, with each iteration defined as follows:
where
denotes the
i-th
module, which consists of a
convolution followed by an
RFAConv module, with residual connections introduced when the specified conditions are satisfied.
The initial branch and the residual-enhanced features are concatenated to form the fused feature
, i.e.:
Finally, a
convolution is applied to restore the channel dimension to
:
In this equation, n denotes the number of residual enhancement layers, Z represents the final output feature of the module, and indicates the number of output channels.
2.4. Four-Detect-ASFF Module
In multi-scale object detection, shallow features contain rich spatial details that facilitate precise localization of small objects, while deep features possess stronger semantic representation capabilities, making them suitable for recognizing large objects. Traditional feature fusion methods (e.g., FPN [
19], PAN [
20]) can achieve semantic enhancement and information transfer to a certain extent; however, their fusion ways are fixed and lack dynamic modeling capabilities, making it difficult to adaptively adjust to different object scales.
To address the aforementioned issues, this study integrates an adaptive feature fusion module based on four detection branches. This module extends the core concept of Adaptively Spatial Feature Fusion (ASFF) proposed by Liu et al. [
21] by introducing an additional P2 branch specifically designed for small object detection. The P2 branch directly exploits high-resolution, low-level semantic features from the backbone network to capture more fine-grained spatial information, thereby enhancing the perception of small and complex-shaped objects.
The working principle of Four-Detect-ASFF is shown in
Figure 5. This module establishes a fully connected cross-scale information interaction pathway between multi-scale feature maps (Level 1–Level 4). First, feature maps from different levels are aligned and re-calibrated through an adaptive spatial weighting mechanism to ensure spatial consistency during cross-scale information fusion. Subsequently, a hierarchical enhancement mechanism is introduced to reinforce the complementary relationship between the detailed information contained in shallow-level features and the semantic information in high-level features. After fusion, each level of features enters an independent prediction branch to achieve fine-grained detection of objects at different scales.
In each Four-Detect-ASFF module, shallow high-resolution features are processed by downsampling, while deep low-resolution features are aligned to the target scale by upsampling. Subsequently, each feature map is processed through a dedicated convolution layer for channel alignment and fed into a lightweight attention branch to obtain the fusion weight for each path. These weights are normalized using the Softmax function and applied in a weighted summation to achieve dynamic feature fusion, effectively alleviating the issues of information redundancy or loss commonly seen in traditional fusion methods. The fused features at each scale are then passed to independent prediction branches to achieve refined multi-scale object detection.
For the
-th output layer, the calculation method for its fusion features
(where
) can be expressed as follows:
where
denotes the input feature map of the
-th layer, where
corresponding to the four input scales (P2, P3, P4, and P5), with strides of 4, 8, 16, and 32, respectively;
denotes transforming the spatial resolution of the feature map at layer
to the same size as layer
;
is the weight parameter, corresponding to the attention weights of input
in the fusion at layer
, satisfying the normalization constraint;
;
denotes the standard convolution layer after fusion, used for further feature extraction and channel integration.
2.5. Slim-Neck Module
The Slim-neck module based on GSConv was proposed by Li et al. [
22]. Its main innovation is to combine the lightweight convolution idea of GSConv with Ghost feature generation and Channel Shuffle technology, which reduces redundant computations and parameter counts while retaining key feature expression capabilities. It also integrates the multi-branch feature aggregation of VoV-GSCSP with the cross-stage fusion mechanism of CSP, achieving efficient multi-scale feature interaction and information completion. This method aims to reduce the computational overhead of the neck network and improve the model’s inference speed.
The module structure of GSConv is shown in
Figure 6. It primarily alleviates the loss of semantic information caused by feature map spatial compression (i.e., reduction in width and height) and channel expansion by parallel concatenation of standard convolution downsampling and depth-wise separable convolution (DWConv). Standard convolution is mainly responsible for extracting local spatial features, while DWConv enhances the expressive power between channels while maintaining a low number of parameters. The synergistic effect of the two not only improves the integrity and expressive richness of features but also further optimizes the feature propagation path, helping to retain more key semantic information in a lightweight structure, thereby providing stronger support for subsequent feature fusion and object detection.
In
Figure 7, VoV-GSCSP integrates the GS bottleneck module with channel splitting and reorganization strategies to enable efficient information flow and feature interaction, thereby reducing inter-layer dependency and redundant transmission. This design improves inference speed and computational efficiency while maintaining detection accuracy, making it particularly suitable for deployment on unmanned surface vehicles (USVs) with limited space and computing resources. Specifically,
Figure 7a illustrates the overall framework of VoV-GSCSP: the input features are first processed by a convolutional layer, then extracted and compressed through the GSConv-based bottleneck structure (GS bottleneck), followed by feature concatenation and convolutional output.
Figure 7b presents the details of the GS bottleneck, where a two-stage GSConv is applied to process features in parallel, with subsequent channel alignment and fusion. Combined with the Cross Stage Partial (CSP) structure, this design reduces computational cost and enhances cross-channel feature interaction, thereby achieving model lightweighting.
2.6. Loss Function Improvement
The regression loss function is a key factor in the training and optimization process of object detection. The YOLOv8 model uses the Complete Intersection over Union (CIoU) as the default boundary regression loss function, which comprehensively considers three factors: overlapping area, center point distance, and width-to-height ratio difference, to enhance regression performance.
The expression for
CIoU is as follows:
In Equation (7), is the Euclidean distance between the center point of the predicted box and the center point of the true box , represents the diagonal length of the minimum bounding box containing the predicted box and the true box, is a balancing weight parameter that dynamically adjusts the influence of the width and height terms on the total loss based on the Intersection over Union (IoU) value, and is the aspect ratio penalty term.
However, since the center distance term of CIoU is a smooth function, when the prediction box is far away, the gradient of this term tends to saturate and cannot provide strong regression guidance. Therefore, in the task of detecting floating garbage on water surfaces, the model’s initial predictions often exhibit significant positional deviations due to reflections, water ripples, or interference from floating objects. In such cases, CIoU struggles to effectively drive the prediction box to converge rapidly toward the target area, thereby impacting convergence speed and detection accuracy.
To alleviate the above issues, this paper adopts Minimum Point Distance Intersection over Union (MPDIoU)proposed in [
23] as an auxiliary regression loss. This method calculates the minimum Euclidean distance between the predicted box and the true box boundary, providing an effective gradient optimization direction even when the predicted box does not overlap with the target, thereby enhancing the boundary position constraint capability. Compared to traditional IoU-based losses, MPDIoU provides stable and clear optimization signals when the bounding box is in the deviation phase, making it particularly suitable for addressing issues such as target boundary blurring caused by reflective interference and initial positioning inaccuracies of small targets. This helps the model focus on the true target region earlier in training, thereby improving regression accuracy and convergence speed.
Figure 8 shows the geometric diagram of MPDIoU, where the red box represents the predicted box A, the yellow box represents the true target box B, and the blue dashed line indicates the minimum Euclidean distance between the boundary points of the two boxes. Let
denote the coordinates of the top-left and bottom-right corners of box A, respectively, and
denote those of box B. The terms
and
denote the squared Euclidean distances between the top-left and bottom-right corners of boxes A and B, respectively, used to measure the minimum offset between the two boxes at their boundary positions;
and
denote the width and height of the input image.
The
MPDIoU loss function, represented as
, is formulated as follows:
In the equation, represents the IoU between the ground truth box and the predicted box, and denotes the IoU based on the minimum point-to-point distance.
4. Conclusions
Due to the high complexity of the water surface environment, object detection tasks face multiple challenges. First, strong lighting and mirror-like reflections can cause bright areas on the water surface, which can interfere with the model’s ability to identify object edges and textures. Therefore, in the improved model proposed in this article, the C2f module in the first layer of the backbone network is replaced with the C2f_RFAConv module, enhancing the model’s ability to represent spatial structure and focus on key regions. Second, considering that floating garbage on water surfaces is typically small in size, has limited pixel coverage, and is prone to being missed at long distances or under low-resolution conditions, the detection head is replaced with the Four-Detect-ASFF module to enhance multi-scale feature fusion capabilities, thereby improving small object detection performance. Third, the Slim-neck module is adopted in the neck structure to reduce the computational complexity of the model and ensure its lightweight nature. Finally, to address the issue that continuously changing water waves can cause target shape distortion or local occlusion, thereby weakening the model’s ability to perceive target boundaries, this study replaces the original CIoU with MPDIoU to improve the accuracy and convergence stability of bounding box regression. Experimental results show that the model achieves 94.5% and 58.6% on mAP@0.5 and mAP@0.5:0.95, respectively, representing improvements of 2.27% and 5.21% over the original YOLOv8.
5. Future Work
However, this study also has certain limitations: first, the robustness of the algorithm under different water environments, lighting conditions, and weather scenarios still requires further validation; second, although the lightweight design of the model improves inference speed, it may still face computational challenges on extremely resource-constrained embedded platforms; finally, the scale and diversity of the dataset remain insufficient, which limits the model’s applicability in more complex and diverse water surface environments.
In future research, we will further expand the categories and labeling system of the dataset to achieve more refined target annotation, thereby enhancing the model’s adaptability and generalization capabilities across diverse types of surface debris. Additionally, we will focus on further optimizing the model’s lightweight design to reduce computational and storage overhead, enabling more efficient real-time deployment on resource-constrained embedded devices and unmanned surface vehicle platforms. Concurrently, we plan to explore integrating temporal information with video object detection methods to enhance detection stability in dynamic water waves and continuous scenes, thereby providing more practical solutions for intelligent water environment monitoring and cleanup.