1. Introduction
Enteromorpha is a large floating green alga, and its strong reproductive capacity leads to an exponential increase in biomass after a single reproductive cycle. The excessive proliferation of Enteromorpha can result in the formation of green tides, which affect the growth of benthic algae. Additionally, the large-scale proliferation of Enteromorpha can cause water eutrophication, leading to oxygen depletion and the death of marine organisms due to hypoxia and deteriorating water quality. Severe green tide coverage can also have a profound impact on tourism, fisheries, and marine safety in coastal cities [
1,
2]. Therefore, rapidly and accurately monitoring the distribution of Enteromorpha and understanding its drift law is of significant importance for effectively implementing control measures [
3].
Yu et al. [
4] proposed a Fully Automated Green Tide Extraction Method (FAGTE), which utilizes multi-source satellite remote sensing data to achieve high-precision monitoring of green tides in the Yellow Sea. Additionally, a method for merging results at various resolutions was proposed. In addition to utilizing the Gompertz and Logistic models for forecasting the growth patterns of green tides, this information serves as a foundation for implementing effective preventive and control strategies. Xu et al. [
5] put forth a semi-automated approach for extracting green tides utilizing NDVI. To extract green tide data from the Yellow Sea between 2008 and 2012, remote sensing images from multiple satellites were utilized. This application served as a means to confirm the method’s universality. Dong et al. [
6] proposed a boundary-assisted dual-path convolutional neural network (BADP-CNN) to address the issue of accurate boundary detection of Enteromorpha in high-spatial-resolution remote sensing images (HSRIs).
Currently, monitoring of Enteromorpha mainly relies on remote sensing data, which presents significant limitations in terms of temporal and spatial resolution, as well as data timeliness [
7]. This makes it difficult to detect small-scale or dispersed Enteromorpha colonies and to capture the details of Enteromorpha drift and growth. As a result, research on the living environment and drift law of Enteromorpha still faces considerable challenges.
In contrast, vision-based object detection techniques, particularly YOLO, offer significant advantages in these scenarios. YOLO’s deep learning architecture excels in detecting small-scale objects by analyzing high-resolution imagery and leveraging real-time processing capabilities [
8]. This allows YOLO to detect subtle or dispersed Enteromorpha colonies with higher accuracy, especially in areas where traditional methods fail to provide reliable results.
Meanwhile, with the development of deep learning technologies, significant advancements have been made in vision-based object detection methods. YOLO [
9] series object detection models, known for their outstanding real-time processing capabilities and detection accuracy, have been successfully deployed in numerous application scenarios, particularly in the field of marine object detection [
10].
Fu et al. [
11] proposed an improved YOLOv4 model for marine vessel detection. They incorporated the CBAM attention mechanism into the original model to enhance its ability to detect small objects, resulting in a 2.02% improvement in map@50. Bi et al. [
12] proposed an improved YOLOv7-based jellyfish detector, HD-YOLO. They established a jellyfish dataset and validated the effectiveness of HD-YOLO and related methods through comparative experiments. This approach provides a more accurate and faster detection method for jellyfish, along with a more versatile dataset. Wang et al. [
13] proposed the YOLO11-YX algorithm for marine litter detection. Based on YOLO11s, they introduced the SDown downsampling module, the C3SE feature extraction module, and the FAN feature fusion module. Experimental results demonstrated a 2.44% improvement in detection accuracy for marine litter, providing a further solution for marine litter detection. Jia et al. [
14] applied an improved YOLOv8 model to the field of marine organism detection. They incorporated the InceptionNeXt module into the backbone network of YOLOv8 to enhance feature extraction capabilities. The SEAM attention module was added to the Neck network to improve the detection of overlapping objects. Additionally, the NWD loss was integrated into the CIoU, improving the recognition of small objects. Compared to the original model, the mAP was increased by approximately 6.2%. Tian et al. [
15] replaced the C2f module in the YOLOv8n network with deformable convolutions and integrated the SimAM attention mechanism before the detection head. They also replaced the traditional CIoU loss function with WIoU. The improved algorithm significantly enhanced the detection accuracy of marine flexible organisms. Jiang et al. [
16] proposed a lightweight ship detection model, YOLOv7-Ship, which integrates the “Coordination Attention Mechanism” (CAM) and full-dimensional dynamic convolution (ODConv). This model addresses the trade-off between detection accuracy and real-time performance in complex marine environmental backgrounds. Wu et al. [
17] proposed an improved YOLOv7 model that incorporates the ECA attention mechanism to reduce the model’s focus on redundant information, providing a more accurate advantage for underwater object detection.
Therefore, we have developed an intelligent tracking unmanned vessel monitoring system for Enteromorpha, which integrates meteorological, biological, chemical and other environment elements. The object detection model is deployed to the unmanned vessel monitoring system, enabling real-time tracking of Enteromorpha drift path and the collection of water quality parameters within its growth range. This is of significant importance for improving water quality and controlling Enteromorpha disasters.
This study will focus on the research of Enteromorpha detection algorithms. We have selected YOLOv8n as the benchmark model and make improvements to enhance the model’s detection accuracy while reducing false negative rates for Enteromorpha. These improvements will provide technical support for the recognition system of the unmanned vessel monitoring system. The contributions of this study are as follows:
- 1.
The introduction of the more advanced feature extraction module ConvNeXtv2 into the C2f module helps to improve the feature extraction ability of the model for the edge part of Enteromorpha.
- 2.
To enhance the model’s focus on the Enteromorpha region and boost detection accuracy, the Neck network incorporates the ECA attention mechanism.
- 3.
The utilization of the EIoU loss function further optimizes the model’s bounding box localization ability, enhancing the recognition performance of Enteromorpha.
- 4.
An all-weather Enteromorpha image dataset was constructed, and the application value of the proposed method was verified through comparative experiments.
The remainder of this paper is structured in the following manner.
Section 2 introduces the construction of the dataset and provides a detailed description of the framework for improving the CEE-YOLOv8 model.
Section 3 describes the experimental setup, configuration of training parameters, and performance metrics used to evaluate the model.
Section 4 compares the CEE-YOLOv8 model with other models using various performance metrics.
Section 5 presents a visual comparison between CEE-YOLOv8 and the benchmark model YOLOv8n, obtaining the expected results and validating the practical utility of the proposed method.
Section 6 concludes this paper.
2. Data Collection and Proposed Methods
2.1. Data Collection
The dataset used in this study is an all-weather image dataset of Enteromorpha that was self-compiled. The images of Enteromorpha were collected through an independently developed online monitoring platform for Enteromorpha, with the collection site selected in Qingdao, as shown in
Figure 1.
To facilitate subsequent model training, the collected photographs were cropped to retain the Enteromorpha itself along with some background information. We utilized LabelImg 1.8.6 software to annotate the images, generating .txt files suitable for model training, as shown in
Figure 2. Finally, these images were randomly divided into training, validation, and test sets in a ratio of 7:2:1.
2.2. Benchmark: YOLOv8n
In this research, we selected YOLOv8n as the reference model for enhancement due to its compact network architecture and superior detection precision. It is suitable for deployment on the Enteromorpha monitoring platform.
YOLOv8n contains four modules, Input, Backbone, Neck, and Head [
18], as demonstrated in
Figure 3.
The input layer integrates methods like Mosaic data augmentation and Adaptive Anchor, which significantly increase the detection opportunities for small objects and improve the model’s robustness in recognizing occluded scenarios. The Backbone is composed of a concatenation of Conv, C2f, and SPPF modules. The newly designed C2f module, by incorporating residual connections and branching structures, establishes a richer gradient flow path, significantly enhancing the model’s feature extraction capability [
19]. The SPPF module effectively merges local and global features at different scales through parallel maximum pooling operations, enabling multi-scale capture of contextual information for the target [
20]. The neck utilizes an FPN (Feature Pyramid Network) + PAN (Path Aggregation Network) structure, which achieves efficient fusion of features at different levels through both top-down and bottom-up pyramid structures [
21].
In the output stage, YOLOv8n completely abandons the conventional anchor-based mechanism and embraces a fully anchor-free detection method [
22]. Additionally, it employs a decoupled design, separating the classification and regression tasks into independent parallel branches. In short, the overall architecture further improves detection accuracy and generalization capability based on the YOLOv7 framework.
2.3. C2f-ConvNeXtv2
ConvNeXtv2 is a pure convolutional vision model that integrates the Transformer design concept and the advantages of convolutional networks. It significantly improves the feature extraction ability while maintaining efficient inference [
23]. The network structure is shown in
Figure 4.
ConvNeXtv2 first performs convolution operations on the feature map through depth-wise separable convolution. This network uses a
large convolution kernel to expand the model’s receptive field, enabling it to extract the edge features of kelp more comprehensively [
24].
Layer Normalization is employed to eliminate negative effects, such as gradient explosion, caused by the increased network depth. Furthermore, a convolution kernel is used to enhance the features of Enteromorpha edges, which allows for better utilization of channel-wise information, thereby improving the model’s expressive capability.
In ConvNeXtv2, the
GELU activation function is selected to replace the
function, as shown in the following equations.
Here,
denotes the cumulative distribution function of the standard normal distribution, and its calculation formula is given as follows.
The ReLU function maps all negative input values to zero, which leads to the complete loss of information carried by those negative inputs. Additionally, ReLU is non-differentiable at , and its discontinuous derivative can cause gradient explosions during training, thereby increasing the risk of training instability.
Compared with the
ReLU function, even if the
input is negative,
will still output a small positive value. Therefore, its negative input information is not completely discarded, but retained in a weaker form and participates in the subsequent calculation. Meanwhile, the
GELU activation function is smoother and more continuous, providing better non-linear characteristics, enhancing gradient flow, and improving the speed and accuracy of neural network learning [
25].
The most important design in ConvNeXtv2 is the introduction of the GRN layer (Global Response Normalization), which achieves inter-channel competition and global perception through pure mathematical constraints at the cost of zero parameters, effectively addressing the issue of feature redundancy in traditional convolutions [
26].
The GRN layer first performs global feature aggregation on the feature maps of each channel using the L2 norm to obtain a set of aggregated vectors. Specifically, given an input feature
, the spatial feature map
is aggregated into a vector
using a global function
. Using the L2 norm for feature aggregation effectively suppresses noise and enhances the model’s generalization ability, resulting in a set of aggregated values, as shown in the following formula.
Then, a standard split normalization function
is applied to the aggregated vectors to obtain a set of normalized values. The normalization operation is illustrated in the following formula.
In conclusion, the original input responses are calibrated using the calculated normalized scores, as illustrated by the subsequent formula.
In this study, to enhance feature extraction capability, we propose a modified C2f module (named C2f-ConvNeXtv2) by replacing each Bottleneck unit within the original C2f structure with the ConvNeXtv2 module, while retaining the C2f’s core design of “branch splitting, parallel feature processing, and concatenation fusion”. The architecture of the C2f-ConvNeXtv2 module is illustrated in
Figure 5, where the original Bottleneck is entirely substituted by the ConvNeXtv2 module.
Subsequently, all C2f modules in the Backbone of YOLOv8n are completely replaced with the proposed C2f-ConvNeXtv2 modules. This full replacement ensures that the improved Backbone inherits the original C2f’s efficient gradient flow (via preserved branch splitting and residual mechanisms) while leveraging ConvNeXtv2’s advantages in capturing fine-grained textures [
27]. This enables the model to accurately distinguish subtle features between macroalgal communities and complex marine backgrounds, such as wave reflections and foam. The improved YOLOv8n model is illustrated in
Figure 6.
2.4. ECA Attention Mechanism in Neck
The ECA attention mechanism, proposed by Wang et al., is a lightweight channel attention module that captures inter-channel dependencies through local cross-channel interactions without the need for dimensionality reduction [
28]. This approach significantly enhances computational efficiency while maintaining accuracy. The structure diagram of the ECA module is shown in
Figure 7.
First, in the ECA attention module, the input feature map
for each channel undergoes GAP (global average pooling) to obtain a set of channel descriptors
, as shown in the following formula.
Here, represents the height of the input feature map, represents the width of the input feature map, represents the number of channels of the input, and and represent the value of the element at the row and column of the input feature map.
Meanwhile, the obtained
channel descriptors will form a
-dimensional column vector
. Subsequently, a one-dimensional convolution is applied to the compressed channel column vector
, resulting in a set of processed column vectors
; the formula is shown below.
The column vector
is mapped through the Sigmoid function to obtain a set of attention weights
, as shown in the formula below.
Finally, the attention weights
are multiplied with the input feature map
channel-wise to obtain the output feature map
, as shown in the formula below.
In this study, the ECA module is added after the C2f module in the Neck network of YOLOv8n, which effectively enhances the signal strength of important channels while suppressing secondary channels, allowing the model to focus more on regions relevant to the information of Enteromorpha [
29].
Notably, ECA addresses the multi-scale challenge of Enteromorpha detection by adjusting channel attention weights to prioritize crucial spatial regions, especially small or scattered patches [
30]. It also enhances attention to edge details, critical for detecting subtle or fragmented colonies that traditional models might miss due to limited receptive fields or inadequate feature extraction.
Compared to SE or CBAM, ECA is better suited here for its computational efficiency and effective channel-wise recalibration, key for multi-scale object detection in complex marine environments. The CEE-YOLOv8 network is illustrated in
Figure 8.
2.5. Loss Function
In this study, the original loss function CIoU [
31] of YOLOv8n was replaced with EIoU [
32], resulting in more accurate localization of the predicted bounding boxes. In object recognition tasks, the IoU (Intersection over Union) is employed to quantify the degree of overlap between the actual boxes and the anticipated boxes, with the calculation formula shown below.
represents the area of the predicted box, and
represents the area of the ground truth box. The formula for the IoU loss function is as follows.
CIoU, based on IoU, adds constraints on the aspect ratio. The formula is as follows.
represents the coordinates of the center point of the predicted box
, while
represents the coordinates of the center point of the ground truth box
.
represents the length of the diagonal of the minimum enclosing rectangle that covers both the predicted and ground truth boxes;
denotes the weight coefficient; and
signifies the parameter that measures the consistency of the aspect ratio. Their calculation formulas are as follows.
Compared to CIoU, EIoU directly optimizes the coverage distance of the bounding boxes, aspect ratio, and center point distances. This approach avoids the gradient vanishing problem that CIoU encounters with non-overlapping bounding boxes, reducing the harmful gradient effects of low-quality anchor boxes during the training process and improving the accuracy of bounding box matching. The formula for the EIoU loss function is as follows.
Here, and represent the width and height of the predicted box, while and denote the width and height of the ground truth box. Additionally, and indicate the width and height of the minimum enclosing rectangle.
5. Discussion
5.1. Validation of Results
To better illustrate the performance differences between YOLOv8n and CEE-YOLOv8 visually, this study conducted inference on Enteromorpha detection using the trained models under the same experimental settings, with the outcomes depicted in
Figure 13 and
Figure 14. It can be intuitively noted that CEE-YOLOv8 performs exceptionally better in balancing detection precision and recall.
Comparing the two sets of detection results, it can be observed that at night (3.jpg), the benchmark model YOLOv8n tends to confuse the sea surface background with Enteromorpha communities, misclassifying the ocean background as Enteromorpha. In contrast, CEE-YOLOv8 maintains a good degree of target discrimination in the night environment. Furthermore, when comparing 2.jpg and 8.jpg, it is evident that CEE-YOLOv8 demonstrates good recognition performance for Enteromorpha communities at a greater distance. However, YOLOv8n has a problem with missed detections, as it fails to identify all instances of Enteromorpha in the images. Additionally, in scenes containing two non-adjacent Enteromorpha communities (7.jpg and 9.jpg), CEE-YOLOv8 effectively achieves separate detection of targets without encountering issues of duplicate detections. Lastly, other test results further indicate that while YOLOv8n can detect Enteromorpha communities, it still exhibits lower confidence scores and reduced precision compared to the improved model, CEE-YOLOv8.
These differences can be attributed to targeted improvements to the model. The incorporation of the ConvNeXtv2 module significantly enhances the feature extraction capabilities of the C2f module, greatly improving the model’s ability to capture subtle edge features of Enteromorpha and expanding its perception range of target edges. This is particularly important given that the internal characteristics of Enteromorpha communities are often indistinct, and the edges of these communities can dissipate or settle over time. Such enhancements can improve the model’s feature extraction capacity for Enteromorpha of varying scales and forms, reducing missed detections and false positives caused by feature confusion, thus enhancing the robustness of Enteromorpha recognition in complex marine environments.
The integration of the ECA module further optimizes the model’s feature weighting for Enteromorpha, improving target discrimination in nighttime conditions and effectively reducing the false positive rate. The application of the EIoU loss function effectively increases the accuracy of bounding box localization, minimizing issues related to duplicate detections and boundary offsets. The synergistic effect of these three enhancements allows CEE-YOLOv8 to maintain stable detection performance even in challenging scenarios such as nighttime conditions, thereby demonstrating the effectiveness of the improvement strategies.
5.2. Limitations and Challenges
Although CEE-YOLOv8 has achieved significant improvements in Enteromorpha detection performance, its enhancement strategies still have certain limitations.
First, there is room for optimizing the balance between computational efficiency and performance. While the introduction of ConvNeXtv2 and ECA has enhanced feature extraction capabilities, the complexity of the convolution operations has increased the model’s inference time to some extent, particularly affecting its real-time performance on low-computing-power devices (such as unmanned vessel).
Secondly, the ability to adapt to harsh weather conditions still falls short. Although advancements have been made in nighttime conditions, the model’s capacity to discern edge characteristics might diminish in highly intricate scenarios. Conditions like heavy fog, intense wave disruptions, or the dense entanglement of Enteromorpha with other algal species can contribute to an increased probability of false positives.
These limitations point to directions for future research, such as the integration of lightweight network designs, the introduction of multimodal feature fusion, or the construction of more diverse Enteromorpha datasets.
6. Conclusions
This study focuses on Enteromorpha recognition, which plays a vital role in monitoring Enteromorpha drift law and acquiring environmental parameters necessary for Enteromorpha survival, as it is an integral part of intelligent unmanned vessel tracking systems. The discussion primarily emphasizes methods to enhance the accuracy of Enteromorpha detection and improve the performance metrics of the algorithms utilized.
This research enhances the C2f component within the Backbone architecture by incorporating the ConvNeXtv2 module. This enhancement enhances the model’s capacity to extract edge feature information related to Enteromorpha, thereby offering more detailed support for object detection. By integrating the ECA module into the Neck network, the model’s emphasis on Enteromorpha regions is heightened, which in turn boosts cross-scene adaptability. Furthermore, the employment of the EIoU loss function enhances the precision of bounding box localization by mitigating problems associated with boundary offsets and redundant annotations. Collectively, these three strategies enable the model to maintain stable capturing of subtle features in marine environments. Compared to the benchmark model YOLOv8n and other models (SSD, YOLOv5s, and YOLOv7-tiny), the proposed CEE-YOLOv8 demonstrates superior performance, achieving higher Precision, Recall, and mAP50-95. In addition, various quantitative metrics (P Curve, R Curve, PR Curve, and F1 Curve) and the final recognition results further validate the comprehensive performance advantages of the proposed model.
The performance advantages demonstrated by CEE-YOLOv8 support the monitoring tasks of unmanned vessels concerning Enteromorpha drift law, providing reliable technical support for the precise acquisition of environmental parameters essential for Enteromorpha survival. Notably, its stability in complex scenarios, such as nighttime conditions, meets the demands for highly robust detection models in marine ecological monitoring. In the future, the focus will be on further optimizing the model structure of CEE-YOLOv8 by employing techniques such as pruning and quantization to balance detection accuracy and real-time performance, thereby enhancing its adaptability in unmanned vessel monitoring. Additionally, efforts will be made to expand the sample coverage by increasing the dataset to include data under extreme lighting conditions, complex sea states, and mixed-species scenarios.