1. Introduction
Coral reefs are among the most important structural components of marine ecosystems. Although they cover less than 1% of the ocean’s surface area, they provide habitats and food resources for approximately 25% of marine organisms [
1,
2]. Common forms of coral abnormalities include coral bleaching, band disease, and white pox disease [
3]. In recent years, the occurrence of abnormal coral phenomena has shown an increasing global trend, with coral bleaching becoming particularly severe, posing a major challenge to coral reef conservation both domestically and internationally [
4,
5]. If such abnormal coral conditions persist over extended periods, they can lead to large-scale coral mortality, threaten coral-dependent organisms and food webs, and ultimately undermine the stability of marine ecosystems. Therefore, timely detection of corals and the implementation of appropriate artificial intervention measures based on the condition of abnormal corals can improve coral survival rates and help maintain marine ecological balance [
6,
7].
To address the aforementioned challenges, convolutional neural network (CNN)-based object detection methods have become the mainstream approach in underwater coral-related tasks. These methods can be broadly categorized into two types: two-stage algorithms and one-stage algorithms. Representative two-stage algorithms include R-CNN (Region-Based CNN) [
8] and Faster R-CNN [
9], while typical one-stage algorithms include YOLO (You Only Look Once) [
10] and EfficientDet [
11]. These approaches have been widely applied to underwater object detection tasks, including those related to underwater coral detection. Xin G. et al. employed an improved convolutional neural network architecture, EfficientNet, for the detection of bleached corals [
12]. González-Rivero et al. utilized VGG-16 in combination with a support vector machine (SVM) to detect corals and benthic coral reef communities [
13]. Wang Lan et al. applied an improved deep data augmentation method, DeepSMOTE, together with transfer learning for the detection of reef-building corals [
14]. Hua Mingzhu adopted an improved YOLOv5 model to perform classification and detection of corals, reef fishes, and starfish [
15]. Although these studies have achieved notable progress in coral-related detection tasks, existing methods still suffer from insufficient lightweight design and limited robustness when applied to diseased coral detection. In this work, robustness primarily refers to the model’s ability to maintain stable detection performance by suppressing background interference and enhancing pathological feature representation under varying underwater imaging conditions.
Corals are characterized by high species diversity, complex and variable morphologies, and often dense spatial distributions. In addition, the widespread presence of interference factors such as fish further increases the difficulty of diseased coral detection. Moreover, due to underwater optical effects such as light attenuation and scattering, coral images captured underwater may suffer from severe color distortion and blurring. Factors such as water currents can also cause variations in target posture, further complicating coral object detection [
16]. On the other hand, operations in underwater environments are typically constrained by limitations in computational capability and storage capacity of the deployed equipment [
17]. Consequently, existing object detection models generally exhibit the following shortcomings: Methods with higher detection accuracy usually suffer from high computational complexity and large model sizes, making them difficult to deploy in underwater task environments; in contrast, methods with smaller model sizes and lower computational complexity that are easier to deploy often exhibit relatively lower detection accuracy and robustness. Therefore, it is necessary to develop a lightweight underwater diseased coral object detection algorithm that achieves a better balance between model efficiency, detection accuracy, and robustness.
2. Method
Underwater diseased coral detection faces practical challenges such as complex illumination conditions, low contrast, background interference, and limited computational resources in real deployment scenarios. Therefore, the proposed method follows a task-oriented design strategy, aiming to balance detection accuracy and computational efficiency.
To achieve more accurate, lightweight, and easily deployable underwater diseased coral object detection, this study proposes multiple improvements to the YOLO11 model, resulting in the CD-YOLO architecture. The overall network structure is illustrated in
Figure 1.
The main architectural design choices of CD-YOLO are summarized below: (1) The core unit of ShuffleNetV2 (ShuffleNet Version 2) is improved by adopting a “dimension reduction followed by dimension expansion” strategy combined with residual connections. Based on this design, a task-oriented lightweight network named CDShuffleNet (Coral Disease–ShuffleNet) is constructed by adapting and reorganizing existing lightweight design principles for underwater diseased coral detection, which replaces the backbone of YOLO11 to achieve model lightweighting while simultaneously improving detection performance; (2) the downsampling convolution module in the Neck is optimized to SPDConv (Space-to-Depth Convolution), which further enhances detection accuracy while maintaining a lightweight model design; and (3) attention mechanisms are integrated, including an improved C2PSA module incorporating the EMA (Efficient Multi-scale Attention) mechanism, as well as the fusion of the SENetV2 (Squeeze-and-Excitation Network Version 2) attention mechanism in the Neck. These designs further promote model lightweighting, strengthen feature representation capability, and improve the overall robustness of the model.
2.1. YOLO11 Model
YOLO11 is one of the latest versions in the YOLO series. It adopts improved backbone, Neck, and Head architectures, which enhance feature extraction capability to achieve more accurate object detection and high performance in complex tasks. The model provides faster processing speed while maintaining a favorable balance between accuracy and efficiency, and it can be deployed in various environments, including edge devices, demonstrating high flexibility [
18].
Owing to its excellent trade-off between detection accuracy and inference speed, YOLO11 is often selected as a baseline and further improved for resource-constrained object detection tasks [
19]. Compared with YOLO12 in the YOLO series and various Transformer-based models, YOLO11 has lower computational complexity and smaller model size. Therefore, this study selects YOLO11 as the baseline model for improvement. By enhancing the YOLO11 model, it is possible to not only improve the overall detection accuracy in underwater diseased coral object detection tasks, but also further reduce the number of model parameters, computational cost, and model size, thereby making the model more suitable for deployment.
2.2. Backbone Replacement with CDShuffleNet
Due to the limitations of underwater hardware conditions in diseased coral detection tasks, the detection model is required to maintain a balance between computational efficiency and feature representation capability. Therefore, a lightweight backbone architecture is adopted in this study.
This study improves the core unit module of the lightweight ShuffleNetV2 network by drawing on the characteristics of ShuffleNetV2 and multiple network design principles. The improved core unit modules are then used to construct a new network, CDShuffleNet, in which the number of unit modules at each stage is reduced compared with the original network. As a result, CDShuffleNet is lighter than the original YOLO11 backbone while maintaining comparable detection performance, making it suitable for object detection tasks in constrained underwater environments. The details are described as follows.
ShuffleNetV2 [
20] is a lightweight and efficient deep neural network specifically designed for computation-constrained environments such as mobile devices. Based on ShuffleNetV1, it emphasizes that, in addition to FLOPs, the relationship between the number of multiply–accumulate operations and actual inference latency should also be considered. ShuffleNetV2 adopts a more uniform channel allocation strategy, reduces memory access cost, and introduces new computational units to improve parallel computing efficiency. Its main characteristics include: splitting the input channels into two branches, where one branch undergoes lightweight computation while the other is directly bypassed; removing the group convolution (GConv) used in ShuffleNetV1; employing channel shuffle operations to ensure effective information fusion; and reducing the proportion of pointwise operations.
Residual connections and Bottleneck modules were first introduced in ResNet [
21]. In traditional neural networks, each layer typically learns a mapping function H(x), whereas residual connections indirectly learn this mapping by optimizing a residual function F(x). This mechanism effectively alleviates gradient vanishing and gradient explosion problems as network depth increases. The Bottleneck structure is designed to address the growth in computational cost and parameter count caused by deeper networks. In YOLO11, the Bottleneck module consists of a CBS module that first reduces the channel dimension, followed by another CBS module that restores the channel dimension, with an optional residual connection. This “dimension reduction followed by dimension expansion” design reduces computational cost while maintaining model representational capacity.
The original core module of ShuffleNetV2 and the improved core module proposed in this study are illustrated in
Figure 2. A stride value of 1 is used for feature extraction, while a stride value of 2 is used for downsampling. In this work, the modules with a stride value of 2 are retained to ensure lightweight downsampling. The modules with a stride value of 1 are redesigned as follows.
First, the channel splitting operation is removed, and a CBS module with a convolution kernel size of 1 is employed to compress the channel dimension. The compressed features are then divided into three branches, referred to as Branch 1, Branch 2, and Branch 3. Next, Branch 3 is sequentially processed by a CBS module with a kernel size of 3, a depthwise convolution (DWConv), and another CBS module with a kernel size of 3. In this process, the first CBS module performs channel compression (“dimension reduction”), while the second CBS module restores the channel dimension (“dimension expansion”). Subsequently, the processed Branch 3 is added element-wise to Branch 2 to form a residual connection. The output of this residual connection is then concatenated with Branch 1, followed by a channel shuffle operation to obtain the final output of the module.
Compared with the original core module, the improved module removes the original Conv 1 × 1 + BN + ReLU combination and replaces it with CBS modules, with only the first CBS module using a 1 × 1 convolution kernel. By increasing the use of 3 × 3 convolutions and adopting the SiLU activation function, the feature learning capability of the module is effectively enhanced, although this also increases computational cost. Therefore, in contrast to the original module, the feature extraction branch is further compressed to a lower channel dimension before applying the DWConv from the original module and subsequently restoring the channel dimension. This design reduces computational cost while preserving sufficient feature representation capability, such as learning shape and color information. Moreover, since ShuffleNetV2 stacks a large number of core modules, deep networks are prone to gradient vanishing. To alleviate this issue, a residual connection is introduced before channel concatenation in the improved core module.
In addition, because the improved core modules exhibit stronger feature extraction capability, the ShuffleNetV2 network architecture is further adjusted by reducing the number of stacked core modules at each stage, thereby decreasing the overall network complexity and computational cost. The original ShuffleNetV2 architecture and the proposed CDShuffleNet architecture are illustrated in
Figure 3.
CDShuffleNet is employed to replace the backbone network of YOLO11, aiming to reduce model complexity while preserving essential feature extraction capability under underwater conditions. Owing to uneven underwater illumination, light refraction and scattering can cause brightness variations, while light attenuation often leads to color shifts, resulting in coral images that are sometimes blurred and thus hinder effective feature extraction. Moreover, abnormal coral regions vary in size and may exhibit low contrast with healthy corals, making them difficult to distinguish. Bleached corals mainly differ from healthy corals in terms of color, whereas other types of diseased corals differ primarily in fine-grained details, such as patches and stripes.
By adopting the improved core unit modules, CDShuffleNet is able to learn color information as well as detailed features such as patches and stripes on corals more effectively than the original ShuffleNet network. This enhancement strengthens the overall feature extraction capability of the model, enabling it to maintain favorable detection performance while achieving lightweight design, and making it better suited to meet the deployment and detection requirements in underwater environments.
2.3. Incorporation of SPDConv
Underwater coral images often exhibit insufficient resolution, reduced contrast, and significant noise due to limitations of imaging equipment as well as light scattering and absorption effects in water. When coral targets are small in scale, sparsely distributed, or located at a considerable distance from the imaging device, the model’s ability to perceive fine-grained features and small objects is severely challenged.
Existing studies have demonstrated that, in scenarios with significant background interference or limited data quality, preserving and effectively processing informative features is critical to the performance of deep learning models. For example, Selvan et al. employed discrete wavelet transform (DWT) to remove noise from sensor signals and achieved an accuracy of 99.58% in a fetal health classification task using deep learning models [
22]. This result further indicates that strengthening information preservation in complex environments has general significance for improving model performance and is also worthy of attention in underwater object detection tasks.
To address this issue, this study improves the downsampling module, as described below. SPDConv [
23] is a downsampling convolution module. In conventional convolutional neural networks, strided convolutions and pooling operations are widely used for feature map downsampling to enlarge the receptive field and reduce computational cost. However, when the input image resolution is low or the target objects are small, early-stage downsampling using these methods discards a substantial amount of fine-grained information, leading to significant performance degradation in small-object detection and low-resolution classification tasks. SPDConv splits the input feature map into multiple sub-feature maps using a fixed stride and then concatenates them along the channel dimension to achieve downsampling. Although the spatial resolution is reduced, the number of channels is increased, preserving more feature information. A convolution with a stride of 1 is then applied to reduce the channel dimension. In this way, fine-grained information loss that could lead to insufficient feature learning is avoided, while the computational overhead of the downsampling module is not increased.
In YOLO11, the Neck still employs several conventional downsampling convolution modules. In this work, these modules are replaced with SPDConv. After this replacement, the model is able to maintain sufficient feature learning of coral details under low-resolution conditions or when targets are small. Consequently, the proposed improvement enhances the model’s capability for underwater diseased coral detection.
2.4. Fusion of Attention Mechanisms
After the aforementioned improvements, the model achieves a certain degree of lightweight design while maintaining relatively high detection accuracy. However, in practical scenarios, underwater coral detection remains highly challenging due to the presence of small and scattered pathological regions, such as localized bleaching and white pox disease occurring among numerous healthy corals. In addition, white pathological areas are easily confused with white sandy seabeds, and interference from water surface reflections, illumination variations, and marine organisms further increases the complexity of the detection task. As a result, conventional models often exhibit limited robustness in terms of stable feature discrimination and background suppression under complex underwater environments. To address these issues, this study integrates two attention mechanisms to further lighten the model while enhancing the representation of pathological region features, suppressing environmental interference, and improving overall robustness. The details are as follows.
Efficient Multi-scale Attention (EMA) [
24] is an efficient multi-scale attention module specifically designed for computer vision tasks, which is able to reduce the number of parameters and computational cost while preserving the key information of each channel. EMA enhances feature processing capability by reorganizing channel and batch dimensions, captures pixel-level relationships through cross-dimensional interactions, and improves feature representation via global information encoding and channel weight calibration. These designs enable EMA to achieve a favorable balance between the number of parameters, computational cost, and detection performance, thereby facilitating further model lightweighting. SENetV2 [
25] is an improved version of the Squeeze-and-Excitation Network (SENet), whose core component is the Squeeze Aggregated Excitation (SaE) module. By combining the characteristics of SENet and ResNeXt architectures, SaE adopts multi-branch fully connected layers for feature compression followed by scaling. This design enhances global information aggregation and feature representation capability, allowing SENetV2 to learn more complex channel relationships with negligible additional computational cost.
Accordingly, this study integrates the EMA mechanism into the C2PSA module to construct the C2PSA_EMA module. The C2PSA module is an innovative component of YOLO11 and serves as the final processing module in the Backbone. In this work, Attention modules within all PSABlock units of the original C2PSA module are replaced with EMA. EMA modules are more lightweight than the original Attention modules. By leveraging EMA’s cross-dimensional interaction and global information encoding capabilities, the model can learn the relationships between different regions of an image, enabling more effective discrimination between targets and background across diverse environments.
Furthermore, the SENetV2 mechanism is introduced into the Neck of YOLO11, aiming to strengthen channel-wise feature discrimination for small and visually ambiguous coral targets. By learning complex channel relationships and aggregating global information, SENetV2 enhances the model’s sensitivity to pathological features in terms of color (e.g., white regions) and shape (e.g., band-like or spot-like patterns). Its placement in the small-object detection layer enables more effective handling of small and dispersed white pox disease regions or localized bleaching mixed within healthy corals, thereby improving the overall robustness of the model.
Since the above attention mechanism modules are inherently lightweight, the integration of the EMA mechanism replaces the original attention module, while the SENetV2 mechanism is only added to a single detection layer, resulting in a limited increase in parameters and computational cost. Therefore, the incorporation of these two attention mechanisms allows the improved model to maintain its lightweight characteristics.
4. Conclusions
This study addresses the challenges of underwater coral target detection posed by hardware constraints and complex environments, which often lead to high rates of false positives and false negatives. To overcome these challenges, a lightweight yet high-performance underwater coral detection method, CD-YOLO, was proposed. Compared with YOLO11n, CD-YOLO achieves a 4.3 percentage point increase in mAP, while reducing parameter count, computational cost, and model size by 20.6%, 21.9%, and 18.9%, respectively. In comparison with several other lightweight models and related studies, the experimental results indicate that CD-YOLO achieves improved detection performance while maintaining lightweight characteristics. The cross-dataset experiments on the BHD Coral dataset further suggest preliminary adaptability under similar underwater detection conditions. Visualization experiments using heatmaps further illustrate the effectiveness of the proposed improvements.
Due to the lack of access to deployed underwater hardware platforms, real-device latency and energy consumption evaluations were not conducted in this study. Such evaluations will be considered an important part of future work when appropriate hardware conditions become available. Moreover, the conclusions of this study are subject to the limitation of dataset scale, and further validation on larger and more diverse real-world underwater datasets is required.
For future work, collecting a larger dataset encompassing more complex underwater scenarios would allow the model training to better reflect real-world conditions. Additionally, the Neck layer of the model still offers potential for improvement, such as adopting more efficient feature fusion networks or replacing it with advanced upsampling modules to further enhance detection performance.