1. Introduction
Coral reef ecosystems, known for their biodiversity and primary productivity, play critical roles in coastal protection, resource provision, and supporting marine life habitats, but are increasingly threatened by intensified human activities and global warming [
1]. In particular, coral reefs are experiencing significant stress due to coral bleaching, pollution, and destructive fishing practices. While coral reefs can recover from bleaching events, prolonged exposure to these stressors has led to increased mortality, making effective monitoring and conservation efforts critically important. Furthermore, both natural and anthropogenic pressures are exacerbating their vulnerability. Conducting detailed surveys and explorations of coral reefs is essential to understanding the stability, dynamic evolution, and future trajectory [
2]. Coral reef ecological surveys lay the foundation for the planning, management, development, utilization, and protection of coral reef resources and are essential for ensuring the construction of coral reef projects, providing necessary support for maintaining national security and territorial integrity [
3]. The Live Coral Cover (LCC) is the most crucial indicator of reef health and is one of the key parameters in coral reef surveys [
4]. Monitoring LCC provides direct insights into the health of coral reef ecosystems and serves as an essential basis for assessing the overall function and dynamics of the ecosystem, which is crucial for ecological preservation and management [
5].
Currently, the primary method used to investigate LCC is the belt transect method [
6]. This method typically involves placing a transect tape underwater, where researchers, maintaining a vertical distance of about 30 cm from the tape, use underwater cameras to take continuous shots along the laid tape. The footage is later analyzed indoors with the help of computer magnification to calculate the total length of the transect occupied by live corals and thereby estimate the LCC. This method is time-consuming, labor-intensive, and requires expert interpretation [
7,
8]. Other methods also involve interpreting videos or photographs from underwater ecological surveys to calculate LCC. For example, Kenyon et al. (2006) used the towing method, dragging divers to film the coral reefs, and later manually extracted key frames from the videos to calculate LCC based on the proportion of coral pixels in the images [
9]. Jokiel et al. (2015) used the CRAMP RAT method to photograph corals, selecting random points on the images to calculate LCC based on the number of live coral points [
10]. While these traditional methods are effective, they are time-consuming, labor-intensive, and require specialized expertise. Additionally, they are prone to human error, limited by the resolution of the images, and often produce results that can vary depending on the researcher’s interpretation, leading to inconsistencies in data. Furthermore, such methods typically lack scalability, making them less suitable for large-scale or continuous monitoring efforts.
With the development of computer science, coral can now be identified and detected using technologies such as artificial intelligence and deep learning. Mahmood et al. (2016) combined manually extracted features with VGG (visual geometry group network) extracted features to identify coral reefs in the ocean [
11]. Padma Priya uses LIC-WM embedding in VGG for coral recognition. Sharan et al. (2021) used a CNN network to recognize and classify corals by inputting grayscale and RGB images of corals [
12]. Jiang et al. used the YOLOv5 model to identify corals [
13]. These tasks mainly focus on coral detection. And image semantic segmentation aims to decompose the visual scene into different semantic category entities, achieving category prediction for each pixel in the image. Alonso et al. (2019) used an improved Deeplab v3 model to segment corals; Zhang et al. (2024) used their own designed network to segment corals; and King et al. (2018) used an improved Deeplab v2 model to segment corals. These studies indicate that deep learning methods can accurately segment corals in images, but most of them have not applied it to ecological surveys and monitoring [
14,
15,
16]. We have noticed that some estimates of coverage and area can be achieved through semantic segmentation, such as Ou et al. (2023) using the U-net network to measure vegetation coverage and Shen et al. (2023) using the U-net network to measure grassland coverage, which inspires our project [
17,
18]. Despite the success of these models in image segmentation tasks, applying deep learning directly to LCC interpretation for reef health monitoring remains an underexplored area. These studies provide concrete inspiration for our work, demonstrating the successful use of deep learning models to quantify coverage areas in complex natural environments. Specifically, they illustrate the feasibility of adapting similar segmentation techniques for coral coverage estimation, emphasizing the potential to improve accuracy, scalability, and efficiency.
In response to these challenges, a new method is proposed in this paper to enhance the efficiency of LCC interpretation by optimizing the PSPNet model through the integration of channel and spatial attention mechanisms, along with pixel shuffle modules. The contributions are outlined as follows:
- Proposed a deep learning-based automated method for efficient LCC estimation from underwater videos, overcoming the limitations of manual interpretation. This approach significantly reduces labor and time costs while enabling large-scale monitoring of coral reef health, making long-term ecosystem tracking more feasible. 
- Enhanced the semantic segmentation performance by integrating channel and spatial attention mechanisms with pixel shuffle modules. This optimization significantly improved the precision of the PSPNet model in segmenting corals and handling complex underwater environments, boosting both the efficiency and robustness of LCC estimation, particularly in noisy and challenging backgrounds. 
- Converted model outputs into LCC, a key indicator of coral reef ecosystem health. This transformation sets the study apart from other image recognition research and enables the model to provide direct and precise insights into the health status of coral reefs, significantly improving the relevance and utility of the results for coral reef monitoring and conservation efforts. 
  3. Methods
This chapter comprehensively describes the proposed automatic interpretation method for estimating LCC. The approach involves three key stages: keyframe extraction, modeling, and coverage estimation. Detecting and segmenting corals in underwater imagery presents significant challenges due to variations in lighting conditions, underwater visibility, and the intricate structure of coral reefs [
13]. To identify the most appropriate model for LCC estimation, we evaluated four widely used semantic segmentation neural network models: PSPNet, Deeplab v3+, U-Net, and HRNet. Standard semantic segmentation accuracy metrics were employed to assess model performance, including mean Intersection over Union (mIoU) and mean Average Precision (mAP). The results of automatic LCC interpretation were then compared with manual interpretation data to ensure the reliability and precision of the automated method.
  3.1. Key Frame Extraction
Taking videos shot using the transect line method as an example, divers set up transect tapes on the coral reefs and move along these tapes while filming. Due to limitations in the filming equipment, the motion state of divers, and the specific particularity of underwater operations, the resulting videos inevitably include blurry and static segments. Using a simple posterizTime algorithm that extracts frames at fixed intervals would not eliminate blurry or repetitive images. Blurry images might lead the model to fail in recognizing corals, thereby reducing the final coverage estimate. Repetitive images would lead to redundant counting, particularly if the diver’s footage happens to linger over areas that are either barren or densely populated with corals, thus significantly impacting the final calculation of coverage. Therefore, we introduce a Laplacian variance threshold and Structural Similarity Index (SSIM) to exclude extracted key frames. The Laplacian variance, a measure of image sharpness, is calculated by convolving a single-channel image with a 3 × 3 kernel to obtain its variance.
In image processing, the Laplacian operator is typically used to detect edges and texture information in images. For sharp images, which contain richer edge and texture information, the variance of the image after Laplacian transformation is higher [
26]. We use a predefined threshold (set here at 100) to determine if a frame is blurry; if the Laplacian variance exceeds this threshold, the frame is considered sharp. Otherwise, it is deemed blurry and discarded. 
Figure 2 shows underwater coral images with different Laplacian variances.
The SSIM index is a metric used to measure the similarity between two digital images. The core function of SSIM is defined as Formula (1):
        where 
S(
x,
y) describes the similarity between distorted signal and original signal and acts as distortion measure, 
l(
x,
y) is the brightness comparison function, 
c(
x,
y) is the contrast comparison function, 
s(
x,
y) is the structure comparison function, 
f(∙) is the integration function, and these functions are independent [
27]. When one of the two images is undistorted and the other is distorted, their structural similarity can be regarded as an indicator of image quality [
28].
We calculate the SSIM value between the current frame and previously saved frames. Then, based on a predefined threshold (set here at 0.9), we determine whether two frames are similar. If the SSIM value exceeds the threshold, the frames are considered duplicates; the latter is discarded. Otherwise, the frames are considered unique.
To ensure no coral is missed, we set the frame extraction frequency at one frame every 30 frames. The key frames extracted after computation of Laplacian variance and SSIM thresholds are shown in 
Figure 3. Depending on different ecological survey videos and computer hardware conditions, the frame extraction frequency and thresholds may vary.
  3.2. Model Building
Zhao et al. (2017) introduced the PSPNet neural network model, a semantic segmentation network based on a spatial pyramid pooling module that effectively integrates global context information to enhance segmentation accuracy in complex scenes [
29]. Our approach is built upon and improves the PSPNet for segmenting coral images. The flowchart for coral image segmentation is depicted in 
Figure 4 and can be broadly divided into two components: the backbone network and the detection network.
The backbone network is primarily used for feature extraction from input images. In this study, we employ a backbone network based on ResNet50 (Residual Network), which incorporates both spatial and channel attention mechanisms. These enhancements emphasize useful information while suppressing irrelevant features, improving the model’s feature extraction capabilities. The detection network is built upon the PSPNet architecture, with the addition of a pixel recombination module to improve mask generation quality. The following sections will provide a detailed explanation of both the backbone network and the detection network.
  3.2.1. Backbone Network
He et al. (2016) introduced the ResNet network, which effectively addressed the degradation problem associated with training deep neural networks by incorporating residual blocks, thereby enabling the successful training of deeper networks [
30]. The outstanding performance and widespread adoption of ResNet have established it as a landmark model computer vision. Common variants of ResNet include ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152, which differ in the number and configuration of residual blocks. While deeper layers allow for the capture of more complex features, they also lead to increased computational requirements and longer training times. To strike a balance between performance and computational efficiency, we utilize ResNet-50 as the backbone network.
Corals exhibit considerable diversity in species and morphology, with different species displaying both similarities and differences in their features and structures. Furthermore, the complexity of the marine environment can result in challenges when distinguishing corals from the seabed substrate, especially when they share similar colors and structures. Corals often coexist with other marine organisms such as seaweeds, anemones, urchins, and starfish, further complicating the segmentation task. Therefore, it is critical to accurately identify coral-specific features while minimizing attention to non-coral elements, thereby improving the model’s ability to extract relevant features. We have integrated spatial and channel attention mechanisms into the ResNet-50 model to achieve this.
Different channels may represent different features of an image, such as edges, textures, or colors. The channel attention mechanism weights each channel of the feature map, allowing the network to automatically focus on the channels most relevant to the current task. First, global average pooling 
 and global max pooling 
 operations are applied to the input feature map, producing two 1 × 1 dimension feature maps representing the average and maximum values across the channel dimensions, respectively. These pooled results are then passed through a shared, fully connected layer, implemented with two 1 × 1 convolutional layers. The first convolutional layer reduces the number of channels to 1/16th of the original, followed by a ReLU activation function, and the second convolutional layer restores the number of channels back to the original size. The outputs from these layers are then summed and processed by a sigmoid activation function, yielding the final weight for each channel. This weight is subsequently applied to the original feature map, producing an enhanced feature map 
. In short, the channel attention is computed as Formula (2):
		  where 
 denotes the sigmoid activation function, which normalizes the output. The MLP weights 
 and 
, are shared for both inputs, 
 and 
. After passing through 
, the ReLU activation function is applied before the second weight, 
, restores the original channel size.
The spatial attention mechanism focuses on the importance of different spatial locations within the feature map. Initially, 
 global average pooling and global max pooling are applied to the input feature map across the channel dimension, generating two feature maps of dimensions 
, representing the average and maximum values. These feature maps are then concatenated along the channel dimension, forming a two-channel feature map. This is followed by a convolution operation that computes the attention weights for each spatial location. The resulting weights are used to enhance the original feature map by focusing on the most relevant spatial information. In short, the spatial attention is computed as Formula (3):
          where 
 denotes the sigmoid function and 
 represents a convolution operation with a filter size of 
. Shown in 
Figure 5. We incorporated the CBAM between the max pooling layer and the average pooling layer.
  3.2.2. Detection Network
After feature extraction by the backbone network, the detection network predicts masks and their corresponding categories. The Pyramid Pooling Module captures global contextual information through multi-scale pooling. It pools the input feature map at different scales, then upsamples these results back to the original size and merges them with the original feature map to form a feature-rich contextual map. The feature map extracted through the pyramid pooling module is further processed by a convolution, and a 1 × 1 convolutional layer maps the feature map to the number of class channels to produce the final segmentation result. In the original PSPNet, the upsampling process used bilinear interpolation.
Upsampling is a technique used in computer vision and image processing to enhance image resolution. The most common approach is interpolation, which estimates new pixel values based on known ones. Nearest-neighbor interpolation is simple and fast but can produce jagged edges, while bilinear interpolation offers smoother results by using the linear relationship between four neighboring pixels. However, bilinear interpolation can lose high-frequency details, particularly at edges, impacting the accuracy of mask generation tasks. To address these shortcomings, we introduce a Pixel Shuffle layer as part of the up-sampling process to preserve more features. As 
Figure 6, the Pixel Shuffle layer is an efficient upsampling method that avoids introducing redundancy. Unlike convolution or pooling, which focus on feature extraction, padding, which adds potentially redundant zeros or ones, or interpolation, which estimates new pixel values based on neighboring pixels, Pixel Shuffle operates purely by rearranging the pixels in a feature map without any additional operations. This rearrangement-only approach enhances upsampling efficiency, reduces the likelihood of redundant pixel addition, and fully preserves the features in the feature map, thereby significantly improving the quality of the generated masks.
The principle of Pixel Shuffle is as follows (4) and (5):
          where 
 is the number of input channels, 
 and 
 are the height and width of the feature map, respectively. 
 represents the number of channels in the final output image, and 
 is the upsampling factor. The input channels 
 are divided into 
 groups, with each group containing 
 channels. These 
 channels correspond to spatial information that will be rearranged in the output image. Specifically, each group of 
 channels is reshaped into a small 
 grid and then reorganized into the final spatial dimension, enlarging the resolution of the feature map.
Pixel Shuffle enhances our model’s upsampling process by preserving more feature details and improving mask quality without introducing redundant computations, thereby increasing segmentation accuracy. 
Figure 7 illustrates its application in our work.
  3.3. Evaluation Metrics
Semantic segmentation is a classification at the pixel level, commonly evaluated by using metrics such as Pixel Accuracy (PA), Class Pixel Accuracy (CPA), Mean Pixel Accuracy (mPA), Intersection over Union (IoU), and Mean Intersection over Union (mIoU). In this study, Mean Average Precision (mAP) and mIoU are used as the primary evaluation metrics. The mIoU is a crucial index for assessing the precision of image segmentation because the IoU represents the overlap ratio between the predicted mask and the label pixels and evaluates whether the foreground areas are predicted accurately, while the mIoU calculates the arithmetic mean of the IoU values for each category, reflecting the overall pixel overlap situation in the dataset. The mPA represents the average probability of correctly classified pixels within each category. The expression for mIoU is given by Formula (6):
        where 
 represents the total number of classes. 
 refers to the predicted segmentation region for the 
 class, while 
 denotes the ground truth region for the same class. The numerator, 
, is the area of intersection between the predicted and actual regions, and the denominator, 
, is the area of their union. By averaging the IoU values across all classes, mIoU reflects the overall accuracy of the segmentation model.
The expression for mAP is given by Formulas (7) and (8):
In this formula,  is the total number of classes, and  represents the Average Precision (AP) for the iii-th class. AP is computed by summing the product of precision () and the change in recall  at each threshold , capturing the trade-off between precision and recall. By averaging AP across all classes, mAP provides an overall assessment of the model’s performance.
These 2 metrics (mIoU and mAP) complement each other by providing insights into both pixel-level segmentation accuracy and class-level prediction performance. In this study, they collectively provide a comprehensive evaluation of the model’s capabilities, namely accurately delineating object boundaries while maintaining robust classification across all categories in coral health monitoring.
  3.4. Counting
In this approach, we use the model to predict and extract key frame images, replacing the background class with black pixels 
 and the coral with white pixels 
, as shown in 
Figure 8. Then we calculate the proportion of coral in each image and then average these proportions to approximate the area ratio of living coral in the ecological survey videos, using area ratio as an indicator of living coral coverage. However, the accuracy of this method is to some extent dependent on the color contrast and the visibility of the coral in the images. If the coral and the background exhibit minimal color differences or a lack of prominent three-dimensional structure in the images, the model’s ability to identify coral may be compromised. Consequently, in particular in complex underwater environments, the model may encounter limitations and require further improvements to enhance its ability to detect objects with low color contrast or limited three-dimensional features. Future research will focus on integrating additional dimensions of information, such as depth data or variations in lighting, to improve the model’s applicability.
To sum up, this approach leverages the latest advances in computer vision to provide a more efficient and accurate method for assessing coral health, which is crucial for the ongoing conservation efforts and understanding of coral ecosystems.
  4. Experimental Results and Discussion
  4.1. Experimental Environment
The experimental platform operates under Ubuntu 18.04 with an RTX4090 GPU, using PyTorch 2.0.0, Python 3.8, and CUDA 11.8. As 
Table 1.
In terms of training, we employed stochastic gradient descent (SGD) as the optimization technique, with a momentum of 0.9 and a weight decay of 0.0001. The initial learning rate was set to 0.01, and the model was trained for 300 epochs with an input size of 640 × 640 and a batch size of 8. Overfitting was mitigated using weight decay and early stopping based on validation performance.
  4.2. Analysis of the Experimental Results
The comparative analysis of the validation set shows various model performances as outlined in 
Table 2. For instance, our PSPNet achieved a mIoU of 89.51% and mAP of 94.47%. The segmentation effects of different models on the validation set are illustrated in 
Table 2, demonstrating the effectiveness of each from PSPNet to our improved model.This improvement can be attributed to the optimized network architecture and the integration of key techniques such as spatial and channel attention mechanisms and the Pixel Shuffle upsampling. Additionally, in our evaluation metrics, IoU0 represents the IoU for the background area. In contrast, Io1 represents the IoU for the coral area, clearly demonstrating the model’s performance in distinguishing between the background and coral regions. 
Table 3 presents the results of the ablation experiments.
To better evaluate our model, the researchers selected three different transect video samples. The first transect has the highest coverage of live corals, mainly consisting of Rose Corals. The corals in the second transect are mainly Shore Corals, Honeycomb Corals, and Horn Honeycomb Corals. The third transect has the lowest coverage, including many bleaching and dead Staghorn Corals. Due to the water body’s absorption of different light wavelengths, the colors in the footage exhibited biases, with the first and third transects appearing more yellow. 
Figure 9 compares the absolute errors between the interpretation results of different models and the manual interpretation for the three transects, while 
Figure 10 presents the LCC interpretation results from each model alongside the manual interpretation, with coverage 1–3 representing the names of the three transects. From these two figures, it is clear that our model achieves the best accuracy. The three transects correspond to the majority of underwater ecological survey scenarios for coral reefs. These nuances highlight the challenges and the effectiveness of our semantic segmentation approach in diverse and realistic underwater conditions.
The overall results indicate that our improved semantic segmentation model performs well in live coral segmentation and coverage interpretation under various ecological conditions. The improved accuracy of mIoU and coral area delineation indicates that the model has significantly improved LCC interpretation and made great contributions to coral reef ecological health monitoring.
  4.3. Discussion
Our model improved the accuracy of LCC estimation, contributing to ongoing efforts in coral reef health monitoring. The increasing frequency and intensity of global coral bleaching have led to a sharp decline in LCC, highlighting the urgent need for better monitoring methods [
31]. By automating coral recognition and segmentation, our model provides a potentially more efficient alternative to traditional manual transect methods, particularly in addressing large-scale coral bleaching events.
However, the model presents certain technical limitations. Initial training and setup for users unfamiliar with deep learning techniques may take approximately 10–15 h. Additionally, the model faces challenges in accurately segmenting degraded or low-contrast corals, which is crucial in coral bleaching scenarios. In complex marine environments, interactions between corals and other organisms can also lead to false positives. While the model generally provides more consistent LCC estimates than manual techniques, a small but statistically significant difference remains. This suggests room for further error reduction, especially in challenging conditions.
From a cost-benefit perspective, the time required to learn and implement the model may be notable, but potential long-term gains in accuracy and efficacy balanced it. Future work will focus on refining the model’s ability to distinguish between healthy, bleached, and dead corals while improving its adaptability to a broader range of environments. Expanding the training dataset and increasing the diversity of coral types and conditions will further enhance the model’s robustness. Additionally, underwater imaging challenges such as light attenuation, distortion, and water turbidity must be addressed.
Underwater robots equipped with sensors offer a promising platform for coral imaging. Previous work by Mogstad et al. (2019) demonstrated the use of unmanned robots for coral imaging. Integrating hyperspectral imaging with our model could enhance coral detection and classification, providing a more comprehensive approach to reef health monitoring [
32]. We also plan to explore 3D scanning technologies to create detailed coral reef models, allowing for more precise multi-scale surveys and further contributing to coral reef conservation efforts.
  5. Conclusions
This study introduces a deep learning-based automated method for efficiently estimating live coral cover (LCC) and optimizing the monitoring process. The method significantly enhances semantic segmentation performance in complex underwater environments by improving the PSPNet model with channel and spatial attention mechanisms and incorporating pixel shuffle modules. The experimental results show that the proposed model outperforms traditional manual methods in both accuracy and efficiency, achieving a mean Intersection over Union (mIoU) of 89.51% and a mean Pixel Accuracy (mPA) of 94.47%. These results surpass other models such as Deeplab v3+, U-Net, and HRNet. Additionally, the model exhibits a low mean absolute error of 4.17% in LCC estimation, further confirming its precision.
The findings highlight substantial advancements in automating LCC estimation and coral image segmentation, reducing manual workload, and providing critical support for large-scale ecological monitoring. Consequently, the proposed method offers promising improvements in coral reef health assessments and holds great potential as a valuable tool for the protection and management of coral reef ecosystems. In the future, integrating this approach with remotely operated underwater vehicles (ROVs) could further enhance real-time monitoring efficiency, contributing to the long-term conservation and management of coral reefs.