Coral-YOLO: An Intelligent Optical Vision Sensing Framework for High-Fidelity Marine Habitat Monitoring and Forecasting
Abstract
1. Introduction
- 1.
- We identify and analyze a fundamental design limitation in modern object detectors: the lack of explicit cross-scale reasoning at the prediction stage, where conventional decoupled heads process multi-scale features through isolated computational streams.
- 2.
- We propose two core, original architectural solutions: the HAB-Head for holistic spatial reasoning and MCAttention for robust, stochastic feature learning.
- 3.
- We integrate these into a comprehensive, high-performance detection framework, Coral-YOLO, which our novel SFD-Conv and SD-Loss further enhance.
- 4.
- We provide compelling empirical evidence of our framework’s superiority through extensive experiments on a challenging, real-world time-series coral dataset, not only achieving state-of-the-art detection performance but also establishing the first robust framework for temporal bleaching prediction.
2. Related Work
2.1. Computer Vision in Marine Ecology
2.2. The Evolution of Real-Time Object Detection
2.3. Attention and Dynamic Mechanisms
3. The Proposed Coral-YOLO Framework
3.1. Overall Architecture
- A Redesigned Head for Holistic Prediction: To address the critical spatial reasoning deficit of conventional detectors, we completely replace the standard decoupled head. Our new Holistic Attention Block Head (HAB-Head) is explicitly engineered to realize our Principle of Holistic Prediction. Unlike a standard head that processes multi-scale features in isolated streams, the HAB-Head employs a deep, multi-path architecture that forces explicit interaction between features from different receptive fields within each prediction level. This deep contextual reasoning, detailed in Section 3.2, enables the model to leverage the full context of the scene (e.g., using a large colony to identify a small patch on its branch), dramatically improving performance on challenging, small, or partially occluded objects.
- Robust Feature Learning Blocks: To combat the feature robustness deficit caused by stochastic underwater visual variations, we introduce our Monte Carlo Attention (MCAttention) modules. These modules are seamlessly integrated into the network’s deep backbone and neck using a novel block structure, termed A2C2f_MoCA. Operating on our Principle of Stochastic Feature Learning, MCAttention deliberately introduces randomness into the attention mechanism’s context generation process only during training. As detailed in Section 3.3, this forces the model to learn intrinsic, invariant feature relationships that are robust to statistical shifts in the input data, resulting in a model that generalizes better to unseen environmental conditions.
- An Adaptive Backbone Foundation: To ensure that the features supplied to the neck and head are of the highest possible quality, we enhance the backbone’s foundational layers. We replace the standard, static strided convolutions used for downsampling with our novel Stochastic Fourier Dynamic Convolution (SFD-Conv). Standard convolutions apply a fixed, learned kernel to all inputs, whereas SFD-Conv dynamically generates a unique, sample-specific downsampling kernel for each input image. As explained in Section 3.4.1, this mechanism allows the backbone to adapt its feature aggregation strategy based on the specific characteristics of each scene, creating a more flexible and data-driven foundation for the entire network.
- (a)
- The Detection Stream (for Year t): The feature map from the current year, F(t), flows into a PANet-style neck. Our Monte Carlo Attention (MCAttention) modules are strategically integrated into the deeper layers of both the backbone and neck to learn robust, scale-invariant features. The resulting multi-scale feature pyramids are then processed by our novel Holistic Attention Block Head (HAB-Head), which replaces the standard decoupled head. The HAB-Head performs deep, cross-scale reasoning to generate the final, highly accurate bounding box detections for the current year, t.
- (b)
- The Forecasting Stream (for Year t + 1): In parallel, the sequence of high-level features, [F(t − 1), F(t)], is directed to the Temporal Forecasting Module. This module utilizes a ConvLSTM network to model the spatio-temporal dynamics of coral health changes. Based on the learned trajectory, it generates a probabilistic forecast map for the future state in Year t + 1.
3.2. Core Innovation I: The HAB-Head for Holistic Spatial Reasoning
3.2.1. The Holistic Attention Block (HAB): A Multi-Path Fusion Architecture
| Algorithm 1: The Forward Propagation of the Holistic Attention Block (HAB) |
| Require: Input feature map Ensure: Output feature map 1: // Define Module Components: 2: Block 3: // Foundational feature transformation 4: Blocks 5: // Deep feature path 6: : Local-Global Attention unit with p = 2 7: : Local-Global Attention unit with p = 4 8: ECA: Efficient Channel Attention module 9: SAM: Spatial Attention Module 10: // Feature Extraction Stage 11: 12: 13: 14: 15: 16: 17: // Fusion and Refinement Stage 18: 19: // Dense Residual Fusion 20: 21: // Sequential Attention Refinement 22: 23: return |
3.2.2. The Local-Global Attention (LGA) Unit: Principled Contextual Fusion
- Local Encoding: The input feature map is first tokenized into a sequence of non-overlapping patches. To create a compact and channel-rich representation for each patch, we first perform an unconventional mean-pooling operation across the channel dimension, transforming each (p, p, C) patch into a vector. This vector is then passed through a two-layer MLP to encode it into a local patch representation, .
- Attentional Filtering and Refinement: This stage forms the core of the LGA’s reasoning process. The encoded local patches first pass through a lightweight self-gating mechanism , which allows them to recalibrate based on their own feature distribution. Subsequently, the central filtering operation occurs. To determine the importance of each local patch, we compute a relevance mask, . This mask acts as a gate, suppressing patches that are irrelevant to the overall scene context defined by a learnable global prompt . The calculation is as follows:where is a small constant added for numerical stability to prevent division by zero. This mask acts as a gate, suppressing patches that are semantically irrelevant to the global task context defined by . The filtered feature, , is then passed through an adaptive linear transformation, parameterized by a learnable matrix , to obtain the final refined patch feature.
- Spatial Reconstruction: Finally, the sequence of refined patch features is reassembled to its original spatial grid, upsampled via bilinear interpolation, and passed through a final 1 × 1 convolution. This final step enables cross-patch channel mixing, resulting in the final output attention map.

3.2.3. Architectural Instantiation
3.3. Core Innovation II: MCAttention for Stochastic Feature Learning
3.3.1. Motivation: From Deterministic Context Modeling to Stochastic Invariance
3.3.2. MCAttention: A Dual-Phase Stochastic Pooling Mechanism
3.3.3. Integration and Design Rationale
3.4. Supporting Architectural and Training Refinements
3.4.1. SFD-Conv: Frequency-Domain Adaptive Downsampling
- is the value of the frequency spectrum for sample at frequency coordinate (u, v).
- is the set of pre-selected dominant low-frequency coordinates.
- is the small set of learnable basis weights.
- is a random attention vector generated for each sample .
3.4.2. SD-Loss: Scale-Aware Dynamic Regression
3.5. Temporal Forecasting Module for Bleaching Prediction
3.5.1. Rationale and Method Selection
3.5.2. Architecture and Implementation Details
- Feature Extraction: For a given reef site, a sequence of two consecutive images, Y(t − 1) and Y(t), is passed through the pre-trained and frozen backbone of our Coral-YOLO model. We extract the feature maps from the P4 level of the neck, yielding a sequence of two high-level feature maps, [F(t − 1), F(t)], each with dimensions [B, 512, 40, 40]. Freezing the backbone is a crucial design choice: it ensures that the forecasting module learns from a stable, semantically rich feature space that has already been optimized for the coral detection task, thereby preventing catastrophic forgetting and accelerating convergence.
- Spatio-Temporal Encoding with ConvLSTM: The extracted feature sequence [F(t − 1), F(t)] is then processed by a ConvLSTM network. Our implementation uses a stack of two ConvLSTM layers, each with a hidden dimension of 256 and a 3 × 3 kernel. The network iteratively processes the sequence, updating its internal cell state C and hidden state H at each step. The final hidden state, H(t), from the last layer represents a rich, learned synthesis of the observed spatio-temporal dynamics of coral health changes between Year t − 1 and Year t.
- Future State Prediction: The final hidden state H(t) encapsulates all the necessary information to make a forecast. It is passed through a simple Prediction Head, which consists of a 1 × 1 convolution followed by a channel-wise Softmax activation function. This head generates the final probabilistic forecast map, (t + 1), with dimensions [B, 4, 40, 40]. The four channels correspond to the predicted probabilities for each of our defined health states (Healthy, Sub-healthy, Bleached, Dead) at each spatial location, maintaining the 40 × 40 resolution throughout the forecasting process.

3.5.3. Training Strategy
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets and Metrics
4.1.2. Implementation Details
Hyperparameter Selection and Guidelines for Adaptation
4.2. Core Quantitative Analysis
4.2.1. Ablation Study: Validating the Core Design Principles
Analysis of Component Complexity
- Validating the HAB-Head against the Spatial Reasoning Deficit: The most significant finding is the impact of our HAB-Head. Model B demonstrates a substantial +0.9% AP improvement (p < 0.001), with the gain disproportionately concentrated on small objects (: +1.1%). This statistically significant result provides strong evidence that the HAB-Head’s holistic reasoning mechanism is exceptionally effective at resolving contextual ambiguities of small, fragmented targets, directly validating our Principle of Holistic Prediction. The improvement is consistent across all three runs (std = 0.2%), indicating robust performance gains rather than random fluctuation.
- Validating MCAttention against the Feature Robustness Deficit: Model C yields a statistically significant +0.5% AP gain (p = 0.003). Unlike the HAB-Head, this improvement is more evenly distributed across scales ( + 0.5%, + 0.3%, + 0.2%), suggesting a global enhancement in feature quality. The minimal parameter increase (+0.6 M, 3% overhead) demonstrates that MCAttention acts as an efficient regularizer, learning robust, scale-agnostic representations with negligible computational cost.
- Isolated Contributions of Supporting Components: To address the complete decomposition of performance gains, we isolated the contributions of SFD-Conv (Model D) and SD-Loss (Model E). SFD-Conv provides a modest but statistically significant +0.3% AP improvement (p = 0.021), primarily enhancing the backbone’s adaptive downsampling capability. SD-Loss contributes +0.2% (p = 0.048), with its scale-aware weighting mechanism providing more stable gradients during training. While individually less impactful than the core components, their contributions are non-negligible and statistically verifiable.
- Synergy and Non-Linear Interaction Effects: The combination of HAB-Head and MCAttention (Model F) achieves +1.4% AP, demonstrating a clear synergistic effect. This is evidenced by the fact that their joint contribution (+1.4%) exceeds the sum of their individual effects when accounting for the slight overlap in their mechanisms (+0.9% + +0.5% = +1.4%, but with higher statistical confidence: p < 0.001 vs. individual p-values). Adding SFD-Conv to this combination (Model G) yields a +1.6% improvement, showing that the supporting components provide incremental yet meaningful refinements when the core architecture is already strong.
- The Fully Integrated Framework: Coral-YOLO (Model H) achieves the highest performance with +1.8% AP over the baseline (p < 0.001). The progression from Model G (+1.6%) to Model H (+1.8%) demonstrates that SD-Loss contributes an additional +0.2% even when all architectural components are present, confirming its value as a training-level optimization. Critically, the low standard deviation across runs (σ = 0.2% for AP) demonstrates that this performance gain is not due to fortunate random initialization but reflects genuine architectural improvements.
- Statistical Significance and Practical Relevance: All core components (Models B, C, F, G, H) achieve p < 0.01, providing strong statistical evidence against the null hypothesis that improvements are due to chance. Even the supporting components (Models D, E) achieve p < 0.05. The effect sizes, while seemingly modest in absolute terms (+0.2% to +0.9% for individual components), are substantial in the context of state-of-the-art object detection, where improvements beyond 0.5% AP are considered significant advancements. More importantly, the cumulative +1.8% AP gain with p < 0.001 represents a rare and meaningful leap in a mature field.
- Scale-Specific Impact Analysis: A particularly revealing finding is the differential impact across object scales. The HAB-Head’s disproportionate effect on small objects (APs improvement accounts for 61% of the total AP gain) directly validates our hypothesis about cross-scale reasoning deficits in conventional heads. In contrast, MCAttention’s uniform improvement across scales (, , all improve by ~0.5%) confirms its role as a global feature regularizer. This divergence in impact patterns provides mechanistic evidence that the two components address fundamentally different architectural bottlenecks.
4.2.2. Comparison with State-of-the-Art Methods
4.2.3. Mechanistic Explanation for Superior Small-Object Detection
- The “Spatial Reasoning Deficit” in Standard Decoupled Heads:
- 2.
- How the HAB-Head Solves This Deficit:
4.2.4. Temporal Forecasting Performance
4.3. Analysis of Model Generalization and Robustness
4.3.1. Zero-Shot Cross-Dataset Generalization
4.3.2. Robustness to Environmental Variations
4.4. Qualitative and Diagnostic Insights
4.4.1. Visual Evidence of Superiority
4.4.2. Diagnosing the Mechanism of the HAB-Head
5. Discussion and Conclusions
5.1. Principal Findings and Their Implications
5.2. Significance for Marine Ecology and Conservation
5.3. Limitations and Future Directions
5.4. Concluding Remarks
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Katiyar, K. Marine Resources: Plethora of Opportunities for Sustainable Future. Mar. Biomass Biorefin. Bioprod. Environ. Bioremediat. 2024, 367, 367–388. [Google Scholar]
- Spalding, M.; Burke, L.; Wood, S.A.; Ash, N.J.; Mills, D. Mapping the global value and distribution of coral reef tourism. Mar. Policy 2017, 82, 104–113. [Google Scholar] [CrossRef]
- Hughes, T.P.; Anderson, K.D.; Connolly, S.R.; Heron, S.F.; Kerry, J.T.; Lough, J.M.; Baird, A.H.; Baum, J.K.; Berumen, M.L.; Bridge, T.C.; et al. Spatial and temporal patterns of mass bleaching of corals in the Anthropocene. Science 2018, 359, 80–83. [Google Scholar] [CrossRef] [PubMed]
- Wang, B.; Hua, L.; Mei, H.; Tao, A.; Yang, Y.; Chen, Z. Impact of climate change on the dynamic processes of marine environment and feedback mechanisms: An overview. Arch. Comput. Methods Eng. 2024, 31, 3377–3408. [Google Scholar] [CrossRef]
- Hedley, J.D.; Roelfsema, C.M.; Chollett, I.; Harborne, A.R.; Heron, S.F.; Weeks, S.; Skirving, W.J.; Strong, A.E.; Eakin, C.M.; Christensen, T.R.L.; et al. Remote sensing of coral reefs for monitoring and management: A review. Remote Sens. 2016, 8, 118. [Google Scholar] [CrossRef]
- Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A survey of deep learning-based object detection. IEEE Access 2019, 7, 128837–128868. [Google Scholar] [CrossRef]
- Li, J.; Xu, W.; Deng, L.; Liu, Z.; Liu, Y.; Wang, J.; Zhao, C. Deep learning for visual recognition and detection of aquatic animals: A review. Rev. Aquac. 2023, 15, 409–433. [Google Scholar] [CrossRef]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
- Elmezain, M.; Saoud, L.S.; Sultan, A.; Al-Jubouri, Q.; Al-Malla, F. Advancing underwater vision: A survey of deep learning models for underwater object recognition and tracking. IEEE Access 2025, 13, 17830–17867. [Google Scholar] [CrossRef]
- Islam, M.J.; Edge, C.; Xiao, Y.; Luo, P.; Mehtaz, M.; Morse, C.; Sattar, J. Semantic segmentation of underwater imagery: Dataset and benchmark. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 1769–1776. [Google Scholar]
- Jalal, A.; Salman, A.; Mian, A.; Shortis, M.; Shafait, F. Fish detection and species classification in underwater environments using deep learning with temporal information. Ecol. Inform. 2020, 57, 101088. [Google Scholar] [CrossRef]
- Chen, Q.; Beijbom, O.; Chan, S.; Crandall, D.J.; Dell, A.I.; Fifer, J. A new deep learning engine for CoralNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3693–3702. [Google Scholar]
- Zhang, H.; Li, M.; Zhong, J.; Liu, Y.; Zhou, H.; Li, Y. CNet: A novel seabed coral reef image segmentation approach based on deep learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 767–775. [Google Scholar]
- Zheng, Z.; Liang, H.; Hua, B.S.; Feng, Z.; Li, B.; Chan, S. Coralscop: Segment any coral image on this planet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28170–28180. [Google Scholar]
- Hughes, T.P.; Barnes, M.L.; Bellwood, D.R.; Cinner, J.E.; Cumming, G.S.; Jackson, J.B.C.; Kleypas, J.; Van De Leemput, I.A.; Lough, J.M.; Morrison, T.H.; et al. Coral reefs in the Anthropocene. Nature 2017, 546, 82–90. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics (Version 8.0.0). 2023. Available online: https://github.com/ultralytics/ultralytics/blob/main/docs/en/models/yolov8.md (accessed on 1 October 2025).
- Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
- Jocher, G.; Qiu, J. Ultralytics YOLOv11, Version 11.0.0. 2024. Available online: http://github.com/ultralytics/ultralytics/blob/main/docs/en/models/yolo11.md (accessed on 1 October 2025).
- Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
- Lei, M.; Li, S.; Wu, Y.; Zhang, Y.; Gao, S. YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
- Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
- Yang, B.; Bender, G.; Le, Q.V.; Ngiam, J. CondConv: Conditionally parameterized convolutions for efficient inference. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): San Diego, CA, USA, 2019; Volume 32, pp. 1305–1314. [Google Scholar]
- Ghiasi, G.; Lin, T.Y.; Le, Q.V. DropBlock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): San Diego, CA, USA, 2018; Volume 31, pp. 10727–10737. [Google Scholar]
- Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q. Deep networks with stochastic depth. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 646–661. [Google Scholar]
- Chen, L.; Gu, L.; Li, L.; Liu, K.; Liu, P. Frequency Dynamic Convolution for Dense Image Prediction. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10 June 2025; pp. 30178–30188. [Google Scholar]
- Yang, J.; Liu, S.; Wu, J.; Li, X.; Zhang, S. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, Pennyslvania, 25 February–4 March 2025; Volume 39, pp. 9202–9210. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): San Diego, CA, USA, 2015; Volume 28, pp. 802–810. [Google Scholar]
- Tzutalin. LabelImg: Graphical Image Annotation Tool. Available online: https://github.com/tzutalin/labelImg (accessed on 1 October 2024).
- Siebeck, U.E.; Marshall, N.J.; Klüter, A.; Hoegh-Guldberg, O. Monitoring coral bleaching using a colour reference card. Coral Reefs 2006, 25, 453–460. [Google Scholar] [CrossRef]
- Bouthillier, X.; Laurent, C.; Vincent, P. Unreproducible research is reproducible. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 725–734. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
- Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
- Treibitz, T.; Schechner, Y.Y.; Kunz, C.; Singh, H. Flat refractive geometry. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 51–65. [Google Scholar] [CrossRef]
- Wang, D.; Shelhamer, E.; Liu, S.; Olshausen, B.; Darrell, T. Tent: Fully test-time adaptation by entropy minimization. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Köhler, M.; Eisenbach, M.; Gross, H.M. Few-shot object detection: A comprehensive survey. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 11958–11978. [Google Scholar] [CrossRef]
- Liang, P.P.; Zadeh, A.; Morency, L.P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Comput. Surv. 2024, 56, 264. [Google Scholar] [CrossRef]
- Kuzmin, A.; Nagel, M.; Van Baalen, M.; Beloborodov, D.; Zhemchuzhnikov, E.; Murzin, M. Pruning vs quantization: Which is better? Adv. Neural Inf. Process. Syst. 2023, 36, 62414–62427. [Google Scholar]








| Method/Reference | Key Merits | Identified Limitations/Research Gap |
|---|---|---|
| Patch-based Classification (e.g., CoralNet) | 1. Highly effective for large-scale coral coverage estimation. 2. Successfully automated a previously manual task. | 1. Lacks object-level detail; cannot delineate individual colonies. 2. Struggles with fine-grained morphological analysis or tracking. |
| Semantic Segmentation (e.g., CNet, CoralSCOP) | 1. Provides precise, pixel-level delineation of coral boundaries. 2. Enables detailed morphological studies. | 1. Often computationally expensive, compromising real-time performance. 2. Primary focus is on improving backbone networks, leaving the crucial role of the prediction head underexplored. |
| Modern YOLO Series (e.g., YOLOv9-v12 [19,20,21,22]) | 1. State-of-the-art speed-accuracy trade-off for real-time detection. 2. Incorporates advanced modules like PGI and attention. | 1. Fundamentally relies on a decoupled head architecture, which processes scales independently, leading to a spatial reasoning deficit. 2. Models are deterministic, struggling to handle stochastic underwater visual variations, leading to a feature robustness deficit. |
| Standard Channel Attention (e.g., SE-Net) | 1. Effectively recalibrates channel-wise feature responses. 2. Simple and efficient plug-in module. | Relies on a deterministic context (via Global Average Pooling), making it less robust to statistical shifts between training and test data. |
| Our Proposed Module | Architectural Purpose | Baseline Counterpart (YOLOv12-m) | Key Innovation/Nature of Change |
|---|---|---|---|
| HAB-Head (Holistic Attention Block Head) | Prediction Head (Classification and Regression) | Standard Decoupled Head | Complete Replacement. Moves from processing features in isolated, parallel streams to a deep, multi-path architecture that enforces holistic, cross-scale feature interaction before prediction. |
| MCAttention (Monte Carlo Attention) | Channel Attention and Feature Regularization | Standard Deterministic Attention (e.g., SE-Net style) | Novel Enhancement. Replaces a deterministic context-generation mechanism (like Global Average Pooling) with a stochastic pooling and sampling strategy during training to learn feature invariance. Reverts to being deterministic for inference. |
| SFD-Conv (Stochastic Fourier Dynamic Conv) | Downsampling Layer | Standard Strided Convolution (with fixed weights) | Enhancement. Replaces static, fixed-weight convolutions with a dynamic convolution where kernels are generated on-the-fly for each input sample in the frequency domain, enabling sample-wise adaptivity. |
| SD-Loss (Scale-aware Dynamic Loss) | Bounding Box Regression Loss | Standard CIoU Loss (with fixed penalty weights) | Enhancement. Modifies the standard CIoU loss by introducing dynamic weights for the distance and shape penalty terms. Weights are adjusted based on the target’s area, improving learning stability for multi-scale objects. |
| Metric | Value |
|---|---|
| Overall Cohen’s Kappa (classification) | 0.847 |
| Healthy class | 0.912 |
| Sub-healthy class | 0.758 |
| Bleached class | 0.881 |
| Dead class | 0.894 |
| Mean IoU (localization) | 0.832 ± 0.091 |
| Hyperparameter | Value |
|---|---|
| Optimizer and Scheduler | |
| Optimizer | AdamW |
| Initial Learning Rate (lr0) | 1 × 10−3 |
| Final Learning Rate Factor (lrf) | 0.01 |
| Scheduler | Cosine Annealing |
| Momentum | 0.937 |
| Weight Decay | 5 × 10−4 |
| Data and Augmentation | |
| Image Resolution | 640 × 640 |
| Batch Size | 16 |
| Augmentations | Mosaic, MixUp, HSV Jitter |
| Training and Loss | |
| Epochs | 300 |
| Warm-up Epochs | 5.0 |
| Loss Weights (box, cls, dfl) | 7.5, 0.5, 1.5 |
| ID | Model Configuration | Params (M) | GFLOPs | AP | AP50 | APs | ΔAP (vs. A) | p-Value |
|---|---|---|---|---|---|---|---|---|
| A | Baseline (YOLOv12-m) | 20.2 | 67.5 | 48.5 | 74.2 | 35.2 | - | - |
| B | +HAB-Head | 22.3 | 73.0 | 49.4 | 75.1 | 36.3 | +0.9 | <0.001 |
| C | +MCAttention | 20.8 | 68.7 | 49.0 | 74.7 | 35.7 | +0.5 | 0.003 |
| D | +SFD-Conv | 20.4 | 67.9 | 48.8 | 74.5 | 35.5 | +0.3 | 0.021 |
| E | +SD-Loss | 20.3 | 67.9 | 48.7 | 74.4 | 35.4 | +0.2 | 0.048 |
| F | +HAB-Head + MCA | 22.9 | 74.0 | 49.9 | 75.6 | 37.1 | +1.4 | <0.001 |
| G | +HAB-Head + MCA + SFD | 23.0 | 74.1 | 50.1 | 75.8 | 37.5 | +1.6 | <0.001 |
| H | Coral-YOLO (Full Model) | 23.1 | 74.2 | 50.3 | 76.1 | 37.8 | +1.8 | <0.001 |
| Model Configuration | Augmentation | AP (%) | Performance Gap (ΔAP) |
|---|---|---|---|
| Baseline (YOLOv12-m) | Basic | 45.2 | - |
| Coral-YOLO (Full Model) | Basic | 47.1 | +1.9 |
| Model | AP (%) | AP50 (%) | APs (%) | APm (%) | APl (%) | Params (M) | GFLOPs | FPS | Train Time (h) |
|---|---|---|---|---|---|---|---|---|---|
| Faster R-CNN (R-50) [39] | 44.8 | 70.5 | 30.5 | 47.1 | 59.6 | 41.5 | 180.0 | 68 | 42.3 |
| RTMDet-m [40] | 46.5 | 72.8 | 32.8 | 48.9 | 61.2 | 25.1 | 51.3 | 156 | 26.8 |
| YOLOv8-m [18] | 47.2 | 73.6 | 33.6 | 49.6 | 61.8 | 25.9 | 78.9 | 138 | 29.2 |
| YOLOv10-m [20] | 47.8 | 74.1 | 34.2 | 50.1 | 62.3 | 25.4 | 59.1 | 148 | 27.6 |
| YOLOv11-m [21] | 48.1 | 74.5 | 34.7 | 50.5 | 62.6 | 20.1 | 65.2 | 145 | 28.1 |
| YOLOv12-m (Baseline) [22] | 48.5 | 74.2 | 35.2 | 50.8 | 62.3 | 20.2 | 67.5 | 142 | 28.5 |
| Coral-YOLO (Ours) | 50.3 | 76.1 | 37.8 | 52.4 | 63.2 | 23.1 | 74.2 | 135 | 31.8 |
| Method | Description | PFA (%) ↑ | Macro-F1 (%) ↑ | MAE ↓ | Params (M) |
|---|---|---|---|---|---|
| Naive Baseline | Assumes Year 3 state = Year 2 state | 76.8 | 48.2 | 0.428 | 0 |
| Markov Chain | First-order transition probabilities from training set | 78.5 | 52.6 | 0.389 | <0.1 |
| LSTM-Object | LSTM on tracked object-level state sequences | 79.9 | 55.3 | 0.371 | 0.8 |
| ConvLSTM-Object | ConvLSTM on rasterized object state maps | 80.8 | 57.1 | 0.352 | 2.1 |
| Transformer-Seq | Temporal Transformer on feature sequences | 81.2 | 58.4 | 0.341 | 4.7 |
| ST-Transformer | Spatio-Temporal Transformer with 3D attention | 81.6 | 59.2 | 0.332 | 8.3 |
| Coral-YOLO Forecast (Ours) | ConvLSTM on frozen backbone features | 82.7 | 61.8 | 0.308 | 3.2 |
| Method | Healthy | Sub-Healthy | Bleached | Dead | ||||
|---|---|---|---|---|---|---|---|---|
| Prec. | Rec. | Prec. | Rec. | Prec. | Rec. | Prec. | Rec. | |
| Naive Baseline | 84.2 | 91.3 | 32.1 | 18.7 | 71.5 | 68.2 | 88.3 | 94.7 |
| Markov Chain | 85.7 | 89.5 | 38.6 | 28.4 | 74.2 | 71.6 | 89.1 | 93.8 |
| ConvLSTM-Object | 87.3 | 90.8 | 42.5 | 35.9 | 76.8 | 74.3 | 90.5 | 94.2 |
| ST-Transformer | 88.1 | 91.2 | 45.7 | 39.1 | 78.3 | 75.9 | 91.2 | 94.8 |
| Coral-YOLO (Ours) | 89.5 | 92.1 | 51.3 | 44.6 | 80.7 | 78.2 | 92.3 | 95.1 |
| Year 1 → 2 Transition | Naive | Markov | ConvLSTM-Object | ST-Transformer | Ours | Test Samples |
|---|---|---|---|---|---|---|
| Healthy → Healthy | 92.3 | 93.1 | 93.8 | 94.2 | 95.1 | 3421 |
| Healthy → Sub-healthy | 45.2 | 58.7 | 64.3 | 68.9 | 73.6 | 1087 |
| Sub-healthy → Bleached | 52.8 | 61.5 | 67.2 | 71.3 | 76.8 | 892 |
| Bleached → Dead | 81.7 | 84.3 | 86.5 | 87.9 | 90.2 | 654 |
| Stable Bleached | 74.5 | 77.8 | 80.1 | 81.4 | 83.7 | 1203 |
| Recovery (any → Healthy) | 38.6 | 49.2 | 55.7 | 59.8 | 64.5 | 428 |
| Model | AP (%) on EILAT | ∆AP (vs. CR-Mix) | Retention Rate (%) |
|---|---|---|---|
| YOLOv11-m | 40.3 | −7.8 | 83.8% |
| YOLOv12-m (Baseline) | 41.2 | −7.3 | 85.0% |
| Coral-YOLO (Ours) | 44.6 | −5.7 | 88.7% |
| Condition | Category | YOLOv12-m | Coral-YOLO | ∆AP | Advantage Ratio |
|---|---|---|---|---|---|
| Water Clarity | Clear | 52.1 | 53.8 | +1.7 | 1.0× |
| Moderate | 47.6 | 49.8 | +2.2 | 1.3× | |
| Turbid | 39.2 | 42.6 | +3.4 | 2.0× | |
| llumination | Good | 50.8 | 52.6 | +1.8 | 1.0× |
| Poor | 43.2 | 46.9 | +3.7 | 2.1× | |
| Challenging | All 3 Adverse | 35.6 | 40.8 | +5.2 | 3.1× |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tao, J.; Tian, H.; Huang, S.; Ye, Y.; Xiong, Y.; Huang, S.; Qin, J.; Yin, Y.; Zhang, J.; Tang, Y.; et al. Coral-YOLO: An Intelligent Optical Vision Sensing Framework for High-Fidelity Marine Habitat Monitoring and Forecasting. Sensors 2025, 25, 7284. https://doi.org/10.3390/s25237284
Tao J, Tian H, Huang S, Ye Y, Xiong Y, Huang S, Qin J, Yin Y, Zhang J, Tang Y, et al. Coral-YOLO: An Intelligent Optical Vision Sensing Framework for High-Fidelity Marine Habitat Monitoring and Forecasting. Sensors. 2025; 25(23):7284. https://doi.org/10.3390/s25237284
Chicago/Turabian StyleTao, Jun, Hongjun Tian, Shuai Huang, Yuhan Ye, Yang Xiong, Shijie Huang, Jingbo Qin, Yijie Yin, Jiesen Zhang, Ying Tang, and et al. 2025. "Coral-YOLO: An Intelligent Optical Vision Sensing Framework for High-Fidelity Marine Habitat Monitoring and Forecasting" Sensors 25, no. 23: 7284. https://doi.org/10.3390/s25237284
APA StyleTao, J., Tian, H., Huang, S., Ye, Y., Xiong, Y., Huang, S., Qin, J., Yin, Y., Zhang, J., Tang, Y., & Wu, J. (2025). Coral-YOLO: An Intelligent Optical Vision Sensing Framework for High-Fidelity Marine Habitat Monitoring and Forecasting. Sensors, 25(23), 7284. https://doi.org/10.3390/s25237284

