Morphology-Adaptive YOLO for Underwater Crack Detection in Hydraulic Structures

Chen, Zhe; Zhou, Changning; Guo, Jingkun; Yin, Guangjun

doi:10.3390/w18101241

Open AccessArticle

Morphology-Adaptive YOLO for Underwater Crack Detection in Hydraulic Structures

¹

College of Information Science and Engineering, Hohai University, Changzhou 213200, China

²

State Key Laboratory of Hydrology-Water Resources and Hydraulic Engineering, Hohai University, Nanjing 210098, China

^*

Author to whom correspondence should be addressed.

Water 2026, 18(10), 1241; https://doi.org/10.3390/w18101241

Submission received: 23 April 2026 / Revised: 17 May 2026 / Accepted: 18 May 2026 / Published: 21 May 2026

(This article belongs to the Section Hydraulics and Hydrodynamics)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

MA-YOLO improves underwater crack detection on UCD, increasing mAP@0.5 and mAP@0.5:0.95 by 1.7% and 3.0% over YOLOv11, respectively.
BRF-SPPM captures multi-scale crack features with broader receptive fields.
MAM improves crack representation through morphology-adaptive attention.
X-Head enhances the detection of close-range magnified underwater cracks.
MA-YOLO improves detection accuracy while maintaining a lightweight computa-tional design.

What are the implications of the main findings?

Morphology-aware modeling improves underwater hydraulic-structure inspection.
The method supports efficient visual monitoring of submerged infrastructure.
MA-YOLO shows potential for future ROV-based crack inspection.

Abstract

Accurate underwater crack detection is essential for the condition monitoring of hydraulic structures. However, reliable detection in underwater inspection imagery remains challenging because of low visibility, complex backgrounds, large-scale variation, and irregular crack morphology. To improve detection under these conditions, we develop MA-YOLO, a YOLOv11-based detector that adapts feature representation to underwater crack morphology. The proposed method integrates a broader receptive field spatial pyramid pooling module to enhance multi-scale feature extraction, a morphological attention module to improve the representation of irregular crack patterns, and an extra-large detection head to better detect magnified cracks in close-range underwater images. Experiments on the underwater crack dataset (UCD) show that MA-YOLO outperforms both conventional detectors and recent underwater object-specific detectors. Relative to YOLOv11, MA-YOLO increases mAP@0.5 from 91.2% to 92.9% and mAP@0.5:0.95 from 60.0% to 63.0%, while maintaining a lightweight architecture and real-time inference capability. The results demonstrate the effectiveness of morphology-adaptive feature modeling for image-based underwater crack detection and its potential for practical monitoring of submerged hydraulic structures.

Keywords:

underwater crack detection; hydraulic structure; morphology-adaptive YOLO; morphological attention

1. Introduction

As critical components of water conservancy and hydropower projects, hydraulic structures—such as dams, embankments, and gates—play a vital role in irrigation, flood control, and power generation. During long-term service, these structures are continuously exposed to high hydrostatic pressure and high-velocity water flow, which may induce the initiation and development of underwater cracks. Such underwater cracks can propagate from the surface into the internal matrix, leading to significant degradation of structural integrity. Therefore, accurate underwater crack detection is of great importance for routine operation and maintenance, as well as for emergency inspection and rescue operations.

Despite the importance of underwater crack detection, existing deep learning-based methods still suffer from significant limitations in practical scenes. In practical underwater inspections, image data are commonly acquired by remotely operated vehicles (ROVs) or other mobile platforms and can be processed either offline or online. Offline processing is useful for final inspection reports, while online detection can provide immediate crack localization and support close-range reinspection during field operations. Therefore, lightweight real-time detection is considered a desirable capability for ROV-assisted hydraulic-structure inspection. However, directly applying standard YOLO architectures to underwater crack detection exposes two critical structural limitations. First, underwater cracks often exhibit pronounced multi-scale and large-scale morphological variations. Although enlarging the receptive field is necessary to capture such global characteristics, naively increasing the size of conventional isotropic (square) convolutions inevitably amplifies noise interference in feature representation, thereby causing severe feature degradation. Second, the annotation statistics of UCD indicate that underwater cracks exhibit considerable aspect-ratio variability, including strip-like, block-like, and tree-like patterns. Such morphological variability makes standard isotropic convolution less consistent with crack geometry and motivates morphology-adaptive feature modeling. To systematically address the above challenges, this paper proposes a novel morphology-adaptive model, termed MA-YOLO, which tailors layer-wise feature representation to the morphological characteristics of underwater cracks. Specifically, MA-YOLO is designed to address three key issues in underwater crack detection: (1) the multi-scale nature of underwater cracks, (2) their morphological variability, and (3) the image magnification problem introduced by close-range imaging in practical inspection scenes. To this end, three dedicated modules are developed. First, BRF-SPPM enlarges the receptive field to enhance multi-scale feature extraction, thereby preserving the integrity of underwater crack representations, while introducing wavelet convolution to strengthen the response to low-frequency morphological features and suppress high-frequency noise. Second, MAM adaptively emphasizes convolutional kernels whose morphological attributes are better aligned with morphological patterns of underwater cracks, thus improving the precision of feature representation. Third, X-Head is incorporated to improve detection accuracy for magnified cracks in close-range underwater images. Together, these three modules establish a coarse-to-fine feature representation framework for underwater crack detection and effectively address practical challenges in underwater inspection. Experimental results on UCD show that MA-YOLO outperforms both conventional detectors and state-of-the-art underwater object detection methods while maintaining real-time inference capability, demonstrating its strong potential for practical underwater crack detection applications.

The main contributions of this paper can be summarized as follows.

BRF-SPPM is proposed to enhance the completeness of underwater crack feature representation. Specifically, Switchable Atrous Convolution (SAConv) is employed to enlarge the receptive field and adapt to cracks of different scales. Wavelet convolution is further integrated to exponentially expand the effective receptive field without increasing the number of trainable parameters. By strengthening the low-frequency morphological response, the proposed module enables the model to capture the global morphological characteristics of underwater cracks while exhibiting strong robustness against high-frequency underwater background noise.
MAM is designed to improve the precision of underwater crack feature representation. By adaptively aligning feature representation with the morphological characteristics of underwater cracks, MAM enhances the consistency between learned features and crack morphology, thereby addressing the challenge of crack morphological variability.
X-Head is introduced to improve detection performance on magnified crack images, where crack instances may appear as enlarged morphologies due to close-range underwater imaging. This design improves the adaptability of the proposed method to practical underwater inspection scenes.

2. Related Work

This section reviews studies related to conventional underwater object detection, underwater crack-specific detection, and YOLO-series object detection models.

2.1. Application of Deep Learning in Conventional Underwater Object Detection

Underwater object detection has been widely studied because it supports tasks such as resource investigation, environmental observation, and marine monitoring. However, underwater environments pose unique challenges, such as light attenuation, scattering, color distortion, and suspended particles, all of which significantly degrade detection performance. To handle these degradations, many studies have adopted deep-learning detectors to improve recognition accuracy and robustness.

Most existing DL-based conventional underwater object detection methods are adapted from well-established detectors, which can be categorized into two-stage (e.g., Faster R-CNN [1] and Sparse R-CNN [2]) and one-stage frameworks (e.g., RetinaNet [3] and YOLO [4]). Building upon these frameworks, researchers have introduced various improvements tailored to underwater scenes. For example, Song et al. proposed Boosting R-CNN [5], a two-stage detector that corrects detection errors in the first stage and enhances feature representation in the second stage. Gao et al. proposed an augmented weighted bidirectional feature pyramid algorithm (AWBiFPN) with a consistent supervision module to effectively integrate multi-scale features [6]. Bhalla et al. (2025) introduced a pyramid visual Transformer into R-CNN and developed HydR-CNN [7], which employs multi-level feature extraction to improve detection under low-visibility conditions. Although two-stage detectors often achieve competitive accuracy, their region proposal and refinement stages usually introduce higher computational cost. Therefore, two-stage detectors remain useful for offline analysis, but may be less favorable for online inspection scenarios where response latency is important.

For efficiency-oriented underwater detection, one-stage detectors, especially YOLO-based models, have been widely explored. Liu et al. developed TC-YOLO by introducing self-attention and coordinate attention into the backbone and neck, together with an improved transfer label assignment strategy [8]. Feng proposed CEH-YOLO, which combines high-order deformable attention with an enhanced SPPF module to strengthen underwater feature representation [9]. Lian-Suo et al. proposed MTD-YOLOv5 [10], which employs grayscale equalization and multi-scale perceptual hybrid pooling to enhance underwater image contrast and capture latent object information. Cao et al. introduced BG-YOLO by jointly constructing an underwater image enhancement branch and a detection branch to guide feature learning [11]. To mitigate the impact of underwater lighting noise, Ma et al. proposed a weighted multi-error entropy YOLO network based on YOLOv8 [12]. Li et al. developed SU-YOLO by integrating spiking neural networks and integer-addition-based denoising [13], achieving improved efficiency and reduced computational overhead.

Although these detectors have achieved promising performance on conventional underwater objects, such as marine organisms and submersibles, they still lack the fine-grained morphological representation required for underwater crack detection. As a result, they are prone to severe false positives and missed detections in noisy and turbid underwater environments.

2.2. Application of Deep Learning in Underwater Crack-Specialized Detection

Compared with general underwater object detection, underwater crack detection is more limited by scarce dedicated datasets and irregular crack morphology. Li et al. used transfer learning with a deep residual network for crack classification and weakly supervised localization in concrete dams [14]. Chen et al. developed A-DCDNet [15], which combines statistical features and attention mechanisms to handle data imbalance. Cao proposed a large-scale crack detection method that integrates image stitching and segmentation [16]. Li et al. built a pixel-level framework for identifying and quantifying tunnel lining defects using machine vision and deep learning [17]. Huang et al. enhanced underwater crack images with white balance correction and bilateral filtering before applying an improved YOLOv9-OREPA detector [18]. To enhance underwater crack feature extraction, Shi et al. proposed CrackYOLO [19], which reduces parameter count while introducing redesigned skip connections and a feature fusion module to aggregate spatial and channel information; a genetic algorithm was further employed for hyperparameter optimization. To address data scarcity, Huang et al. utilized CycleGAN to synthesize realistic underwater dam crack images from in-air samples [20]. Lin et al. proposed a generative diffusion model [21], UWDM, for cross-domain underwater crack image enhancement and developed SDI-ASF-YOLO11 for crack detection. Guo et al. introduced CrackWave-R [22], which integrates the Discrete Wavelet Transform (DWT) to extract frequency-domain features for underwater crack detection.

However, existing crack-specific detectors still exhibit several notable limitations. Preprocessing strategies based on image enhancement, such as CycleGAN and bilateral filtering, may introduce artificial textures that distort the morphological representation of micro-cracks. More importantly, most existing methods mainly focus on image enhancement, generic feature fusion, conventional attention mechanisms, or dataset construction, but they do not explicitly encode the geometric morphology of underwater cracks, including extreme aspect ratios, tree-like branching, and large-scale variation under close-range imaging. This research gap limits their ability to distinguish true cracks from underwater textures and to completely localize magnified crack regions in hydraulic inspection scenes. Therefore, a detector that can explicitly adapt feature representation to crack morphology is still needed for underwater hydraulic-structure inspection.

2.3. YOLO Series

The last decade has witnessed the remarkable progress of object detection technology, with the YOLO series emerging as one of the most influential frameworks. YOLO, introduced in 2015, pioneered the one-stage detection paradigm, achieving unprecedented inference speed at the cost of reduced localization accuracy. YOLO9000 subsequently introduced multi-scale training and anchor boxes to improve small-object detection [23]. YOLOv3 adopted Darknet-53 and residual connections [24], further enhancing detection performance. YOLOv4 introduced CSPDarknet-53 and the CIoU loss to improve efficiency and bounding box regression accuracy [25], while YOLOv5 integrated the Focus module and Mosaic data augmentation. YOLOv6 optimized convolutional modules and attention mechanisms [26], and YOLOv7 introduced model reparameterization and an extended efficient layer aggregation network [27]. YOLOv8 further refined network architecture [28], loss functions, and label assignment strategies. YOLOv9 incorporated programmable gradient information and a generalized ELAN structure to reduce computational complexity [29], while YOLOv10 focused on end-to-end real-time optimization [30]. YOLOv11 [31], released in 2024, built upon YOLOv8 and reduced parameter count by approximately 20% at comparable accuracy. It introduced the C3k2 module for dynamic bottleneck control and the C2PSA attention mechanism to enhance the detection of small and occluded objects. More recently, YOLOv12 integrated attention mechanisms into a one-stage framework [32], and YOLOv13 employed hypergraphs and the FullPAD paradigm to model high-order feature relationships and improve adaptability to complex scenes [33].

While segmentation-based methods are widely used in crack analysis because they can provide fine-grained crack contours, they usually require pixel-level annotations and higher annotation costs. In this study, the primary goal is rapid crack localization and screening in underwater inspection; therefore, a bounding-box-based detector is more consistent with the task setting. Among one-stage detectors, YOLOv11 provides a favorable trade-off between accuracy and efficiency, making it suitable as the baseline architecture in this study.

2.4. Comparison to the State-of-the-Art Method

As discussed above, most existing underwater crack detection methods are adapted from conventional underwater or terrestrial object detectors, and their improvements mainly focus on image preprocessing, feature attention mechanisms, and dataset construction. Although baseline models such as YOLOv11 offer high computational efficiency, they still suffer from fundamental limitations when directly applied to underwater crack detection. Specifically, their fixed receptive fields are inadequate for capturing crack features with substantial morphological scale variation in practical engineering scenarios. Moreover, their inherently isotropic feature extraction is less consistent with the aspect-ratio variability and diverse morphology of underwater cracks, resulting in imprecise feature representation.

Therefore, existing approaches generally fail to explicitly model the intrinsic morphological characteristics of underwater cracks. In contrast, the proposed MA-YOLO is designed to directly address the underlying physical properties of underwater cracks. Specifically, BRF-SPPM is introduced to overcome the receptive-field limitation, MAM is designed to adapt feature representation to the anisotropic morphology of underwater cracks, and X-Head is incorporated to capture the global morphology of magnified crack images. Through these designs, MA-YOLO systematically compensates for the inherent deficiencies of the standard YOLOv11 baseline architecture and establishes stronger consistency between mathematical feature representation and the physical characteristics of underwater cracks. This principle forms the central motivation of this study and provides the methodological foundation for improving underwater crack detection performance.

3. Morphology-Adaptive YOLO

MA-YOLO is a one-stage architecture specifically designed for underwater crack detection. It introduces three novel modules—BRF-SPPM, MAM, and X-Head—to systematically address the challenges of multi-scale, morphology-variable underwater cracks, as well as practical issues encountered in underwater inspection. These modules collectively enhance the adaptability of MA-YOLO to the characteristics of underwater cracks in hydraulic structures, enabling accurate detection while maintaining real-time processing capability.

3.1. Overall Architecture

The overall architecture of MA-YOLO is shown in Figure 1. Based on YOLOv11, MA-YOLO consists of three main components, namely, a backbone network for feature extraction, a neck network for feature fusion and enhancement, and a detection head for final prediction. The proposed framework follows a coarse-to-fine detection paradigm, where the coarse-grained stage emphasizes the integrity of underwater crack representation, while the fine-grained stage focuses on detailed morphological characterization. Specifically, in the coarse-grained stage, BRF-SPPM integrates Switchable Atrous Convolution (SAConv) [34] and wavelet convolution [35] to enlarge the receptive field and more comprehensively capture multi-scale crack features. In the fine-grained stage, MAM extracts morphological priors of underwater cracks to guide morphological attention, thereby adaptively aligning feature fusion with variable crack morphology. In addition, X-Head is incorporated as an extra detection head to address magnified crack images, enabling effective detection in close-range underwater inspection scenes.

3.2. Broader Receptive Field-Spatial Pyramid Pooling Module (BRF-SPPM)

In the original YOLOv11, the SPPF module is primarily designed for feature representation in terrestrial images, where objects of interest usually appear at moderate scales and within relatively limited observation ranges. In contrast, underwater cracks exhibit substantial scale variation, and large crack instances are frequently encountered, as illustrated in Figure 2. Each image in Figure 2 is annotated with the corresponding bounding-box size under the original resolution of 1920 × 1080 pixels. The two crack images in the first row correspond to extremely large-scale cracks: the first exhibits a conventional morphology and occupies 13.5% of the original image area, whereas the second presents a tree-like morphology and accounts for up to 66% of the image area, covering nearly two-thirds of the entire image. By contrast, the two images in the second row depict small-scale cracks, each occupying approximately 0.6% of the original image area. These examples clearly highlight the severe scale variation in underwater cracks, under which a larger receptive field becomes necessary to preserve the integrity of feature representation.

To address this issue, BRF-SPPM is proposed, and its architecture is illustrated in Figure 3. Specifically, SAConv is introduced after each max-pooling layer and consists of two parallel convolutional branches with small and large dilation rates, respectively. This design enables adaptive feature representation for multi-scale underwater cracks, particularly for large-scale instances. More specifically, the branch with the smaller dilation rate is emphasized for small cracks, whereas the branch with the larger dilation rate is prioritized for large cracks. The computation process of SAConv is given as follows.

SAConv = S (X_{i n}^{S A C}) \cdot Conv (X_{i n}^{S A C}, ω, r) + (1 - S (X_{i n}^{S A C})) \cdot Conv (X_{i n}^{S A C}, ω + Δ ω, 3 r),

(1)

where

X_{i n}^{S A C}

denotes the input features,

r

represents the hyper-parameter,

Δ ω

denotes the trainable weight, and the switch function is implemented as an average pooling.

Despite its adaptability, SAConv still exhibits two critical limitations when the spatial receptive field is naively enlarged using standard convolutions. First, obtaining a global receptive field generally entails a substantial increase in trainable parameters, thereby reducing computational efficiency. Second, particularly in underwater environments, directly enlarging an isotropic convolutional kernel aggravates background noise accumulation, as the network inevitably incorporates abundant localized high-frequency interference, such as suspended particles, which in turn suppresses discriminative underwater crack features. To address these limitations, we integrate the wavelet transform (WT) with depthwise separable convolution (DSConv) to construct WT-DSConv, which is incorporated into both the backend and the skip connections of BRF-SPPM. WT-DSConv exponentially expands the effective receptive field without introducing additional parameter overhead, thereby providing broader spatial context. Furthermore, by enhancing the low-frequency morphological response, it facilitates the extraction of the global morphological characteristics of large underwater cracks while naturally mitigating interference from high-frequency underwater background noise. Specifically, the original depthwise convolution in DSConv is replaced with wavelet transform convolution (WTConv), enabling large-scale feature representation through a three-stage pipeline composed of WT, small-kernel depthwise convolution, and inverse wavelet transform (IWT), as illustrated below:

Y = IWT (C o n v (W, WT (X_{i n}^{W T}))),

(2)

where

X_{i n}^{W T}

represents the input feature of wavelet convolution, and

W

denotes the weight matrix of a

k \times k

depthwise kernel.

All modules within BRF-SPPM jointly produce coarse-grained feature representations of underwater cracks. Notably, all components in BRF-SPPM employ expanded receptive fields, effectively accommodating the inherent size variations in underwater cracks while incurring only minimal additional computational cost.

3.3. Morphological Attention Module (MAM)

Owing to the complex and diverse conditions in underwater environments, underwater cracks exhibit substantial morphological variability, appearing as straight lines, curved patterns, or even tree-like morphologies. Under such circumstances, a single convolutional module is often inadequate for modeling such complex morphological variations. To address this challenge, we propose a morphological attention mechanism that adaptively fuses multiple morphological features under the guidance of crack-specific priors. The detailed architecture of MAM is illustrated in Figure 4. Specifically, the WaveletModule first refines the features generated by the preceding BRF-SPPM through cross-channel interaction and wavelet convolution, thereby enhancing the representation of multi-scale morphological features. Subsequently, the StripModule is introduced to capture strip-like feature representations using two orthogonal strip convolutions with elongated kernels (i.e., 1 × N and N × 1). This design is particularly suitable for underwater cracks, which usually exhibit high-aspect-ratio morphologies.

Within MAM, the features extracted by the StripModule and the WaveletModule are adaptively fused under the guidance of morphological priors associated with underwater cracks. The morphological prior is derived from the feature map produced by the WaveletModule. Specifically, feature responses along two orthogonal directions are first computed to characterize the spatial distribution of underwater cracks, as follows:

\begin{array}{l} p (x) = \sum_{y = 1}^{H} F (x, y), \\ q (y) = \sum_{x = 1}^{W} F (x, y), \end{array}

(3)

where

p (x)

and

q (x)

denote the feature response along the horizontal and vertical directions,

F (x, y)

denotes the feature map from WaveletModule with width

W

and height

H

.

Subsequently, the cumulative distribution function (CDF) is employed to identify continuous feature responses corresponding to underwater cracks. In complex underwater scenes, direct feature projections are inherently vulnerable to distortion induced by severe background noise. To enhance robustness against such perturbations, CDF is formulated as a statistical smoothing mechanism, which can be described as follows:

\begin{array}{l} C_{p} (x) = \sum_{i = 1}^{x} \hat{p} (i), \\ C_{q} (y) = \sum_{j = 1}^{y} \hat{q} (j), \end{array}

(4)

where

\hat{p} (i)

and

\hat{q} (i)

are the normalized feature responses, as

\hat{p} (i) = \frac{p (i)}{\sum_{i = 1}^{W} p (i)}

,

\hat{q} (j) = \frac{q (j)}{\sum_{x = 1}^{H} q (j)}

.

Given a quantile

γ

, the feature response boundaries corresponding to underwater cracks are determined as follows:

\begin{array}{l} x_{L} = \min \{x |C_{p} (x) \geq γ\}, \\ x_{R} = \min \{x |C_{p} (x) \geq 1 - γ\}, \\ y_{L} = \min \{y |C_{q} (y) \geq γ\}, \\ y_{R} = \min \{y |C_{q} (y) \geq 1 - γ\}, \end{array}

(5)

where

x_{L}

and

x_{R}

denote the left and right boundaries along the horizontal direction, while

y_{L}

and

y_{R}

represent the boundaries along the vertical direction. In the experimental implementation, the quantile

q

is set to 0.05, which effectively serves as a statistical low-pass filter. It suppresses the long-tail distribution caused by isolated high-frequency underwater noise while preserving the dominant integral mass associated with the morphology of underwater cracks. The width and height of an underwater crack, denoted as

W_{c r a c k}

and

H_{c r a c k}

, can be computed as:

\begin{array}{l} W_{c r a c k} = x_{R} - x_{L}, \\ H_{crack} = y_{R} - y_{L} . \end{array}

(6)

These orthogonal responses serve as priors for computing the weight parameters used in feature fusion:

α = σ (\ln \frac{\max (H_{crack}, W_{crack})}{\min (H_{crack}, W_{crack}) + ε} + ε),

(7)

where

σ (\cdot)

denotes the sigmoid function and

ε

is a small constant introduced for numerical stability. The weighting coefficient

α

is established upon the logarithmic transformation of the aspect ratio of underwater cracks. This formulation is particularly advantageous for feature fusion in MAM, as it enables adaptive integration of the features extracted by the WaveletModule and the StripModule, as follows:

Y_{o u t} = (1 - α) \cdot X_{i n}^{S} + α \cdot X_{o u t}^{S},

(8)

where

X_{i n}^{S}

denotes the features from WaveletModule,

X_{o u t}^{S}

corresponds to the output of the StripBlock, and

Y_{o u t}

represents the weighted output. As the aspect ratio increases,

α

gradually approaches 1, thereby assigning a larger contribution to the strip-like convolution. In contrast, for tree-like or block-like cracks with aspect ratios close to 1,

α

approaches 0.5, so that the module relies more heavily on the large-scale receptive field provided by the WaveletModule.

Unlike the front-end BRF-SPPM, which focuses on preserving the integrity of underwater crack representation, the back-end MAM emphasizes the adaptive refinement of morphological features. As a result, it effectively addresses morphological variability and enables a progressive transition from coarse-grained to fine-grained feature representation for underwater cracks.

3.4. Extra Large Detection Head (X-Head)

Close-range imaging is an effective strategy for mitigating optical degradation in complex underwater environments. Consequently, underwater inspections are often performed on magnified images, in which underwater cracks occupy a substantial portion of the image area, as illustrated in Figure 5. Under such conditions, standard YOLO-based detectors, as well as conventional terrestrial object detectors, become less effective. For instance, in YOLOv11, the P5 detection head is intended for large objects and is typically used to detect instances with resolutions of 32 × 32 pixels or larger. While this design is suitable for datasets such as COCO, where the maximum anchor size is approximately 512 pixels, it remains inadequate for practical underwater inspection scenarios. Preliminary statistics show that nearly 70% of underwater cracks exceed 32 × 32 pixels, and approximately 50% are larger than 64 × 64 pixels. The prevalence of such magnified cracks significantly reduces the number of valid anchors on the feature maps, thereby degrading regression stability and localization accuracy.

To address this limitation, we propose a novel X-Head module specifically designed for magnified underwater cracks larger than 64 × 64 pixels. Specifically, a stride-2 convolution is applied to the original P5 feature layer of YOLOv11 to further downsample the feature map and generate a new P6 feature layer. Owing to its spatial resolution of only 1/64 of the input image, the P6 feature map is able to capture the global morphological characteristics of magnified cracks more effectively at a lower resolution. The resulting P6 features are then concatenated with skip-connected backbone features and further refined by two successive C3k2 modules. By introducing X-Head, MA-YOLO improves the detection of magnified underwater cracks in close-range imaging and thereby enhances its applicability to diverse underwater inspection scenarios.

4. Experiments and Discussion

4.1. Experimental Datasets

To the best of our knowledge, no publicly available underwater dataset can adequately support DL-based underwater crack detection, mainly due to the scarcity of underwater crack samples. To address this limitation, we constructed a dedicated Underwater Crack Dataset (UCD) in this paper. Hydraulic engineering simulation experiments were conducted at the Dangtu Scientific Research and Technology Development Base of the Nanjing Hydraulic Research Institute, where damaged test blocks were used to reproduce diverse underwater crack patterns in hydraulic structures. The experimental base draws water directly from the mainstream of the Huai River, thereby preserving the natural hydrological characteristics of the test environment.

In addition, underwater crack imaging was performed in real-world hydraulic projects, i.e., Xiaowan and Nuozhadu Hydropower Station in Yunnan Province, China. During underwater imaging operations, crack images were extracted at 2 frames per second (FPS) to reduce redundancy during dataset construction. Overall, UCD comprises a total of 1941 underwater crack samples with a resolution of 1920 × 1080 pixels, of which 30% were obtained from simulation experiments and the remaining 70% from field practices. To better characterize the complexity and diversity of UCD, we performed a detailed statistical analysis of the annotated bounding boxes.

(a) Scale Distribution: UCD exhibits substantial multi-scale variation. Specifically, small crack instances may occupy only 0.6% of the image area, whereas large tree-like cracks can cover up to 66%. Owing to the prevalence of close-range imaging, nearly 70% of the crack instances exceed 32 × 32 pixels, and approximately 50% are larger than 64 × 64 pixels. This heavy-tailed scale distribution highlights the necessity of the proposed BRF-SPPM and X-Head modules.

(b) Aspect Ratio Variability: The crack instances in UCD exhibit considerable aspect-ratio variation, ranging from elongated strip-like morphologies to block-like or tree-like patterns with aspect ratios close to 1. According to the annotation statistics, strip-like cracks account for approximately 80% of the dataset. This observation supports the need for morphology-adaptive modeling and justifies the adaptive feature fusion strategy adopted in MAM.

(c) Environmental Complexity: In addition to geometric diversity, UCD includes challenging real-world underwater conditions, such as high turbidity, severe light attenuation, non-uniform local illumination, and color distortion. These environmental variations may affect model generalization by reducing crack-edge contrast, introducing suspended-particle noise, causing color shift, and producing local shadows or overexposed regions. In this study, field images in UCD, HSV color-space jittering, mosaic augmentation, and morphology-adaptive feature modeling are used to improve robustness under such complex underwater conditions.

These morphological and environmental variations together contribute to improved data diversity, thereby helping mitigate overfitting and enhancing the detector’s generalization ability in hydraulic engineering applications. The underwater crack samples in UCD are then manually annotated using LabelImg. For each crack, a rectangular bounding box that covers the entire crack region is drawn, and the annotations are saved in the YOLO format (txt file). The sample annotation process is presented in Figure 6. During our experimental evaluation, the training, validation, and test sets were split in a ratio of 7:2:1.

4.2. Experimental Details and Evaluation Metrics

4.2.1. Experimental Settings

All models were trained and evaluated using 640 × 640-pixel inputs, while the original image aspect ratio was retained during resizing. During training, Mosaic augmentation was applied with a probability of 0.5, and HSV color-space jittering was also used, with hue, saturation, and value gains of 15 × 10⁻³, 0.7, and 0.4, respectively. The probability of 0.5 was selected as a moderate setting to balance augmentation strength and preservation of the original crack morphology, as overly frequent Mosaic augmentation may affect the spatial continuity of cracks and the illumination characteristics of underwater images. The batch size was fixed at 16, and the model was trained for 200 epochs. Stochastic gradient descent (SGD) was adopted for optimization, with a momentum of 0.937, a weight decay of 5 × 10⁻⁴, and an initial learning rate of 0.01. To improve training stability, a warmup strategy was used in the first 3 epochs, after which a cosine annealing scheduler was employed to gradually decay the learning rate throughout the remaining training process. Detailed specifications of the hardware and software environments used in the experiments are summarized in Table 1.

4.2.2. Evaluation Metrics

Model performance was evaluated using Precision, Recall, mAP@0.5, mAP@0.5:0.95, GFLOPs, and FPS. Precision and Recall reflect prediction reliability and detection completeness, respectively. mAP@0.5 measures detection accuracy at an IoU threshold of 0.5, whereas mAP@0.5:0.95 averages mAP over IoU thresholds from 0.5 to 0.95. GFLOPs and FPS are used to evaluate computational complexity and inference speed, respectively.

4.3. Ablation Experiments and Analysis

To evaluate the effectiveness and contribution of each module in MA-YOLO, ablation experiments were conducted on UCD. The YOLOv11n model was adopted as the baseline for comparison. In Table 2, “-” indicates that the corresponding module is not included, whereas “√” denotes that the module is included.

(a) MA-YOLO: As summarized in Table 2, integrating all proposed modules into MA-YOLO yielded a significant improvement, with mAP@0.5 and mAP@0.5:0.95 increasing by 1.7% and 3.0%, respectively, compared to the baseline. This improvement can be attributed to the capability of MA-YOLO to adapt feature representation to the morphological characteristics of underwater cracks. More importantly, a favorable computational synergy is observed. Although incorporating BRF-SPPM and MAM alone increases the computational cost to 7.2 GFLOPs, the complete MA-YOLO reduces the overall complexity to 6.5 GFLOPs. This phenomenon can be explained by an architectural substitution mechanism. In conventional YOLO architectures, a substantial portion of the computation is concentrated in the high-resolution P3 detection head for small-object perception. After X-Head is introduced, the network topology is reorganized such that redundant dense computation in the high-resolution branches can be bypassed or reduced. Because the P6 feature map has significantly smaller spatial dimensions, the additional convolutional cost brought by X-Head is marginal. In contrast, the computation relieved from the high-resolution routing is considerably larger, leading to an overall reduction in total GFLOPs. Therefore, although the proposed modules introduce slight computational overhead, MA-YOLO still maintains a lightweight architecture with low GFLOPs and high inference speed. Specifically, when built on YOLOv11, MA-YOLO achieves 54 FPS, which satisfies the practical requirements of underwater inspection and demonstrates strong potential for real-time applications.

The training curves of the proposed MA-YOLO model are presented in Figure 7. Specifically, box loss measures the discrepancy between the predicted and ground-truth bounding boxes in terms of spatial localization and scale, classification loss (cls_loss) quantifies the difference between the predicted and ground-truth categories, and distribution focal loss (dfl_loss) is employed for bounding box regression to evaluate localization accuracy. The consistent decrease in all three losses during both training and validation indicates stable convergence throughout the training process, without evident signs of overfitting. In addition, the steady improvements in Precision, Recall, and mean Average Precision demonstrate the continuous enhancement of detection performance, suggesting that the proposed modules contribute to both training stability and robust detection capability.

(b) BRF-SPPM: As shown in Table 2, incorporating BRF-SPPM improves mAP@0.5 and mAP@0.5:0.95 by 0.3% and 1.5%, respectively. This gain can be attributed to the use of convolutional branches with different dilation rates, which allow the model to better adapt to cracks of varying scales. Moreover, wavelet convolution exponentially enlarges the effective receptive field, thereby improving the model’s ability to capture multi-scale crack features.

(c) MAM: Table 2 further shows that the introduction of MAM leads to notable improvements in detection accuracy, with mAP@0.5 and mAP@0.5:0.95 increasing by 0.5% and 2.0%, respectively. This gain can be attributed to the sensitivity of wavelet convolution to low-frequency components, which provides a morphological prior for adaptively fusing strip-like and large-scale feature representations while suppressing isolated high-frequency noise. Consequently, MA-YOLO exhibits stronger morphological adaptability and improved capability in capturing cracks with diverse morphological characteristics.

(d) X-Head: According to Table 2, the introduction of X-Head alone improves mAP@0.5 and mAP@0.5:0.95 by 0.8% and 1.3%, respectively, demonstrating a clear performance gain. Importantly, this improvement is achieved without a substantial increase in computational complexity. Since X-Head is specifically designed for magnified underwater images acquired through close-range imaging, the accuracy gain can be attributed to its enhanced ability to localize magnified cracks more precisely.

To qualitatively assess the contribution of each module in MA-YOLO, Gradient-weighted Class Activation Mapping was employed to generate heatmaps on UCD, as illustrated in Figure 8. These heatmaps provide comprehensive visualizations of feature representations learned by MA-YOLO, including BRF-SPPM at the 11th layer, C2PSA at the 12th layer, MAM at the 13th layer, and the integration of BRF-SPPM and MAM. These qualitative results are compared with the feature representations obtained from the SPPF, which is at the 11th layer of YOLOv11, as shown in Figure 8. This qualitative evaluation allows intuitive observation of each module’s contribution to feature learning.

First, comparison of the second and third columns in Figure 8 indicates that BRF-SPPM generates a more complete activation region for underwater cracks and exhibits enhanced sensitivity to discriminative features. This behavior is particularly beneficial during the coarse-grained phase for capturing overall morphological integrity, as it substantially reduces the false negative rate.

Second, the fourth and fifth columns in Figure 8 show that activation regions become more concentrated after MAM, clearly highlighting the specific morphological boundaries of underwater crack regions. Such refinement is advantageous in the fine-grained phase, significantly improving detection accuracy while reducing the false positive rate.

Third, as illustrated in Figure 8, the joint integration of BRF-SPPM and MAM not only facilitates comprehensive representation of underwater cracks but also highlights the most discriminative morphological features, thereby establishing a strong basis for high-quality detection.

Overall, MA-YOLO maintains a compact architecture, in which different layers perform distinct yet collaborative roles in coarse- and fine-grained detection. A clear coarse-to-fine progression can be observed across successive layers, demonstrating effective hierarchical feature refinement.

4.4. Comparison Experiments and Analysis

To validate the performance of the proposed MA-YOLO, comprehensive comparative experiments were conducted on UCD against 11 state-of-the-art methods, including seven popular detectors, i.e., Faster RCNN, RetinaNet, YOLOv5n, YOLOv8n, YOLOv11n, YOLOv12n, YOLOv13n, and four underwater object-specific detectors, i.e., Boosting RCNN, GCC-Net [36], SU-YOLO and LUOD-YOLO [37].

Among the compared methods, Faster R-CNN, Boosting R-CNN, and GCC-Net are two-stage detectors based on region proposal mechanisms, whereas the others are one-stage detectors with different architectural designs. The quantitative comparison results are reported in Table 3. Under identical training and evaluation settings, MA-YOLO consistently outperforms all competing methods, achieving the best results in terms of both mAP@0.5 and mAP@0.5:0.95. In particular, relative to the YOLOv11 baseline and other YOLO-series models, MA-YOLO improves mAP@0.5 and mAP@0.5:0.95 by an average of 1.9% and 2.5%, respectively. This result reflects an accuracy-oriented but still real-time speed-accuracy trade-off. Compared with YOLOv11n, MA-YOLO improves mAP@0.5 from 91.2 to 92.9 and mAP@0.5:0.95 from 60.0 to 63.0, while the FPS decreases from 128 to 54 because of the added morphology-adaptive modules. Nevertheless, MA-YOLO remains faster than representative two-stage detectors, including Faster R-CNN, Boosting R-CNN, and GCC-Net, which achieve 15, 37, and 4 FPS, respectively. Different from existing improved YOLO-based underwater detectors that mainly focus on generic feature fusion, attention enhancement, image degradation suppression, or lightweight design, MA-YOLO explicitly models underwater crack morphology through BRF-SPPM, MAM, and X-Head. These modules are designed for scale variation, aspect-ratio variability, and close-range magnified cracks, respectively. Therefore, MA-YOLO improves detection accuracy while retaining real-time potential for online underwater inspection.

To further evaluate the effectiveness of MA-YOLO, we compared it with four representative underwater object detection methods, namely Boosting R-CNN, GCC-Net, SU-YOLO, and LUOD-YOLO. Overall, MA-YOLO achieves the best detection accuracy, surpassing these methods by an average of 2.6% in mAP@0.5 and 5.0% in mAP@0.5:0.95. Although the recently proposed SU-YOLO shows strong performance in general underwater object detection, it mainly relies on spike-based denoising while overlooking the distinctive geometric and morphological characteristics of underwater cracks, which likely accounts for its degraded performance in hydraulic inspection tasks. By contrast, two-stage underwater detectors such as Boosting R-CNN can achieve high detection accuracy, but their substantial computational overhead limits their suitability for real-time deployment on resource-constrained underwater platforms.

Figure 9 presents the qualitative underwater crack detection results of the compared methods and the proposed MA-YOLO on UCD. As observed in Figure 9, MA-YOLO produces more accurate bounding boxes that tightly enclose various underwater cracks. For small-scale cracks, as illustrated in the second column of Figure 9, only a few comparative methods, i.e., RetinaNet, YOLOv13n, Boosting RCNN and GCC-Net, can roughly identify the crack locations, whereas others suffer from missed or inaccurate detections. In contrast, MA-YOLO achieves the highest performance in this case. The third and fourth columns of Figure 9 present the results for underwater cracks with significant morphological variations. In these cases, MA-YOLO demonstrates a distinct advantage, as it provides the most compact and complete bounding boxes of such crack types. For extremely large-scale cracks observed at close range, as shown in the fifth column of Figure 9, most comparative methods fail to completely detect the cracks, as their predicted bounding boxes cannot cover the complete crack morphology. By contrast, MA-YOLO is capable of fully encompassing the entire crack regions, making it particularly well-suited for close-range imaging scenes in practical engineering applications. In summary, our MA-YOLO exhibits strong adaptability to underwater cracks of varying scales and morphologies, achieving superior detection accuracy while maintaining moderate computational complexity. These findings indicate that our MA-YOLO is not only effective but also suitable for real-world underwater inspection tasks.

5. Conclusions

This paper presents a novel underwater crack detection framework, termed MA-YOLO, which substantially improves detection performance by jointly addressing both methodological and practical challenges. Based on YOLOv11, MA-YOLO introduces three dedicated modules, namely BRF-SPPM, MAM, and X-Head, to address the multi-scale nature and morphological variability of underwater cracks, as well as the image magnification introduced by close-range inspection in practical scenes. Together, these modules establish a coarse-to-fine feature representation framework for underwater crack detection. Specifically, BRF-SPPM captures multi-scale crack features by enlarging the receptive field and preserving the integrity of crack representation. MAM further refines the representation by adaptively fusing orthogonal strip-convolution features with BRF-SPPM features under the guidance of a robust statistical morphological prior. In addition, X-Head facilitates accurate localization of cracks in magnified underwater images, thereby improving the applicability of MA-YOLO in practical inspection scenes. Experimental results on UCD demonstrate that MA-YOLO achieves the best mAP@0.5 of 92.9%, outperforming the baseline by 1.7%, while remaining highly lightweight with only 6.5 GFLOPs. Moreover, it shows robust performance across different types of underwater cracks. Although the reported inference speed of 54 FPS was measured on high-performance hardware, the low computational complexity of MA-YOLO suggests strong potential for future deployment on resource-constrained ROV platforms. Overall, these results demonstrate that MA-YOLO is an efficient and high-precision solution for underwater crack detection in hydraulic structures.

Despite these promising results, several directions for future research remain. First, as illustrated in Figure 8, when underwater cracks exhibit extremely complex morphologies, the learned feature representation may fail to uniformly cover the entire crack region, which increases the risk of missed detections. Second, the scarcity of publicly available high-quality underwater crack datasets still limits comprehensive cross-dataset validation. Third, although MA-YOLO has effectively reduced the computational cost for underwater inspection, further model lightweighting is needed to ensure real-time inference and lower response latency on practical resource-constrained edge platforms, such as embedded systems deployed on ROVs. Future work will therefore focus on deploying MA-YOLO in hardware-in-the-loop physical environments to dynamically evaluate its operational robustness. In addition, incorporating other imaging modalities, such as acoustic imaging, for multi-modal fusion may provide a promising direction for achieving more robust and accurate detection in complex underwater engineering environments.

Author Contributions

Z.C.: Conceptualization, Methodology, Project administration, Funding acquisition, Writing—review and editing. C.Z.: Validation, Investigation, Experimental design, Writing—original draft. J.G.: Formal analysis, Resources. G.Y.: Investigation, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (Grant No. 2024YFC3212400) and the National Natural Science Foundation of China (Grant No. U23B20150).

Data Availability Statement

The UCD dataset and annotation files are available from the corresponding author upon reasonable request. Restrictions apply to part of the field images obtained from hydropower stations, which cannot be publicly released because of engineering-site data-use agreements.

Conflicts of Interest

The authors declare no conflicts of interest. The sponsors had no role in the design, execution, interpretation, or writing of the study.

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-end object detection with learnable proposals. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Song, P.; Li, P.; Dai, L.; Wang, T.; Chen, Z. Boosting R-CNN: Reweighting R-CNN samples by RPN’s error for underwater object detection. Neurocomputing 2023, 530, 150–164. [Google Scholar] [CrossRef]
Gao, J.; Geng, X.; Zhang, Y.; Wang, R.; Shao, K. Augmented weighted bidirectional feature pyramid network for marine object detection. Expert Syst. Appl. 2024, 237, 121688. [Google Scholar] [CrossRef]
Bhalla, S.; Kushwaha, R.; Kumar, A. Hydr-CNN: Advancing underwater object detection using a multi-stage framework with hybrid R-CNN and pyramid vision transformer with augmented convolution. Eur. Phys. J. Plus 2025, 140, 556. [Google Scholar] [CrossRef]
Liu, K.; Peng, L.; Tang, S. Underwater object detection using TC-YOLO with attention mechanisms. Sensors 2023, 23, 2567. [Google Scholar] [CrossRef] [PubMed]
Feng, J.; Jin, T. CEH-YOLO: A composite enhanced YOLO-based model for underwater object detection. Ecol. Inform. 2024, 82, 102758. [Google Scholar] [CrossRef]
Wei, L.S.; Huang, S.H.; Ma, L.Y. MTD-YOLOv5: Enhancing marine target detection with multi-scale feature fusion in YOLOv5 model. Heliyon 2024, 10, e26145. [Google Scholar] [CrossRef]
Cao, R.; Zhang, R.; Yan, X.; Zhang, J. BG-YOLO: A bidirectional-guided method for underwater object detection. Sensors 2024, 24, 7411. [Google Scholar] [CrossRef]
Ma, H.; Zhang, Y.; Sun, S.; Zhang, W.; Fei, M.; Zhou, H. Weighted multi-error information entropy based you only look once network for underwater object detection. Eng. Appl. Artif. Intell. 2024, 130, 107766. [Google Scholar] [CrossRef]
Li, C.; Liu, W.; Gong, G.; Ding, X.; Zhong, X. SU-YOLO: Spiking neural network for efficient underwater object detection. Neurocomputing 2025, 644, 130310. [Google Scholar] [CrossRef]
Li, Y.; Bao, T.; Xu, B.; Shu, X.; Zhou, Y.; Du, Y.; Wang, R.; Zhang, K. A deep residual neural network framework with transfer learning for concrete dams patch-level crack classification and weakly-supervised localization. Measurement 2022, 188, 110641. [Google Scholar] [CrossRef]
Chen, B.; Zhang, H.; Li, Y.; Wang, S.; Zhou, H.; Lin, H. Quantify pixel-level detection of dam surface crack using deep learning. Meas. Sci. Technol. 2022, 33, 065402. [Google Scholar] [CrossRef]
Cao, W.; Li, J. Detecting large-scale underwater cracks based on remote operated vehicle and graph convolutional neural network. Front. Struct. Civ. Eng. 2022, 16, 1378–1396. [Google Scholar] [CrossRef]
Li, Y.; Bao, T.; Huang, X.; Wang, R.; Shu, X.; Xu, B.; Tu, J.; Zhou, Y.; Zhang, K. An integrated underwater structural multi-defects automatic identification and quantification framework for hydraulic tunnel via machine vision and deep learning. Struct. Health Monit. 2023, 22, 2360–2383. [Google Scholar] [CrossRef]
Huang, X.; Liang, C.; Li, X.; Kang, F. An underwater crack detection system combining new underwater image-processing technology and an improved YOLOv9 network. Sensors 2024, 24, 5981. [Google Scholar] [CrossRef] [PubMed]
Shi, P.; Shao, S.; Fan, X.; Xin, Y.; Zhou, Z.; Cao, P.; Li, X.; Zhu, S. CrackYOLO: Towards efficient dam crack detection for underwater scenes. Pattern Anal. Appl. 2024, 27, 105. [Google Scholar] [CrossRef]
Huang, B.; Kang, F.; Li, X.; Zhu, S. Underwater dam crack image generation based on unsupervised image-to-image translation. Autom. Constr. 2024, 163, 105430. [Google Scholar] [CrossRef]
Lin, C.; Liu, R.; Lin, W.; Zou, Y.; Wei, X.; Su, Y. Underwater dam crack image enhancement and crack detection based on improved diffusion model and SDI-ASF-YOLO11. Constr. Build. Mater. 2025, 492, 142861. [Google Scholar] [CrossRef]
Guo, B.; Li, X.; Li, D. CrackWave R-convolutional neural network: A discrete wavelet transform and deep learning fusion model for underwater dam crack detection. Struct. Health Monit. 2025, 25, 14759217241308132. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar] [CrossRef]
Qiao, S.; Chen, L.C.; Yuille, A. DetectoRS: Detecting objects with recursive feature pyramid and switchable atrous convolution. arXiv 2020, arXiv:2006.02334. [Google Scholar] [CrossRef]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. arXiv 2024, arXiv:2407.05848. [Google Scholar] [CrossRef]
Dai, L.; Liu, H.; Song, P.; Liu, M. A gated cross-domain collaborative network for underwater object detection. Pattern Recognit. 2024, 149, 110222. [Google Scholar] [CrossRef]
Lv, C.; Pan, W. LUOD-YOLO: A lightweight underwater object detection model based on dynamic feature fusion, dual path rearrangement and cross-scale integration. J. Real.-Time Image Process. 2025, 22, 204. [Google Scholar] [CrossRef]

Figure 1. MA-YOLO architecture.

Figure 2. Multi-scale underwater cracks.

Figure 3. BRF-SPPM architecture.

Figure 4. MAM architecture.

Figure 5. Close-range imaging of underwater cracks.

Figure 6. Image labeling and annotation.

Figure 7. MA-YOLO training results on UCD.

Figure 8. Heatmaps generated by Grad-CAM for the baseline and ablated MA-YOLO variants on representative underwater crack images.

Figure 9. Qualitative detection comparison between MA-YOLO and representative detectors on underwater crack images with small-scale, irregular, and close-range enlarged cracks.

Table 1. Experimental hardware and software configuration.

Experimental Environment	Version
GPU	RTX 3090
CUDA	V11.8
Python	V3.10
PyTorch	V2.6.0

Table 2. Ablation experiment of BRF-SPPM, MAM, X-Head and MA-YOLO on UCD. Bold values indicate the best performance.

BRF-SPPM	MAM	X-Head	mAP@0.5	mAP@0.5:0.95	GFLOPs	FPS
-	-	-	91.2	60.0	6.4	128
√	-	-	91.5	61.5	6.9	77
-	√	-	91.7	62.0	6.8	85
-	-	√	92.0	61.3	6.3	95
√	√	-	92.1	61.7	7.2	61
√	-	√	92.0	62.5	6.4	67
-	√	√	92.1	62.6	6.4	70
√	√	√	92.9	63.0	6.5	54

Table 3. Quantitative comparison. Bold values indicate the best performance.

Models	mAP@0.5	mAP@0.5:0.95	GFLOPs	FPS
Faster RCNN	89.1	57.0	251.4	15
RetinaNet	91.9	59.7	127.2	17
YOLOv5n	90.5	59.6	4.1	82
YOLOv8n	91.4	60.6	8.7	148
YOLOv11n	91.2	60.0	6.4	128
YOLOv12n	90.5	61.7	6.5	84
YOLOv13n	91.3	60.5	6.4	56
Boosting RCNN	91.0	59.5	181.2	37
GCC-Net	90.0	54.0	313.4	4
SU-YOLO	91.3	59.3	9.4	87
LUOD-YOLO	90.0	59.3	6.0	83
Ours	92.9	63.0	6.5	54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Z.; Zhou, C.; Guo, J.; Yin, G. Morphology-Adaptive YOLO for Underwater Crack Detection in Hydraulic Structures. Water 2026, 18, 1241. https://doi.org/10.3390/w18101241

AMA Style

Chen Z, Zhou C, Guo J, Yin G. Morphology-Adaptive YOLO for Underwater Crack Detection in Hydraulic Structures. Water. 2026; 18(10):1241. https://doi.org/10.3390/w18101241

Chicago/Turabian Style

Chen, Zhe, Changning Zhou, Jingkun Guo, and Guangjun Yin. 2026. "Morphology-Adaptive YOLO for Underwater Crack Detection in Hydraulic Structures" Water 18, no. 10: 1241. https://doi.org/10.3390/w18101241

APA Style

Chen, Z., Zhou, C., Guo, J., & Yin, G. (2026). Morphology-Adaptive YOLO for Underwater Crack Detection in Hydraulic Structures. Water, 18(10), 1241. https://doi.org/10.3390/w18101241

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Morphology-Adaptive YOLO for Underwater Crack Detection in Hydraulic Structures

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Application of Deep Learning in Conventional Underwater Object Detection

2.2. Application of Deep Learning in Underwater Crack-Specialized Detection

2.3. YOLO Series

2.4. Comparison to the State-of-the-Art Method

3. Morphology-Adaptive YOLO

3.1. Overall Architecture

3.2. Broader Receptive Field-Spatial Pyramid Pooling Module (BRF-SPPM)

3.3. Morphological Attention Module (MAM)

3.4. Extra Large Detection Head (X-Head)

4. Experiments and Discussion

4.1. Experimental Datasets

4.2. Experimental Details and Evaluation Metrics

4.2.1. Experimental Settings

4.2.2. Evaluation Metrics

4.3. Ablation Experiments and Analysis

4.4. Comparison Experiments and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI