Abstract
Detecting underwater structural defects is vital for hydraulic engineering safety. Diverse patterns of underwater structural defects, i.e., the morphology and scale characteristics, pose difficulties on feature representability during detection. Any single feature morphology is insufficient to fully characterize diverse types of underwater defect patterns. This paper proposes a novel hybrid neural network to enhance feature representation of underwater structural multi-defects, which in turn improves the accuracy and adaptability of underwater detection. Three types of convolution operations are combined to build Hybrid Aggregation Network (HanNet), enhancing the morphological representation for diverse defects. Considering the scale difference of diverse defects, the Multi-Scale Shared Feature Pyramid (MSFP) is proposed, facilitating adaptive representation for diverse sizes of structural defects. The defect detection module leverages an Adaptive Spatial-Aware Attention (ASAA) at the backend, enabling selective enhancement of salient defect features. For model training and evaluation, we, for the first time, build an underwater structural multi-defects sonar image dataset containing a wide range of typical defect types. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods, significantly improving defect detection accuracy, and provides an effective solution for detecting diverse structural defects in complex underwater environments.
1. Introduction
Hydraulic structures (dams, gates, and pipelines, etc.) are critical facilities for hydraulic safety and exploration [1]. Underwater structures are susceptible to the initiation and accelerated growth of defects—particularly cracks, holes, and spalling—due to high-velocity flow-induced erosion, severe stress concentrations, and harsh service conditions; if unaddressed, these defects pose serious safety risks and can lead to substantial economic losses and significant environmental damage. Automatic underwater structural multi-defects automatic detection has emerged as a crucial technology for inspecting hydraulic structures [2]. However, relative to terrestrial objects, underwater defects exhibit distinct characteristics. In particular, common structural defects—cracks, holes, and spalling—show pronounced intra-class variability in surface appearance, in-plane (planform) morphology, and subsurface (through-thickness) extent. Comprehensively and sufficiently representing a variety of underwater structural multi-defects is challenging for underwater detection. Moreover, the low visibility in complex underwater environments further exacerbates the challenges in representing underwater defects.
To motivate the sensing modality adopted in this study, we first situate sonar within the broader landscape of non-destructive testing (NDT) methods for hydraulic structures. Common NDT techniques include ultrasonic testing and impact-echo, which provide material-level diagnostics but typically require contact or careful coupling and are difficult to scale to large underwater areas. Magnetic particle/dye-penetrant methods are limited to specific materials and surface-breaking defects on prepared surfaces. Optical/laser imaging can deliver high spatial detail in clear water but suffers in turbidity and low illumination. In contrast, sonar imaging is an active, stand-off modality that remains effective in turbid water, offers longer imaging ranges suitable for AUV/ROV surveys, and is intrinsically sensitive to geometric discontinuities (e.g., edges, cavities, delamination) that characterize cracks, holes, and spalling. Sonar’s limitations—speckle noise, clutter, and coarser spatial resolution—motivate the method proposed in this work, which strengthens morphological representation, enforces cross-scale consistency, and selectively emphasizes salient defect cues in noisy acoustic backgrounds. Guided by this rationale, we next detail the forward-looking sonar (FLS) acquisition setting and its operational constraints, which directly inform the design choices of our sonar-based structural-defect object detector.
Recently, autonomous underwater vehicles outfitted with an FLS imaging system have emerged as a crucial instrument for inspecting hydraulic structures [3] since they can largely address the issues caused by signal attenuation and scattering. Despite the long range of underwater visibility, the use of FLS still faces challenges for underwater defect detection [4,5,6,7]. First, the low spatial resolution of underwater sonar imaging makes it difficult to identify small and weak defects. For example, small-scale defects—such as holes and fine cracks—are often low-contrast and poorly resolved in sonar imagery. In addition, sonar acquisitions are contaminated by speckle and reverberation, which further reduce the SNR and impair the representational fidelity and separability of defect signatures. These limitations of underwater sonar imaging and the inherent variability of underwater structural defects jointly render underwater defect detection a highly challenging task. Aiming to address the above challenges in underwater defect detection, efforts have been devoted to enhance the feature representability of multi-defects. Considering the monochromatic and texture representation of sonar images, previous methods [8,9,10,11,12,13] largely rely on handcrafted low-level features, e.g., edges and textures. Although these features have clear semantic information, they can hardly reach promising detection accuracy due to their limited generalization. Deep learning provides a data-driven way to mitigate this issue by learning features that adapt to complex underwater scenes. Nonetheless, under limited data, any high-capacity model—not only deep networks—can overfit, biasing the detector toward a narrow range of defect appearances. In addition, supervised deep learning benefits from sufficient labeled data; in hydraulic inspections where annotations are scarce, model generalization is constrained unless more training data are curated.
Concerning three typical underwater structural defects, i.e., cracks, holes, and spalling, this paper comprehensively addresses the above issues from two perspectives. First, we conducted underwater inspecting tests to increase the number of underwater sonar images of multi-defects, which provides a comprehensive dataset for training defect detection models. Second, we proposed a novel hybrid neural network which can comprehensively and fully represent typical underwater structural defects by adaptively fusing multi-type feature representation. The main contributions of this work are summarized as follows:
- (1)
- An underwater structural defect dataset is built by conducting underwater inspecting tests in a real-world water body, which is of 2500 images catering to most deep learning model training.
- (2)
- A hybrid neural network is proposed, which integrates multi-scale multi-type convolutions to enrich feature types. It can enhance the feature representability to different types of underwater structural defects.
- (3)
- The adaptive spatial-aware attention module is introduced, which dynamically adjusts the receptive field to selectively fuse discriminative features for backend defect detection.
The remainder of this paper is organized as follows. Section 2 reviews related work on underwater structural defect detection and underwater sonar image object detection. The project background and underwater sonar defect dataset (USDD) are introduced in Section 3. Section 4 presents the novel hybrid neural network in detail. Experimental results along with comprehensive analysis are reported in Section 5. Finally, Section 6 concludes this paper.
2. Related Work
This study addresses object detection of structural defects (crack, hole, spalling) in sonar imagery. Accordingly, we organize this section into two parts: (i) multimodal object detection for structural defects, which surveys methods across optical and sonar modalities; and (ii) sonar object detection for underwater structures, which focuses specifically on acoustic imaging methods and sonar-specific challenges. Both subsections review object detection; the distinction is breadth across modalities vs. sonar-specific depth.
2.1. Multimodal Object Detection for Structural Defects
Underwater structural defects have emerged as a vital cause of hydraulic engineering risk. The most efficient method to swiftly detect underwater defects is through the use of underwater imaging technologies. Optical and sonar imaging are the two representative modalities evaluated in this domain. Owing to their high spatial resolution and acquisition efficiency, optical systems have been widely adopted for the inspection of hydraulic structures. However, underwater optical imaging suffers from serious light absorption and scattering effects, which in turn causes information attenuation and distortion in underwater images. This issue would lead to a significant decline in defect detection accuracy. Aiming to solve this problem, previous methods commonly employ various mathematical techniques to enhance the optical image quality. In general, underwater image enhancement techniques generally fall into three categories: contrast compensation, color correction, and hybrid method. For contrast compensation, the simple histogram equalization method proved highly effective in the early stage, while other complex mathematical methods, e.g., empirical mode decomposition and biological vision-inspired methods, have been proposed later. All of these methods have been evaluated for underwater defect detection and have proven effective at suppressing background clutter. Color-correction approaches, in particular, show promising performance for enhancing imagery of underwater concrete structures and improving the visibility of defects. Due to the complementarity between contrast compensation and color correction, many hybrid methods have been proposed to comprehensively enhance the quality of underwater images [14,15]. Also, other traditional mathematical morphology technologies [16] have been utilized in underwater defect detectors. The advantage of these mathematical models lies in that they are lightweight; however, this type of models faces significant challenges in adapting to complex hydraulic engineering environments, particularly in turbid waters and high-flow-rate areas. Additionally, these conventional methods are of low detection efficiency and often impractical for real-world tasks. Recently, deep learning-based underwater image enhancement models have been utilized in underwater structural defects detection, demonstrating a superior performance in contrast to traditional mathematical models [17,18]. The main advantage of these learning-based methods lies in their availability to various underwater detection tasks, which have gained significant attention, primarily in underwater image enhancement and object detection applications. Typical deep learning models, e.g., CNN or generative adversarial networks (GANs), have been used to enhance the quality of underwater defect images. They learn and map the relationship between underwater images and their high-quality counterparts, correcting issues like blurriness, uneven illumination, and color distortion. Also, deep learning models have been applied in developing defect detectors [19]. Typical deep learning-based detection models, e.g., Faster R-CNN, YOLO, SSD, and others have been evaluated in underwater defect detection, which locate underwater structural defects based on the image feature expression. For example, Zhuang et al. [20] demonstrated that domain-tailored DL models improve structural damage-level recognition after earthquakes.
In the realm of underwater defect detection, sonar technology stands as a promising tool since it can overcome the issues caused by light attenuation and scattering. In contrast to underwater optical imaging technologies, underwater sonar imaging is of longer imaging range and better representation of the geometric shape. In general, sonar imaging detection just requires speckle noise removal, while it is independent of complex image enhancement computing. Traditional sonar imaging-based underwater defect detection methods rely on eye observation and experience of structural defect types. However, underwater sonar imaging results are distinct from perceptual intuition, posing a considerable difficulty on accurately interpreting defect features. To address these difficulties, deep learning models have been investigated in underwater defect detection. The use of deep learning-based sonar image detection enables autonomous extraction and learning defect features, eliminating the need for hand-craft feature designing. Methodologically, the deep learning architecture for processing underwater sonar images is similar to that used in underwater optical images [21]. For example, the aforementioned CNN and Yolo models have been evaluated in sonar imaging detection, demonstrating a nice performance in underwater structural defect detection.
In conclusion, deep learning demonstrates a promising performance in underwater structural defect detection, yet it currently faces challenges in practical applications due to the small dataset of underwater defects. Moreover, hydraulic structure inspecting is commonly conducted in the wild, and computational resources desired for deep learning can hardly be deployed.
2.2. Sonar Object Detection for Underwater Structures
This subsection focuses on the details of underwater sonar imaging detection models related to general underwater objects. Recent development of the Yolo architecture, e.g., PP-YOLOv2 [22] and YOLOv3 [23], has led to their widespread adoption in sonar image processing due to the real-time processing capability. Based on the YOLO baseline architecture, numerous attention mechanisms and denoising modules have been introduced to strengthen feature representations, achieving a favorable accuracy–efficiency balance. In addition, hybrid architectures [24] have been proposed by combining Yolo with Transformer architectures [25]. These hybrid architectures have been deployed on remotely operated vehicles (ROVs) [26], demonstrating their performance in practice. Aiming to improve the efficiency of global context representability, the Transformer architectures have been utilized in sonar image processing. Many studies [27] have demonstrated performance improvement by using Vision Transformers (ViT-B/16) and Swin Transformers (Swin-T).
Complementary to these detector-centric advances, Tan et al. [28] demonstrated a deep learning-assisted high-resolution sonar pipeline alongside a task-specific dataset that effectively detects local damages in underwater structures, highlighting the importance of tailored data and feature design for acoustic imaging. From a deployment perspective, Zheng et al. [29] developed a real-time sonar-based crack detector for underwater concrete, emphasizing latency and resource constraints typical of field operations. To lighten the model complexity, pruning technologies, e.g., Batch Normalization [30] and DepGraph [31], have been developed in underwater sonar image detection.
In conclusion, underwater sonar image-based object detection has a very broad space for development, and few efforts have been devoted in this field due to the small available datasets. Existing underwater detection techniques primarily adapt existing object detection technologies on the ground. This strategy is somewhat problematic. First, mature object detection models are primarily designed for processing color optical images, making them unsuited to apply to sonar image modalities. Second, existing object detection models are based on the images on the ground, while they suffer from problems caused by complex underwater imaging environments. Finally, the morphological differences between underwater objects and objects on the ground cause the performance degeneration of detection methods. These gaps call for detectors natively designed for acoustic imagery rather than direct reuse of color-image heuristics. Motivated by this, we develop a sonar-oriented architecture that emphasizes morphology fidelity, parameter-efficient multi-scale fusion, and robustness to speckle/clutter. Unlike prior detectors that rely on a single convolutional morphology, scale-independent pyramids, or generic attention, our pipeline explicitly targets sonar-specific challenges—morphological diversity, scale variation, and speckle/clutter—via HanNet (multi-morphology aggregation), MSFP (parameter-shared pyramid), and ASAA (sonar-aware spatial attention). Together with USDD (2,500 images) and a unified evaluation protocol, this clarifies where the gains originate and supports reproducible comparisons.
3. Underwater Structural Defect Dataset
As aforementioned, most existing intelligent detection technologies are heavily reliant on data-driven training. To address this, we have developed a comprehensive underwater structural defect dataset specifically designed for model fine-tuning and parameter optimization. Unlike previous datasets that focus on any signal defects, the proposed dataset incorporates multiple defect types, providing sufficient samples to cover various typical defects, i.e., cracks, holes, and spalling. This enables it to accommodate diverse hydraulic inspection tasks. To this end, an engineering simulation scenario is previously established in the Dangtu Scientific Research and Technology Development Base of Nanjing Hydraulic Research Institute. The experimental water was collected from the Huai River mainstream to ensure natural hydrological characteristics. The experimental site included a reinforced test pool with dimensions of 11.0 m × 5.9 m × 4.2 m (length × width × depth), with 3.4 m below ground level and a 0.8 m high surrounding wall reinforced with carbon fiber cloth. Simulated hydraulic structures, including tunnels and dams, were constructed using C30 concrete. The underwater tunnel is of 6.0 m × 4.0 m × 3.3 m (length × width × height), with one side attached to the pool wall and the tunnel end located 1.3 m from the boundary. A 5.9 m × 5.0 m water space was reserved as the deployment area of the underwater sonar imaging platform. The interior walls, ceiling, and floor are inserted with typical defect patterns, i.e., cracks, holes, and spalling. By removing the ceiling panel, the tunnel scene can be converted into a dam inspection scenario. The tunnel provides a clear internal width of 3.0 m and height of 2.08 m, offering sufficient operational space for the platform and its manipulator arm. The experimental platform and the data collection process are presented in Figure 1. This facility offers a standardized and controlled environment for dataset acquisition.
Figure 1.
Experimental platform for underwater structural inspection.
To enrich available underwater data for model training, an underwater sonar defect dataset (USDD) was constructed. The raw data used in this study was extracted from underwater inspection videos recorded at a hydraulic engineering testing facility. Video frames were sampled to generate static images for further processing. Specifically, frames were collected at a rate of 0.1 frames per image to reduce redundancy and improve the diversity of the dataset. In total, 2500 images were obtained, each with a fixed resolution of 1920 × 1080 pixels. The images were manually annotated using the LabelImg v1.8.6 tool, where structural defects were enclosed using the largest possible horizontal bounding boxes. The annotations were saved in XML format following the PASCAL VOC standard, ensuring that each image contains at least one defect label. Following annotation, all images were normalized using the calculated mean and standard deviation values. The XML annotation files were subsequently converted into TXT format, making them compatible with object detection model training pipelines. The data labeling process is illustrated in Figure 2.
Figure 2.
Image annotation and format–conversion pipeline: (a) raw sonar image of tunnel lining; (b) manual bounding-box annotation in LabelImg; (c) conversion from JSON to Pascal VOC XML (class + box) and then to YOLO TXT (You Only Look Once) used by the detector.
4. Proposed Method
4.1. Framework of Hybrid Neural Network
In this paper, we propose a single-stage detector—Hybrid Neural Network—for precise and real-time detection of multiple underwater structural defects. As shown in Figure 3, it follows a backbone–neck–head design that is aligned with the characteristics of sonar imagery. Sonar defect imagery poses three coupled challenges: (i) diverse morphologies (thin cracks, circular/irregular cavities, spalling edges), (ii) large scale variation with weak contrast, and (iii) speckle-induced clutter and shadows. To address these, the backbone employs HanNet blocks to reduce morphology bias and extract multi-scale representations; the neck adopts an MSFP with ASAA to enhance cross-scale consistency with compact parameters while emphasizing salient defect cues and suppressing clutter; and the decoupled detection head contains a classification branch for defect categories (crack, hole, spalling) and a regression branch that estimates bounding-box center coordinates, width, and height.
Figure 3.
Framework of Hybrid Neural Network (HNN).
4.2. Hybrid Aggregation Module (HanNet)
Aiming to enrich the feature types for defect detection, we have developed HanNet to enhance the feature extraction capacity of the backbone network, as illustrated in Figure 4. HanNet integrates three complementary convolutional operations, each designed to capture specific attributes of underwater structural defects: 1 × 1 bypass convolution serves to recalibrate channel-wise responses by emphasizing or suppressing specific patterns across channels. This is particularly useful for mitigating fluctuations in sonar signal intensity caused by material heterogeneity or uneven acoustic reflectivity in underwater structures, thereby facilitating more effective separation of defect-relevant signals from irrelevant background noise. Depthwise separable convolution (DSConv) focuses on efficient spatial filtering within each channel, enabling precise localization of fine-grained structural defects, e.g., narrow cracks or small-scale discontinuities. By decoupling spatial and channel filtering, DSConv effectively captures the linear or elongated morphological characteristics of underwater defects. The Star module achieves nonlinear high-dimensional feature interactions through kernelized pairwise multiplications, which implicitly model higher-order relationships between pixels and regions. This mechanism is particularly suited for modeling irregular and complex texture patterns of large-area spalling defects. By synergistically combining these convolutional components, HanNet achieves a multi-type feature extraction that is sensitive to the diverse attributes of underwater structural defects, e.g., fine cracks (linear, small-scale), holes (regular, mid-scale), and spalling (irregular, large-scale). The novel HanNet can be modeled as follows:
where denotes the input feature map from the backbone network with a channel number of c, represents the intermediate feature map obtained by applying a convolution to the input feature, where the channel number is expanded to to enhance feature representation, denote the feature maps generated from different branches (such as standard convolution, depthwise separable convolution, and Star modules) applied to , each with a channel number of c to ensure dimensional consistency. Finally, we fuse and compress the information of the three types of features through a concatenation operation followed by a 1 × 1 convolution to generate the , as follows:
where the symbol indicates channel-wise concatenation, which merges the outputs of multiple branches, denotes a 1 × 1 convolution, with a channel number of 2c, thereby enhancing representational capacity while keeping the computational cost under control.
Figure 4.
Hybrid Aggregation Network (HanNet).
4.3. Multi-Scale Shared Feature Pyramid (MSFP)
The Multi-Scale Shared Feature Pyramid (MSFP) module is proposed to address the insufficient feature representation in underwater defect detection, as illustrated in Figure 5. This module employs multi-scale dilated convolutional layers with varying dilation rates (e.g., 1×, 2×, 4×) to hierarchically aggregate features. The architecture enables simultaneous capture of localized fine details (small dilation rates) and global contextual patterns (large dilation rates). The MSFP module introduces a parameter-efficient architecture by employing shared convolutional kernels across dilation rates (e.g., 1×, 2×, 4×). This architecture is particularly advantageous for underwater defect detection scenes, where real-time processing and model compactness are prioritized over feature diversity. Additionally, 1 × 1 convolutional layers are incorporated to refine the channel dimensions and facilitate effective feature integration. Unlike spatial pyramid pooling, MSFP is based on convolution-based feature extraction, thus it offers greater flexibility and avoids information loss caused by the pooling operation. In summary, the MSFP module provides an effective multi-scale feature extraction and fusion mechanism that balances computational efficiency with enhanced discriminative capability. This enables more robust and accurate detection of underwater structural multi-defects by addressing the insufficient feature representation of previous methods.
Figure 5.
Multi-scale shared feature pyramid.
4.4. Adaptive Spatial-Aware Attention (ASAA)
The rigid architecture of static receptive fields, constrained by uniform kernel sizes, fundamentally impairs their ability to capture the heterogeneous spatial characteristics of underwater structural defects. Aiming to solve this problem, the ASAA module is integrated into our hybrid architecture, as illustrated in Figure 6. The ASAA module introduces a paradigm shift in defect feature representation through its adaptive feature amplification mechanism, which dynamically enhances discriminative defect attributes while suppressing irrelevant noise.
Figure 6.
Adaptive spatial-aware attention module.
The ASAA module uses the contextual feature fusion and spatial re-weighting refinement stages to sequentially learn backbone and neck features (H and N). Firstly, multi-scale local features from H and global semantic features from N are fused into a unified representation. The fine-grained structural cues (e.g., crack textures) and the contextual information (e.g., spatial layout of spalling regions) are integrated, which allows our model to accurately identify structural defects at different scales, as following:
where and denote depthwise and dilated convolutions, is point-wise convolution, and indicates concatenation, while Avg and Max represent max and average pooling for global context extraction.
The fused feature map U is then projected into two distinct attention channels to generate spatial attention features and :
where and , respectively, represent the features highlighting the spatial attention regions.
In the spatial attention refinement stage, the cross-channel interaction is employed to mutually enhance local precision and global context:
The refined feature maps are then multiplicatively combined:
Finally, an attention map is generated by using point-wise convolution:
This output encodes both precise local cues and enriched global context, enabling more accurate and robust defect localization in subsequent decoder stages.
5. Experiments and Analysis
An ablation study was conducted to show the intermediates in our proposed Hybrid Neural Network, aiming to present layerwise performance of defect detection. Moreover, our model was compared with other state-of-the-art methods in terms of both objective and subjective evaluations. These comparative experiments comprehensively validate its effectiveness and the efficiency of our proposed network.
5.1. Experimental Environment
The experimental environment is set up as follows: The computer used features an Intel Core i7-14700KF CPU and an Nvidia GeForce RTX 4070Ti GPU, along with 32 GB of DDR43200 Mhz RAM. We used the PyTorch 1.10.2 deep learning framework, based on CUDA 11.3 and cuDNN 8.0 within the PyCharm 2021.3 integrated development environment.
5.2. Training Strategy
The dataset is partitioned into training/validation/test using a stratified 7:2:1 split at the image level to preserve per-class proportions (crack, hole, spalling). Unless otherwise noted, all methods are trained under a unified protocol: input ; Adam (, ), weight decay ; batch size 4; 200 epochs; cosine annealing of the learning rate () with a 10-epoch warm-up (). We fix the random seed (2024) for reproducibility and, where stated, report results as mean±std over repeated runs. To make the sampling rationale explicit and reproducible, we next summarize the class statistics of USDD and quantify label skew for crack, hole, and spalling. These summaries motivate our use of multi-label stratified sampling and provide the basis for the 7:2:1 split.
We report the numbers of annotated instances for crack, hole, and spalling in Table 1 and quantify label skew using two complementary metrics.
where denotes the Imbalance Ratio; is the set of classes, , hence ; are class indices that iterate over the elements of ; is the count for class i; and are the maximum and minimum class counts across ; is the relative proportion of class i (i.e., ); and is the normalized entropy of the class distribution, taking values in , with when all classes are equally represented ( for all i) and when all samples belong to a single class.
Table 1.
Per-class instance counts and imbalance metrics.
We adopt stratified sampling to preserve per-class proportions across subsets, with a fixed seed 2024. We produce five independent splits and report mean ± std across runs. With 2500 images in total, the final partition sizes are 1750/500/250 for train/val/test, respectively. Our sonar images are class-imbalanced and some contain multiple defect types; random splitting can under-represent minority classes in val/test and inflate metric variance. Stratified splitting consistently reduces the standard deviation of mAP and per-class recall across the five runs (detailed numbers in Table 2). This practice is consistent with prior work that applies stratified sampling to imbalanced operational datasets [32].
Table 2.
Stratified 7:2:1 split (seed = 2024): per-class instance counts per subset.
Totals across subsets (instances): Crack/Hole/Spalling , overall . Per-subset instance totals preserve global proportions by design.
5.3. Ablation Experiment
We selected the feature outputs from the HanNet, MSFP, and ASAA network layers. The visualized results are shown in Figure 7, Figure 8 and Figure 9, respectively. The qualitative results are shown in Table 3. As shown in Figure 7, feature maps generated by HanNet (Figure 7b) demonstrate enhanced boundary sharpness and improved defect discriminability. In contrast, the boundary features extracted by the baseline model (Figure 7a) exhibit noticeable blurriness, which significantly impairs the discriminability between distinct defect categories.
Figure 7.
Feature visualization. (a) feature map output from the baseline model and (b) Feature map output from the HanNet module.
Figure 8.
Feature visualization. (a) feature map output from the baseline model and (b) Feature map output from the proposed MSPF module.
Figure 9.
Feature visualization. (a) Feature map output from the baseline model and (b) feature map output from the ASAA module.
Table 3.
Results of the ablation experiment.
The comparative analysis between the baseline model (Figure 8a) and the proposed MSPF (Figure 8b) reveals a substantial enhancement in feature fusion. As illustrated in Figure 8, the results from baseline model suffer from blurred feature representations, which degenerates its ability to preserve fine-grained details. In contrast, the MSPF-generated feature maps exhibit improved clarity, with well-defined textures, coherent structures, and sharp boundaries.
The feature maps generated by the ASAA module (Figure 9b) exhibit a marked improvement in defect representation compared to the baseline. Without ASAA (Figure 9a), the feature map displays pronounced blurriness, spatially scattered activations with diffused distributions, and a lack of well-defined structural clusters. In contrast, the ASAA-enhanced feature map presents more continuous and sharply delineated high-response regions, reflecting targeted feature activation.
In summary, our proposed Hybrid Neural Network model is compact with the different layers being distinctive and collaborative, some for noise suppression and others for feature representation. These modules collectively contribute to accurate, robust, and reliable underwater structural multi-defects detection.
5.4. Comparison Experiment
To systematically evaluate the proposed method under small-target and real-time constraints, we compare with eight representative detectors at and a shared compute budget. Faster R-CNN [33] serves as a two-stage, accuracy-oriented reference with strong region-level recall on small objects but higher latency; RetinaNet [34] is a one-stage baseline whose Focal Loss mitigates severe foreground–background imbalance typical of sparse small defects. YOLOv8n [35] is a widely used lightweight real-time baseline within our FLOPs/parameter limits; YOLOv9s [36] introduces GELAN/PGI to enhance information flow and small-object recall at similar FLOPs. YOLOv10 [37,38] adopts an end-to-end, NMS-free design that improves the accuracy–latency Pareto frontier and has been validated on small, occluded PPE targets in safety-helmet monitoring with surveillance and body-worn cameras. YOLOv11 provides a contemporary real-time baseline under identical input resolution; Mamba-YOLO [39] integrates state-space modules to model long-range dependencies in linear time, potentially benefiting elongated crack patterns; and YOLOv12 [40] evaluates attention-centric refinements for tiny targets while maintaining FPS. Qualitative comparisons are shown in Figure 10, Figure 11, Figure 12 and Figure 13, and quantitative results in Table 4. From these results, Faster R-CNN demonstrated competitive hole/spalling detection (mAP50: 78.2/76.5) but exhibited critical limitations in crack detection (mAP50: 42.1), while RetinaNet outperformed it in crack detection (mAP50: 48.7) yet failed in large spalling detection (mAP50: 51.3).YOLOv8 achieved exceptional hole/spalling detection (mAP50: 83.4/81.2) but suffered from redundant detections and poor crack detection (mAP50: 45.9). YOLOv9 maintained strong hole/spalling performance (mAP50: 82.1/79.8) with marginal crack detection improvements (mAP50: 50.3), whereas YOLOv10 exhibited balanced hole detection (mAP50: 75.4) but significantly lagged in crack/spalling detection (mAP50: 43.2/52.7). YOLOv11 emerged as a strong competitor excelling in crack/spalling detection (mAP50: 85.3/80.6) though slightly inferior in hole detection (mAP50: 78.9). Mamba-YOLO showcased superior crack/hole detection (mAP50: 86.2/82.5) but demonstrated critical deficiencies in spalling detection (mAP50: 54.1), while YOLOv12 delivered promising spalling/hole detection (mAP50: 83.7/81.5) yet struggled with fine cracks (mAP50: 47.8). The proposed method achieved state-of-the-art performance across all defect categories with highest mAP50 scores: 92.4 (cracks), 90.1 (holes), and 89.7 (spalling). This superiority stems from its innovative multi-scale feature fusion architecture and adaptive attention mechanisms, effectively addressing challenges in underwater structural multi-defects detection.
Figure 10.
Crack results on the USDD dataset. The test images detection results of FasterRcnn, RetinaNet, YOLOv8n, YOLOv9, YOLOv10, YOLOv11, Mamba-YOLO, YOLOv12, and our proposed model on the USDD dataset.
Figure 11.
Spalling results on the USDD dataset. The test images detection results of FasterRcnn, RetinaNet, YOLOv8n, YOLOv9, YOLOv10, YOLOv11, Mamba-YOLO, YOLOv12, and our proposed model on the USDD dataset.
Figure 12.
Hole results on the USDD dataset. The test images detection results of FasterRcnn, RetinaNet, YOLOv8n, YOLOv9, YOLOv10, YOLOv11, Mamba-YOLO, YOLOv12, and our proposed model on the USDD dataset.
Figure 13.
Spalling results on the USDD dataset. The test images detection results of FasterRcnn, RetinaNet, YOLOv8n, YOLOv9, YOLOv10, YOLOv11, Mamba-YOLO, YOLOv12, and our proposed model on the USDD dataset. Additional qualitative results (more examples) under the same experimental settings as Figure 11.
Table 4.
Comparison of different methods on the USDD dataset.
To fairly assess the efficiency of the proposed detector under real-time constraints, we benchmark eight representative frameworks at a unified input size and report model complexity and throughput. As shown in Table 5, two-stage/large one-stage baselines are heavy and slow (e.g., Faster R-CNN: 41.40 M/173.6 G/32.6 FPS; RetinaNet: 36.52 M/182.5 G/51.4 FPS), whereas our model attains the best throughput at comparable computation (2.87 M params, 8.3 GFLOPs, 190.2 FPS), exceeding YOLOv8n (177.3 FPS) and YOLOv10 (187.3 FPS). Among the remaining baselines, YOLOv9s (7.16 M, 26.8 G) reaches 67.6 FPS; YOLOv11 (2.58 M, 6.3 G) delivers 93.8 FPS; Mamba-YOLO is relatively heavy (26.9 M, 74.2 G) and runs at 73.5 FPS; and YOLOv12 (2.55 M, 6.4 G) achieves 129.1 FPS. Overall, our detector offers the highest measured throughput without inflating FLOPs or parameter count, situating it on a favorable accuracy–efficiency frontier.
Table 5.
Complexity and speed comparison.
6. Conclusions
This paper presents a hybrid neural network for multi-defect detection in underwater environments. The architecture jointly integrates HanNet, MSFP, and ASAA modules, leveraging their complementary strengths to enable accurate detection of diverse defect types under challenging conditions. To address the challenges of underwater multi-defect detection, this paper introduces a comprehensive dataset specifically designed for underwater structural defects, i.e., Underwater Structural Defect Dataset (USDD), which serves as a foundational resource for data-driven model training. Experimental results on USDD demonstrate that our method outperforms existing methods, achieving superior detection accuracy and robustness.
Despite these promising results, there are areas for improvement in future work. The proposed model still experiences missed or erroneous detections for complex crack geometries. Feature representation is required to further enhance feature extraction. Furthermore, noise reduction for sonar-based detection is essential to improving defect detection quality and ensuring the accurate identification of defects. Lastly, future research should also explore methods for measuring defect attributes, e.g., length, width, and depth.
Author Contributions
Conceptualization, C.M. and Z.C.; methodology, C.M.; software, H.W.; validation, G.S.; formal analysis, Z.C.; investigation, H.W.; resources, G.S.; data curation, Z.C.; writing—original draft preparation, C.M.; writing—review and editing, Z.C.; visualization, G.S.; supervision, Z.C.; project administration, H.W.; funding acquisition, Z.C. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported in part by the National Key Research and Development Program of China under Grant 2024YFC3212400; in part by the National Natural Science Foundation of China under Grant U23B20150; in part by the Jiangsu Province Young Scientific and Technological Talents Promotion Plan under Grant JSTJ-2024-082.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- Hoyń, S. A Review of Underwater Mine Detection and Classification in Sonar Imagery. Electronics 2021, 10, 2943. [Google Scholar] [CrossRef]
- Demarco, K.J.; West, M.E.; Howard, A.M. Sonar-Based Detection and Tracking of a Diver for Underwater Human-Robot Interaction Scenarios. In Proceedings of the 2013 IEEE International Conference on Systems, Man, and Cybernetics, Manchester, UK, 13–16 October 2013. [Google Scholar]
- Hurtos, N.; Palomeras, N.; Nagappa, S.; Salvi, J. Automatic detection of underwater chain links using a forward-looking sonar. In Proceedings of the OCEANS, Bergen, Norway, 10–14 June 2013. [Google Scholar]
- Liu, Z.; Ye, X.; Guo, S.; Xing, H.; Li, Y. Profile Fitting-based Small Target Detection in Water for Side-scan Sonar Image. In Proceedings of the IEEE International Conference on Mechatronics and Automation (ICMA), Takamatsu, Japan, 8–11 August 2021. [Google Scholar]
- Karoui, I.; Quidu, I.; Legris, M. Automatic Sea-Surface Obstacle Detection and Tracking in Forward-Looking Sonar Image Sequences. IEEE Trans. Geosci. Remote Sens. 2015, 53, 4661–4669. [Google Scholar] [CrossRef]
- Abu, A.; Diamant, R. Enhanced Fuzzy-Based Local Information Algorithm for Sonar Image Segmentation. IEEE Trans. Image Process. 2019, 29, 445–460. [Google Scholar] [CrossRef]
- Zhou, T.; Si, J.; Wang, L.; Xu, C.; Yu, X. Automatic detection of underwater small targets using forward-looking sonar images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4207912. [Google Scholar] [CrossRef]
- Ercan, M.F.; Muhammad, N.I.; Sirhan, M.R.N. Underwater target detection using deep learning. In Proceedings of the TENCON 2022—IEEE Region 10 Conference (TENCON), Hong Kong, 1–4 November 2022; pp. 1–5. [Google Scholar]
- Deng, Y.; Tang, F.; Dong, W.; Ma, C.; Pan, X.; Wang, L.; Xu, C. Stytr2: Image style transfer with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11326–11336. [Google Scholar]
- Khosla, C.; Saini, B.S. Enhancing performance of deep learning models with different data augmentation techniques: A survey. In Proceedings of the 2020 International Conference on Intelligent Engineering and Management (ICIEM), London, UK, 17–19 June 2020; pp. 79–85. [Google Scholar]
- Niu, B.; Ma, Y.; Qi, Y. Style transfer of image target region based on deep learning. In Proceedings of the 6th International Symposium on Computer and Information Processing Technology (ISCIPT), Changsha, China, 11–13 June 2021; pp. 486–489. [Google Scholar]
- Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
- Swathi, S.; Rajalakshmi, M.; Senniappan, V. Deep learning: A detailed analysis of various image augmentation techniques. In Proceedings of the 2023 International Conference on Networking and Communications (ICNWC), Chennai, India, 5–6 April 2023; pp. 1–6. [Google Scholar]
- Shi, P.; Fan, X.; Ni, J.; Wang, G. A detection and classification approach for underwater dam cracks. Struct. Health Monit. 2016, 15, 541–554. [Google Scholar] [CrossRef]
- Xin, G.; Fan, X.; Shi, P.; Luo, C.; Ni, J.; Cao, Y. A fine extraction algorithm for image-based surface cracks in underwater dams. Meas. Sci. Technol. 2022, 34, 035402. [Google Scholar] [CrossRef]
- Fan, X.; Wu, J.; Shi, P.; Zhang, X.; Xie, Y. A novel automatic dam crack detection algorithm based on local-global clustering. Multimed. Tools Appl. 2018, 77, 26581–26599. [Google Scholar] [CrossRef]
- Zhang, J.; Zhang, J. An improved nondestructive semantic segmentation method for concrete dam surface crack images with high resolution. Math. Probl. Eng. 2020, 2020, 5054740. [Google Scholar] [CrossRef]
- Qi, Z.; Liu, D.; Zhang, J.; Chen, J. Micro-concrete crack detection of underwater structures based on convolutional neural network. Mach. Vis. Appl. 2022, 33, 74. [Google Scholar] [CrossRef]
- Fan, X.; Cao, P.; Shi, P.; Chen, X.; Zhou, X.; Gong, Q. An underwater dam crack image segmentation method based on multi-level adversarial transfer learning. Neurocomputing 2022, 505, 19–29. [Google Scholar] [CrossRef]
- Zhuang, X.; Tran, T.V.; Nguyen-Xuan, H.; Rabczuk, T. Deep learning-based post-earthquake structural damage level recognition. Comput. Struct. 2025, 315, 107761. [Google Scholar] [CrossRef]
- Li, Y.; Bao, T.; Huang, X.; Chen, H.; Xu, B.; Shu, X.; Zhou, Y.; Cao, Q.; Tu, J.; Wang, R.; et al. Underwater crack pixel-wise identification and quantification for dams via lightweight semantic segmentation and transfer learning. Autom. Constr. 2022, 144, 104600. [Google Scholar] [CrossRef]
- Wang, F.; Li, H.; Wang, K.; Su, L.; Li, J.; Zhang, L. An Improved Object Detection Method for Underwater Sonar Image Based on PP-YOLOv2. J. Sens. 2022, 2022, 5827499. [Google Scholar] [CrossRef]
- Wang, F.; Wang, X.; Zhou, J.; Liu, M. An underwater object detection method for sonar image based on YOLOv3 model. J. Electron. Inf. Technol. 2022, 44, 3419–3426. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–10. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Chen, R.; Zhan, S.; Chen, Y. Underwater target detection algorithm based on YOLO and Swin transformer for sonar images. In Proceedings of the OCEANS 2022, Hampton Roads, VA, USA, 17–20 October 2022; pp. 1–7. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Tan, H.; Zheng, L.; Ma, C.; Xu, Y.; Sun, Y. Deep learning-assisted high-resolution sonar detection of local damage in underwater structures. Autom. Constr. 2024, 164, 105479. [Google Scholar] [CrossRef]
- Zheng, L.; Tan, H.; Ma, C.; Ding, X.; Sun, Y. A real-time crack detection approach for underwater concrete structures using sonar and deep learning. Ocean Eng. 2025, 322, 120582. [Google Scholar] [CrossRef]
- Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2736–2744. [Google Scholar]
- Fang, G.; Ma, X.; Song, M.; Mi, M.B.; Wang, X. Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 16091–16101. [Google Scholar]
- Wang, S. A hybrid SMOTE and Trans-CWGAN for data imbalance in real operational AHU AFDD: A case study of an auditorium building. Energy Build. 2025, 348, 116447. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics; Ultralytics: Frederick, MD, USA, 2023. [Google Scholar]
- Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
- Wang, S.; Park, S.; Kim, J.; Kim, J. Safety helmet monitoring on construction sites using YOLOv10 and advanced transformer architectures with surveillance and body-worn cameras. J. Constr. Eng. Manag. 2025, 151, 04025186. [Google Scholar] [CrossRef]
- Zhao, Z.; He, P. Yolo-mamba: Object detection method for infrared aerial images. Signal Image Video Process. 2024, 18, 8793–8803. [Google Scholar] [CrossRef]
- Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).